Latest Updates: Our Blog

Category: Documents

WRAL builds award-winning app with DocumentCloud API

Posted
Apr 7th, 2016

Tags
Code,Documents,People

Author
Anthony DeBarros

A big congratulations from the DocumentCloud team to Tyler Dukes, public records reporter at TV station WRAL in North Carolina. Dukes received a 2016 Sunshine Award from the North Carolina Open Government Coalition for work including a custom document-search application he built using the DocumentCloud API.

The API is a powerful piece of the DocumentCloud platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.

When the University of North Carolina at Chapel Hill released hundreds of thousands of pages of documents gathered during an independent investigation into academic fraud involving faculty, staff and student athletes, Dukes turned to the search method of DocumentCloud’s API to build a web application that let users find and read documents by keyword or key people in the investigation.

“We wanted to build something to allow users to browse and search hundreds of thousands of pages of documents all in one place,” Dukes said. “DocumentCloud’s existing embeddable search was close, but because the documents were arbitrarily spread across hundreds of batches, I was concerned it would be too confusing for the average user.

“The API allowed us to very quickly prototype and roll out exactly what we wanted for this very specific circumstance,” he said. “We used the API to pull every page from documents stored in a single project and display them in an intuitive application that allows users to read page by page (or even random pages) or search everything at once. We’ve updated the application twice now, and we’re currently up to 680,000 pages and counting.”

Dukes’ project is one of several in recent months that have used our API or components to give readers custom search and viewing, including a Wall Street Journal application to let readers tag Hillary Clinton’s emails and La Nacion’s election crowdsourcing application VozData.

If you’re interested in using the DocumentCloud API, check out our help documentation and don’t be shy about getting in touch.

Storytelling with improved DocumentCloud notes

Posted
Feb 17th, 2016

Tags
Documents

Author
Anthony DeBarros

Starting this week, DocumentCloud notes are sporting a subtle facelift that aligns their type style and color palette with the more modern aesthetic developed for our recently launched page embed.

Notes, like pages, are fully responsive, lightweight and a great choice for websites viewed on a variety of screen sizes. Plus, each note includes a link to view the full document. With today’s update, notes adapt better to varying device widths and rely less on expensive JavaScript calculations.

Here’s a note in action:

A key procedure — doing a 360-degree scan of the structure — was not followed, according to the report. Also, personnel did not observe the fire on the first floor of the house.

Using notes to strengthen the narrative

Improved storytelling is one of DocumentCloud’s aims, and weaving notes into a story can build a stronger narrative. For example, the Chicago Tribune recently reported details on when aides to Chicago Mayor Rahm Emanuel became aware of facts in the police shooting of Laquan McDonald. Reporters placed notes at key points in the story to highlight portions of emails, calendar items and other documents showing when officials discussed the shooting.

Oregon Public Broadcasting also used notes to highlight phrases in potentially confusing earthquake insurance policies. By weaving notes into the story, readers could quickly view the specific contract language the story discussed.

Our work on notes and pages is leading towards improvements to our main document viewer and is part of an effort to improve overall publishing performance. Our code is open source, so if you’re a developer you can follow progress on notes, pages, and our whole platform on Github. As always, we welcome your thoughts at support@documentcloud.org.

WordPress DocumentCloud 0.4.0 supports page embed

Posted
Dec 17th, 2015

Tags
Code,Documents

Author
Anthony DeBarros

We’re excited to announce that DocumentCloud’s custom WordPress plugin now features support for Page Embed, our lightweight, responsive viewer. Version 0.4.0 — which includes additional enhancements and bug fixes — is available for download now, and we recommend upgrading as soon as practical.

The plugin makes publishing documents, notes and now pages as simple as dropping a shortcode in your WordPress post. As with documents and notes, the embed wizard in the DocumentCloud workspace generates the shortcode for a page at the same time it creates our traditional embed code. Here’s an example:

[documentcloud url="https://www.documentcloud.org/documents/1659580-economic-analysis-of-the-south-pole-traverse.html#document/p4"]

Download, install and activate the plugin, drop the shortcode into your story, and it renders the page including annotations:

For more information on embedding, check out our Help docs.

Our WordPress plugin — as with the entire DocumentCloud platform — is an open-source project, and we welcome your contributions, ideas and feedback! Visit the GitHub repository or get in touch via support@documentcloud.org

Celebrating one million public documents

Posted
Nov 24th, 2015

Tags
Documents

Author
Anthony DeBarros

Dear DocumentCloud users:

Pat yourselves on the back!

On Monday evening, the number of documents available in the DocumentCloud public catalog passed one million. All told, the number of public pages now exceeds 13 million.

These public documents — plus another 1.4 million private documents in our database — represent a lot of your hard work. Often, they’re the result of hours of dogged reporting, persistent requests to government agencies, the scraping of websites, and a determination to treat “no” as an unacceptable answer. 

Thanks to you, DocumentCloud’s public catalog has become a deep well representing an amazing diversity of topics. In November’s uploads alone, you’ll find subjects ranging from New York state’s lawsuit against the fantasy sports site DraftKings to a recent announcement by the National Institutes of Health that it will no longer support biomedical research on chimpanzees to tens of thousands of pages of Argentinian election results.

It’s a moment to celebrate. We at DocumentCloud and Investigative Reporters and Editors applaud you. We’re grateful both for your reporting that shines a light and for what, collectively, you’re building along with us. Thank you.

Easier publishing with WordPress and oEmbed

Posted
May 5th, 2015

Tags
Code,Documents

Author
Ted Han

Today, we’re making it easier for you to embed documents and notes with an updated WordPress plugin and an oEmbed service to power it.

The WordPress plugin – which builds upon an earlier version developed by Chris Amico for NPR’s StateImpact project – adds the ability to embed notes as well as documents using shortcodes. And we’ve updated our embed wizard so it will generate the WordPress shortcodes for you like this:

[documentcloud url="http://www.documentcloud.org/documents/1699074-sb0101-05-enrs/annotations/210824.html" ]

Which embed like this:

 

The plugin is powered by our new oEmbed API. It's our next step in helping you integrate DocumentCloud embed codes into your content management system so embedding documents and notes is as simple as pasting in a URL.

You can install our plugin right now by visiting its WordPress page. Developers interested in adding simple embedding with our oEmbed API can find its documentation in our help pages.

Read on for more details:

WordPress

Since 2011, WordPress users have been able to embed documents on their blogs thanks to a plugin created by Chris Amico for NPR’s StateImpact project.

Chris’ plugin let users embed documents with shortcodes by translating the shortcodes into HTML embed codes.

With Chris’ help, and input from Adam Schweigert of the Project Largo team, we’ve released a new version of the plugin that adds:

  1. Note Embeds: You can now embed individual notes as easily as documents. Just pass a note URL into the shortcode with the url attribute.
  2. Raw URL Support: You can now paste the URL for a document or note onto its own line, and the plugin will translate that into an embed.
  3. oEmbed Support: Embed codes are now fetched from our oEmbed service rather than generated internally, so they’ll always be up to date.

Full installation and usage instructions are available on the plugin page.

Note for users of the existing plugin: Because we’re releasing a whole new plugin, prior installations of Navis DocumentCloud won’t automatically update. You’ll have to deactivate/delete Navis DocumentCloud from your site and install the new plugin. Sorry! This should be a one-time process, and future updates will be delivered through the normal WordPress update mechanism.

oEmbed

We’ve added an oEmbed API to make it easier for developers to get embed codes for documents and notes. oEmbed has become a standard for easily embedding Web content, and it has long been one of our users’ top feature requests. Now, instead of having to reverse engineer the format of our embed codes, developers can just send a request to the DocumentCloud API to ask for a correctly formatted embed code.

CMSes that support oEmbed can turn DocumentCloud URLs for documents and notes directly into embeds.  So users can be assured that their CMS won’t eat or mangle an HTML/JavaScript embed code, and developers can use existing oEmbed tools to support DocumentCloud’s current embeds as well as future types of embeds we have in the works.

If you’d like your CMS to support embedding documents as easily as DocumentCloud’s WordPress plugin, you or your developers can read more about our oEmbed service in our API help pages.

Thanks for reading.  Follow us on twitter for more updates.

Hungarian, Norwegian, Swedish OCR support added

Posted
Mar 12th, 2015

Tags
Documents,Workspace

Author
Anthony DeBarros

Starting today, DocumentCloud users can choose three additional languages to OCR uploaded documents: Hungarian, Norwegian and Swedish. We’ve added the three based on support requests and feedback we heard during last week’s NICAR conference in Atlanta.

The addition brings the number of languages available for OCR to 17. To see them all, click “Manage account” beneath your user name and click the “New Documents” dropdown under “Language Defaults.”

We believe that journalists around the world should have access to tools that enable better reporting, and growing our language support is critical to that. DocumentCloud’s language support falls into three independent categories (so partial support of your language may be possible):

Text Search: DocumentCloud fully supports Unicode, allowing users to search documents in a variety of character sets. For scanned documents that require OCR, DocumentCloud uses the battle-tested open-source Tesseract engine, which also powers Google Books and is maintained by Google. The open-source community has contributed language packages for many widely used languages, which allows DocumentCloud to enable them on our platform. So, if DocumentCloud does not yet support your language, please reach out and let us know about your interest!

Entity extraction: DocumentCloud supports identifying people, places, organizations and other entities through OpenCalais. As of now, OpenCalais only supports English, French and Spanish. We are evaluating other tools that would allow us to bring entity extraction to other languages.

Interface translation: Accessibility to our tools is more than being able to process non-english documents. Our users have already collaborated with us to translate DocumentCloud’s Workspace user interface into four languages: English, Spanish, Russian and Ukrainian. If you are interested in bringing DocumentCloud to your language, please email us at info@documentcloud.org

Shearing PDFs with PDFShaver at DocumentCloud

Posted
Mar 7th, 2015

Tags
Documents

Author
Ted Han

As of this week, documents uploaded to DocumentCloud will process much faster thanks to a new tool we’ve written called PDFShaver that wraps Google Chrome’s PDFium library.

How much faster? From our preliminary statistics, a lot.

Under the covers, DocumentCloud uses our Docsplit open source library to disassemble documents. Prior to PDFShaver, Docsplit relied upon Graphicsmagick and Ghostscript (GM+GS) to render PDFs and save pages as images.

GraphicsMagick and Ghostscript have served DocumentCloud well, but we’ve had trouble processing some poorly constructed documents that journalists receive from sources — governments, companies and non-profits, for example. Our search for a replacement led us to PDFium, and we found that not only did it solve a number of our issues but it also provided substantial gains in speed.

Testing PDFShaver and Graphicsmagick on 50 documents picked at random from DocumentCloud’s public collection shows that PDFShaver can render documents an order of magnitude faster (here’s our raw data). These data are a preliminary sample, but we’re excited about what it shows about the kinds of speed gains we can make to our processing pipeline. We’ll continue to track PDFShaver and DocumentCloud’s performance as we make improvements, so look forward to more updates!

Rendering PDF pages with PDFShaver & PDFium

PDFShaver works by connecting PDFium to Ruby with a C/C++ extension inside a Ruby gem.  PDFium itself is an open source library and the software that powers Google Chrome’s PDF viewer. And aside from taking advantage of the speed and capabilities Chrome’s tools provide, we’re happy to be able to make open source PDF processing easier to access through a programming language such as Ruby.

For example, picking the landscape-oriented pages out of a document and rendering them is as easy as these three lines:

document = PDFShaver::Document.new("./path/to/document.pdf")
landscape_pages = document.pages.select{ |page| page.aspect > 1 }
landscape_pages.each{ |page| page.render("page_#{page.number}.gif") }

We plan to keep improving PDFShaver and make more of PDFium’s features accessible to give Rubyists, data scientists and journalists a boost for overcoming the impediments that PDFs present.

If you’re interested in installing and using PDFShaver, you can read how on our Github repository. And if you’d like to help journalists and others free information from PDFs, your contributions are welcome!

 

Catch DocumentCloud at NICAR 2015 in Atlanta

Posted
Feb 26th, 2015

Tags
Documents,People

Author
Anthony DeBarros

If you’re coming to IRE’s annual data journalism conference March 5-8 in Atlanta, be sure to stop by and say hello to the DocumentCloud team!

We’ve made a few ways for you to learn more about the platform, tell us your ideas, and hear about what’s next for DocumentCloud:

— Saturday at 3:20 p.m., join us for a hands-on class, “Reporting and Presentation with DocumentCloud.” Get to know the suite of tools that DocumentCloud offers to help you better organize, analyze and present public documents.

— Sunday at 11:20 a.m. in the Demo Room, we’ll offer “Advanced DocumentCloud: Examples and Suggestions.” Take a deeper dive into DocumentCloud and its API and bring your ideas for features you’d like to see in the platform. Plus, see some of the best uses of DocumentCloud in the last year!

— Throughout the conference, you can find the DocumentCloud team on hand at its booth outside the conference rooms. We’ll be set up to give you a demo, answer questions about accounts and hear how you use the platform.

During the conference, reach us by email or Twitter. Find Ted Han via ted@documentcloud.org and @knowtheory; Anthony DeBarros via anthony@documentcloud.org and @anthonydb; and Lauren Grandestaff via lauren@ire.org and @lgrandestaff.

We’ll look forward to seeing you!

Ahead for 2015: A Faster, More Productive DocumentCloud

Posted
Jan 20th, 2015

Tags
Documents,Workspace

Author
Anthony DeBarros

If you’d dropped into the DocumentCloud workspace in Columbia, Mo., at the start of January, you’d have found at least two things: a team actively avoiding the single-digit temps outside the office and a whiteboard that we frequently filled with ideas, photographed for posterity, erased and filled up again.

We ended up ignoring the temps. The ideas generated enough heat to last us until summer – and beyond!

So, what’s ahead for 2015? We’ve spent the last year researching and reflecting on what you want, both as a buildup to the recent $1.4 million Knight Foundation grant and to make sure you’re as happy as possible with the service. We want to be sure the platform is fast, reliable, and enables you to do your best work.

There’s a legacy to continue. DocumentCloud was founded by and for journalists to support in-depth reporting around public documents. Today, more than 900 news organizations worldwide use the platform. Whether it’s publishing the documents related to doubts about the guilt of a Texas man executed for murder or the grand jury testimony regarding the death of Michael Brown in Ferguson, Mo., journalists use DocumentCloud to give their readers a first-hand view of the primary source documents they gather.

We plan to build from that success. The DocumentCloud team’s expanding, and we’re locking in a roadmap to grow and improve the platform. We’ve recently hired a director of product development, and we’ve just posted a job description for a front-end developer. With a bigger team and lots of focus, we believe that by the end of the year, you’ll see substantial improvements and expanded offerings that will maintain DocumentCloud’s place as an essential reporting and presentation tool.

So, here are some things we’re going to do to help you:

Improved processing. If you use DocumentCloud, you upload documents to the platform. You want them processed quickly, without errors, so you can get right to publishing or annotating. Over the last year, we’ve made substantial improvements to our processing cluster and sped up imports of popular documents uploaded by multiple users. And we’re glad that you’re noticing the results. We have more work planned – look for a blog post soon detailing the changes.

— Go mobile. We know (and so do you) that more of your readers view more of your content on their phones. So, we’re planning mobile-specific changes to the viewer to improve scrolling, zooming and the experience in general.

— More storytelling tools. We’re exploring ideas for expanding the display options for the viewer, such as presentation templates, social media sharing, notes displays and more. Many of you have asked for oEmbed support, and we’re looking closely at that!

— Telling DocumentCloud’s story. We’ll bring you more blog posts like this, keeping you up to date on our progress and listening for your ideas. We’ll also tell you more about how to make better use of all the site has to offer — such as deeper search options — and highlight great examples of your storytelling.

— Expanded reach and premium offerings. We want to make sure DocumentCloud is going to be available to journalists for years to come, and one way to do that is for the platform to begin generating revenue. This goal is part of the Knight grant, and so we’ll be exploring options – premium features, opening the tool to additional types of users on a fee basis, donations, and other ideas.

Beyond those, we have many more ideas, among them better feedback on document processing, ability to rotate pages, better organization of the site and workspace, and batch processing options.

That’s a lot to chew in one year, but with our team expanding – and, we hope, with continued contributions from the open source community – we’re pretty excited about the prospects.

As always, let us know your thoughts!  You can reach us on UserVoice, Twitter, or email.

How we made DocumentCloud note embeds responsive

Posted
Sep 25th, 2014

Tags
Documents

Author
Ted Han

By Tom Meagher, data editor at The Marshall Project,
Emily Yount, interaction designer at The Washington Post,
Matt DeLong, national digital projects editor at The Washington Post
and Ted Han, lead developer at IRE/DocumentCloud

On Aug. 3, The Marshall Project, a new nonprofit journalism organization focused on criminal justice issues, published an investigation in partnership with The Washington Post that revealed new evidence raising doubts about a high-profile Texas execution.

TOM: Our reporter, Maurice Possley, began working on this story months before most of the rest of our newsroom at the Marshall Project was even hired. By the time we were able to start helping, the story was mostly reported, so we dove into the documents to bring ourselves up to speed.

The case against Cameron Todd Willingham — who was executed in Texas for the murder of his three daughters — had been written about extensively over the last 22 years, but a lot of new information was uncovered, and it was all in the documents. We knew we wanted to be able to explore and highlight the correspondence that cast this case in an entirely new light. DocumentCloud was clearly the answer.

In the course of his reporting, Possley, who has covered this case for more than a decade, was given access to copies of dozens of primary source documents that tell the backstory of Cameron Todd Willingham and the informer who helped convict him. In filing its grievance with the State Bar of Texas against the former prosecutor in the case, the Innocence Project had acquired these documents and assembled them into a series of appendices. They gave us eight PDF files that added up to nearly 400 pages. We used DocumentCloud to stitch them all back together into one large file.

We then combed through the appendices and dozens of other records of court testimony and correspondence. As we saw the various typefaces and handwriting styles that made up the key passages, we knew we wanted to use DocumentCloud notes to present excerpts directly in the story.

Matt: I started working on the story in earnest a couple of weeks before it published. We were very excited about having so many primary-source documents to enrich the narrative. The Post has been using DocumentCloud for years, but we’ve long been frustrated by one of its biggest limitations: it isn’t mobile-friendly. This isn’t really DocumentCloud’s fault; these scanned documents are a set size, so when you scale them down, at some point words will become too small to read.

We had seen how the New York Times addressed the problem, by putting up the text and linking to the original document in DocumentCloud. That’s totally logical and fine if the words are all that you care about, but in this case we have official letters and handwritten notes between the characters in the story. The pages themselves are interesting, and many readers will want to see them with their own eyes.

We decided at the outset that however we ended up displaying the documents we included in the story, they had to be responsive. But this meant we’d have to come up with our own hack. Emily and I had both been thinking about this problem individually for a while, and we had some time to work on it, so we decided to try to figure out a solution that we could use in this project.

Emily: At the time of publication, DocumentCloud’s note embed code already resized and repositioned the note based on the width of the DC-note-container div, so I knew we only needed to solve for when the note is wider than the note embed and the right side of the note is cut off (see image below).

image01

To solve this problem, when the embed first loads, the code stored the coordinates, width and height of the note relative to an image of the page of the annotated document. When the page loads, the browser resizes or the orientation of your device changes, javascript media queries (matchMedia) detect whether the note is wider than the embed and then resizes and repositions the document image.

The original coordinates, width and height allow us to determine how wide the note is in relation to the document image and resize the document just enough to make the note 100% of the embed, instead of the document image 100% of the embed. This helps with readability by making the text as large as it can be. At times, depending on the width of the note and the size of the text, there will still be readability issues, so cropping the annotations carefully and testing to make sure they are readable is really important.

Here’s an example from the Willingham story of a responsive note on an iPhone 5:

image00

Ted: We were thrilled when Ben Chartoff (OpenNews fellow at the Washington Post) reached out to put Emily in touch with us.

We believe deeply in DocumentCloud as an open source project as well as the service to which journalists post documents relevant to the public interest. Emily and Matt’s motivation to extend the behavior which DocumentCloud already provides and to share their code back is exactly the kind of effort we love to see and encourage.

Technology in the world of news is a means toward the end of better reporting. Especially in competitive industries like ours, an open source ethos around the tools we all share is an avenue for us work together to improve the state of all reporting. Anyone who solves an issue for their own needs can help to solve that issue for everyone.

In that spirit we were excited to incorporate Emily’s code into our own. To do so, we spun our note code off into its own repository to make it easier for anyone to contribute (you can find the code on Github as documentcloud-notes). Then with the Washington Post’s & Marshall Project’s stories as a basis we began incorporating the changes. Ultimately, we ended up rewriting much of Emily’s code in the process, but what she had written served as the design criteria to anchor the code we wrote.

Our responsive notes code is already live on DocumentCloud now, and journalists needn’t take any additional steps to use it. Any embedded note from DocumentCloud will now behave responsively.

When Documents Are Challenged

Posted
Apr 5th, 2012

Tags
Documents,IdeaLab

Author
Mark Horvit

Last week, DocumentCloud received a complaint seeking the removal of a collection of emails posted by journalists with the Australian Financial Review. The emails involved a company called NDS, which hired a law firm to try and have the documents pulled from public view. This kind of thing is rare, but it happens. This case in particular has a couple of wrinkles that make it unusual, and it presents a good opportunity to remind all of our members that DocumentCloud has policies and options in place that allow you to keep all documents processed through our service available to the public for as long as you desire.

I’ll detail those below. But first, a little context.

DocumentCloud was created as a 501c3 nonprofit organization and remains so as part of Investigative Reporters and Editors (IRE). The service is offered free of charge and all expenses and manpower are covered through grants and IRE’s normal operations. We provide a suite of tools that allow you to analyze and publish documents, and we don’t control what you post. And, we don’t have a large budget to fight legal challenges to items you post.

There have been only a handful of cases in which DocumentCloud has received legal challenges to material posted on the site, where we now host more than 4 million pages.

Typically those challenges have involved allegations of copyright violation. In every case, we have contacted the posting news organization and asked them how they would like to handle the complaint. Our terms of service detail how we handle those cases, using a process based on the Digital Millennium Copyright Act (DMCA). DocumentCloud is a neutral party hosting content on behalf of users and is protected by the DCMA’s safe-harbor provisions. If we receive a formal complaint, we contact the organization that uploaded the material. If they assert their right to publish, the documents remain public and the matter is resolved between the complainant and the posting organization.

We also offer an alternative for organizations that would prefer to host their own documents and still use DocumentCloud’s viewer. A number of news organizations have chosen this option, for a variety of reasons. We make document data and our viewer code available for download to journalists directly through our workspace. Downloading a viewer will provide a news organization with an html file that is functionally indistinguishable from the viewers we host.

It’s also worth noting that all of our software is available free and open source to any journalist or software developer who wishes to use or improve upon it (and members of both groups have done so).

The case that came up last week involving the Australian Financial Review presented some new issues. The company filing the complaint over the posted emails alleged a variety of issues, but didn’t cite the DMCA. AFR opted to take down the documents rather than provide us with a letter asserting their right to publish and offering indemnity for DocumentCloud. The company said it did so because it believes that action is more appropriate in Australia, so it did not wish to become involved in a U.S. dispute with NDS. They opted to download the viewer, and AFR plans to repost the documents using the DocumentCloud software.

Dealing with such challenges is an inevitable byproduct of hosting documents. If you have questions about our policies or suggestions on how we can improve our service, please get in touch; my email is mhorvit@ire.org.

Printing Document Annotations

Posted
Sep 26th, 2011

Tags
Documents

Author
Ted Han

We’ve been hard at work during our short Columbia, Missouri hackathon at DocumentCloud’s new home at the Investigative Reporters & Editors office. As a result we’ve rolled out a new feature for readers and journalists to print annotations made on documents.

Journalists have been publishing documents through DocumentCloud for a while now as well as annotating documents both for readers and for their own story writing processes. We think it’s just as important for DocumentCloud to make story writing quicker and easier as it is to help readers find primary source material.

So, when Marshall Allen of ProPublica told us that he would like to try using DocumentCloud to take his story notes, we did our best to help out. As a result, you can now select one or more documents in the workspace and choose “Print Notes” under the “Publish” menu.

This way you can annotate your sources in DocumentCloud, and have a single copy of all your research ready at hand for your copy editor or read when your flight attendant announces that all power switches should be in the off position.

And readers can find a “Print Notes” link in the sidebar footer of the document viewer too.

We hope this will help readers and journalists alike note and collect information in the format the best suits their workflows. Happy Printing (and remember to recycle)!

FAQ: Should I Try Again?

Posted
Apr 1st, 2011

Tags
Documents,Workspace ,

Author
Amanda Hickman

Every once in a while, DocumentCloud gets hit with the kind of document stash that really slows us down. We can take a lot, but if one newsroom finally gets a 25,000 page FOIA turned over to them and another gets a hold of 30,000 pages of documents for a breaking news story about the on the same afternoon, that’s a volume that will tax our servers.

We recently established a “fast lane” to ensure that smaller documents don’t have to get in line behind behemoths, but that doesn’t help if you’ve got a few MB of documents about a local scandal — you’ll still have to shuffle into line with the big sets. Continue reading »

Embed a Set of Documents

Posted
Mar 30th, 2011

Tags
Documents,Workspace ,

Author
documentcloud

Sets of documents are nothing new to DocumentCloud.

The Las Vegas Sun published hundreds of pages of legislation, emails, court filings and medical records alongside their award winning package on hospital care in Las Vegas. The Sun‘s Marshall Allen assembled each document collection by hand to produce that page. Plenty of other newsrooms have used our API to do likewise. Even with the API it isn’t trivial to assemble and publish a set of documents.

Some document sets are living creatures that continue to grow: Chad Skleton at The Vancouver Sun has been adding documents retrieved from the local ferry authority’s website to a growing cache of public records on DocumentCloud. The only way to ensure his readers will find new documents as they roll in, is to point the public straight to DocumentCloud to find ferry authority FOIAs. It should be easier to embed that growing set of public documents right at The Vancouver Sun.

Samuel Clay has been working very hard to make that possible for every DocumentCloud newsroom. Continue reading »

Improved Document Collaboration

Posted
Mar 9th, 2011

Tags
Documents,Workspace , ,

Author
Amanda Hickman

From inviting a law professor to help Arizona readers understand recent legislation to asking some top notch designers to review New York’s new ballot, DocumentCloud users have already found some great ways to bring experts from outside the newsroom in, and we thought it was time to make it much easier to do just that.

We spent some time at ONA last year, brainstorming with the good folks from the Public Insight Network — they really helped us distill this into a workable feature. We’re looking forward to seeing PIN newsrooms do some great reporting aided by this new feature. Continue reading »

A Million Pages

Posted
Feb 28th, 2011

Tags
Documents

Author
Jeremy Ashkenas

This morning, not quite one year since we opened our beta to newsrooms at NICAR 2010, the millionth page of primary source material was uploaded to DocumentCloud. Reaching this milestone so soon is a tribute to our users and the amazing document-driven investigative reporting you have published over the past year.

Most of the thousands of documents in our catalog have arrived in small batches: five documents here, 20 there, most often accompanying a breaking story. Take a look for yourself: browse through recently published documents by searching for “filter: published” or read up on other searches you can run.

Now is a good moment to highlight some notable recent stories:

Last week, Center for Public Integrity launched a series of articles on hidden hazards at oil refineries in the United States. Readers of Regulatory Flaws, Repeated Violations Put Oil Refinery Workers at Risk can review a dozen citations and court filings that the Center’s journalists used in the reporting.

Sunday, The New York Times published the first installment of an investigation into lax regulation of natural-gas drilling across the US, accompanied by a large cache of E.P.A. and industry documents.

The Seattle Times reported last week on evidence of financial abuse in Seattle public schools, based on documents released by state auditors. The documents detail over-billing, intimidation, and ethics violations that add up to $1.8 million in potentially fraudulent expenses.

Thanks for a great first year, and here’s hoping that the next year brings millions more pages, and more great document-driven reporting.

Going Public

Posted
Jan 26th, 2011

Tags
Documents,Workspace

Author
Amanda Hickman

With close to 200 newsrooms contributing documents and thousands of documents in our catalog, we decided it was time to open DocumentCloud to public searches.

Wondering who is still covering the Deepwater Horizon oil spill? Try a search for “deepwater horizon” organization: transocean, and see documents that both reference the rig by name as well as the drilling contractor, Transocean. Then, click on the “Entities” tab to see more data provided by OpenCalais’ entity extraction.

Did you miss Memphis Commercial Appeal‘s coverage of Ernest Whithers? Catch up with a search for
group: commercial-appeal withers, and find every document uploaded by reporters in the Commercial Appeal newsroom that mentions Whithers by name. Curious to see the annotations journalists have been making on the documents they’re sharing? Try a search for filter: annotated and you’ll skip any documents that were published without annotations.

There’s plenty more you can do with DocumentCloud’s search syntax. Check out our primer and try a few searches.

We’d love to know what you think, and what you’ve found.

PS. Finding bugs rather than documents? We want to know about those, too.

Embedding Documents on Your Site (UPDATED)

Posted
Jun 4th, 2010

Tags
Documents

Author
Jeremy Ashkenas

Over the past few months, you might have noticed a handful of news organizations using embedded documents to complement their reporting.

This morning, we’re opening up the ability to embed documents to all of the newsrooms participating in DocumentCloud. When you log into your workspace, you’ll notice a new menu: “Publish”.

From here, you can grab an embed code (a short snippet of HTML) that can be dropped onto a web page to create a document viewer. You may be familiar with such snippets from embedding YouTube videos: this works in a similar fashion. For guidelines on setting up a template and other help, check out our documentation.
If you still have questions about the process, we’re listening at support@documentcloud.org.

Note: we know you’re eager to host documents yourself, and you can do that now, but we recommend that you stick with embedded documents so that you can take advantage of bug fixes and other improvements to the viewer. We don’t know yet whether we plan to offer embedding as a long term service. Keep in mind, as well, that this is still a beta. As described in our terms, our capacity to commit to uninterrupted service is limited, as is our liability if service is interrupted in some way.

For those news organizations that want to host documents on their own servers, we’re now offering that as an alternative too. Click on “Download Document Viewer” to get a zipped up folder with all the code, text, and images bundled together as a web page. Drop the folder into any web server (no special software required), and voila, it’s online.

Search of the document’s text is provided by DocumentCloud as a service, but everything else in the package is completely static — just HTML, images, JavaScript and CSS. If you choose to use this alternative, there is a caveat: If you edit your annotations, or want to make any changes to the document, you’ll have to download it again.

Here at DocumentCloud, we’re looking forward to seeing the great reporting you do with embedded documents — don’t forget to use the workspace to add a “Related Article” link.

Documents Rolling In

Posted
Apr 12th, 2010

Tags
Documents

Author
Amanda Hickman

Reblogged from the PBS IdeaLab.

Eagle-eyed followers of the DocumentCloud Twitter feed have already picked up on the fact that we began adding users to our beta last month.

We made a strategic decision to peg our beta to NICAR’s March 2010 computer assisted reporting conference, where we knew we’d be able to gather a sizable group of just the sort of investigative reporters we hope to support with DocumentCloud, and get them excited about using our tools to do more with their documents. Nothing beats hands-on support when you’re using a new tool. Plus, we identified dozens of quick fixes we could make after watching over journalists’ shoulders as they explored DocumentCloud.

In the month since NICAR, we’ve added more than 150 users who’ve uploaded a cumulative 54,000 pages of text, and made close to 300 documents available in DocumentCloud. Our repository is already home to police reports from New Orleans, a confirmation hearing transcript that adds context to coverage of Justice Stevens’ resignation, and disaster preparedness plans from Haiti. There’s even a collection of emails that document how some hedge funds not only saw the mortgage crash coming, but wagered on the collapse and won big. (The hedge fund that these reporters investigated argues it never had the hands-on role ascribed to them; that’s in DocumentCloud, too.) Eventually, anyone will be able to connect with those documents right through our website.

Want to be part of the beta? Get in touch and tell us a bit about the documents you’re working with.

We’re still adding beta testers and actively listening to the users we’ve got as we prioritize and refine our to do lists, but we think we’re off to a great start.