We are thrilled to announce the latest addition to the DocumentCloud family, Dylan Freedman, who has assumed the role of lead developer.
Dylan comes to us fresh off earning his masters in journalism at Stanford University — but he’s no typical hack. Prior to Stanford, Dylan worked for several years at Google, first as an intern and then as a full-time member of the research team.
Dylan worked extensively on projects that involved machine learning and language, but he wasn’t fully satisfied solving technical problems alone. He wanted, as he put it, a “more tangible connection to the world and its social topics.”
So he decided to get a masters in journalism at Stanford and, after graduation, to join us at DocumentCloud. We couldn’t be happier to have him!
Dylan is filling some size 15 sneakers in taking over for Ted Han, who has moved into a part-time, emeritus role. Ted, as most of you know, is virtually synonymous with DocumentCloud, and words truly cannot express the gratitude we all feel for all he did to make DocumentCloud what it is today. We cannot thank Ted enough.
You can reach Dylan at firstname.lastname@example.org and find him patrolling our support and Slack channels. Please join me in welcoming him to the team!
Two sites will expand work with journalists, non-profits, and the public
We are thrilled to announce that DocumentCloud and MuckRock are merging.
The reason is simple: Mission. Our organizations share a core belief that institutions should be open, transparent and accountable to the people they serve.
This merger will strengthen both organizations and allow us to serve our users and that mission far better than either of us could alone. In fact, as we have been discussing this privately with friends and advisors, the most common reaction has been, “Why didn’t you do this sooner?”
Over time, our goal is to make the sites and services we run work seamlessly together, helping guide our combined 30,000-plus users through every step of the reporting process, from finding initial ideas to obtaining documents to analysis and, finally, publishing reporting critical to an informed democracy.
By the end of the year, you should start to see some visible changes: Easier document search and analysis in MuckRock. The ability to log into both sites with a single account. Better account management. A shared design. Enhanced stability, scalability, and speed for DocumentCloud.
The first example of this is already live: MuckRock’s Assignment tool, a powerful crowdsourcing platform that allows users to ask the public for help turning caches of records into structured data for easier analysis. DocumentCloud users can easily select a collection of documents on that site and import them into a new Assignment with just a click.
We’ll continue to build out this suite of integrated tools for accountability and transparency. In collaboration with our friends at Quartz, we’ve launched one already: Quackbot, a Slack bot that performs useful tasks for journalists.
We’re excited to help supercharge Quackbot with new “superpowers” — taking what’s traditionally done with one-off scripts or specialized services and packaging them for easy use by everyone. (If you’re a newsroom developer with a superpower you’d like Quackbot to learn, be sure to apply to be our News Nerd in Residence.)
On an administrative level, the two organizations have been working as one since January, and we’ve been delighted with how smoothly the process has gone. Going forward, Michael Morisy will act as Executive Director of the combined organization, with Aron Pilhofer serving as Chief Strategy Officer and Mitchell Kotler as Chief Technology Officer.
The organization’s board will draw from DocumentCloud and MuckRock’s existing boards.
Some important things won’t change: Both services will continue to be open source. The parent organization will continue to be a 501(c)3 nonprofit with a public service mission to help the public better understand their government, their communities and their world.
And, most importantly, both services will continue to be dedicated to serving their incredible communities that work every day to leave us all a little more informed and aware. We’re excited to get to work together.
What this will mean for MuckRock users
Whether you’ve known it or not, if you’ve used MuckRock much you’ve already used DocumentCloud; the site has powered the embedded document viewer since launch. But because the feature relies on DocumentCloud’s public API, there are serious limitations on the feature set available to users.
The merger means we will have better access to each site’s APIs, meaning as soon as you get documents back through MuckRock you can take advantage of all DocumentCloud’s analysis and collaboration features. Take a peek at some of those features here. We’re also working on Quackbot integration, so hopefully soon you’ll be able to get updates on your requests right in Slack … and possibly even file new ones.
If you’re a MuckRock Pro or Organizational subscriber, we’ll be adding even more premium features so that you’ll get even more value out of your account. We have a roadmap we’re very excited about, but we’d also love to hear what you want to see so if you have ideas or feedback, let us know.
What this will mean for DocumentCloud users
The merger will mean a better, more stable, more functional platform. It will mean you will have access to a vast, and growing, set of tools, including MuckRock, Quackbot and FOIA Machine.
This merger is more than tools. Although documents are at the heart of both platforms, MuckRock adds a layer of workflow management, the notion that there may be many steps to a desired outcome. That could mean managing a large set of FOI requests or a crowdsourcing effort through the assignments tool. Workflow management is something DocumentCloud has never had, and now we will. Assignments and Quackbot are just the beginning, in other words.
There is another big change: By the end of the year, we plan to begin asking our users to support us financially, and will likely restrict some advanced features to those who do. This is nothing new. From the day we launched DocumentCloud in 2009 we said we would begin charging at some point, and that time has come.
We now host 4.1 million documents and serve more than 200 million document views per year. Our users are regularly uploading 40,000 or more documents per week. And as a result, the cost to run the platform has quadrupled over the past few years, and we think it’s fair to start asking for help defraying those costs.
There will be a limited free account available. That won’t change as a result of the merger. But for users who need to upload more than a few documents per month, and/or need access to advanced features, like bulk processing or publishing via our API, we will be asking for support.
There’s one more exciting change coming: DocumentCloud has historically limited access to newsrooms and freelance journalists. In part this was to limit our legal liability, since any user can publish to the documentcloud.org domain. However, our shared technical roadmap includes some key changes that will allow us to offer a DocumentCloud account to anyone who wants one: journalists, yes, but also bloggers, citizen journalists, activists, academics. This is something we have always wanted to do, and now we can.
These are just a few of the improvements we are planning. Much, much more to come. We are both excited beyond words for this next chapter and look forward to having you along for the ride. If you have any questions, concerns or suggestions, don’t hesitate to get in touch with us.
Executive Director, MuckRock
Come help build the future of journalism. DocumentCloud.org is looking for a developer to help us expand our open-source platform designed to make journalism more impactful, transparent, and trustworthy.
DocumentCloud.org is a global platform that now hosts more than 4 million documents and serves hundreds of millions of pageviews each year. It has been used by tens of thousands of journalists worldwide to upload, analyze, annotate and publish primary source documents.
DocumentCloud has helped the largest and most innovative news organizations in the world tackle high-profile stories such as WikiLeaks, Panama Papers and the Snowden documents.
The role is for a full-stack programmer who can continue to evolve and run a large-scale Ruby on Rails site in an AWS environment where uptime matters. The job requires being involved with the entire stack, from front-end to document indexing, but one needn’t be an expert in everything.
You’ll be working closely with an experienced team to help you grow in the areas you’re interested in, including opportunities to experiment with machine learning, chatbots, crowdsourcing and more.
As important, this is for someone who cares about making better journalism but knows that code isn’t the solution to every problem. The developer will work hand-in-hand with the rest of the team to assess new opportunities, forge key partnerships, evolve the product roadmap and overall strategy for DocumentCloud.
We are extremely flexible with location and remote working. DocumentCloud is a distributed team, based coast-to-coast, with people in Philadelphia, the Bay Area and Boston.
We believe deeply in the importance of diversity and inclusion. We strongly encourage people of color and LGBTQIA candidates to apply. We’re committed to structuring our hiring process and organization to be welcoming to applicants of all backgrounds.
This is a full-time position with paid vacation and health care benefits, vision, and dental benefits and a 401k.
The ideal candidate will have:
Experience building and scaling Ruby on Rails applications;
Experience working with Amazon Web Services or a similar cloud-based infrastructure;
Experience with front-end technologies and frameworks;
A desire to work for a mission-driven organization.
Other skills we’d like to see but are not a requirement:
Familiarity with indexing engines, such as Solr;
Experience with Python and Django;
Experience working with distributed teams;
Experience with using data and documents to tell a story, either through traditional reporting or in other ways.
If this role sounds interesting to you, drop Aron a line at email@example.com or apply directly at firstname.lastname@example.org
A big congratulations from the DocumentCloud team to Tyler Dukes, public records reporter at TV station WRAL in North Carolina. Dukes received a 2016 Sunshine Award from the North Carolina Open Government Coalition for work including a custom document-search application he built using the DocumentCloud API.
The API is a powerful piece of the DocumentCloud platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.
When the University of North Carolina at Chapel Hill released hundreds of thousands of pages of documents gathered during an independent investigation into academic fraud involving faculty, staff and student athletes, Dukes turned to the search method of DocumentCloud’s API to build a web application that let users find and read documents by keyword or key people in the investigation.
“We wanted to build something to allow users to browse and search hundreds of thousands of pages of documents all in one place,” Dukes said. “DocumentCloud’s existing embeddable search was close, but because the documents were arbitrarily spread across hundreds of batches, I was concerned it would be too confusing for the average user.
“The API allowed us to very quickly prototype and roll out exactly what we wanted for this very specific circumstance,” he said. “We used the API to pull every page from documents stored in a single project and display them in an intuitive application that allows users to read page by page (or even random pages) or search everything at once. We’ve updated the application twice now, and we’re currently up to 680,000 pages and counting.”
Dukes’ project is one of several in recent months that have used our API or components to give readers custom search and viewing, including a Wall Street Journal application to let readers tag Hillary Clinton’s emails and La Nacion’s election crowdsourcing application VozData.
Starting this week, DocumentCloud notes are sporting a subtle facelift that aligns their type style and color palette with the more modern aesthetic developed for our recently launched page embed.
Improved storytelling is one of DocumentCloud’s aims, and weaving notes into a story can build a stronger narrative. For example, the Chicago Tribune recently reported details on when aides to Chicago Mayor Rahm Emanuel became aware of facts in the police shooting of Laquan McDonald. Reporters placed notes at key points in the story to highlight portions of emails, calendar items and other documents showing when officials discussed the shooting.
Oregon Public Broadcasting also used notes to highlight phrases in potentially confusing earthquake insurance policies. By weaving notes into the story, readers could quickly view the specific contract language the story discussed.
Our work on notes and pages is leading towards improvements to our main document viewer and is part of an effort to improve overall publishing performance. Our code is open source, so if you’re a developer you can follow progress on notes, pages, and our whole platform on Github. As always, we welcome your thoughts at email@example.com.
We’re excited to announce that DocumentCloud’s custom WordPress plugin now features support for Page Embed, our lightweight, responsive viewer. Version 0.4.0 — which includes additional enhancements and bug fixes — is available for download now, and we recommend upgrading as soon as practical.
The plugin makes publishing documents, notes and now pages as simple as dropping a shortcode in your WordPress post. As with documents and notes, the embed wizard in the DocumentCloud workspace generates the shortcode for a page at the same time it creates our traditional embed code. Here’s an example:
For more information on embedding, check out our Help docs.
Our WordPress plugin — as with the entire DocumentCloud platform — is an open-source project, and we welcome your contributions, ideas and feedback! Visit the GitHub repository or get in touch via firstname.lastname@example.org
On Monday evening, the number of documents available in the DocumentCloud public catalog passed one million. All told, the number of public pages now exceeds 13 million.
These public documents — plus another 1.4 million private documents in our database — represent a lot of your hard work. Often, they’re the result of hours of dogged reporting, persistent requests to government agencies, the scraping of websites, and a determination to treat “no” as an unacceptable answer.
It’s a moment to celebrate. We at DocumentCloud and Investigative Reporters and Editors applaud you. We’re grateful both for your reporting that shines a light and for what, collectively, you’re building along with us. Thank you.
The WordPress plugin – which builds upon an earlier version developed by Chris Amico for NPR’s StateImpact project – adds the ability to embed notes as well as documents using shortcodes. And we’ve updated our embed wizard so it will generate the WordPress shortcodes for you like this:
The plugin is powered by our new oEmbed API. It's our next step in helping you integrate DocumentCloud embed codes into your content management system so embedding documents and notes is as simple as pasting in a URL.
Since 2011, WordPress users have been able to embed documents on their blogs thanks to a plugin created by Chris Amico for NPR’s StateImpact project.
Chris’ plugin let users embed documents with shortcodes by translating the shortcodes into HTML embed codes.
With Chris’ help, and input from Adam Schweigert of the Project Largo team, we’ve released a new version of the plugin that adds:
Note Embeds: You can now embed individual notes as easily as documents. Just pass a note URL into the shortcode with the url attribute.
Raw URL Support: You can now paste the URL for a document or note onto its own line, and the plugin will translate that into an embed.
oEmbed Support: Embed codes are now fetched from our oEmbed service rather than generated internally, so they’ll always be up to date.
Full installation and usage instructions are available on the plugin page.
Note for users of the existing plugin: Because we’re releasing a whole new plugin, prior installations of Navis DocumentCloud won’t automatically update. You’ll have to deactivate/delete Navis DocumentCloud from your site and install the new plugin. Sorry! This should be a one-time process, and future updates will be delivered through the normal WordPress update mechanism.
We’ve added an oEmbed API to make it easier for developers to get embed codes for documents and notes. oEmbed has become a standard for easily embedding Web content, and it has long been one of our users’ top feature requests. Now, instead of having to reverse engineer the format of our embed codes, developers can just send a request to the DocumentCloud API to ask for a correctly formatted embed code.
If you’d like your CMS to support embedding documents as easily as DocumentCloud’s WordPress plugin, you or your developers can read more about our oEmbed service in our API help pages.
Starting today, DocumentCloud users can choose three additional languages to OCR uploaded documents: Hungarian, Norwegian and Swedish. We’ve added the three based on support requests and feedback we heard during last week’s NICAR conference in Atlanta.
The addition brings the number of languages available for OCR to 17. To see them all, click “Manage account” beneath your user name and click the “New Documents” dropdown under “Language Defaults.”
We believe that journalists around the world should have access to tools that enable better reporting, and growing our language support is critical to that. DocumentCloud’s language support falls into three independent categories (so partial support of your language may be possible):
Text Search: DocumentCloud fully supports Unicode, allowing users to search documents in a variety of character sets. For scanned documents that require OCR, DocumentCloud uses the battle-tested open-source Tesseract engine, which also powers Google Books and is maintained by Google. The open-source community has contributed language packages for many widely used languages, which allows DocumentCloud to enable them on our platform. So, if DocumentCloud does not yet support your language, please reach out and let us know about your interest!
Entity extraction: DocumentCloud supports identifying people, places, organizations and other entities through OpenCalais. As of now, OpenCalais only supports English, French and Spanish. We are evaluating other tools that would allow us to bring entity extraction to other languages.
Interface translation: Accessibility to our tools is more than being able to process non-english documents. Our users have already collaborated with us to translate DocumentCloud’s Workspace user interface into four languages: English, Spanish, Russian and Ukrainian. If you are interested in bringing DocumentCloud to your language, please email us at email@example.com
As of this week, documents uploaded to DocumentCloud will process much faster thanks to a new tool we’ve written called PDFShaver that wraps Google Chrome’s PDFium library.
How much faster? From our preliminary statistics, a lot.
Under the covers, DocumentCloud uses our Docsplit open source library to disassemble documents. Prior to PDFShaver, Docsplit relied upon Graphicsmagick and Ghostscript (GM+GS) to render PDFs and save pages as images.
GraphicsMagick and Ghostscript have served DocumentCloud well, but we’ve had trouble processing some poorly constructed documents that journalists receive from sources — governments, companies and non-profits, for example. Our search for a replacement led us to PDFium, and we found that not only did it solve a number of our issues but it also provided substantial gains in speed.
Testing PDFShaver and Graphicsmagick on 50 documents picked at random from DocumentCloud’s public collection shows that PDFShaver can render documents an order of magnitude faster (here’s our raw data). These data are a preliminary sample, but we’re excited about what it shows about the kinds of speed gains we can make to our processing pipeline. We’ll continue to track PDFShaver and DocumentCloud’s performance as we make improvements, so look forward to more updates!
Rendering PDF pages with PDFShaver & PDFium
PDFShaver works by connecting PDFium to Ruby with a C/C++ extension inside a Ruby gem. PDFium itself is an open source library and the software that powers Google Chrome’s PDF viewer. And aside from taking advantage of the speed and capabilities Chrome’s tools provide, we’re happy to be able to make open source PDF processing easier to access through a programming language such as Ruby.
For example, picking the landscape-oriented pages out of a document and rendering them is as easy as these three lines:
If you’re coming to IRE’s annual data journalism conference March 5-8 in Atlanta, be sure to stop by and say hello to the DocumentCloud team!
We’ve made a few ways for you to learn more about the platform, tell us your ideas, and hear about what’s next for DocumentCloud:
— Saturday at 3:20 p.m., join us for a hands-on class, “Reporting and Presentation with DocumentCloud.” Get to know the suite of tools that DocumentCloud offers to help you better organize, analyze and present public documents.
— Sunday at 11:20 a.m. in the Demo Room, we’ll offer “Advanced DocumentCloud: Examples and Suggestions.” Take a deeper dive into DocumentCloud and its API and bring your ideas for features you’d like to see in the platform. Plus, see some of the best uses of DocumentCloud in the last year!
— Throughout the conference, you can find the DocumentCloud team on hand at its booth outside the conference rooms. We’ll be set up to give you a demo, answer questions about accounts and hear how you use the platform.
If you’d dropped into the DocumentCloud workspace in Columbia, Mo., at the start of January, you’d have found at least two things: a team actively avoiding the single-digit temps outside the office and a whiteboard that we frequently filled with ideas, photographed for posterity, erased and filled up again.
We ended up ignoring the temps. The ideas generated enough heat to last us until summer – and beyond!
So, what’s ahead for 2015? We’ve spent the last year researching and reflecting on what you want, both as a buildup to the recent $1.4 million Knight Foundation grant and to make sure you’re as happy as possible with the service. We want to be sure the platform is fast, reliable, and enables you to do your best work.
There’s a legacy to continue. DocumentCloud was founded by and for journalists to support in-depth reporting around public documents. Today, more than 900 news organizations worldwide use the platform. Whether it’s publishing the documents related to doubts about the guilt of a Texas man executed for murder or the grand jury testimony regarding the death of Michael Brown in Ferguson, Mo., journalists use DocumentCloud to give their readers a first-hand view of the primary source documents they gather.
We plan to build from that success. The DocumentCloud team’s expanding, and we’re locking in a roadmap to grow and improve the platform. We’ve recently hired a director of product development, and we’ve just posted a job description for a front-end developer. With a bigger team and lots of focus, we believe that by the end of the year, you’ll see substantial improvements and expanded offerings that will maintain DocumentCloud’s place as an essential reporting and presentation tool.
So, here are some things we’re going to do to help you:
— Improved processing. If you use DocumentCloud, you upload documents to the platform. You want them processed quickly, without errors, so you can get right to publishing or annotating. Over the last year, we’ve made substantial improvements to our processing cluster and sped up imports of popular documents uploaded by multiple users. And we’re glad that you’re noticing the results. We have more work planned – look for a blog post soon detailing the changes.
— Go mobile. We know (and so do you) that more of your readers view more of your content on their phones. So, we’re planning mobile-specific changes to the viewer to improve scrolling, zooming and the experience in general.
— More storytelling tools. We’re exploring ideas for expanding the display options for the viewer, such as presentation templates, social media sharing, notes displays and more. Many of you have asked for oEmbed support, and we’re looking closely at that!
— Telling DocumentCloud’s story. We’ll bring you more blog posts like this, keeping you up to date on our progress and listening for your ideas. We’ll also tell you more about how to make better use of all the site has to offer — such as deeper search options — and highlight great examples of your storytelling.
— Expanded reach and premium offerings. We want to make sure DocumentCloud is going to be available to journalists for years to come, and one way to do that is for the platform to begin generating revenue. This goal is part of the Knight grant, and so we’ll be exploring options – premium features, opening the tool to additional types of users on a fee basis, donations, and other ideas.
Beyond those, we have many more ideas, among them better feedback on document processing, ability to rotate pages, better organization of the site and workspace, and batch processing options.
That’s a lot to chew in one year, but with our team expanding – and, we hope, with continued contributions from the open source community – we’re pretty excited about the prospects.
By Tom Meagher, data editor at The Marshall Project,
Emily Yount, interaction designer at The Washington Post,
Matt DeLong, national digital projects editor at The Washington Post
and Ted Han, lead developer at IRE/DocumentCloud
On Aug. 3, The Marshall Project, a new nonprofit journalism organization focused on criminal justice issues, published an investigation in partnership with The Washington Post that revealed new evidence raising doubts about a high-profile Texas execution.
TOM: Our reporter, Maurice Possley, began working on this story months before most of the rest of our newsroom at the Marshall Project was even hired. By the time we were able to start helping, the story was mostly reported, so we dove into the documents to bring ourselves up to speed.
The case against Cameron Todd Willingham — who was executed in Texas for the murder of his three daughters — had been written about extensively over the last 22 years, but a lot of new information was uncovered, and it was all in the documents. We knew we wanted to be able to explore and highlight the correspondence that cast this case in an entirely new light. DocumentCloud was clearly the answer.
In the course of his reporting, Possley, who has covered this case for more than a decade, was given access to copies of dozens of primary source documents that tell the backstory of Cameron Todd Willingham and the informer who helped convict him. In filing its grievance with the State Bar of Texas against the former prosecutor in the case, the Innocence Project had acquired these documents and assembled them into a series of appendices. They gave us eight PDF files that added up to nearly 400 pages. We used DocumentCloud to stitch them all back together into one large file.
We then combed through the appendices and dozens of other records of court testimony and correspondence. As we saw the various typefaces and handwriting styles that made up the key passages, we knew we wanted to use DocumentCloud notes to present excerpts directly in the story.
Matt: I started working on the story in earnest a couple of weeks before it published. We were very excited about having so many primary-source documents to enrich the narrative. The Post has been using DocumentCloud for years, but we’ve long been frustrated by one of its biggest limitations: it isn’t mobile-friendly. This isn’t really DocumentCloud’s fault; these scanned documents are a set size, so when you scale them down, at some point words will become too small to read.
We had seen how the New York Times addressed the problem, by putting up the text and linking to the original document in DocumentCloud. That’s totally logical and fine if the words are all that you care about, but in this case we have official letters and handwritten notes between the characters in the story. The pages themselves are interesting, and many readers will want to see them with their own eyes.
We decided at the outset that however we ended up displaying the documents we included in the story, they had to be responsive. But this meant we’d have to come up with our own hack. Emily and I had both been thinking about this problem individually for a while, and we had some time to work on it, so we decided to try to figure out a solution that we could use in this project.
Emily: At the time of publication, DocumentCloud’s note embed code already resized and repositioned the note based on the width of the DC-note-container div, so I knew we only needed to solve for when the note is wider than the note embed and the right side of the note is cut off (see image below).
The original coordinates, width and height allow us to determine how wide the note is in relation to the document image and resize the document just enough to make the note 100% of the embed, instead of the document image 100% of the embed. This helps with readability by making the text as large as it can be. At times, depending on the width of the note and the size of the text, there will still be readability issues, so cropping the annotations carefully and testing to make sure they are readable is really important.
Here’s an example from the Willingham story of a responsive note on an iPhone 5:
Ted: We were thrilled when Ben Chartoff (OpenNews fellow at the Washington Post) reached out to put Emily in touch with us.
We believe deeply in DocumentCloud as an open source project as well as the service to which journalists post documents relevant to the public interest. Emily and Matt’s motivation to extend the behavior which DocumentCloud already provides and to share their code back is exactly the kind of effort we love to see and encourage.
Technology in the world of news is a means toward the end of better reporting. Especially in competitive industries like ours, an open source ethos around the tools we all share is an avenue for us work together to improve the state of all reporting. Anyone who solves an issue for their own needs can help to solve that issue for everyone.
In that spirit we were excited to incorporate Emily’s code into our own. To do so, we spun our note code off into its own repository to make it easier for anyone to contribute (you can find the code on Github as documentcloud-notes). Then with the Washington Post’s & Marshall Project’s stories as a basis we began incorporating the changes. Ultimately, we ended up rewriting much of Emily’s code in the process, but what she had written served as the design criteria to anchor the code we wrote.
Our responsive notes code is already live on DocumentCloud now, and journalists needn’t take any additional steps to use it. Any embedded note from DocumentCloud will now behave responsively.
Last week, DocumentCloud received a complaint seeking the removal of a collection of emails posted by journalists with the Australian Financial Review. The emails involved a company called NDS, which hired a law firm to try and have the documents pulled from public view. This kind of thing is rare, but it happens. This case in particular has a couple of wrinkles that make it unusual, and it presents a good opportunity to remind all of our members that DocumentCloud has policies and options in place that allow you to keep all documents processed through our service available to the public for as long as you desire.
I’ll detail those below. But first, a little context.
DocumentCloud was created as a 501c3 nonprofit organization and remains so as part of Investigative Reporters and Editors (IRE). The service is offered free of charge and all expenses and manpower are covered through grants and IRE’s normal operations. We provide a suite of tools that allow you to analyze and publish documents, and we don’t control what you post. And, we don’t have a large budget to fight legal challenges to items you post.
There have been only a handful of cases in which DocumentCloud has received legal challenges to material posted on the site, where we now host more than 4 million pages.
Typically those challenges have involved allegations of copyright violation. In every case, we have contacted the posting news organization and asked them how they would like to handle the complaint. Our terms of service detail how we handle those cases, using a process based on the Digital Millennium Copyright Act (DMCA). DocumentCloud is a neutral party hosting content on behalf of users and is protected by the DCMA’s safe-harbor provisions. If we receive a formal complaint, we contact the organization that uploaded the material. If they assert their right to publish, the documents remain public and the matter is resolved between the complainant and the posting organization.
We also offer an alternative for organizations that would prefer to host their own documents and still use DocumentCloud’s viewer. A number of news organizations have chosen this option, for a variety of reasons. We make document data and our viewer code available for download to journalists directly through our workspace. Downloading a viewer will provide a news organization with an html file that is functionally indistinguishable from the viewers we host.
The case that came up last week involving the Australian Financial Review presented some new issues. The company filing the complaint over the posted emails alleged a variety of issues, but didn’t cite the DMCA. AFR opted to take down the documents rather than provide us with a letter asserting their right to publish and offering indemnity for DocumentCloud. The company said it did so because it believes that action is more appropriate in Australia, so it did not wish to become involved in a U.S. dispute with NDS. They opted to download the viewer, and AFR plans to repost the documents using the DocumentCloud software.
Dealing with such challenges is an inevitable byproduct of hosting documents. If you have questions about our policies or suggestions on how we can improve our service, please get in touch; my email is firstname.lastname@example.org.
We’ve been hard at work during our short Columbia, Missouri hackathon at DocumentCloud’s new home at the Investigative Reporters & Editors office. As a result we’ve rolled out a new feature for readers and journalists to print annotations made on documents.
Journalists have been publishing documents through DocumentCloud for a while now as well as annotating documents both for readers and for their own story writing processes. We think it’s just as important for DocumentCloud to make story writing quicker and easier as it is to help readers find primary source material.
So, when Marshall Allen of ProPublica told us that he would like to try using DocumentCloud to take his story notes, we did our best to help out. As a result, you can now select one or more documents in the workspace and choose “Print Notes” under the “Publish” menu.
This way you can annotate your sources in DocumentCloud, and have a single copy of all your research ready at hand for your copy editor or read when your flight attendant announces that all power switches should be in the off position.
And readers can find a “Print Notes” link in the sidebar footer of the document viewer too.
We hope this will help readers and journalists alike note and collect information in the format the best suits their workflows. Happy Printing (and remember to recycle)!
Every once in a while, DocumentCloud gets hit with the kind of document stash that really slows us down. We can take a lot, but if one newsroom finally gets a 25,000 page FOIA turned over to them and another gets a hold of 30,000 pages of documents for a breaking news story about the on the same afternoon, that’s a volume that will tax our servers.
We recently established a “fast lane” to ensure that smaller documents don’t have to get in line behind behemoths, but that doesn’t help if you’ve got a few MB of documents about a local scandal — you’ll still have to shuffle into line with the big sets. Continue reading »
Sets of documents are nothing new to DocumentCloud.
The Las Vegas Sun published hundreds of pages of legislation, emails, court filings and medical records alongside their award winning package on hospital care in Las Vegas. The Sun‘s Marshall Allen assembled each document collection by hand to produce that page. Plenty of othernewsrooms have used our API to do likewise. Even with the API it isn’t trivial to assemble and publish a set of documents.
Some document sets are living creatures that continue to grow: Chad Skleton at The Vancouver Sun has been adding documents retrieved from the local ferry authority’s website to a growing cache of public records on DocumentCloud. The only way to ensure his readers will find new documents as they roll in, is to point the public straight to DocumentCloud to find ferry authority FOIAs. It should be easier to embed that growing set of public documents right at The Vancouver Sun.
We spent some time at ONA last year, brainstorming with the good folks from the Public Insight Network — they really helped us distill this into a workable feature. We’re looking forward to seeing PIN newsrooms do some great reporting aided by this new feature. Continue reading »
This morning, not quite one year since we opened our beta to newsrooms at NICAR 2010, the millionth page of primary source material was uploaded to DocumentCloud. Reaching this milestone so soon is a tribute to our users and the amazing document-driven investigative reporting you have published over the past year.
Most of the thousands of documents in our catalog have arrived in small batches: five documents here, 20 there, most often accompanying a breaking story. Take a look for yourself: browse through recently published documents by searching for “filter: published” or read up on other searches you can run.
Now is a good moment to highlight some notable recent stories:
With close to 200 newsrooms contributing documents and thousands of documents in our catalog, we decided it was time to open DocumentCloud to public searches.
Wondering who is still covering the Deepwater Horizon oil spill? Try a search for “deepwater horizon” organization: transocean, and see documents that both reference the rig by name as well as the drilling contractor, Transocean. Then, click on the “Entities” tab to see more data provided by OpenCalais’ entity extraction.
Did you miss Memphis Commercial Appeal‘s coverage of Ernest Whithers? Catch up with a search for group: commercial-appeal withers, and find every document uploaded by reporters in the Commercial Appeal newsroom that mentions Whithers by name. Curious to see the annotations journalists have been making on the documents they’re sharing? Try a search for filter: annotated and you’ll skip any documents that were published without annotations.
There’s plenty more you can do with DocumentCloud’s search syntax. Check out our primer and try a few searches.
We’d love to know what you think, and what you’ve found.
This morning, we’re opening up the ability to embed documents to all of the newsrooms participating in DocumentCloud. When you log into your workspace, you’ll notice a new menu: “Publish”.
From here, you can grab an embed code (a short snippet of HTML) that can be dropped onto a web page to create a document viewer. You may be familiar with such snippets from embedding YouTube videos: this works in a similar fashion. For guidelines on setting up a template and other help, check out our documentation.
If you still have questions about the process, we’re listening at email@example.com.
Note: we know you’re eager to host documents yourself, and you can do that now, but we recommend that you stick with embedded documents so that you can take advantage of bug fixes and other improvements to the viewer. We don’t know yet whether we plan to offer embedding as a long term service. Keep in mind, as well, that this is still a beta. As described in our terms, our capacity to commit to uninterrupted service is limited, as is our liability if service is interrupted in some way.
For those news organizations that want to host documents on their own servers, we’re now offering that as an alternative too. Click on “Download Document Viewer” to get a zipped up folder with all the code, text, and images bundled together as a web page. Drop the folder into any web server (no special software required), and voila, it’s online.
Here at DocumentCloud, we’re looking forward to seeing the great reporting you do with embedded documents — don’t forget to use the workspace to add a “Related Article” link.
Eagle-eyed followers of the DocumentCloud Twitter feed have already picked up on the fact that we began adding users to our beta last month.
We made a strategic decision to peg our beta to NICAR’s March 2010 computer assisted reporting conference, where we knew we’d be able to gather a sizable group of just the sort of investigative reporters we hope to support with DocumentCloud, and get them excited about using our tools to do more with their documents. Nothing beats hands-on support when you’re using a new tool. Plus, we identified dozens of quick fixes we could make after watching over journalists’ shoulders as they explored DocumentCloud.
In the month since NICAR, we’ve added more than 150 users who’ve uploaded a cumulative 54,000 pages of text, and made close to 300 documents available in DocumentCloud. Our repository is already home to police reports from New Orleans, a confirmation hearing transcript that adds context to coverage of Justice Stevens’ resignation, and disaster preparedness plans from Haiti. There’s even a collection of emails that document how some hedge funds not only saw the mortgage crash coming, but wagered on the collapse and won big. (The hedge fund that these reporters investigated argues it never had the hands-on role ascribed to them; that’s in DocumentCloud, too.) Eventually, anyone will be able to connect with those documents right through our website.
Want to be part of the beta? Get in touch and tell us a bit about the documents you’re working with.
We’re still adding beta testers and actively listening to the users we’ve got as we prioritize and refine our to do lists, but we think we’re off to a great start.