DocumentCloud

Archive for September, 2009

Two Dozen Media Outlets and Others Join Us as Beta Testers

without comments

We have some more news: About two dozen news and other organizations have signed on as beta-testers. They’ll be contributing documents to DocumentCloud, and giving us feedback as we work out the kinks. It’s a wide-ranging list:

  • ACLU National Security Project
  • Arizona Republic
  • The Atlantic
  • Center for Democracy and Technology / OpenCRS
  • Centre for Investigative Journalism, City University London
  • Center for Investigative Reporting / California Watch
  • Center for Public Integrity
  • Chicago Tribune
  • Dallas Morning News
  • The Investigative Reporting Workshop at American University
  • The New Yorker
  • NewsHour
  • MinnPost
  • MSNBC
  • Mother Jones
  • Public.Resource.Org
  • St. Petersburg Times
  • Sunlight Foundation
  • Voice of San Diego
  • Washington Post
  • WNYC

These organizations will be joining our original set of contributors — The New York Times, ProPublica, Talking Points Memo, The National Security Archive, and Gotham Gazette — all of whom will of course be working with us during the testing too.

Earlier this morning we also announced that we’re working with Thomson Reuters’ OpenCalais service to extract and make available information from the documents contributed to DocumentCloud.

E-mail us if you’d like to participate in the testing. We’re interested in any organization, including non-profits and academic institutions, that have obtained documents during their research.

If you’re new here, the goal of DocumentCloud is to super-charge investigations by making documents, and the information in them, easier to find and share. Readers will be able to search documents on DocumentCloud and then will be pointed to the documents themselves on contributing organizations’ Web sites. (Here’s a FAQ with more details.)

Finally, you can keep following our progress on this blog — or follow us on Twitter, or RSS. And we’re releasing our code each step of the way.

Written by Scott

September 24th, 2009 at 8:00 am

Posted in Code

Thomson Reuters and OpenCalais

with one comment

This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.

If you’ve seen us do a presentation about DocumentCloud, you already know it’s going to be a key part of what makes DocumentCloud great.

Written by Scott

September 24th, 2009 at 7:30 am

Posted in Code

CloudCrowd — Parallel Processing for the Rest of Us

without comments

As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.

Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.

CloudCrowd Operations Center

CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.

We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.

This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!

Written by Jeremy Ashkenas

September 14th, 2009 at 9:29 am

Posted in Code

Our First Hire

without comments

We’re excited to announce that Jeremy Ashkenas has joined the team as the lead developer for DocumentCloud. His previous job was at Zenbe Inc., a provider of online email and collaboration software. He’s the creator of the Ruby-Processing visualization toolkit, and a winnertwice — of the Sunlight Foundation’s Apps for America competition. Jeremy graduated from Brown University with a degree in Literary Systems.

Over the past few weeks, he’s been working on the central processing system for a DocumentCloud prototype. We are planning to open source this tool shortly … so stay tuned.

Written by Scott

September 14th, 2009 at 12:29 am

Posted in People