Announcing Docsplit: Break Documents into Images, Pages, and Plain Text
We’ve been spending a lot of time in the DocumentCloud Lab researching the best way to break apart documents into their component parts, to make it easier to index them for searching and to display them on the web. The latest open-source piece of DocumentCloud is a tool to help you extract images, thumbnails, plain text, and individual pages from any kind of document. It wraps up the PDFBox, GraphicsMagick, and JODConverter libraries, providing you with a command-line utility and a Ruby API for breaking apart documents.
Docsplit is our fourth open-source project, but is perhaps the most immediately useful in the newsroom. We’ve been talking to the Guardian and the New York Times about techniques for pulling images and text out of documents, and Docsplit synthesizes some of the best practices into a single package with a simple interface. We’re hoping it comes in handy the next time you need to analyze a pile of documents.
Seeking Consultants (updated)
update: we have what we need for now, thanks.
Have you been watching DocumentCloud roll out code releases and wishing you could be part of it all? You can! We’re looking for a couple of consultants to help us build out Document Cloud: we need a JavaScript consultant to work with us on an ongoing basis over the next few months and a Posgres expert to do some intense consulting with us.
We’re building a research tool for reporters, a semantic search engine, an index of primary source documents with our grant from the Knight Foundation. DocumentCloud will be free and open source software.
We need a JavaScript developer to help build out a rich, web-based tool that journalists will use to search and organize documents, as well as visualize the relationships between documents. A strong foundation in HTML and CSS is required, bonus points for comfort in Ruby. If you think that doing full JavaScript MVC in the browser doesn’t sound like a crazy idea, then we want to hear from you.
We also need an expert-level PostgreSQL consultant to sit down with us and review and refine our architecture plans. We’re looking someone with plenty of experience working with sharded Postgres installations, someone skilled at tuning Postgres for full text searches over very large datasets (potentially approaching hundreds of thousands of documents) and well versed in best practices for deploying Postgres on EC2.
If either of these sounds like you, send your resume, a rate quote and a short description of particularly relevant work to: jobs@documentcloud.org with “JavaScript Developer” or “Postgres Consultant” in the subject line.
Hint: the subject line matters more than you’d think. Our “jobs” inbox has a procmail filter and three folders: JavaScript, Postgres and Trash.
Announcing Jammit: DocumentCloud’s Asset Packager
The DocumentCloud prototype includes a “Journalist Workspace” — a tool for searching, organizing, and visualizing the relationships among documents. We’re building the workspace as a modern web application, which means that there’s a lot of static assets behind the scenes (JavaScript, templates, CSS, and images). The problem arises: how do you keep all of these assets organized while still delivering them as efficiently as possible to a web browser?
Our answer, Jammit, is a Rails gem that takes care of merging and compressing all of a website’s static assets. It runs JavaScript and CSS through the excellent YUI Compressor, zips them up for speedy downloads, and can embed small images right into the stylesheets. Using it in the DocumentCloud prototype has cut the time that it takes to load the workspace in half.
The project page contains complete overview of Jammit, including installation instructions, documentation, and examples. We hope you can use it to help speed up your Rails applications.
Our Second Hire
Here at Document Cloud we’ve finally hired ourselves a Program Director to keep Jeremy, our lead developer, company. Someone to manage our impressive and growing list of document partners and help them get the most out of Document Cloud. Someone to develop some training materials and help our beta testers get started beta testing. For her first challenge, we asked her to write a blog post in the third person.
Amanda Hickman joins us from Gotham Gazette where, as the Director of Technology, she managed development of a series of games about public policy issues, built a pretty cool database of candidates for local office and shared an ONA award for General Excellence with her colleagues there. Prior to joining Gotham Gazette, she worked as a Circuit Rider, providing technology assistance and training to low-income grassroots groups in the U.S. working on anti-poverty issues and as a consultant to foundations looking for ways to support their grantees’ use of technology in organizing work. She taught an undergraduate course at NYU’s Gallatin School on using the Internet as an organizing tool. An active local organizer, she’s got her hands in a few community composting and gardening projects, too. If you ever tire of hearing about semantic analysis of primary source documents, try asking her about the dwarf crab apple trees at Greene Acres or what she does with 1300 lbs of compost every week.
She’ll be back here answering all your questions just as soon as she can manage.
Underscore.js: Our Second Open-Source Release
We released the first open-source component of DocumentCloud a little over a month ago. Since then CloudCrowd has picked up a lot of steam, with hundreds of developers watching it on GitHub, and many patches and features being contributed by the community. Among other uses, it’s running gene sequence analysis on strains of influenza virus — something we certainly never expected to see. Since anything worth doing is worth doing twice, this morning I’m pleased to announce the release of the second open-source component of DocumentCloud: Underscore.js.
Underscore is a Javascript library that provides a lot of the functional programming support that users of Prototype.js or Ruby expect, but does so by introducing a single object, the underscore: “_”. It’s a partial adaptation of many of the utility methods from the Prototype.js project, in order to use them without touching the prototypes of any of the core Javascript objects. This is important because it means you can use Underscore right alongside jQuery without having to worry about conflicting variables, redundant functionality, or differences in expected coding style. For Javascript 1.6 compliant browsers, it delegates to the native implementations of the functional methods, so that you can enjoy them at full speed where available.
This release has a much smaller scope than the previous one, but we think that it’s a helpful bit of code for any team that takes Javascript seriously — especially in conjunction with jQuery. The production version of the library weighs in at only 4kb when gzipped, a relatively fat-free download that you can add to your page without worrying too much about load time. We’re using it to develop our “journalist workspace”, the area in which researchers can search and organize documents, and visualize the relationships between them. We hope you find it useful.
Two Dozen Media Outlets and Others Join Us as Beta Testers
We have some more news: About two dozen news and other organizations have signed on as beta-testers. They’ll be contributing documents to DocumentCloud, and giving us feedback as we work out the kinks. It’s a wide-ranging list:
- ACLU National Security Project
- Arizona Republic
- The Atlantic
- Center for Democracy and Technology / OpenCRS
- Centre for Investigative Journalism, City University London
- Center for Investigative Reporting / California Watch
- Center for Public Integrity
- Chicago Tribune
- Dallas Morning News
- The Investigative Reporting Workshop at American University
- The New Yorker
- NewsHour
- MinnPost
- MSNBC
- Mother Jones
- Public.Resource.Org
- St. Petersburg Times
- Sunlight Foundation
- Voice of San Diego
- Washington Post
- WNYC
These organizations will be joining our original set of contributors — The New York Times, ProPublica, Talking Points Memo, The National Security Archive, and Gotham Gazette — all of whom will of course be working with us during the testing too.
Earlier this morning we also announced that we’re working with Thomson Reuters’ OpenCalais service to extract and make available information from the documents contributed to DocumentCloud.
E-mail us if you’d like to participate in the testing. We’re interested in any organization, including non-profits and academic institutions, that have obtained documents during their research.
If you’re new here, the goal of DocumentCloud is to super-charge investigations by making documents, and the information in them, easier to find and share. Readers will be able to search documents on DocumentCloud and then will be pointed to the documents themselves on contributing organizations’ Web sites. (Here’s a FAQ with more details.)
Finally, you can keep following our progress on this blog — or follow us on Twitter, or RSS. And we’re releasing our code each step of the way.
Thomson Reuters and OpenCalais
This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.
If you’ve seen us do a presentation about DocumentCloud, you already know it’s going to be a key part of what makes DocumentCloud great.
CloudCrowd — Parallel Processing for the Rest of Us
As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.
Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.

CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.
We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.
This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!
Our First Hire
We’re excited to announce that Jeremy Ashkenas has joined the team as the lead developer for DocumentCloud. His previous job was at Zenbe Inc., a provider of online email and collaboration software. He’s the creator of the Ruby-Processing visualization toolkit, and a winner — twice — of the Sunlight Foundation’s Apps for America competition. Jeremy graduated from Brown University with a degree in Literary Systems.
Over the past few weeks, he’s been working on the central processing system for a DocumentCloud prototype. We are planning to open source this tool shortly … so stay tuned.
