As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.
Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.
CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.
We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.
This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!