Latest Updates: Our Blog

December 2009

Big Apps Are Here

Posted
Dec 18th, 2009

Tags
IdeaLab

Author
Amanda Hickman

Cross posted from PBS Idealab.

I’ve already voiced my own suspicion that New York City’s Big Apps competition is a deft end-run on an actual open data bill in New York City. Nonetheless, some 85 applications built on the city’s currently public data sets are available now to explore and vote on through early January. They include a handful of legislator lookup tools and an unexpected number of park spot finders. There’s also a graffiti finder designed for the curious dual purpose of helping steer both Wildstyle fans and the city’s Anti-Graffiti Unit paint trucks straight to new throw-ups.

Other gems that I’ve been watching include ProPublica’s great and very offline crowdsourcing efforts as part of their coverage of police shootings in New Orleans in the days following Hurricane Katrina.

Discuss Big Apps Are Here on PBS’s IdeaLab.

DocumentCloud Releases More Code, Continues to Attract Developer Interest

Posted
Dec 10th, 2009

Tags
IdeaLab

Author
Amanda Hickman

Cross posted from PBS Idealab.

A public beta of DocumentCloud, one that journalists can kick the wheels on and upload documents to, won’t be ready for a few more months, but work is continuing apace in our corner of the cloud.

We’ve released a handful of code that comprises some of the components of our big picture, and it is great to see how well received our work has been by the Ruby and JavaScript communities. Last week we hit a little milestone: more than 1,000 developers are watching DocumentCloud projects on Git Hub, which is pretty cool. The advantage for us is that many of these developers are actually trying out our software releases and helping us make them stronger.

Gregg Pollack included a great review of CloudCrowd in a recent episode of his show, Scaling Rails. CloudCrowd will still be Greek to the truly non-technical readers out there, but if you have enough of a handle on software development to wish you understood”scaling” better, his review just might help.

Our latest release, Docsplit, is a command-line utility and Ruby library for splitting documents into distinct components such as raw text (which you need for searches), page thumbnails, and document metadata (details like the document’s author or the number of pages it contains).

Splitting documents apart is a pretty key functionality for DocumentCloud: everything else DocumentCloud does depends on the presence of one or another of these pieces. Docsplit got a lot of attention when we released it on Monday — and we’re all looking forward to seeing what other folks do with it.

Discuss DocumentCloud Releases More Code, Continues to Attract Developer Interest on PBS’s IdeaLab.

Announcing Docsplit: Break Documents into Images, Pages, and Plain Text

Posted
Dec 7th, 2009

Tags
Code

Author
Jeremy Ashkenas

We’ve been spending a lot of time in the DocumentCloud Lab researching the best way to break apart documents into their component parts, to make it easier to index them for searching and to display them on the web. The latest open-source piece of DocumentCloud is a tool to help you extract images, thumbnails, plain text, and individual pages from any kind of document. It wraps up the PDFBox, GraphicsMagick, and JODConverter libraries, providing you with a command-line utility and a Ruby API for breaking apart documents.

Docsplit is our fourth open-source project, but is perhaps the most immediately useful in the newsroom. We’ve been talking to the Guardian and the New York Times about techniques for pulling images and text out of documents, and Docsplit synthesizes some of the best practices into a single package with a simple interface. We’re hoping it comes in handy the next time you need to analyze a pile of documents.