Announcing Docsplit: Break Documents into Images, Pages, and Plain Text

Dec 7th, 2009


Jeremy Ashkenas

We’ve been spending a lot of time in the DocumentCloud Lab researching the best way to break apart documents into their component parts, to make it easier to index them for searching and to display them on the web. The latest open-source piece of DocumentCloud is a tool to help you extract images, thumbnails, plain text, and individual pages from any kind of document. It wraps up the PDFBox, GraphicsMagick, and JODConverter libraries, providing you with a command-line utility and a Ruby API for breaking apart documents.

Docsplit is our fourth open-source project, but is perhaps the most immediately useful in the newsroom. We’ve been talking to the Guardian and the New York Times about techniques for pulling images and text out of documents, and Docsplit synthesizes some of the best practices into a single package with a simple interface. We’re hoping it comes in handy the next time you need to analyze a pile of documents.

