Latest Updates: Our Blog

Shearing PDFs with PDFShaver at DocumentCloud

Mar 7th, 2015


Ted Han

As of this week, documents uploaded to DocumentCloud will process much faster thanks to a new tool we’ve written called PDFShaver that wraps Google Chrome’s PDFium library.

How much faster? From our preliminary statistics, a lot.

Under the covers, DocumentCloud uses our Docsplit open source library to disassemble documents. Prior to PDFShaver, Docsplit relied upon Graphicsmagick and Ghostscript (GM+GS) to render PDFs and save pages as images.

GraphicsMagick and Ghostscript have served DocumentCloud well, but we’ve had trouble processing some poorly constructed documents that journalists receive from sources — governments, companies and non-profits, for example. Our search for a replacement led us to PDFium, and we found that not only did it solve a number of our issues but it also provided substantial gains in speed.

Testing PDFShaver and Graphicsmagick on 50 documents picked at random from DocumentCloud’s public collection shows that PDFShaver can render documents an order of magnitude faster (here’s our raw data). These data are a preliminary sample, but we’re excited about what it shows about the kinds of speed gains we can make to our processing pipeline. We’ll continue to track PDFShaver and DocumentCloud’s performance as we make improvements, so look forward to more updates!

Rendering PDF pages with PDFShaver & PDFium

PDFShaver works by connecting PDFium to Ruby with a C/C++ extension inside a Ruby gem.  PDFium itself is an open source library and the software that powers Google Chrome’s PDF viewer. And aside from taking advantage of the speed and capabilities Chrome’s tools provide, we’re happy to be able to make open source PDF processing easier to access through a programming language such as Ruby.

For example, picking the landscape-oriented pages out of a document and rendering them is as easy as these three lines:

document ="./path/to/document.pdf")
landscape_pages ={ |page| page.aspect > 1 }
landscape_pages.each{ |page| page.render("page_#{page.number}.gif") }

We plan to keep improving PDFShaver and make more of PDFium’s features accessible to give Rubyists, data scientists and journalists a boost for overcoming the impediments that PDFs present.

If you’re interested in installing and using PDFShaver, you can read how on our Github repository. And if you’d like to help journalists and others free information from PDFs, your contributions are welcome!


Leave a Reply