When you upload a document to DocumentCloud, and the file does not contain text, we attempt to perform OCR (optical character recognition) on the document, using the open source Tesseract project. Tesseract is a venerable piece of software, originally developed at Hewlett-Packard between 1985 and 1995. Google acquired the project in 2006, and has been sponsoring work on it since then. A few months ago, Tesseract 3.0 was released; and this morning, we’ve deployed the new version of Tesseract as part of DocumentCloud.
At the same time, we’ve also added a layer of OCR post-processing. One unfortunate characteristic of OCR software is its tendency to produce a lot of “garbage” letters and words, when it encounters a section of a page that has markings on it, or non-textual images. Fortunately, these garbage words are easy to identify and remove. We implemented a number of algorithms found in two academic papers: “Automatic Removal of ‘Garbage Strings’ in OCR Text: An Implementation”, and “Improving Search and Retrieval Performance through Shortening Documents, Detecting Garbage, and Throwing out Jargon”, with some tweaks and modifications. The OCR cleanup code is available as part of the latest release of our Docsplit project. If you know Ruby, the rules for garbage detection are worth a read.
At the end of the day, this means dramatically better results for your OCR’d documents. Let’s take this line of muddy text as an example:
Before today, DocumentCloud’s OCR would have given you fairly garbled results for this:
But with Tesseract 3.0 and the OCR cleanup, we get this:
If you want to take advantage of this improved OCR for a document you already uploaded, we’ve added a menu item to make it possible. Select the documents you wish to reprocess, click on the “Edit” menu, and choose “Reprocess Text”.