Starting today, DocumentCloud users can choose three additional languages to OCR uploaded documents: Hungarian, Norwegian and Swedish. We’ve added the three based on support requests and feedback we heard during last week’s NICAR conference in Atlanta.
The addition brings the number of languages available for OCR to 17. To see them all, click “Manage account” beneath your user name and click the “New Documents” dropdown under “Language Defaults.”
We believe that journalists around the world should have access to tools that enable better reporting, and growing our language support is critical to that. DocumentCloud’s language support falls into three independent categories (so partial support of your language may be possible):
Text Search: DocumentCloud fully supports Unicode, allowing users to search documents in a variety of character sets. For scanned documents that require OCR, DocumentCloud uses the battle-tested open-source Tesseract engine, which also powers Google Books and is maintained by Google. The open-source community has contributed language packages for many widely used languages, which allows DocumentCloud to enable them on our platform. So, if DocumentCloud does not yet support your language, please reach out and let us know about your interest!
Entity extraction: DocumentCloud supports identifying people, places, organizations and other entities through OpenCalais. As of now, OpenCalais only supports English, French and Spanish. We are evaluating other tools that would allow us to bring entity extraction to other languages.
Interface translation: Accessibility to our tools is more than being able to process non-english documents. Our users have already collaborated with us to translate DocumentCloud’s Workspace user interface into four languages: English, Spanish, Russian and Ukrainian. If you are interested in bringing DocumentCloud to your language, please email us at email@example.com