Improving DocumentCloud’s speed and reliability has been a major focus of our team’s development time in the last year, with as much attention spent on back-end processing as on more-visible changes such as re-styled notes, a mobile-ready page embed, and enhancements to our WordPress plugin.
We thought you’d like to know about some of that work — from caching to a beefier database server to compressing the files we serve — and how it plays out every day in faster load times, a more stable platform, and improved performance during high-traffic situations.
Here’s a rundown:
Content Delivery Network: Last fall, we began serving PDFs, text, and images from Amazon’s CloudFront content delivery network instead of directly from Amazon’s S3 storage. You might have noticed the change when we switched URL subdomains from
assets.*, as in https://assets.documentcloud.org/documents/282753/lefler-thesis.pdf.
One benefit of using CloudFront is that we now serve files from multiple places around the globe. That means our users and readers in Europe, Asia and South America now see faster load times. We’re very happy to support better performance this way as our worldwide user base grows.
Compressed assets: DocumentCloud assets are now served compressed, a step developers had requested to improve performance. That means web browsers don’t have to download as many bytes to load our viewer and the code around it. This speeds up viewing and saves us some money on data costs — double win.
Caching: We’ve improved caching on our main application server, which frees it up to do other work. We did this by using its NGINX web server as a reverse proxy cache, which keeps track of requests for documents and other resources and saves the responses. If we see multiple requests for the same document or resource, we serve the response from the cache rather than making the server generate the response all over again.
Previously, the server relied on Ruby on Rails’ built-in page caching. While that worked well in most instances, it wasn’t able to handle resources over a certain length, typically those containing a large number of options and parameters (such as certain API calls). Using NGINX as a reverse proxy cache removes that limit and considerably improves caching coverage and performance. For example, our new caching setup came in very handy when publicity around the Panama Papers pushed traffic to documentcloud.org up 25 times more than usual.
Database upgrade: Finally, we recently upgraded our database to PostgreSQL 9.4 and moved it to a more powerful server. Our database server gets a considerable workout recording data about each document uploaded, processed and served, and as our user base grows we were starting to hit limits on memory, storage and processing. Now, we have room to spare.
We have plenty of work ahead as we embark on improving the data model for handling accounts and create a new workflow for signing up. But we hope these less-visible back-end improvements help you get your work done faster and improve your readers’ experience with DocumentCloud.
Thanks to Ryan Murphy of the Texas Tribune, there’s a new way to simplify using DocumentCloud’s API – a Node.js library aptly called node-documentcloud.
“Why should Ruby and Python get to have all the fun?” Murphy said, referring to the fact that coders for some time have been able to use python-documentcloud and the documentcloud RubyGem wrappers to work with the DocumentCloud API.
“The more I use Node.js, the more I like having the option to complete tasks in the language,” Murphy said. “DocumentCloud also has a relatively straightforward API structure, so it also seemed like a good opportunity to try building a client for the first time (something I’ve wanted to attempt for a while).”
DocumentCloud’s API is a powerful piece of the platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.
You can interact with our API via the programming language of your choice, but if you’re a user of Python, Ruby, or now Node.js, the wrappers around the API contributed by the open-source community provide many shortcuts over coding all the interactions yourself.
Murphy sees his library as a gateway to additional features and platforms around DocumentCloud.
“For example, one of the first spinoffs I’ve begun work on is a command line interface on top of node-documentcloud — something that would allow you to interface with the DocumentCloud client from your terminal,” Murphy said.
“Then, it could be as simple as something like
documentcloud-cli upload <name_of_folder> to send a bunch of documents to the service. Or,
documentcloud-cli download <document_id> to pull down a file. It’s still early going!”
Read all about the wrappers
You can learn more about node-documentcloud by visiting its documentation on Github or npm.
If you’re a Python or Ruby coder, take a look at:
python-documentcloud: From Ben Welsh of the Los Angeles Times’ data desk comes this full-featured API client for Python programmers. In addition to covering the basics, this library goes deep with providing details such as the location of annotations in a document. Documentation.
pneumatic: A Python bulk-upload library for DocumentCloud, written by Anthony DeBarros of the DocumentCloud team. Provides features including cataloging all the files uploaded and their URLs in a database. Documentation.
DocumentCloud: A RubyGem for interacting with the DocumentCloud API, created by Miles Zimmerman. Upload, search, retrieve data about documents. Github. RubyGems.
A big congratulations from the DocumentCloud team to Tyler Dukes, public records reporter at TV station WRAL in North Carolina. Dukes received a 2016 Sunshine Award from the North Carolina Open Government Coalition for work including a custom document-search application he built using the DocumentCloud API.
The API is a powerful piece of the DocumentCloud platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.
When the University of North Carolina at Chapel Hill released hundreds of thousands of pages of documents gathered during an independent investigation into academic fraud involving faculty, staff and student athletes, Dukes turned to the search method of DocumentCloud’s API to build a web application that let users find and read documents by keyword or key people in the investigation.
“We wanted to build something to allow users to browse and search hundreds of thousands of pages of documents all in one place,” Dukes said. “DocumentCloud’s existing embeddable search was close, but because the documents were arbitrarily spread across hundreds of batches, I was concerned it would be too confusing for the average user.
“The API allowed us to very quickly prototype and roll out exactly what we wanted for this very specific circumstance,” he said. “We used the API to pull every page from documents stored in a single project and display them in an intuitive application that allows users to read page by page (or even random pages) or search everything at once. We’ve updated the application twice now, and we’re currently up to 680,000 pages and counting.”
Dukes’ project is one of several in recent months that have used our API or components to give readers custom search and viewing, including a Wall Street Journal application to let readers tag Hillary Clinton’s emails and La Nacion’s election crowdsourcing application VozData.
If you’re interested in using the DocumentCloud API, check out our help documentation and don’t be shy about getting in touch.
We’re excited to announce that DocumentCloud’s custom WordPress plugin now features support for Page Embed, our lightweight, responsive viewer. Version 0.4.0 — which includes additional enhancements and bug fixes — is available for download now, and we recommend upgrading as soon as practical.
The plugin makes publishing documents, notes and now pages as simple as dropping a shortcode in your WordPress post. As with documents and notes, the embed wizard in the DocumentCloud workspace generates the shortcode for a page at the same time it creates our traditional embed code. Here’s an example:
Download, install and activate the plugin, drop the shortcode into your story, and it renders the page including annotations:
For more information on embedding, check out our Help docs.
Our WordPress plugin — as with the entire DocumentCloud platform — is an open-source project, and we welcome your contributions, ideas and feedback! Visit the GitHub repository or get in touch via firstname.lastname@example.org
With content consumption continuing to shift to mobile, we’ve been spending a considerable amount of time thinking about how to help our users tell stories with documents across a range of devices. Loading and reading a 50-page PDF on a desktop computer is an experience that, for many reasons, doesn’t translate well to a smart phone. So, what’s a better way to point readers to what matters in a document when it appears on one of your mobile pages?
Today, we’re pleased to announce our first step in moving toward a better mobile PDF experience: DocumentCloud Page Embed. It’s a lightweight, responsive viewer that highlights a single page, along with your annotations, and works across desktop and mobile. It’s available in our workspace right now. Look for “Embed a Page” under the Publish Menu and use the wizard to generate the code. See our Help documentation for details.
Here’s an example of Page Embed showing a page from an investigative report on a fire that injured several firefighters in Northern Virginia. The page has one note highlighted:
Behind the development
Why a page-focused embed? We noticed that publishers often embed simple screenshots of document pages. While easy to use and natively responsive, images deny readers the rich context of an embedded DocumentCloud document: annotations, searchable page text, and of course access to the original source document itself. While our full document viewer offers all these options, it’s overkill for presenting a single page.
Our new page embed strips the cruft from the document viewer and lets the reader focus on the page (and your annotations, if you’ve added any). We know how important mobile has become for document publishing, so we’ve made the page embed natively responsive. The entire interface resizes and changes capabilities with its surrounding context.
Extending across the platform
In fact, we’re so happy with the page embedder, we’re using it as the foundation for our next-generation full document viewer. Our goal is a responsive, extensible viewer with minimal interface chrome that lets readers view the document and your annotations and then get right back into the story. In other words, our focus is keeping their focus on you.
We still have plenty of improvements and additions in the pipeline. We’ll soon add the navigation to all the pages and text in the document, more immersive notes, more customization options, and better performance. We also plan to support page embeds with our oEmbed API and WordPress plugin. You can go ahead and start using it today, but keep an eye on the wizard for new options and capabilities in the near future.
Your thoughts are welcome! Please send any feedback to email@example.com or open an issue on our GitHub repo.
Today, we’re making it easier for you to embed documents and notes with an updated WordPress plugin and an oEmbed service to power it.
The WordPress plugin – which builds upon an earlier version developed by Chris Amico for NPR’s StateImpact project – adds the ability to embed notes as well as documents using shortcodes. And we’ve updated our embed wizard so it will generate the WordPress shortcodes for you like this:
[documentcloud url="http://www.documentcloud.org/documents/1699074-sb0101-05-enrs/annotations/210824.html" ]
Which embed like this:
The plugin is powered by our new oEmbed API. It's our next step in helping you integrate DocumentCloud embed codes into your content management system so embedding documents and notes is as simple as pasting in a URL.
You can install our plugin right now by visiting its WordPress page. Developers interested in adding simple embedding with our oEmbed API can find its documentation in our help pages.
Read on for more details:
Since 2011, WordPress users have been able to embed documents on their blogs thanks to a plugin created by Chris Amico for NPR’s StateImpact project.
Chris’ plugin let users embed documents with shortcodes by translating the shortcodes into HTML embed codes.
With Chris’ help, and input from Adam Schweigert of the Project Largo team, we’ve released a new version of the plugin that adds:
- Note Embeds: You can now embed individual notes as easily as documents. Just pass a note URL into the shortcode with the
- Raw URL Support: You can now paste the URL for a document or note onto its own line, and the plugin will translate that into an embed.
- oEmbed Support: Embed codes are now fetched from our oEmbed service rather than generated internally, so they’ll always be up to date.
Full installation and usage instructions are available on the plugin page.
Note for users of the existing plugin: Because we’re releasing a whole new plugin, prior installations of Navis DocumentCloud won’t automatically update. You’ll have to deactivate/delete Navis DocumentCloud from your site and install the new plugin. Sorry! This should be a one-time process, and future updates will be delivered through the normal WordPress update mechanism.
We’ve added an oEmbed API to make it easier for developers to get embed codes for documents and notes. oEmbed has become a standard for easily embedding Web content, and it has long been one of our users’ top feature requests. Now, instead of having to reverse engineer the format of our embed codes, developers can just send a request to the DocumentCloud API to ask for a correctly formatted embed code.
If you’d like your CMS to support embedding documents as easily as DocumentCloud’s WordPress plugin, you or your developers can read more about our oEmbed service in our API help pages.
Thanks for reading. Follow us on twitter for more updates.
“That crux of it is that it makes it safe to drop a document into WordPress and be certain that it won’t be broken,” said Amico, of the plugin he authored as an application developer for NPR’s StateImpact project.
Users can post their document using a DocumentCloud button on the post’s toolbar in Visual mode, or a shortcode in HTML mode.
The plugin also allows users to configure the width and height of the document viewer in an administrative panel in the Settings menu. The “Full-width” option is designed to make the document viewer as wide as the post content.
“We wanted to give reporters the ability to make a post that is basically just the document,” Amico said.
The plugin is available on GitHub and the Project Argo site. StateImpact is a spinoff project of Project Argo, which are both run by NPR. Amico said he wanted to help bloggers use DocumentCloud because he used it as a reporter for PBS NewsHour. “At some point, my plan is to put it into WordPress’s plugin directory,” Amico said.
Samantha Sunne volunteers with DocumentCloud at its hub in Columbia, Missouri. She studies investigative and multimedia reporting at the University of Missouri.
Jeremy Ashkenas, engineer emeritus of DocumentCloud, opened the conference with the first State of the Backbone keynote. We’ve recorded it for those who weren’t able to join us at BackboneConf.
Along with a slew of tweaks and bug fixes, the most notable new feature is HTML5 “pushState” support, which you can see in action by trying a search in DocumentCloud’s public archive. This enables the use of true URLs, but also requires you to do a bit of extra work on the back end to be sure that your application is capable of serving these pages, so it’s strictly on an opt-in basis.
Of course, not all browsers currently in popular use (ahem, Internet Explorer) support the “pushState” function yet. Older browsers will continue to use hash-based URLs, and if hash-based links are shared with modern browsers, they’ll be transparently upgraded to the “pushState” version of the URL.
Other changes include renaming Controller to Router for clarity, the refresh function to reset, and replacing saveLocation with a more flexible navigate API. There are instructions for upgrading from 0.3.3 to 0.5.0 that should help with these.
The full change log is also available.
Over the past two years, we have released much of our toolset as open-source code: Backbone.js, Underscore.js, Jammit, CloudCrowd, and others. Today, we’re launching another piece of DocumentCloud — both on DocumentCloud.org and as a component you can integrate into your own projects. VisualSearch.js is a rich search box for real data. It enhances ordinary search boxes with the ability to autocomplete facets and values for sophisticated searches.
For example, here’s a query that filters The New York Times’ copies of Sarah Palin’s recently-released emails. First we filter just the annotated emails, then just the emails from 2007, and then drill down to a specific date. Visual search works with the arbitrary metadata you’ve already added to your documents.
We are excited to not only see what clever uses developers come up with for VisualSearch, but also what additions you write that can be merged back into the main repository.
When you upload a document to DocumentCloud, and the file does not contain text, we attempt to perform OCR (optical character recognition) on the document, using the open source Tesseract project. Tesseract is a venerable piece of software, originally developed at Hewlett-Packard between 1985 and 1995. Google acquired the project in 2006, and has been sponsoring work on it since then. A few months ago, Tesseract 3.0 was released; and this morning, we’ve deployed the new version of Tesseract as part of DocumentCloud. Continue reading »
The project is hosted on GitHub; annotated source code is available, as is an online test suite.
Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.
Benchmarking with Docsplit
To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…). Continue reading »
At this point at the end of our first summer, over 30 newsrooms are using DocumentCloud to augment their reporting by publishing selected source documents. You can see some examples of DocumentCloud in action on our list of featured documents or our recent MediaShift post. We’ll soon be allowing the general public to search the catalog of primary source documents, and when someone runs a search, we’d like to send readers to the embedded version of a document on the contributing organziation’s site, if it’s available. So we need to know the location of the page where the document is being embedded. In order to help automate this, we created Pixel Ping. Continue reading »
The Document Viewer has always supported the ability to create “page notes” — annotations that sit between two pages and provide commentary about a specific page as a whole or an introduction to a new section of a document. This morning, we released an update to DocumentCloud that provides a way for you to create page notes from within the viewer.
To try it out, open a document you’ve uploaded and click on one of the “Add a Note” links in the sidebar. Hover your crosshair over the margin in between pages, and you’ll see a dotted line appear, with a note tab on the left:
If you click, you’ll create a page note in between the two pages. Add a title and some text and click the “Save” button. The note’s title will appear in the navigation on the right. Of course, page notes are viewable and editable from the workspace, just like any other.
If you’re logged in, you can take a look at the sample document shown here.
Monday morning we rolled out SSL support on DocumentCloud.org — visit https://www.documentcloud.org to view, browse and edit documents in your workspace over an encrypted connection. When you use HTTPS, all traffic between your computer and DocumentCloud is encrypted before it’s sent over the internet. If you’re working on a public wireless connection, are on an unsecured network or are dealing with highly-sensitive documents, we recommend using HTTPS.
You can tell if you’re an secure connection by looking at your browser. When visiting a secure website, all browsers display a lock icon somewhere on the window. Here’s what the lock looks like in Google Chrome:
More Search Parameters
We’ve also added new ways to filter your DocumentCloud searches. You can now use “access” to filter your documents by their access level, and “projectid” to designate a specific project when you’re using our search API. (Access to searches and the API are limited to registered users during the beta.)
To view only your private documents in a particular project, you can add “access: private” to your search terms. Searching by “access: public” will show you only public documents, while “access: organization” will show you those documents shared within your organization.
Already using the search API? We’ve added search terms that let you limit public results to a single project. Drop a line to support AT documentcloud DOT org if you’d like to take advantage of this one.
Still waiting for an important feature? Let us know!
These improvements are only available to users who have an account on DocumentCloud. If you’re a reporter who works with primary source documents, and you’re not using DocumentCloud yet contact us to find out how to start.
IE6 has long been the bane of web developers: developing web applications that work as well in IE6 as in other browsers is substantially more difficult than bypassing the ten year old browser.
IE6 users will still be able to download a original PDF of any document and will see a landing page that encourages IE6 users to upgrade their browser or install Chromeframe.
The New York Times, with whom we continue to collaborate closely on development of the viewer component of DocumentCloud, has long planned to phase out support for IE6. They don’t test new tools against the browser and will soon update to the same version of their document viewer that DocumentCloud is running on. The Times isn’t alone: YouTube began phasing out support for IE6 in March and other Google products are expected to follow suit. We’re certainly open to feedback on our implementation.
Meantime, take a look at some of the great things reporters are doing with DocumentCloud.
Since we launched DocumentCloud’s beta, one of the most common requests has been: “How can I share documents with reporters from other organizations?”
Now you can share a project with any other DocumentCloud user — in any newsroom.
How does it work?
Let’s say I have a project with documents relating to the Madoff Ponzi scheme, and I want to share them with Scott. To open the project for editing, I click on its edit icon.
Inside of the project, I click on the “Add a collaborator to this project” link, and I type in Scott’s email address — the one that he uses to log in to DocumentCloud.
After clicking the “Add” button, Scott now appears as a collaborator on this project.
The next time Scott logs in to DocumentCloud, “The Madoff Files” will show up as one of the projects in his sidebar. He can now view, edit and annotate all of the documents inside of it. He can add documents of his own to the project and I’ll be able to see and edit those as well.
Project collaborators can do anything with the documents in a project that you can do: they can edit public notes, change settings like the document’s title or source, add “related article links.” Collaborators can also add or remove additional people to the project. You can only collaborate with fellow DocumentCloud users, though: if you’re collaborating with a newsroom that isn’t yet part of DocumentCloud, send them our way and we’ll get them set up.
We’d love it if you would give it a spin and let us know what you think: write to firstname.lastname@example.org or suggest improvements where fellow users can weigh in as well: in our support forum.
We’ve been spending a lot of time in the DocumentCloud Lab researching the best way to break apart documents into their component parts, to make it easier to index them for searching and to display them on the web. The latest open-source piece of DocumentCloud is a tool to help you extract images, thumbnails, plain text, and individual pages from any kind of document. It wraps up the PDFBox, GraphicsMagick, and JODConverter libraries, providing you with a command-line utility and a Ruby API for breaking apart documents.
Docsplit is our fourth open-source project, but is perhaps the most immediately useful in the newsroom. We’ve been talking to the Guardian and the New York Times about techniques for pulling images and text out of documents, and Docsplit synthesizes some of the best practices into a single package with a simple interface. We’re hoping it comes in handy the next time you need to analyze a pile of documents.
The project page contains complete overview of Jammit, including installation instructions, documentation, and examples. We hope you can use it to help speed up your Rails applications.
We released the first open-source component of DocumentCloud a little over a month ago. Since then CloudCrowd has picked up a lot of steam, with hundreds of developers watching it on GitHub, and many patches and features being contributed by the community. Among other uses, it’s running gene sequence analysis on strains of influenza virus — something we certainly never expected to see. Since anything worth doing is worth doing twice, this morning I’m pleased to announce the release of the second open-source component of DocumentCloud: Underscore.js.
We have some more news: About two dozen news and other organizations have signed on as beta-testers. They’ll be contributing documents to DocumentCloud, and giving us feedback as we work out the kinks. It’s a wide-ranging list:
- ACLU National Security Project
- Arizona Republic
- The Atlantic
- Center for Democracy and Technology / OpenCRS
- Centre for Investigative Journalism, City University London
- Center for Investigative Reporting / California Watch
- Center for Public Integrity
- Chicago Tribune
- Dallas Morning News
- The Investigative Reporting Workshop at American University
- The New Yorker
- Mother Jones
- St. Petersburg Times
- Sunlight Foundation
- Voice of San Diego
- Washington Post
These organizations will be joining our original set of contributors — The New York Times, ProPublica, Talking Points Memo, The National Security Archive, and Gotham Gazette — all of whom will of course be working with us during the testing too.
Earlier this morning we also announced that we’re working with Thomson Reuters’ OpenCalais service to extract and make available information from the documents contributed to DocumentCloud.
E-mail us if you’d like to participate in the testing. We’re interested in any organization, including non-profits and academic institutions, that have obtained documents during their research.
If you’re new here, the goal of DocumentCloud is to super-charge investigations by making documents, and the information in them, easier to find and share. Readers will be able to search documents on DocumentCloud and then will be pointed to the documents themselves on contributing organizations’ Web sites. (Here’s a FAQ with more details.)
Finally, you can keep following our progress on this blog — or follow us on Twitter, or RSS. And we’re releasing our code each step of the way.
This morning we’re excited to announce a partnership with Thomson Reuters, which is contributing its OpenCalais service to DocumentCloud. OpenCalais uses natural language processing to extract information from documents, instantly identifying and tagging the relevant people, places, companies, facts and events. This will make it easy for readers and journalists to explore connections between documents and across the full collection of source materials.
If you’ve seen us do a presentation about DocumentCloud, you already know it’s going to be a key part of what makes DocumentCloud great.
As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.
Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.
CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.
We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.
This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!