Whenever you upload a document to DocumentCloud we send the contents to OpenCalais, a service that discovers the entities (people, places, organizations, terms, etc.) that are present in plain text. OpenCalais can tell us that “Barack Obama” is the same person as “President Obama”, “Senator Obama”, “Mr. President” … and even “he” or “his” in clauses like “his policy proposals”.
Last month, we stopped indexing entities for faceting because DocumentCloud has reached the point where our search index can no longer support the strain of keeping track of the millions of unique entities stored in our database. We still hope to bring back some form of entity faceting — a feature you may remember as the “Entities” tab — using a different implementation in the future. But for the time being, we have added a new feature that allows you to easily browse through all of the entities associated with a document:
The entities are displayed in a chart that shows how often each entity occurs across each page. Using this chart, you can see which companies and individuals tend to be mentioned together frequently, or which parts of a long document concern a certain topic. Hover over any mention (the small gray boxes) to see the surrounding context, and click on it to jump directly to that mention within the document itself.
If you want to try out an example, here is a link to a recent document that ran with a disability fraud story in today’s New York Times. Right-click on the document and choose View Entities from the context menu, or select the document and choose View Entities from the Analyze menu.
We’re still polishing these charts, so let us know if you have any ideas for improving them, or ideas for other ways that we can make extracted entities more useful for your reporting.
DocumentCloud now supports advanced boolean search queries, allowing you to more easily perform searches that hone right in on the documents you’re trying to find. You may be familiar with boolean operators from other search engines, but here’s a quick refresher on the available options:
- and: both terms must exist in the document Perry and Romney
- or: either term may match indicted or accused
- !: the term must not exist in the document obama !barack
- *: a wildcard to match any sequence of letters J*e Smith (Matches Joe, Jane or Jake Smith)
- ( ): group together words into a term (Perry or Romney) and governor
Here’s an example of what that last search looks like in action:
Behind the scenes, we’re using the latest stable release of the open-source Solr/Lucene search engine (3.4.0). It includes a new query parser called “edismax” that adds boolean operators to the previous implementation of full text search.
Give boolean searches a spin, and let us know if they’re working well for your ongoing projects.
On large document-driven projects, newsrooms often bring together teams of collaborators that include independent researchers who aren’t formally part of the newsroom. Newsrooms that want a research team to evaluate thousands of documents — more than our collaboration tools are designed to accommodate — can take advantage of our new access level: the freelancer. A “freelancer” can upload, annotate, and edit documents like any other user, but they can only access documents you’ve explicitly shared with them.
To add a user (or ten) who is going to be contributing reporting but shouldn’t have access to the rest of your newsroom’s documents, you can create an account for a freelancer.
Freelancer accounts are good for anyone that you regularly work with, but who doesn’t actually work for your organization, or for folks you’re bringing together on a single reporting project.
For more information, check out our accounts documentation.
Along with a slew of tweaks and bug fixes, the most notable new feature is HTML5 “pushState” support, which you can see in action by trying a search in DocumentCloud’s public archive. This enables the use of true URLs, but also requires you to do a bit of extra work on the back end to be sure that your application is capable of serving these pages, so it’s strictly on an opt-in basis.
Of course, not all browsers currently in popular use (ahem, Internet Explorer) support the “pushState” function yet. Older browsers will continue to use hash-based URLs, and if hash-based links are shared with modern browsers, they’ll be transparently upgraded to the “pushState” version of the URL.
Other changes include renaming Controller to Router for clarity, the refresh function to reset, and replacing saveLocation with a more flexible navigate API. There are instructions for upgrading from 0.3.3 to 0.5.0 that should help with these.
The full change log is also available.
As DocumentCloud becomes more deeply embedded into the reporting workflow for many newsrooms, we hear more and more requests for improved document redaction tools. We expect each newsroom to adhere to their own policies about what kind of information is or is not suitable to reveal, but if you’ve used DocumentCloud to analyze records that contain home phone numbers, private details about minor children or personal information that isn’t appropriate for publication, you probably want to redact those documents before you publish them. Continue reading »
This morning, not quite one year since we opened our beta to newsrooms at NICAR 2010, the millionth page of primary source material was uploaded to DocumentCloud. Reaching this milestone so soon is a tribute to our users and the amazing document-driven investigative reporting you have published over the past year.
Most of the thousands of documents in our catalog have arrived in small batches: five documents here, 20 there, most often accompanying a breaking story. Take a look for yourself: browse through recently published documents by searching for “filter: published” or read up on other searches you can run.
Now is a good moment to highlight some notable recent stories:
Last week, Center for Public Integrity launched a series of articles on hidden hazards at oil refineries in the United States. Readers of Regulatory Flaws, Repeated Violations Put Oil Refinery Workers at Risk can review a dozen citations and court filings that the Center’s journalists used in the reporting.
Sunday, The New York Times published the first installment of an investigation into lax regulation of natural-gas drilling across the US, accompanied by a large cache of E.P.A. and industry documents.
The Seattle Times reported last week on evidence of financial abuse in Seattle public schools, based on documents released by state auditors. The documents detail over-billing, intimidation, and ethics violations that add up to $1.8 million in potentially fraudulent expenses.
Thanks for a great first year, and here’s hoping that the next year brings millions more pages, and more great document-driven reporting.
We’ve long allowed users to access DocumentCloud’s tools over an encrypted HTTPS connection. Now, we’ve made it mandatory. Next time you log in to your DocumentCloud account, you will be redirected to the secure version of your workspace.
When you use an unencrypted HTTP connection to access a website, your request and the site’s response are all sent over the network in readable clear text, which is trivially easy to intercept and read. Without HTTPS, it is actually possible for someone hijack your connection to DocumentCloud, inserting or altering the content that you’re viewing. So we think HTTPS is worth using.
If you’re interested in the technical subtleties of implementing SSL, read on. Continue reading »
You already know you can link directly to any page or annotation. Now you can embed documents so that they’ll open to any page or annotation, too. If you want to point your readers to the shocking revelation on page seventy five or open the viewer directly to a key annotation, check out our new embed dialog.
Select any document and choose “Embed Document Viewer” from the “Publish” menu, and you’ll find a new configuration option:
We build features like this because our users ask for them &emdash; what do you need DocumentCloud to do?
Ever since we added the “pages” tab to the document viewer, we’ve wanted to find a way to bring the convenience of browsing through a document’s pages back into the workspace itself. If you log in to DocumentCloud this afternoon, there is now a grid of page images that can be displayed by clicking on the page number at the bottom of each document, or by right-clicking a document and choosing “View Pages.” We flag each page that contains a note with a yellow tab, so you can easily spot them among hundreds of other pages.
Try it on a few of your documents, and let us know what you think!
When you upload a document to DocumentCloud, and the file does not contain text, we attempt to perform OCR (optical character recognition) on the document, using the open source Tesseract project. Tesseract is a venerable piece of software, originally developed at Hewlett-Packard between 1985 and 1995. Google acquired the project in 2006, and has been sponsoring work on it since then. A few months ago, Tesseract 3.0 was released; and this morning, we’ve deployed the new version of Tesseract as part of DocumentCloud. Continue reading »
This morning we rolled out a much-requested update: the ability to set the date on which a document will become public. Until now, users have been manually making documents public when a story was ready to go live. Newsrooms that run new reporting late into the night were left to make documents public before the reporting was available. As DocumentCloud’s user base grows, and as we inch towards public searches of the catalog, we knew reporters need to be able to set the hour of publication and know that no one will get a sneak peak at their reporting. Continue reading »
The project is hosted on GitHub; annotated source code is available, as is an online test suite.
I’m pleased to announce that today (finally) we’re releasing of one of our most highly-requested features: A smaller-format document viewer that can be embedded inline within the text of an article. A few newsrooms have been beta-testing these fixed size document viewers over the past couple of weeks, so you can see them in action by visiting this Des Moines Register article about Fred Hoiberg’s contract, or this WNYC annotation of New York’s new ballot design. Continue reading »
At this point at the end of our first summer, over 30 newsrooms are using DocumentCloud to augment their reporting by publishing selected source documents. You can see some examples of DocumentCloud in action on our list of featured documents or our recent MediaShift post. We’ll soon be allowing the general public to search the catalog of primary source documents, and when someone runs a search, we’d like to send readers to the embedded version of a document on the contributing organziation’s site, if it’s available. So we need to know the location of the page where the document is being embedded. In order to help automate this, we created Pixel Ping. Continue reading »
We added excerpts to our timelines and entities tab today. We are using OpenCalais to parse every document you upload and extract the names of people, organizations, terms and places in the text. We display these under the Entities tab. We also extract date information from each document that you upload and plot those on a timeline which you can access from the Analyze menu.
DocumentCloud’s entities show you, at a glance, the people who are mentioned the most times in a given project, or the organizations that are named in each of the documents you’ve selected. Here’s how it works now:
Click on the “show pages” link next to any entity to reveal a thumbnail of each page in each document that contains that term, alongside excerpts highlighting the mention of the entity in the text. Clicking on the highlighted phrase will take you directly to term within the document itself. In the screenshot below, you can see how the Environmental Protection Agency was correctly identified by both its proper name and its acronym. Continue reading »
The Document Viewer has always supported the ability to create “page notes” — annotations that sit between two pages and provide commentary about a specific page as a whole or an introduction to a new section of a document. This morning, we released an update to DocumentCloud that provides a way for you to create page notes from within the viewer.
To try it out, open a document you’ve uploaded and click on one of the “Add a Note” links in the sidebar. Hover your crosshair over the margin in between pages, and you’ll see a dotted line appear, with a note tab on the left:
If you click, you’ll create a page note in between the two pages. Add a title and some text and click the “Save” button. The note’s title will appear in the navigation on the right. Of course, page notes are viewable and editable from the workspace, just like any other.
If you’re logged in, you can take a look at the sample document shown here.
Monday morning we rolled out SSL support on DocumentCloud.org — visit https://www.documentcloud.org to view, browse and edit documents in your workspace over an encrypted connection. When you use HTTPS, all traffic between your computer and DocumentCloud is encrypted before it’s sent over the internet. If you’re working on a public wireless connection, are on an unsecured network or are dealing with highly-sensitive documents, we recommend using HTTPS.
You can tell if you’re an secure connection by looking at your browser. When visiting a secure website, all browsers display a lock icon somewhere on the window. Here’s what the lock looks like in Google Chrome:
More Search Parameters
We’ve also added new ways to filter your DocumentCloud searches. You can now use “access” to filter your documents by their access level, and “projectid” to designate a specific project when you’re using our search API. (Access to searches and the API are limited to registered users during the beta.)
To view only your private documents in a particular project, you can add “access: private” to your search terms. Searching by “access: public” will show you only public documents, while “access: organization” will show you those documents shared within your organization.
Already using the search API? We’ve added search terms that let you limit public results to a single project. Drop a line to support AT documentcloud DOT org if you’d like to take advantage of this one.
Still waiting for an important feature? Let us know!
These improvements are only available to users who have an account on DocumentCloud. If you’re a reporter who works with primary source documents, and you’re not using DocumentCloud yet contact us to find out how to start.
For some time now, instructions about how to use the DocumentCloud workspace have been available through our wiki. This morning, we released an update that pulls the help pages right into the workspace for easy access, and hopefully makes it faster to get your questions answered. Continue reading »
Since we launched DocumentCloud’s beta, one of the most common requests has been: “How can I share documents with reporters from other organizations?”
Now you can share a project with any other DocumentCloud user — in any newsroom.
How does it work?
Let’s say I have a project with documents relating to the Madoff Ponzi scheme, and I want to share them with Scott. To open the project for editing, I click on its edit icon.
Inside of the project, I click on the “Add a collaborator to this project” link, and I type in Scott’s email address — the one that he uses to log in to DocumentCloud.
After clicking the “Add” button, Scott now appears as a collaborator on this project.
The next time Scott logs in to DocumentCloud, “The Madoff Files” will show up as one of the projects in his sidebar. He can now view, edit and annotate all of the documents inside of it. He can add documents of his own to the project and I’ll be able to see and edit those as well.
Project collaborators can do anything with the documents in a project that you can do: they can edit public notes, change settings like the document’s title or source, add “related article links.” Collaborators can also add or remove additional people to the project. You can only collaborate with fellow DocumentCloud users, though: if you’re collaborating with a newsroom that isn’t yet part of DocumentCloud, send them our way and we’ll get them set up.
We’d love it if you would give it a spin and let us know what you think: write to firstname.lastname@example.org or suggest improvements where fellow users can weigh in as well: in our support forum.
Over the past few months, you might have noticed a handful of news organizations using embedded documents to complement their reporting.
This morning, we’re opening up the ability to embed documents to all of the newsrooms participating in DocumentCloud. When you log into your workspace, you’ll notice a new menu: “Publish”.
From here, you can grab an embed code (a short snippet of HTML) that can be dropped onto a web page to create a document viewer. You may be familiar with such snippets from embedding YouTube videos: this works in a similar fashion. For guidelines on setting up a template and other help, check out our documentation.
If you still have questions about the process, we’re listening at email@example.com.
Note: we know you’re eager to host documents yourself, and you can do that now, but we recommend that you stick with embedded documents so that you can take advantage of bug fixes and other improvements to the viewer. We don’t know yet whether we plan to offer embedding as a long term service. Keep in mind, as well, that this is still a beta. As described in our terms, our capacity to commit to uninterrupted service is limited, as is our liability if service is interrupted in some way.
For those news organizations that want to host documents on their own servers, we’re now offering that as an alternative too. Click on “Download Document Viewer” to get a zipped up folder with all the code, text, and images bundled together as a web page. Drop the folder into any web server (no special software required), and voila, it’s online.
Here at DocumentCloud, we’re looking forward to seeing the great reporting you do with embedded documents — don’t forget to use the workspace to add a “Related Article” link.
We’ve been spending a lot of time in the DocumentCloud Lab researching the best way to break apart documents into their component parts, to make it easier to index them for searching and to display them on the web. The latest open-source piece of DocumentCloud is a tool to help you extract images, thumbnails, plain text, and individual pages from any kind of document. It wraps up the PDFBox, GraphicsMagick, and JODConverter libraries, providing you with a command-line utility and a Ruby API for breaking apart documents.
Docsplit is our fourth open-source project, but is perhaps the most immediately useful in the newsroom. We’ve been talking to the Guardian and the New York Times about techniques for pulling images and text out of documents, and Docsplit synthesizes some of the best practices into a single package with a simple interface. We’re hoping it comes in handy the next time you need to analyze a pile of documents.
The project page contains complete overview of Jammit, including installation instructions, documentation, and examples. We hope you can use it to help speed up your Rails applications.
We released the first open-source component of DocumentCloud a little over a month ago. Since then CloudCrowd has picked up a lot of steam, with hundreds of developers watching it on GitHub, and many patches and features being contributed by the community. Among other uses, it’s running gene sequence analysis on strains of influenza virus — something we certainly never expected to see. Since anything worth doing is worth doing twice, this morning I’m pleased to announce the release of the second open-source component of DocumentCloud: Underscore.js.
As we began to prototype DocumentCloud, it quickly became apparent that we’re going to need a heavy-duty system for document processing. Our PDFs need to have their text extracted, their images scaled and converted, and their entities extracted for later cataloging. All of these things are computationally expensive, keeping your laptop hot and busy for minutes, especially when the documents run into the hundreds or thousands of pages.
Today, we’re pleased to release CloudCrowd, the parallel processing system that we’re using to power DocumentCloud’s document import. It’s a Ruby Gem that includes a central server with a REST-JSON API, worker daemons so you can parcel out the jobs, and a web interface to help keep an eye on your work queue. The screenshot below is an example of what the web interface looks like, showing a series of brief jobs being rapidly dispatched by the workers.
CloudCrowd is intended for a moderate volume of highly expensive tasks — things like PDF processing, image scaling and conversion, video encoding, and migrating data sets. It comes with a couple of example “actions”, including one that serves as a scalable interface to GraphicsMagick, a program that allows you to programmatically apply Photoshop-like transformations and adjustments to images. This sort of work is the kind of thing that always needs to be extracted from a web application; it’s simply too slow to be done in the middle of a request. CloudCrowd should help provide a convenient and scalable way to offload the work.
We’ve been inspired by Google’s MapReduce framework for distributed processing. CloudCrowd provides explicit hooks to help exploit the potential parallelism in your jobs. All “actions” take in a list of inputs: think of a list of PDFs that need to be imported, or a list of images that need to be cropped. Every input is run in parallel. The more workers you spin up, the more machines you add to the cluster, the faster you’ll be able to process them. In addition, if you define an optional “split” method in your action, each input will be split up into multiple work units, all running in parallel. For DocumentCloud, that means the ability to split up large PDFs into 10-page chunks, each of which will be handed off to a different worker.
This is DocumentCloud’s first release of open-source code — hopefully the first of many to follow. If you have similar batch-processing needs to ours, I encourage you to give CloudCrowd a try. The source is available on Github, there’s a wiki, and inline documentation. We’re hoping to get contributions back from the community (there’s even a wish list on the wiki). We hope you find CloudCrowd to be both pleasant and useful. Enjoy!