Ever since we added the “pages” tab to the document viewer, we’ve wanted to find a way to bring the convenience of browsing through a document’s pages back into the workspace itself. If you log in to DocumentCloud this afternoon, there is now a grid of page images that can be displayed by clicking on the page number at the bottom of each document, or by right-clicking a document and choosing “View Pages.” We flag each page that contains a note with a yellow tab, so you can easily spot them among hundreds of other pages.
Try it on a few of your documents, and let us know what you think!
Cross posted from PBS Idealab.
When we embarked on the DocumentCloud project, tools for altering documents were the furthest thing from our minds. After all, a responsible journalist doesn’t tweak source documents!
But one of the first papers to embed material using DocumentCloud needed to do just that. The Chicago Tribune accompanied their coverage of a troubled foster home with a collection of letters and court orders. Though the documents offered an excellent illustration of the state child services agency’s lax oversight and slipped follow-ups, they were predictably full of personal information about children in the foster care system, individual agency staff names and other personal and identifying details about private individuals that the Tribune opted to …
Fine tuning text, adding, removing and reordering pages: when we embarked on this project, tools for altering documents were the furthest thing from our minds. A responsible journalist doesn’t tweak source documents! One of the first papers to embed material using DocumentCloud needed to do just that. Chicago Tribune accompanied their coverage of a troubled foster home with a collection of letters and court orders. Though the documents offered an excellent illustration of the state child services agency’s lax oversight and slipped follow-ups, they were predictably full of personal information about children in the foster care system, individual agency staff names and other personal and identifying details about private individuals that The Trib opted to omit from their reporting. That decision, however, left the news apps team replacing the whole stack of letters multiple times before the package was finally ready to post.
A tool, right inside of DocumentCloud, for replacing, removing and reordering the pages of a document would have helped them a lot. Continue reading »
We’ve added a “pages” tab to document viewers in our workspace and embedded on news sites. The new tab offers a birds-eye view of an entire document. This new tab, which now appears in your document viewer right next to the “document” tab, allows you to browse a document more quickly by showing you thumbnail images of every page. For long documents, this tab allows you to identify exactly where you want to go in the document without having to scroll and search repeatedly until you find a specific section in the document. Continue reading »
Cross posted from PBS Idealab.
Planning to spend the long weekend finalizing your Knight News Challenge application? It’s too late for my favorite bit of advice (“don’t wait until the last minute!”), but as someone who’s been involved with three different winning projects, I like to fancy that I’ve got got some insight into what makes a good project.
A half dozen prospective applicants have sat down with me to workshop their News Challenge ideas, and I think I’ve helped them think through their projects to get them to a more viable place. The application process isn’t hard, but you do need to give some sincere thought to your project or you’re just wasting your time. Here’s the advice I keep giving people:
When you upload a document to DocumentCloud, and the file does not contain text, we attempt to perform OCR (optical character recognition) on the document, using the open source Tesseract project. Tesseract is a venerable piece of software, originally developed at Hewlett-Packard between 1985 and 1995. Google acquired the project in 2006, and has been sponsoring work on it since then. A few months ago, Tesseract 3.0 was released; and this morning, we’ve deployed the new version of Tesseract as part of DocumentCloud. Continue reading »
This morning we rolled out a much-requested update: the ability to set the date on which a document will become public. Until now, users have been manually making documents public when a story was ready to go live. Newsrooms that run new reporting late into the night were left to make documents public before the reporting was available. As DocumentCloud’s user base grows, and as we inch towards public searches of the catalog, we knew reporters need to be able to set the hour of publication and know that no one will get a sneak peak at their reporting. Continue reading »
Cross posted from PBS Idealab.
When we make lists of the kinds of source documents users can upload to DocumentCloud, they can get pretty long. DocumentCloud is court filings, hearing transcripts, testimony, legislation, lab reports, memos, meeting minutes, correspondence. I can say with absolute confidence that in all of our planning, “ballots” never once came up as the sort of document a news organization might want to annotate for readers. Our relentlessly creative users have shown us otherwise.
This summer, the Memphis Commercial Appeal rounded out its guide to August’s primary elections with a sample ballot. Their digital content editor told us that many readers who’d missed the sample ballot in the print edition …
The project is hosted on GitHub; annotated source code is available, as is an online test suite.
Cross posted from PBS Idealab.
ProPublica used DocumentCloud to develop an excellent story they published Friday. I’d planned to write it up, but Krista Kjellman Schmidt, the news applications editor who worked on the story, put it much better than I ever could have. Here’s the opening of her post:
On Oct. 8, we published an investigation examining how a judicial opinion in a pivotal lawsuit brought by a Guantanamo detainee vanished, only to be replaced weeks later by an entirely different opinion. At the center of our reporting are two documents representing separate versions of that same opinion: the original opinion written by Judge Henry H. Kennedy, and a second opinion quietly put in the original’s place …
When Krista Kjellman Schmidt was putting together the chart illustrating key deletions in a set of documents, she needed a quick way to grab thumbnail images of particular pages. She was lucky — I just happened to be at ProPublica’s offices and just happened to overhear a few snips of conversation. You might be able to use firebug, someone suggested, and then reduce the images? another followed. I’m not sure what prompted me to look up (possibly the fact that I’m a busybody?) but I did, and when I realized what she was trying to do I had a far better suggestion. It went something like this: Continue reading »
Last week, Investigative Reporters and Editors had us over for a webinar all about DocumentCloud. A video of the webinar will be available very soon from IRE. In the meantime, I did promise to answer every last question from the back channel.
We covered a lot of ground but there were plenty of questions that I couldn’t get to in the hour alloted. I realized, as I read through the questions that folks were asking, that a good many of our current users have been hesitant to ask questions when our documentation isn’t clear. Part of being in beta means that DocumentCloud is changing fast and often. Sometimes our documentation lags behind, sometimes we miss things. Sometimes what we think is clear makes no sense at all to you. So we’re trying something new: on top of all the other ways in which you’re more than welcome to contact us we’re setting up virtual office hours. Stop by: we’ll take all questions, no matter how technical or mundane. As for the questions you’ve already asked … Continue reading »
Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.
Benchmarking with Docsplit
To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…). Continue reading »
I’m pleased to announce that today (finally) we’re releasing of one of our most highly-requested features: A smaller-format document viewer that can be embedded inline within the text of an article. A few newsrooms have been beta-testing these fixed size document viewers over the past couple of weeks, so you can see them in action by visiting this Des Moines Register article about Fred Hoiberg’s contract, or this WNYC annotation of New York’s new ballot design. Continue reading »
At this point at the end of our first summer, over 30 newsrooms are using DocumentCloud to augment their reporting by publishing selected source documents. You can see some examples of DocumentCloud in action on our list of featured documents or our recent MediaShift post. We’ll soon be allowing the general public to search the catalog of primary source documents, and when someone runs a search, we’d like to send readers to the embedded version of a document on the contributing organziation’s site, if it’s available. So we need to know the location of the page where the document is being embedded. In order to help automate this, we created Pixel Ping. Continue reading »
Cross posted from PBS Idealab.
Since we last updated readers on DocumentCloud’s progress, we’ve made it much easier to upload a lot of documents at once, and introduced a related documents search that uses data about names and places provided by OpenCalais to find documents that are probably related to the one you’re looking at. We’ve also added a bit more contextto the data we help reporters comb through. Most of this work is happening inside the gates of the DocumentCloud workspace, but it is resulting in some lively reporting. For example…
Using Documents to Tell the Story
This summer, as the federal 5th Circuit Court of Appeals prepared to hear arguments in a challenge to the University …
We added excerpts to our timelines and entities tab today. We are using OpenCalais to parse every document you upload and extract the names of people, organizations, terms and places in the text. We display these under the Entities tab. We also extract date information from each document that you upload and plot those on a timeline which you can access from the Analyze menu.
DocumentCloud’s entities show you, at a glance, the people who are mentioned the most times in a given project, or the organizations that are named in each of the documents you’ve selected. Here’s how it works now:
Click on the “show pages” link next to any entity to reveal a thumbnail of each page in each document that contains that term, alongside excerpts highlighting the mention of the entity in the text. Clicking on the highlighted phrase will take you directly to term within the document itself. In the screenshot below, you can see how the Environmental Protection Agency was correctly identified by both its proper name and its acronym. Continue reading »
You’ve always been able to script batch uploads using our API, but for users without coding skills, uploads were one at a time. Today we’re rolling out an improved document uploading dialog that will let you upload as many documents as you want, all in one fell swoop.
You’ll still use the “New Documents” button, but now that button takes you straight to a file selection dialog.
Use the control (on MS Windows) or command (on Macs) key to select additional documents, just like you would in your file browser.
We’ll start you off by suggesting a title, based on each file’s name, but you can edit that name and add additional information, including the source of each document and a description. As with the old upload dialog, you can decide right when you upload your document whether or not you’re ready to share it with the world yet. As ever, you can edit all of these fields again later.
If your documents share a common source or description, use the “Apply to All Files” link to copy your metadata to each document in this batch. Note: this new upload interface requires Flash. If that’s an issue for you, let us know ASAP and we’ll whip up an alternate interface that doesn’t require any plugins. Promise.
As the files upload from your computer to DocumentCloud, you’ll see the progress of each transfer.
Better Processing, Too
This week’s release is more than just a new upload dialog. We’ve made some big changes to Docsplit and the RightAWS gem, and we’re hoping this means the dreaded “import failed” error will be a thing of the past. If looking under the hood is your thing, both Docsplit and our fork of RightAWS are on on github for your viewing (and reusing) pleasure.
Don’t be a Stranger
If you have gigabytes worth of documents to upload, get in touch before you start uploading so we can add more horsepower to handle your job. Otherwise, happy uploading! And don’t forget to tell us about what you’re publishing with DocumentCloud.
Cross posted from PBS Idealab.
We opened the DocumentCloud floodgates less than six months ago and we’re still working hard to make DocumentCloud a better tool. We’re rolling out improvements at a healthy clip including SSL support, better documentation, and support for cross-newsroom collaboration. We continue to listen to feedback from our really incredible crop of beta testers (who now number close to 500!).
There are nearly 100 newsrooms participating in the DocumentCloud beta and requests are still pouring in. We’ve been doing a fair amount of outreach and more is in the works, but it turns out that our users are our best advocates: After John Addams in Great Falls, Montana, blogged about his experiences …
If you log in to DocumentCloud this morning, you’ll notice a new menu, entitled “Analyze”. We’ve gathered the various analytic tools under one roof here — to view the entities for selected documents, or display a timeline — and added a major new one: Related Documents. If you’re working on a story, and you just uploaded a material document, you can use that document as a jumping-off point to find other public documents about the same subject. Select a document, open the “Analyze” menu, and click “Find Related Documents”:
Related documents can span all documents visible to your account, including documents that other organizations have made public.
Under the hood, we are using a technique known as tf/idf, which compares document similarity by looking at the “important” words across a set of documents. The importance of each word is evaluated by weighing the frequency of use of each word in a particular, divided by the frequency of the word in the collection of documents as a whole. In this manner, commonly used words drop out of the index, and distinctive words obtain greater importance. The search engine we use for DocumentCloud, Lucene, has this type of search built-in.
We’re still at work on improving this feature. At the moment, you’ll notice two things: there is a long tail of barely-related documents that follows the first page of results, and shorter documents (1-3 pages) may find no related documents whatsoever. But for most documents with high-quality text, you’ll find that the related documents at the top are very relevant.
There’s one more thing that we released at the same time: a panel that allows you to edit all the information that describes your documents (title, source, description, access level) at a stroke. To use it, click on the pencil icon that now appears next to any document. Naturally, you can also select multiple documents and edit all of their attributes simultaneously.
The Document Viewer has always supported the ability to create “page notes” — annotations that sit between two pages and provide commentary about a specific page as a whole or an introduction to a new section of a document. This morning, we released an update to DocumentCloud that provides a way for you to create page notes from within the viewer.
To try it out, open a document you’ve uploaded and click on one of the “Add a Note” links in the sidebar. Hover your crosshair over the margin in between pages, and you’ll see a dotted line appear, with a note tab on the left:
If you click, you’ll create a page note in between the two pages. Add a title and some text and click the “Save” button. The note’s title will appear in the navigation on the right. Of course, page notes are viewable and editable from the workspace, just like any other.
If you’re logged in, you can take a look at the sample document shown here.
Monday morning we rolled out SSL support on DocumentCloud.org — visit https://www.documentcloud.org to view, browse and edit documents in your workspace over an encrypted connection. When you use HTTPS, all traffic between your computer and DocumentCloud is encrypted before it’s sent over the internet. If you’re working on a public wireless connection, are on an unsecured network or are dealing with highly-sensitive documents, we recommend using HTTPS.
You can tell if you’re an secure connection by looking at your browser. When visiting a secure website, all browsers display a lock icon somewhere on the window. Here’s what the lock looks like in Google Chrome:
More Search Parameters
We’ve also added new ways to filter your DocumentCloud searches. You can now use “access” to filter your documents by their access level, and “projectid” to designate a specific project when you’re using our search API. (Access to searches and the API are limited to registered users during the beta.)
To view only your private documents in a particular project, you can add “access: private” to your search terms. Searching by “access: public” will show you only public documents, while “access: organization” will show you those documents shared within your organization.
Already using the search API? We’ve added search terms that let you limit public results to a single project. Drop a line to support AT documentcloud DOT org if you’d like to take advantage of this one.
Still waiting for an important feature? Let us know!
These improvements are only available to users who have an account on DocumentCloud. If you’re a reporter who works with primary source documents, and you’re not using DocumentCloud yet contact us to find out how to start.
For some time now, instructions about how to use the DocumentCloud workspace have been available through our wiki. This morning, we released an update that pulls the help pages right into the workspace for easy access, and hopefully makes it faster to get your questions answered. Continue reading »
IE6 has long been the bane of web developers: developing web applications that work as well in IE6 as in other browsers is substantially more difficult than bypassing the ten year old browser.
IE6 users will still be able to download a original PDF of any document and will see a landing page that encourages IE6 users to upgrade their browser or install Chromeframe.
The New York Times, with whom we continue to collaborate closely on development of the viewer component of DocumentCloud, has long planned to phase out support for IE6. They don’t test new tools against the browser and will soon update to the same version of their document viewer that DocumentCloud is running on. The Times isn’t alone: YouTube began phasing out support for IE6 in March and other Google products are expected to follow suit. We’re certainly open to feedback on our implementation.
Meantime, take a look at some of the great things reporters are doing with DocumentCloud.
Our third hire! Developer Samuel Clay joins DocumentCloud today, bringing our full time staff to a total of three.
Samuel joins us from Storybird, a collaborative storytelling startup which works with artists to give children access to high quality narrative art that they can use to publish their own original stories. He’s also the mastermind behind NewsBlur, an open source feed reader that uses artificial intelligence to suggest stories you might want to read. Think of it as an RSS reader with intelligence.
Samuel lives in Brooklyn with his dog and guinea pigs, where he photographs historic districts for New York Field Guide. Find him at firstname.lastname@example.org or on twitter.