With content consumption continuing to shift to mobile, we’ve been spending a considerable amount of time thinking about how to help our users tell stories with documents across a range of devices. Loading and reading a 50-page PDF on a desktop computer is an experience that, for many reasons, doesn’t translate well to a smart phone. So, what’s a better way to point readers to what matters in a document when it appears on one of your mobile pages?
Today, we’re pleased to announce our first step in moving toward a better mobile PDF experience: DocumentCloud Page Embed. It’s a lightweight, responsive viewer that highlights a single page, along with your annotations, and works across desktop and mobile. It’s available in our workspace right now. Look for “Embed a Page” under the Publish Menu and use the wizard to generate the code. See our Help documentation for details.
Here’s an example of Page Embed showing a page from an investigative report on a fire that injured several firefighters in Northern Virginia. The page has one note highlighted:
Behind the development
Why a page-focused embed? We noticed that publishers often embed simple screenshots of document pages. While easy to use and natively responsive, images deny readers the rich context of an embedded DocumentCloud document: annotations, searchable page text, and of course access to the original source document itself. While our full document viewer offers all these options, it’s overkill for presenting a single page.
Our new page embed strips the cruft from the document viewer and lets the reader focus on the page (and your annotations, if you’ve added any). We know how important mobile has become for document publishing, so we’ve made the page embed natively responsive. The entire interface resizes and changes capabilities with its surrounding context.
Extending across the platform
In fact, we’re so happy with the page embedder, we’re using it as the foundation for our next-generation full document viewer. Our goal is a responsive, extensible viewer with minimal interface chrome that lets readers view the document and your annotations and then get right back into the story. In other words, our focus is keeping their focus on you.
We still have plenty of improvements and additions in the pipeline. We’ll soon add the navigation to all the pages and text in the document, more immersive notes, more customization options, and better performance. We also plan to support page embeds with our oEmbed API and WordPress plugin. You can go ahead and start using it today, but keep an eye on the wizard for new options and capabilities in the near future.
Your thoughts are welcome! Please send any feedback to email@example.com or open an issue on our GitHub repo.
Hello and happy summer! We’ve been busy here at Team DocumentCloud, using the weeks since meeting many of you at IRE 2015 and SRCCON to focus on building a stronger platform and getting in position to ensure the long-term sustainability of DocumentCloud.
There’s lots in motion, and so here’s a quick update on the highlights:
Milestone: 2 million docs
Thank you for keeping us busy! In July, the total number of files uploaded to DocumentCloud passed 2 million, and our platform now holds more than 27 million pages of the documents you’ve gathered. The numbers keep growing as more news organizations join us – more than 1,400 worldwide right now – and as more people use our API for bulk uploads. Keep those documents coming (and we always appreciate a tip at firstname.lastname@example.org if you’re planning a big drop)!
A mobile-optimized viewer embed
Whenever we chat with our users, the most-requested feature for DocumentCloud is a better experience for viewing our embeds on phones. Well, we’ve heard you and have been busy developing documentcloud-pages, a new responsive embed type that displays a page with minimal chrome but also allows navigation through the entire document. We’re aiming to launch an early version in late August or September; if you’d like to contribute code or issues, please visit the project repository. In fact, we’re excited that the folks at La Nacion are already using the new embed in their Doc2Media project.
Get tips and updates in your mailbox!
We’re here to help, and soon we’ll launch two newsletters filled with info on how to get more out of your DocumentCloud account. News & Tips will highlight new features (and ones you might not know about) plus tips for publishing, collaborating and working with documents. App Developers will offer information for developers working with our API or building news apps based on our open source components. Both newsletters also will highlight great uses of DocumentCloud from around the world. You can sign up now.
An OpenCalais update
Since the launch of DocumentCloud, we’ve used Thomson Reuters’ OpenCalais API for our entity extraction. There’s a new version of the API, and we’re migrating to it this month. There won’t be any immediate difference in how we display entities, but we’re looking at whether the new API may offer us some new features. Stay tuned.
Welcome, Clay Selby, to the team!
There’s a new face at Team DocumentCloud’s daily scrum: Clay Selby of Austin, Texas, joined us in July as a part-time developer (thereby doubling our Texas staff). Clay’s the founder of email marketing startup SocialRest.com and brings a good dose of entrepreneurial experience along with his coding chops. Initially, Clay’s been working on moving us to the new OpenCalais API, but he’ll also be focusing on a lot of our back-end processing improvements.
Out and about
We’ve seen many of you in the last few months at various places on the map, from IRE 2015 in Philadelphia – where we held a hands-on class and talked to hundreds who stopped at our table – to SRCCON in Minneapolis. We had the fortune to show off DocumentCloud to students at Medill/IRE’s National Security Journalism Data/Watchdog Workshop in Washington, D.C., and we checked in with a couple of our user newsrooms as well. If we’re in your area and you’d like to get together, let us know!
A sustainable DocumentCloud
Finally, but not least, our current Knight Foundation grant directs us to find ways to make DocumentCloud financially sustainable. Since launch in 2010, thanks to Knight, our service has been offered to journalists for free. In the spirit of improving journalism and making reporting more transparent, we’re intent on maintaining a level of free access to DocumentCloud for journalists while developing a pricing model around features of the platform. In addition, thanks to a new account signup page, we’re hearing from many outside of journalism who’d like to use DocumentCloud, and we’re exploring that option. In the weeks ahead, we’ll be reaching out to many of you to discuss our plans.
Thanks for reading, and as always we have several ways for you to get in touch or follow our progress in several ways:
Good news – or, if you speak Danish, gode nyheder! Starting today, Danish-speakers can set the DocumentCloud workspace to default to their native language, thanks to translation help from Nils Mulvad, editor at Kaas & Mulvad and associate professor at The Danish School of Media and Journalism.
The addition of Danish increases the number of workspace translations to five. Along with English, we also support Spanish (thanks to work by Fernando Diaz), and Russian and Ukrainian (thanks for both to Roman Kolgushev).
Widening our language support remains an ongoing mission at DocumentCloud, part of our commitment to making our platform accessible to journalists around the world. As we wrote in March when we added OCR support for three additional languages, DocumentCloud language support falls into three categories: text search, entity extraction and workspace translation. We also have work under way to support additional languages in the document viewer.
Thanks to recent work by our development team, we’ve made it easier for collaborators to translate our workspace into more languages – and we’re looking for help! If you’re interested in helping bring DocumentCloud’s workspace to your language, please email us at email@example.com.
Starting today, DocumentCloud users can choose three additional languages to OCR uploaded documents: Hungarian, Norwegian and Swedish. We’ve added the three based on support requests and feedback we heard during last week’s NICAR conference in Atlanta.
The addition brings the number of languages available for OCR to 17. To see them all, click “Manage account” beneath your user name and click the “New Documents” dropdown under “Language Defaults.”
We believe that journalists around the world should have access to tools that enable better reporting, and growing our language support is critical to that. DocumentCloud’s language support falls into three independent categories (so partial support of your language may be possible):
Text Search: DocumentCloud fully supports Unicode, allowing users to search documents in a variety of character sets. For scanned documents that require OCR, DocumentCloud uses the battle-tested open-source Tesseract engine, which also powers Google Books and is maintained by Google. The open-source community has contributed language packages for many widely used languages, which allows DocumentCloud to enable them on our platform. So, if DocumentCloud does not yet support your language, please reach out and let us know about your interest!
Entity extraction: DocumentCloud supports identifying people, places, organizations and other entities through OpenCalais. As of now, OpenCalais only supports English, French and Spanish. We are evaluating other tools that would allow us to bring entity extraction to other languages.
Interface translation: Accessibility to our tools is more than being able to process non-english documents. Our users have already collaborated with us to translate DocumentCloud’s Workspace user interface into four languages: English, Spanish, Russian and Ukrainian. If you are interested in bringing DocumentCloud to your language, please email us at firstname.lastname@example.org
If you’d dropped into the DocumentCloud workspace in Columbia, Mo., at the start of January, you’d have found at least two things: a team actively avoiding the single-digit temps outside the office and a whiteboard that we frequently filled with ideas, photographed for posterity, erased and filled up again.
We ended up ignoring the temps. The ideas generated enough heat to last us until summer – and beyond!
So, what’s ahead for 2015? We’ve spent the last year researching and reflecting on what you want, both as a buildup to the recent $1.4 million Knight Foundation grant and to make sure you’re as happy as possible with the service. We want to be sure the platform is fast, reliable, and enables you to do your best work.
There’s a legacy to continue. DocumentCloud was founded by and for journalists to support in-depth reporting around public documents. Today, more than 900 news organizations worldwide use the platform. Whether it’s publishing the documents related to doubts about the guilt of a Texas man executed for murder or the grand jury testimony regarding the death of Michael Brown in Ferguson, Mo., journalists use DocumentCloud to give their readers a first-hand view of the primary source documents they gather.
We plan to build from that success. The DocumentCloud team’s expanding, and we’re locking in a roadmap to grow and improve the platform. We’ve recently hired a director of product development, and we’ve just posted a job description for a front-end developer. With a bigger team and lots of focus, we believe that by the end of the year, you’ll see substantial improvements and expanded offerings that will maintain DocumentCloud’s place as an essential reporting and presentation tool.
So, here are some things we’re going to do to help you:
— Improved processing. If you use DocumentCloud, you upload documents to the platform. You want them processed quickly, without errors, so you can get right to publishing or annotating. Over the last year, we’ve made substantial improvements to our processing cluster and sped up imports of popular documents uploaded by multiple users. And we’re glad that you’re noticing the results. We have more work planned – look for a blog post soon detailing the changes.
— Go mobile. We know (and so do you) that more of your readers view more of your content on their phones. So, we’re planning mobile-specific changes to the viewer to improve scrolling, zooming and the experience in general.
— More storytelling tools. We’re exploring ideas for expanding the display options for the viewer, such as presentation templates, social media sharing, notes displays and more. Many of you have asked for oEmbed support, and we’re looking closely at that!
— Telling DocumentCloud’s story. We’ll bring you more blog posts like this, keeping you up to date on our progress and listening for your ideas. We’ll also tell you more about how to make better use of all the site has to offer — such as deeper search options — and highlight great examples of your storytelling.
— Expanded reach and premium offerings. We want to make sure DocumentCloud is going to be available to journalists for years to come, and one way to do that is for the platform to begin generating revenue. This goal is part of the Knight grant, and so we’ll be exploring options – premium features, opening the tool to additional types of users on a fee basis, donations, and other ideas.
Beyond those, we have many more ideas, among them better feedback on document processing, ability to rotate pages, better organization of the site and workspace, and batch processing options.
That’s a lot to chew in one year, but with our team expanding – and, we hope, with continued contributions from the open source community – we’re pretty excited about the prospects.
As always, let us know your thoughts! You can reach us on UserVoice, Twitter, or email.
Whenever you upload a document to DocumentCloud we send the contents to OpenCalais, a service that discovers the entities (people, places, organizations, terms, etc.) that are present in plain text. OpenCalais can tell us that “Barack Obama” is the same person as “President Obama”, “Senator Obama”, “Mr. President” … and even “he” or “his” in clauses like “his policy proposals”.
Last month, we stopped indexing entities for faceting because DocumentCloud has reached the point where our search index can no longer support the strain of keeping track of the millions of unique entities stored in our database. We still hope to bring back some form of entity faceting — a feature you may remember as the “Entities” tab — using a different implementation in the future. But for the time being, we have added a new feature that allows you to easily browse through all of the entities associated with a document:
The entities are displayed in a chart that shows how often each entity occurs across each page. Using this chart, you can see which companies and individuals tend to be mentioned together frequently, or which parts of a long document concern a certain topic. Hover over any mention (the small gray boxes) to see the surrounding context, and click on it to jump directly to that mention within the document itself.
If you want to try out an example, here is a link to a recent document that ran with a disability fraud story in today’s New York Times. Right-click on the document and choose View Entities from the context menu, or select the document and choose View Entities from the Analyze menu.
We’re still polishing these charts, so let us know if you have any ideas for improving them, or ideas for other ways that we can make extracted entities more useful for your reporting.
Users who tried to search for pretty much anything on DocumentCloud this morning noticed pretty quickly that there was something not quite right on our servers. The short story is this: the problem was caused by human error and our servers are in the process of rebuilding the index that failed.
The longer story, for those of you who’ve been have been tracking updates about our search outage, is this: Continue reading »
DocumentCloud now supports advanced boolean search queries, allowing you to more easily perform searches that hone right in on the documents you’re trying to find. You may be familiar with boolean operators from other search engines, but here’s a quick refresher on the available options:
- and: both terms must exist in the document Perry and Romney
- or: either term may match indicted or accused
- !: the term must not exist in the document obama !barack
- *: a wildcard to match any sequence of letters J*e Smith (Matches Joe, Jane or Jake Smith)
- ( ): group together words into a term (Perry or Romney) and governor
Here’s an example of what that last search looks like in action:
Behind the scenes, we’re using the latest stable release of the open-source Solr/Lucene search engine (3.4.0). It includes a new query parser called “edismax” that adds boolean operators to the previous implementation of full text search.
Give boolean searches a spin, and let us know if they’re working well for your ongoing projects.
Updated! How I left MuckRock out is beyond me. There may be more updates as appropriate.
If you’re new to programming, looking at what others have done is probably the best way to get your bearings. DocumentCloud is no exception. You asked for more, better API examples. We’re long overdue for a roundup of some of the great tools DocumentCloud users have built on our API or otherwise poked their heads under the hood. Continue reading »
On large document-driven projects, newsrooms often bring together teams of collaborators that include independent researchers who aren’t formally part of the newsroom. Newsrooms that want a research team to evaluate thousands of documents — more than our collaboration tools are designed to accommodate — can take advantage of our new access level: the freelancer. A “freelancer” can upload, annotate, and edit documents like any other user, but they can only access documents you’ve explicitly shared with them.
To add a user (or ten) who is going to be contributing reporting but shouldn’t have access to the rest of your newsroom’s documents, you can create an account for a freelancer.
Freelancer accounts are good for anyone that you regularly work with, but who doesn’t actually work for your organization, or for folks you’re bringing together on a single reporting project.
For more information, check out our accounts documentation.
DocumentCloud’s users are a diverse lot, to say the least. In some newsrooms skilled programmers are busy writing python wrappers for our API, while in other users are embedding documents with no programmers to be found.
If you’re trying to integrate DocumentCloud with blogging software like WordPress or Blogger, we can help. Continue reading »
We’re thrilled to release a feature that has been simmering on a back burner since we launched DocumentCloud: more metadata! Using our document data tools, DocumentCloud users can tag documents with any values you need to store or search by.
Organize hearing transcripts by committee. Tag a stack of emails with information about who sent and received them. Add FOIA request numbers or the date a published document was originally retrieved, and you’ll know much more about document provenance at a glance. Continue reading »
As DocumentCloud becomes more deeply embedded into the reporting workflow for many newsrooms, we hear more and more requests for improved document redaction tools. We expect each newsroom to adhere to their own policies about what kind of information is or is not suitable to reveal, but if you’ve used DocumentCloud to analyze records that contain home phone numbers, private details about minor children or personal information that isn’t appropriate for publication, you probably want to redact those documents before you publish them. Continue reading »
Every once in a while, DocumentCloud gets hit with the kind of document stash that really slows us down. We can take a lot, but if one newsroom finally gets a 25,000 page FOIA turned over to them and another gets a hold of 30,000 pages of documents for a breaking news story about the on the same afternoon, that’s a volume that will tax our servers.
We recently established a “fast lane” to ensure that smaller documents don’t have to get in line behind behemoths, but that doesn’t help if you’ve got a few MB of documents about a local scandal — you’ll still have to shuffle into line with the big sets. Continue reading »
Sets of documents are nothing new to DocumentCloud.
The Las Vegas Sun published hundreds of pages of legislation, emails, court filings and medical records alongside their award winning package on hospital care in Las Vegas. The Sun‘s Marshall Allen assembled each document collection by hand to produce that page. Plenty of other newsrooms have used our API to do likewise. Even with the API it isn’t trivial to assemble and publish a set of documents.
Some document sets are living creatures that continue to grow: Chad Skleton at The Vancouver Sun has been adding documents retrieved from the local ferry authority’s website to a growing cache of public records on DocumentCloud. The only way to ensure his readers will find new documents as they roll in, is to point the public straight to DocumentCloud to find ferry authority FOIAs. It should be easier to embed that growing set of public documents right at The Vancouver Sun.
Samuel Clay has been working very hard to make that possible for every DocumentCloud newsroom. Continue reading »
From inviting a law professor to help Arizona readers understand recent legislation to asking some top notch designers to review New York’s new ballot, DocumentCloud users have already found some great ways to bring experts from outside the newsroom in, and we thought it was time to make it much easier to do just that.
We spent some time at ONA last year, brainstorming with the good folks from the Public Insight Network — they really helped us distill this into a workable feature. We’re looking forward to seeing PIN newsrooms do some great reporting aided by this new feature. Continue reading »
The document upload dialog we rolled out in January allowed users to upload multiple documents at once and incorporated some nice touches like a progress bar that tracked progress of your documents. The Flash uploader was a great move forward for us but a handful of our users were having trouble with it, so we’ve rewritten it in open-standards based HTML5.
Today, we just pushed out an update to the file uploader. Continue reading »
We’ve long allowed users to access DocumentCloud’s tools over an encrypted HTTPS connection. Now, we’ve made it mandatory. Next time you log in to your DocumentCloud account, you will be redirected to the secure version of your workspace.
When you use an unencrypted HTTP connection to access a website, your request and the site’s response are all sent over the network in readable clear text, which is trivially easy to intercept and read. Without HTTPS, it is actually possible for someone hijack your connection to DocumentCloud, inserting or altering the content that you’re viewing. So we think HTTPS is worth using.
If you’re interested in the technical subtleties of implementing SSL, read on. Continue reading »
You already know you can link directly to any page or annotation. Now you can embed documents so that they’ll open to any page or annotation, too. If you want to point your readers to the shocking revelation on page seventy five or open the viewer directly to a key annotation, check out our new embed dialog.
Select any document and choose “Embed Document Viewer” from the “Publish” menu, and you’ll find a new configuration option:
We build features like this because our users ask for them &emdash; what do you need DocumentCloud to do?
With close to 200 newsrooms contributing documents and thousands of documents in our catalog, we decided it was time to open DocumentCloud to public searches.
Wondering who is still covering the Deepwater Horizon oil spill? Try a search for “deepwater horizon” organization: transocean, and see documents that both reference the rig by name as well as the drilling contractor, Transocean. Then, click on the “Entities” tab to see more data provided by OpenCalais’ entity extraction.
Did you miss Memphis Commercial Appeal‘s coverage of Ernest Whithers? Catch up with a search for
group: commercial-appeal withers, and find every document uploaded by reporters in the Commercial Appeal newsroom that mentions Whithers by name. Curious to see the annotations journalists have been making on the documents they’re sharing? Try a search for filter: annotated and you’ll skip any documents that were published without annotations.
There’s plenty more you can do with DocumentCloud’s search syntax. Check out our primer and try a few searches.
We’d love to know what you think, and what you’ve found.
PS. Finding bugs rather than documents? We want to know about those, too.
We get a decent number of inquiries from journalism schools interested in incorporating DocumentCloud into their coursework. That’s great, it really is. If you take a look at our list of document contributors, you’ll see a nice collection of journalism schools, student reporting projects and investigative reporting institutes. We absolutely welcome journalism schools.
That said, there are a few things worth knowing before you contact us. Continue reading »
With WikiLeaks in the news, there are a few questions (two, actually) that we’ve been asked rather frequently of late, questions we hadn’t anticipated in our original list of frequently asked questions. Questions like …
Is DocumentCloud the new Wikileaks? Isn’t OpenLeaks just a Swedish DocumentCloud?
No, not really. We’re both nonprofits dedicated to publishing data and documents, but that’s about it.
To join DocumentCloud, you need to be a journalist, or work a lot like one. Our goal is to help reporters publish more source documents and to build a catalog of primary (and secondary) source documents that individual journalists have researched and written about: we expect our users to be uploading documents they’re reporting on. Document contributors make a commitment to us that they’re confident of the authenticity of the documents they upload. And every user tells us their name — it goes right on every document. Continue reading »
Ever since we added the “pages” tab to the document viewer, we’ve wanted to find a way to bring the convenience of browsing through a document’s pages back into the workspace itself. If you log in to DocumentCloud this afternoon, there is now a grid of page images that can be displayed by clicking on the page number at the bottom of each document, or by right-clicking a document and choosing “View Pages.” We flag each page that contains a note with a yellow tab, so you can easily spot them among hundreds of other pages.
Try it on a few of your documents, and let us know what you think!
Fine tuning text, adding, removing and reordering pages: when we embarked on this project, tools for altering documents were the furthest thing from our minds. A responsible journalist doesn’t tweak source documents! One of the first papers to embed material using DocumentCloud needed to do just that. Chicago Tribune accompanied their coverage of a troubled foster home with a collection of letters and court orders. Though the documents offered an excellent illustration of the state child services agency’s lax oversight and slipped follow-ups, they were predictably full of personal information about children in the foster care system, individual agency staff names and other personal and identifying details about private individuals that The Trib opted to omit from their reporting. That decision, however, left the news apps team replacing the whole stack of letters multiple times before the package was finally ready to post.
A tool, right inside of DocumentCloud, for replacing, removing and reordering the pages of a document would have helped them a lot. Continue reading »
We’ve added a “pages” tab to document viewers in our workspace and embedded on news sites. The new tab offers a birds-eye view of an entire document. This new tab, which now appears in your document viewer right next to the “document” tab, allows you to browse a document more quickly by showing you thumbnail images of every page. For long documents, this tab allows you to identify exactly where you want to go in the document without having to scroll and search repeatedly until you find a specific section in the document. Continue reading »