Latest Updates: Our Blog

Author Archive

How We’re Speeding Up DocumentCloud

Posted
May 2nd, 2016

Tags
Code

Author
Anthony DeBarros

Improving DocumentCloud’s speed and reliability has been a major focus of our team’s development time in the last year, with as much attention spent on back-end processing as on more-visible changes such as re-styled notes, a mobile-ready page embed, and enhancements to our WordPress plugin.

We thought you’d like to know about some of that work — from caching to a beefier database server to compressing the files we serve — and how it plays out every day in faster load times, a more stable platform, and improved performance during high-traffic situations.

Here’s a rundown:

Content Delivery Network: Last fall, we began serving PDFs, text, and images from Amazon’s CloudFront content delivery network instead of directly from Amazon’s S3 storage. You might have noticed the change when we switched URL subdomains from s3.* to assets.*, as in https://assets.documentcloud.org/documents/282753/lefler-thesis.pdf.

One benefit of using CloudFront is that we now serve files from multiple places around the globe. That means our users and readers in Europe, Asia and South America now see faster load times. We’re very happy to support better performance this way as our worldwide user base grows.

Compressed assets: DocumentCloud assets are now served compressed, a step developers had requested to improve performance. That means web browsers don’t have to download as many bytes to load our viewer and the code around it. This speeds up viewing and saves us some money on data costs — double win.

Caching: We’ve improved caching on our main application server, which frees it up to do other work. We did this by using its NGINX web server as a reverse proxy cache, which keeps track of requests for documents and other resources and saves the responses. If we see multiple requests for the same document or resource, we serve the response from the cache rather than making the server generate the response all over again.

Previously, the server relied on Ruby on Rails’ built-in page caching. While that worked well in most instances, it wasn’t able to handle resources over a certain length, typically those containing a large number of options and parameters (such as certain API calls). Using NGINX as a reverse proxy cache removes that limit and considerably improves caching coverage and performance. For example, our new caching setup came in very handy when publicity around the Panama Papers pushed traffic to documentcloud.org up 25 times more than usual.

Database upgrade: Finally, we recently upgraded our database to PostgreSQL 9.4 and moved it to a more powerful server. Our database server gets a considerable workout recording data about each document uploaded, processed and served, and as our user base grows we were starting to hit limits on memory, storage and processing. Now, we have room to spare.

We have plenty of work ahead as we embark on improving the data model for handling accounts and create a new workflow for signing up. But we hope these less-visible back-end improvements help you get your work done faster and improve your readers’ experience with DocumentCloud.

Updating DocumentCloud’s Terms of Service

Posted
Apr 21st, 2016

Tags
Accounts

Author
Anthony DeBarros

DocumentCloud is updating its Terms of Service, and we’re giving our users time to review the update before it takes effect on May 25, 2016.

Before we highlight what’s new, we’d like to express our deepest gratitude to Dalia Topelson Ritvo and students at Harvard Law School‘s Cyberlaw Clinic, based at Harvard’s Berkman Center for Internet & Society. The Cyberlaw Clinic gave our team invaluable assistance in writing our updated Terms of Service, along the way giving us outstanding guidance on legal best practices and answering many questions. Thank you!

DocumentCloud’s Terms of Service has remained relatively unchanged since we launched in 2010. Over time, we’ve added a number of features, and consequently users have found new ways to use DocumentCloud. This update better reflects DocumentCloud’s services today, and we’ll continue to update it as our services evolve.

Highlights of changes and additions:

  • Overall, we’ve updated the document’s organization and its language to reflect best practices for software products as well as current DocumentCloud features and capabilities.
  • Real names and email addresses are required for accounts. We allow one shared account per organization for organization-wide use, which can have an organization name. We also allow one machine account for automation or API use. (Section 2.2)
  • We specify that you may not sell, rent, or otherwise offer the Services to others without DocumentCloud’s prior written consent. (Section 3, paragraph (b))
  • We prohibit using DocumentCloud in any way that interferes with the operation of the service, impacts or harasses any other user, or circumvents our security protections. (Section 3, paragraph (m))
  • You represent that you have the right to contribute your content to DocumentCloud, and you grant us a license to that content in order for us to deliver our services. (Section 5)
  • We clarify that when you delete a document, we also delete it from our platform. If you redact a document, we erase all data related to the redacted information, create a new redacted document, and delete the original. (Section 7.1)

Thank you for reading through these points, and we encourage you to read our updated Terms of Service plus our Privacy Policy and API Guidelines and Terms of Service. We’re proud to provide a platform that contributes to greater transparency in reporting and helps you find and tell great stories in documents. We expect our Terms to continue to evolve as the platform does, so check back to stay informed about updates.

A Node.js wrapper for the DocumentCloud API

Posted
Apr 7th, 2016

Tags
Code,People

Author
Anthony DeBarros

Thanks to Ryan Murphy of the Texas Tribune, there’s a new way to simplify using DocumentCloud’s API – a Node.js library aptly called node-documentcloud.

“Why should Ruby and Python get to have all the fun?” Murphy said, referring to the fact that coders for some time have been able to use python-documentcloud and the documentcloud RubyGem wrappers to work with the DocumentCloud API.

“The more I use Node.js, the more I like having the option to complete tasks in the language,” Murphy said. “DocumentCloud also has a relatively straightforward API structure, so it also seemed like a good opportunity to try building a client for the first time (something I’ve wanted to attempt for a while).”

DocumentCloud’s API is a powerful piece of the platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.

You can interact with our API via the programming language of your choice, but if you’re a user of Python, Ruby, or now Node.js, the wrappers around the API contributed by the open-source community provide many shortcuts over coding all the interactions yourself.

Murphy sees his library as a gateway to additional features and platforms around DocumentCloud.

“For example, one of the first spinoffs I’ve begun work on is a command line interface on top of node-documentcloud — something that would allow you to interface with the DocumentCloud client from your terminal,” Murphy said.

“Then, it could be as simple as something like documentcloud-cli upload <name_of_folder> to send a bunch of documents to the service. Or, documentcloud-cli download <document_id> to pull down a file. It’s still early going!”

Read all about the wrappers

You can learn more about node-documentcloud by visiting its documentation on Github or npm.

If you’re a Python or Ruby coder, take a look at:

python-documentcloud: From Ben Welsh of the Los Angeles Times’ data desk comes this full-featured API client for Python programmers. In addition to covering the basics, this library goes deep with providing details such as the location of annotations in a document. Documentation.

pneumatic: A Python bulk-upload library for DocumentCloud, written by Anthony DeBarros of the DocumentCloud team. Provides features including cataloging all the files uploaded and their URLs in a database. Documentation.

DocumentCloud: A RubyGem for interacting with the DocumentCloud API, created by Miles Zimmerman. Upload, search, retrieve data about documents. Github. RubyGems.

WRAL builds award-winning app with DocumentCloud API

Posted
Apr 7th, 2016

Tags
Code,Documents,People

Author
Anthony DeBarros

A big congratulations from the DocumentCloud team to Tyler Dukes, public records reporter at TV station WRAL in North Carolina. Dukes received a 2016 Sunshine Award from the North Carolina Open Government Coalition for work including a custom document-search application he built using the DocumentCloud API.

The API is a powerful piece of the DocumentCloud platform, a web service that lets you interact programmatically with resources such as documents, projects and entities. Various API methods let you upload files, create projects, update document data and embed assets via oEmbed, among other tasks.

When the University of North Carolina at Chapel Hill released hundreds of thousands of pages of documents gathered during an independent investigation into academic fraud involving faculty, staff and student athletes, Dukes turned to the search method of DocumentCloud’s API to build a web application that let users find and read documents by keyword or key people in the investigation.

“We wanted to build something to allow users to browse and search hundreds of thousands of pages of documents all in one place,” Dukes said. “DocumentCloud’s existing embeddable search was close, but because the documents were arbitrarily spread across hundreds of batches, I was concerned it would be too confusing for the average user.

“The API allowed us to very quickly prototype and roll out exactly what we wanted for this very specific circumstance,” he said. “We used the API to pull every page from documents stored in a single project and display them in an intuitive application that allows users to read page by page (or even random pages) or search everything at once. We’ve updated the application twice now, and we’re currently up to 680,000 pages and counting.”

Dukes’ project is one of several in recent months that have used our API or components to give readers custom search and viewing, including a Wall Street Journal application to let readers tag Hillary Clinton’s emails and La Nacion’s election crowdsourcing application VozData.

If you’re interested in using the DocumentCloud API, check out our help documentation and don’t be shy about getting in touch.

Storytelling with improved DocumentCloud notes

Posted
Feb 17th, 2016

Tags
Documents

Author
Anthony DeBarros

Starting this week, DocumentCloud notes are sporting a subtle facelift that aligns their type style and color palette with the more modern aesthetic developed for our recently launched page embed.

Notes, like pages, are fully responsive, lightweight and a great choice for websites viewed on a variety of screen sizes. Plus, each note includes a link to view the full document. With today’s update, notes adapt better to varying device widths and rely less on expensive JavaScript calculations.

Here’s a note in action:

A key procedure — doing a 360-degree scan of the structure — was not followed, according to the report. Also, personnel did not observe the fire on the first floor of the house.

Using notes to strengthen the narrative

Improved storytelling is one of DocumentCloud’s aims, and weaving notes into a story can build a stronger narrative. For example, the Chicago Tribune recently reported details on when aides to Chicago Mayor Rahm Emanuel became aware of facts in the police shooting of Laquan McDonald. Reporters placed notes at key points in the story to highlight portions of emails, calendar items and other documents showing when officials discussed the shooting.

Oregon Public Broadcasting also used notes to highlight phrases in potentially confusing earthquake insurance policies. By weaving notes into the story, readers could quickly view the specific contract language the story discussed.

Our work on notes and pages is leading towards improvements to our main document viewer and is part of an effort to improve overall publishing performance. Our code is open source, so if you’re a developer you can follow progress on notes, pages, and our whole platform on Github. As always, we welcome your thoughts at support@documentcloud.org.

WordPress DocumentCloud 0.4.0 supports page embed

Posted
Dec 17th, 2015

Tags
Code,Documents

Author
Anthony DeBarros

We’re excited to announce that DocumentCloud’s custom WordPress plugin now features support for Page Embed, our lightweight, responsive viewer. Version 0.4.0 — which includes additional enhancements and bug fixes — is available for download now, and we recommend upgrading as soon as practical.

The plugin makes publishing documents, notes and now pages as simple as dropping a shortcode in your WordPress post. As with documents and notes, the embed wizard in the DocumentCloud workspace generates the shortcode for a page at the same time it creates our traditional embed code. Here’s an example:

[documentcloud url="https://www.documentcloud.org/documents/1659580-economic-analysis-of-the-south-pole-traverse.html#document/p4"]

Download, install and activate the plugin, drop the shortcode into your story, and it renders the page including annotations:

For more information on embedding, check out our Help docs.

Our WordPress plugin — as with the entire DocumentCloud platform — is an open-source project, and we welcome your contributions, ideas and feedback! Visit the GitHub repository or get in touch via support@documentcloud.org

Update: Making DocumentCloud Sustainable

Posted
Dec 9th, 2015

Tags
Accounts,Sustainability

Author
Anthony DeBarros

In the last year, the DocumentCloud team’s completed a lot of work that’s been visible — such as our responsive page embed and WordPress plugin — and even more that’s been behind the scenes, including faster image rendering and improved reliability.

Our goal’s simple: We want DocumentCloud to give you the tools you need to find news in documents and tell stories with them.

At the same time, we’ve been having a lot of conversations to lay groundwork for perhaps our biggest challenge: making sure DocumentCloud has the financial resources to continue its civic mission and develop technically to match our users’ changing needs. What started in 2009 as a great idea has grown up into a full-fledged software product serving thousands of journalists worldwide. In recognition of this, our most recent grant from the Knight Foundation directs us to find ways to generate the revenue needed to ensure DocumentCloud’s future.

Many of those conversations have been with you — our users. Whether casually at a conference or in a formal interview by phone, we’ve talked with reporters, web producers, editors and application developers to find out how you use DocumentCloud, the features you like best, the improvements you’d like, and — relative to DocumentCloud’s sustainability — the extent to which newsrooms value DocumentCloud and would consider paying for its services.

We’ve learned some things. We’re glad to know that DocumentCloud remains valuable to you, and if the price is reasonable and our platform’s competitive, there’s a good chance you’ll talk with your managers about supporting us. But you’ve also made it clear that newsrooms are watching budgets more closely than ever, and we need to keep proving our value. We hear you.

We’ve also learned more about you — for example, that DocumentCloud users aren’t monolithic. For some, we’re the go-to tool for amassing and researching hundreds or hundreds of thousands of documents for investigations and news applications. For others, we’re mainly a publishing platform — a great way to enrich stories with notes, complete documents and now responsive pages. We’re happy being both, and we’ve made sure this past year to make improvements to DocumentCloud that serve everyone.

So, out of those conversations — as well as ongoing study of analytics around user and platform activity — we’ve placed several efforts in motion:

  • We’re developing a pricing model for the platform and will begin charging for certain levels of access at some time in 2016. We’ve yet to land on specifics, but we are committed to providing a free usage tier as well as discounts for non-profit news organizations.
  • We’ve engaged University of Miami assistant professor Vamsi Kanuri to help us conduct an analysis of DocumentCloud user preferences related to features and pricing. Later this week, we’ll send a survey to a set of DocumentCloud users. If you get one, please help us by completing it!
  • We’ve been opening DocumentCloud on a trial basis to paying customers from fields including education, research and libraries/archives. Many have long believed that DocumentCloud’s value extends beyond journalism, and the early results are encouraging. We’ll share more about this in the months to come.

Thank you for using DocumentCloud and for sharing your thoughts with us.  What we learn from you informs our plans on how to build better tools for making the news.

Also thanks to our advisers, who have shared their thoughts on the business, technology and civics of news. Each has been full of ideas and wise counsel.

We love hearing from you. Please reach out to us any time at support@documentcloud.org or via Twitter.

Celebrating one million public documents

Posted
Nov 24th, 2015

Tags
Documents

Author
Anthony DeBarros

Dear DocumentCloud users:

Pat yourselves on the back!

On Monday evening, the number of documents available in the DocumentCloud public catalog passed one million. All told, the number of public pages now exceeds 13 million.

These public documents — plus another 1.4 million private documents in our database — represent a lot of your hard work. Often, they’re the result of hours of dogged reporting, persistent requests to government agencies, the scraping of websites, and a determination to treat “no” as an unacceptable answer. 

Thanks to you, DocumentCloud’s public catalog has become a deep well representing an amazing diversity of topics. In November’s uploads alone, you’ll find subjects ranging from New York state’s lawsuit against the fantasy sports site DraftKings to a recent announcement by the National Institutes of Health that it will no longer support biomedical research on chimpanzees to tens of thousands of pages of Argentinian election results.

It’s a moment to celebrate. We at DocumentCloud and Investigative Reporters and Editors applaud you. We’re grateful both for your reporting that shines a light and for what, collectively, you’re building along with us. Thank you.

Introducing DocumentCloud page embeds

Posted
Nov 12th, 2015

Tags
Code,Workspace

Author
Anthony DeBarros

With content consumption continuing to shift to mobile, we’ve been spending a considerable amount of time thinking about how to help our users tell stories with documents across a range of devices. Loading and reading a 50-page PDF on a desktop computer is an experience that, for many reasons, doesn’t translate well to a smart phone. So, what’s a better way to point readers to what matters in a document when it appears on one of your mobile pages?

Today, we’re pleased to announce our first step in moving toward a better mobile PDF experience: DocumentCloud Page Embed. It’s a lightweight, responsive viewer that highlights a single page, along with your annotations, and works across desktop and mobile. It’s available in our workspace right now. Look for “Embed a Page” under the Publish Menu and use the wizard to generate the code. See our Help documentation for details.

Here’s an example of Page Embed showing a page from an investigative report on a fire that injured several firefighters in Northern Virginia. The page has one note highlighted:


 
Behind the development

Why a page-focused embed? We noticed that publishers often embed simple screenshots of document pages. While easy to use and natively responsive, images deny readers the rich context of an embedded DocumentCloud document: annotations, searchable page text, and of course access to the original source document itself. While our full document viewer offers all these options, it’s overkill for presenting a single page.

Our new page embed strips the cruft from the document viewer and lets the reader focus on the page (and your annotations, if you’ve added any). We know how important mobile has become for document publishing, so we’ve made the page embed natively responsive. The entire interface resizes and changes capabilities with its surrounding context.

Extending across the platform

In fact, we’re so happy with the page embedder, we’re using it as the foundation for our next-generation full document viewer. Our goal is a responsive, extensible viewer with minimal interface chrome that lets readers view the document and your annotations and then get right back into the story. In other words, our focus is keeping their focus on you.

We still have plenty of improvements and additions in the pipeline. We’ll soon add the navigation to all the pages and text in the document, more immersive notes, more customization options, and better performance. We also plan to support page embeds with our oEmbed API and WordPress plugin. You can go ahead and start using it today, but keep an eye on the wizard for new options and capabilities in the near future.

Your thoughts are welcome! Please send any feedback to support@documentcloud.org or open an issue on our GitHub repo.

A summer day’s worth of DocumentCloud updates

Posted
Aug 10th, 2015

Tags
Workspace

Author
Anthony DeBarros

Hello and happy summer! We’ve been busy here at Team DocumentCloud, using the weeks since meeting many of you at IRE 2015 and SRCCON to focus on building a stronger platform and getting in position to ensure the long-term sustainability of DocumentCloud.

There’s lots in motion, and so here’s a quick update on the highlights:

Milestone: 2 million docs

Thank you for keeping us busy! In July, the total number of files uploaded to DocumentCloud passed 2 million, and our platform now holds more than 27 million pages of the documents you’ve gathered. The numbers keep growing as more news organizations join us – more than 1,400 worldwide right now – and as more people use our API for bulk uploads. Keep those documents coming (and we always appreciate a tip at support@documentcloud.org if you’re planning a big drop)!

A mobile-optimized viewer embed

Whenever we chat with our users, the most-requested feature for DocumentCloud is a better experience for viewing our embeds on phones. Well, we’ve heard you and have been busy developing documentcloud-pages, a new responsive embed type that displays a page with minimal chrome but also allows navigation through the entire document. We’re aiming to launch an early version in late August or September; if you’d like to contribute code or issues, please visit the project repository. In fact, we’re excited that the folks at La Nacion are already using the new embed in their Doc2Media project.

Get tips and updates in your mailbox!

We’re here to help, and soon we’ll launch two newsletters filled with info on how to get more out of your DocumentCloud account. News & Tips will highlight new features (and ones you might not know about) plus tips for publishing, collaborating and working with documents. App Developers will offer information for developers working with our API or building news apps based on our open source components. Both newsletters also will highlight great uses of DocumentCloud from around the world. You can sign up now.

An OpenCalais update

Since the launch of DocumentCloud, we’ve used Thomson Reuters’ OpenCalais API for our entity extraction. There’s a new version of the API, and we’re migrating to it this month. There won’t be any immediate difference in how we display entities, but we’re looking at whether the new API may offer us some new features. Stay tuned.

Welcome, Clay Selby, to the team!

There’s a new face at Team DocumentCloud’s daily scrum: Clay Selby of Austin, Texas, joined us in July as a part-time developer (thereby doubling our Texas staff). Clay’s the founder of email marketing startup SocialRest.com and brings a good dose of entrepreneurial experience along with his coding chops. Initially, Clay’s been working on moving us to the new OpenCalais API, but he’ll also be focusing on a lot of our back-end processing improvements.

Out and about

We’ve seen many of you in the last few months at various places on the map, from IRE 2015 in Philadelphia – where we held a hands-on class and talked to hundreds who stopped at our table – to SRCCON in Minneapolis. We had the fortune to show off DocumentCloud to students at Medill/IRE’s National Security Journalism Data/Watchdog Workshop in Washington, D.C., and we checked in with a couple of our user newsrooms as well. If we’re in your area and you’d like to get together, let us know!

A sustainable DocumentCloud

Finally, but not least, our current Knight Foundation grant directs us to find ways to make DocumentCloud financially sustainable. Since launch in 2010, thanks to Knight, our service has been offered to journalists for free. In the spirit of improving journalism and making reporting more transparent, we’re intent on maintaining a level of free access to DocumentCloud for journalists while developing a pricing model around features of the platform. In addition, thanks to a new account signup page, we’re hearing from many outside of journalism who’d like to use DocumentCloud, and we’re exploring that option. In the weeks ahead, we’ll be reaching out to many of you to discuss our plans.

Thanks for reading, and as always we have several ways for you to get in touch or follow our progress in several ways:

A new language for our Workspace: Danish

Posted
May 18th, 2015

Tags
Workspace

Author
Anthony DeBarros

Good news – or, if you speak Danish, gode nyheder! Starting today, Danish-speakers can set the DocumentCloud workspace to default to their native language, thanks to translation help from Nils Mulvad, editor at Kaas & Mulvad and associate professor at The Danish School of Media and Journalism.

The addition of Danish increases the number of workspace translations to five. Along with English, we also support Spanish (thanks to work by Fernando Diaz), and Russian and Ukrainian (thanks for both to Roman Kolgushev).

Widening our language support remains an ongoing mission at DocumentCloud, part of our commitment to making our platform accessible to journalists around the world. As we wrote in March when we added OCR support for three additional languages, DocumentCloud language support falls into three categories: text search, entity extraction and workspace translation. We also have work under way to support additional languages in the document viewer.

Thanks to recent work by our development team, we’ve made it easier for collaborators to translate our workspace into more languages – and we’re looking for help! If you’re interested in helping bring DocumentCloud’s workspace to your language, please email us at info@documentcloud.org.

DocumentCloud adds five to group of advisers

Posted
Apr 16th, 2015

Tags
People

Author
Anthony DeBarros

DocumentCloud is pleased today to welcome five media and technology professionals to its group of advisers. The expanded group, which includes two of the platform’s founders, will help guide DocumentCloud as it develops new offerings and plans for sustainability.

DocumentCloud, which is a service of Investigative Reporters and Editors, serves thousands of journalists worldwide with tools for organizing, researching, annotating and publishing documents gathered while reporting. The expanded advisory group is one of several efforts under way as part of a 2014 Knight Foundation grant that is enabling DocumentCloud to add staff, improve the platform’s efficiency, and implement new features.

The advisers will help the team weigh questions related to technology, market opportunities, product development and revenue models.

“We’re excited to have a strong group of experts who are willing to share their expertise with us,” said Mark Horvit, executive director of IRE. “Their guidance will play a key role in helping us chart the future of DocumentCloud.”

The DocumentCloud advisers includes:

Penelope (Penny) Muse Abernathy, the Knight Chair in Journalism and Digital Media Economics at the University of North Carolina and a journalism professional with more than 30 years of experience as a reporter, editor and media executive. @businessofnews

Matthew de Ganon, Senior Vice President of Product Management & Commerce at Softcard, the mobile wallet joint venture of AT&T, T-Mobile and Verizon. @deganon

Eric Gundersen, CEO of Mapbox, a leading provider of custom online mapping solutions. @ericg

Jacqueline Kazil, an Innovation Specialist working on cross-agency platforms for the federal government. @JackieKazil

Scott Klein, an assistant managing editor at ProPublica and a co-founder of DocumentCloud. @kleinmatic

T. Christian Miller, a member of the Investigative Reporters and Editors board of directors, is a senior reporter at ProPublica, which he joined in 2008. @txiatianmiller

Aron Pilhofer, Executive Editor of Digital at The Guardian and a co-founder of DocumentCloud. @pilhofer

Biographies of the advisers are available on our staff page.

 

DocumentCloud welcomes Justin Reese

Posted
Mar 24th, 2015

Tags
People

Author
Anthony DeBarros

We’re happy to announce that Justin Reese is joining the DocumentCloud development team. Hailing (and working remotely) from Tyler, Texas, Justin will focus on building the next generation of our platform’s front-end components, from the document workspace to embeds to the overall site experience.

Justin comes to DocumentCloud after spending years translating complicated business requirements into simple, usable web apps for companies such as Essilor Labs and Bon-Ton. We’re excited to have Justin as a collaborator. His work shows a thoughtful consideration of users and attention to detail, and as a contributor to projects such as Hack Tyler he shares DocumentCloud’s eagerness to create software that serves the public good. Justin’s artistry extends beyond software: he also makes short films and tolerable Neapolitan-style pizza (which we expect to taste asap).

The addition of Justin is part of our current funding from the Knight Foundation, a grant intended to expand and improve the platform. Our goal is to make DocumentCloud the best document reporting, research and publishing platform for journalists and those who work with public documents and to also ensure the long-term sustainability of the platform. In addition, we’re eager to continue DocumentCloud’s legacy of making elements of the platform available as open-source components, such as our recent release of PDFShaver. Justin will play a key role in helping us make that happen.

Please welcome him to our team. You can reach Justin at justin@documentcloud.org or follow him on Twitter.

Hungarian, Norwegian, Swedish OCR support added

Posted
Mar 12th, 2015

Tags
Documents,Workspace

Author
Anthony DeBarros

Starting today, DocumentCloud users can choose three additional languages to OCR uploaded documents: Hungarian, Norwegian and Swedish. We’ve added the three based on support requests and feedback we heard during last week’s NICAR conference in Atlanta.

The addition brings the number of languages available for OCR to 17. To see them all, click “Manage account” beneath your user name and click the “New Documents” dropdown under “Language Defaults.”

We believe that journalists around the world should have access to tools that enable better reporting, and growing our language support is critical to that. DocumentCloud’s language support falls into three independent categories (so partial support of your language may be possible):

Text Search: DocumentCloud fully supports Unicode, allowing users to search documents in a variety of character sets. For scanned documents that require OCR, DocumentCloud uses the battle-tested open-source Tesseract engine, which also powers Google Books and is maintained by Google. The open-source community has contributed language packages for many widely used languages, which allows DocumentCloud to enable them on our platform. So, if DocumentCloud does not yet support your language, please reach out and let us know about your interest!

Entity extraction: DocumentCloud supports identifying people, places, organizations and other entities through OpenCalais. As of now, OpenCalais only supports English, French and Spanish. We are evaluating other tools that would allow us to bring entity extraction to other languages.

Interface translation: Accessibility to our tools is more than being able to process non-english documents. Our users have already collaborated with us to translate DocumentCloud’s Workspace user interface into four languages: English, Spanish, Russian and Ukrainian. If you are interested in bringing DocumentCloud to your language, please email us at info@documentcloud.org

Job posting: DocumentCloud seeks data engineer

Posted
Mar 7th, 2015

Tags
Jobs

Author
Anthony DeBarros

We’re looking for a data engineer to join the growing team at DocumentCloud! If you’d enjoy a chance to help develop the next generation of our service — an open-source civic platform that more than 1,000 news organizations use to analyze, annotate and publish documents for the public good — we’d love to hear from you.

This is a full-time, two-year position with full University of Missouri benefits funded by a grant from the Knight Foundation. We’re a nimble, tightly knit team that works remotely — we stay connected via Slack and video chats — so you can live where you’d like and work flexible hours.

You’ll work on DocumentCloud’s processing pipeline, which makes searching and analyzing document collections accessible to journalists, to improve DocumentCloud’s extraction and analysis capabilities. The pipeline consists of several open source tools wrapped up in our Ruby-based infrastructure (a Rails-driven API and our CloudCrowd parallel processing toolkit). You’ll also play a key role in developing our production API capabilities, especially focused around what information we extract for users from documents and how best to do so.

Our ideal candidate would have the following skills and qualities:

— Independent problem-solver who values learning, keeps current on trends, and knows how to pick the right set of tools for a problem.
— Able to write clean, well-documented code; you know your way around Git, and your Github account shows activity.
— Strong ability to collaborate and communicate with a distributed team.
— Ruby and Rails.
— Experience with Unix-based systems.
— Some knowledge of data science, linguistics, information extraction or search. SOLR experience is a bonus.
— An interest in language and data processing.
— Knowledge of SQL (Postgres preferred).

You’ll join DocumentCloud at a significant time. We’re enjoying widespread use of our platform, and our tools have been used to investigate and publish stories from the grand jury decision in Ferguson, Missouri, to the Guardian’s NSA spying leaks. We collaborate with organizations such as the Washington Post, The Associated Press and Mozilla’s OpenNews fellows to build better ways to present the news, and you’ll have the chance to be part of the community exploring this intersection of news, data and technology.

To apply, please contact us at jobs@documentcloud.org

Catch DocumentCloud at NICAR 2015 in Atlanta

Posted
Feb 26th, 2015

Tags
Documents,People

Author
Anthony DeBarros

If you’re coming to IRE’s annual data journalism conference March 5-8 in Atlanta, be sure to stop by and say hello to the DocumentCloud team!

We’ve made a few ways for you to learn more about the platform, tell us your ideas, and hear about what’s next for DocumentCloud:

— Saturday at 3:20 p.m., join us for a hands-on class, “Reporting and Presentation with DocumentCloud.” Get to know the suite of tools that DocumentCloud offers to help you better organize, analyze and present public documents.

— Sunday at 11:20 a.m. in the Demo Room, we’ll offer “Advanced DocumentCloud: Examples and Suggestions.” Take a deeper dive into DocumentCloud and its API and bring your ideas for features you’d like to see in the platform. Plus, see some of the best uses of DocumentCloud in the last year!

— Throughout the conference, you can find the DocumentCloud team on hand at its booth outside the conference rooms. We’ll be set up to give you a demo, answer questions about accounts and hear how you use the platform.

During the conference, reach us by email or Twitter. Find Ted Han via ted@documentcloud.org and @knowtheory; Anthony DeBarros via anthony@documentcloud.org and @anthonydb; and Lauren Grandestaff via lauren@ire.org and @lgrandestaff.

We’ll look forward to seeing you!

Ahead for 2015: A Faster, More Productive DocumentCloud

Posted
Jan 20th, 2015

Tags
Documents,Workspace

Author
Anthony DeBarros

If you’d dropped into the DocumentCloud workspace in Columbia, Mo., at the start of January, you’d have found at least two things: a team actively avoiding the single-digit temps outside the office and a whiteboard that we frequently filled with ideas, photographed for posterity, erased and filled up again.

We ended up ignoring the temps. The ideas generated enough heat to last us until summer – and beyond!

So, what’s ahead for 2015? We’ve spent the last year researching and reflecting on what you want, both as a buildup to the recent $1.4 million Knight Foundation grant and to make sure you’re as happy as possible with the service. We want to be sure the platform is fast, reliable, and enables you to do your best work.

There’s a legacy to continue. DocumentCloud was founded by and for journalists to support in-depth reporting around public documents. Today, more than 900 news organizations worldwide use the platform. Whether it’s publishing the documents related to doubts about the guilt of a Texas man executed for murder or the grand jury testimony regarding the death of Michael Brown in Ferguson, Mo., journalists use DocumentCloud to give their readers a first-hand view of the primary source documents they gather.

We plan to build from that success. The DocumentCloud team’s expanding, and we’re locking in a roadmap to grow and improve the platform. We’ve recently hired a director of product development, and we’ve just posted a job description for a front-end developer. With a bigger team and lots of focus, we believe that by the end of the year, you’ll see substantial improvements and expanded offerings that will maintain DocumentCloud’s place as an essential reporting and presentation tool.

So, here are some things we’re going to do to help you:

Improved processing. If you use DocumentCloud, you upload documents to the platform. You want them processed quickly, without errors, so you can get right to publishing or annotating. Over the last year, we’ve made substantial improvements to our processing cluster and sped up imports of popular documents uploaded by multiple users. And we’re glad that you’re noticing the results. We have more work planned – look for a blog post soon detailing the changes.

— Go mobile. We know (and so do you) that more of your readers view more of your content on their phones. So, we’re planning mobile-specific changes to the viewer to improve scrolling, zooming and the experience in general.

— More storytelling tools. We’re exploring ideas for expanding the display options for the viewer, such as presentation templates, social media sharing, notes displays and more. Many of you have asked for oEmbed support, and we’re looking closely at that!

— Telling DocumentCloud’s story. We’ll bring you more blog posts like this, keeping you up to date on our progress and listening for your ideas. We’ll also tell you more about how to make better use of all the site has to offer — such as deeper search options — and highlight great examples of your storytelling.

— Expanded reach and premium offerings. We want to make sure DocumentCloud is going to be available to journalists for years to come, and one way to do that is for the platform to begin generating revenue. This goal is part of the Knight grant, and so we’ll be exploring options – premium features, opening the tool to additional types of users on a fee basis, donations, and other ideas.

Beyond those, we have many more ideas, among them better feedback on document processing, ability to rotate pages, better organization of the site and workspace, and batch processing options.

That’s a lot to chew in one year, but with our team expanding – and, we hope, with continued contributions from the open source community – we’re pretty excited about the prospects.

As always, let us know your thoughts!  You can reach us on UserVoice, Twitter, or email.

Job Posting: Come work on DocumentCloud’s front end!

Posted
Jan 20th, 2015

Tags
Jobs

Author
Anthony DeBarros

DocumentCloud, the platform journalists use to analyze, annotate and publish documents, is growing! We have an immediate need for a JavaScript developer/architect who can help build the next evolution of our platform. This is a full-time, two-year position with full University of Missouri benefits funded by a grant from the Knight Foundation.

You can live where you’d like and work flexible hours. We’re a nimble, tightly knit team that works remotely. We scrum daily and stay connected via Slack and video chats. Our code is open source, so your commits to our Github will be seen by the growing community that depends on our platform.

You’ll join DocumentCloud at a significant time.  We build a civic platform that more than 1,000 news organizations worldwide use for the public good, and we value transparency, accountability and the preservation of a free press.  Our tools have been used to investigate and publish stories from the grand jury decision in Ferguson, Missouri, to the Guardian’s NSA spying leaks. We collaborate with organizations like the Washington Post, The Associated Press and Mozilla’s OpenNews fellows to build better ways to present the news, and you’ll have the chance to be part of the community exploring this intersection of news, data and technology.

You’ll be at the center of several of our immediate goals: we plan on improving the experience of reading documents on mobile devices; developing templates for displaying documents; and refreshing our website and user workspace.  And we’re interested in your creative input as we navigate DocumentCloud’s path forward.

Here’s what we’re looking for:

In this front end role, you’ll focus on DocumentCloud’s Backbone/Rails components, which let users upload, organize, research and embed documents – making it easy for users to highlight and publish the newsy parts of documents. You’ll develop for desktop and mobile web across browsers, creating a smart, easy workflow for journalists investigating and publishing documents, as well as our readers in the public.

We’d like you to have some or most of these skills and qualities:

— Strong ability to collaborate and communicate with a distributed team.

— Independent problem-solver who values learning, keeps current on new trends, but knows how to pick the right set of tools for a problem.

— Able to write clean, well-documented code; you know your way around Git, and your Github account shows activity.

— Familiar with Ruby/Rails and SQL/databases or willing to dive in and gain enough knowledge to contribute when needed.

To apply, please contact us at jobs@documentcloud.org