Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.
Benchmarking with Docsplit
To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…).
For source material, we used a 51 page PDF from The Commercial Appeal‘s recent story on civil rights photographer and FBI informant Ernest Withers: an FBI report that describes the events preceding the assassination of Dr. Martin Luther King Jr.
To benchmark the relative speeds of the instance types, we used Docsplit’s OCR-based text extraction, which is a single-threaded call to Tesseract, as well as Docsplit’s image extraction, which is a multi-threaded call to GraphicsMagick and Ghostscript for PDF to GIF conversion and image resizing.
Here are the commands we ran to download the PDF and extract the images at three different sizes, as well as the full text:
time docsplit images --size 1000x,700x,60x75 --format gif --rolling informant-details-invaders-history-and-activities-part-two.pdf
time docsplit text --ocr informant-details-invaders-history-and-activities-part-two.pdf
We then used screen to run two Docsplit image extractions at the same time, since the high-CPU medium instances are dual-core machines.
$ time docsplit images .pdf
$ time docsplit images .pdf
Let’s look at the results:
|Instance Type||Image extraction||Text extraction (OCR)||Base Cost per Hour|
|High-CPU Medium||5.4 minutes||11.7 minutes||$0.17|
|Small||9.6 minutes||15.0 minutes||$0.085|
|Micro||21.5 minutes||52.0 minutes||$0.02|
Graphing these values:
It’s not hard to see that the cost-effectiveness of the Micro instances is about twice that of the Medium instances. However, the Medium instance is a dual-core machine, and if we run two Docsplit processes at the same time (which we are already doing), the cost-effectiveness of the High-CPU Medium instance nearly doubles, raising it to the level of a Micro instance.
There is a crucial difference, however. The Micro instance, despite being cheaper, has a faster CPU and takes only 4:35 in actual CPU cycles to do the same work that the High-CPU Medium instances takes 5:49 to accomplish. But because you’re sharing the resources of that Micro instance with other EC2 customers, the High-CPU Medium instance ends up processing the documents 3.6 times faster than the Micro instance. The Micro takes 21:32 to process images, whereas the High-CPU Medium finishes in 5:25.
Our Recommendation: If raw speed is important to you, the High-CPU Medium makes more financial sense than the Small or Micro instances. But if speed is not an issue, then the cost of the Micro instance actually wins out for single-threaded workloads, since processing takes longer, but costs less overall. It all depends on your setup. With our parallel document imports, we could switch to using all Micro instances and end up processing the same number of pages per day for the same price, but each individual document would take nearly four times longer to finish. So we’re sticking with the High-CPU Medium instances.
Micro instances come with an optional 64-bit configuration, which is very useful if you ever work with large files, like a MongoDB database, large images or PDFs, or anything beyond 2GB in size. Additionally, Micro instances use Amazon’s EBS service for persistent storage. Because EBS is the same cost no matter the instance size, it’s very convenient if you decide to move up or down in instance size. This is comparable to many competing VPS services like Slicehost and Linode, just a different way of combining the various storage and compute components.
Also, there are many other VPS comparison blog posts which describe the differences between CPU-bound, memory-bound, and I/O-bound application performance. Eivind Uggedal compares a number of different applications on a few hosts, including Amazon. The Bit Source compares CPU performance between Amazon and Rackspace.