Here at DocumentCloud, we’re constantly turning PDF files and Office documents into embeddable document viewers. We extract text from the documents with OCR and generate images at multiple sizes for each of the thousands of pages we process every day. To crunch all of this data, we rely on High-CPU Medium instances on Amazon EC2, and our CloudCrowd parallel-processing system. Since the new Micro instances were just announced, we thought it would be wise to try them out by benchmarking some real world work on these new servers. If they proved cost-effective, it would be beneficial for us to use them as worker machines for our document processing.
Benchmarking with Docsplit
To benchmark EC2 Micros, Smalls, and High-CPU Mediums, we used Docsplit. Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…).
Configuration
For source material, we used a 51 page PDF from The Commercial Appeal‘s recent story on civil rights photographer and FBI informant Ernest Withers: an FBI report that describes the events preceding the assassination of Dr. Martin Luther King Jr.
To benchmark the relative speeds of the instance types, we used Docsplit’s OCR-based text extraction, which is a single-threaded call to Tesseract, as well as Docsplit’s image extraction, which is a multi-threaded call to GraphicsMagick and Ghostscript for PDF to GIF conversion and image resizing.
Here are the commands we ran to download the PDF and extract the images at three different sizes, as well as the full text:
wget http://s3.documentcloud.org/documents/7240/informant-details-invaders-history-and-activities-part-two.pdf
time docsplit images --size 1000x,700x,60x75 --format gif --rolling informant-details-invaders-history-and-activities-part-two.pdf
time docsplit text --ocr informant-details-invaders-history-and-activities-part-two.pdf
Raw Results
High-CPU Medium | Small | Micro |
---|---|---|
$ time docsplit images <SNIP>.pdf
|
$ time docsplit images <SNIP>.pdf
|
$ time docsplit images <SNIP>.pdf
|
We then used screen to run two Docsplit image extractions at the same time, since the high-CPU medium instances are dual-core machines.
$ screen
<Screen 1>
$ time docsplit images .pdf
real 6m30.978s
user 5m51.920s
sys 0m11.230s
<Screen 2>
$ time docsplit images .pdf
real 6m26.808s
user 5m50.730s
sys 0m11.180s
Results
Let’s look at the results:
Instance Type | Image extraction | Text extraction (OCR) | Base Cost per Hour |
---|---|---|---|
High-CPU Medium | 5.4 minutes | 11.7 minutes | $0.17 |
Small | 9.6 minutes | 15.0 minutes | $0.085 |
Micro | 21.5 minutes | 52.0 minutes | $0.02 |
Graphing these values:
Conclusion
It’s not hard to see that the cost-effectiveness of the Micro instances is about twice that of the Medium instances. However, the Medium instance is a dual-core machine, and if we run two Docsplit processes at the same time (which we are already doing), the cost-effectiveness of the High-CPU Medium instance nearly doubles, raising it to the level of a Micro instance.
There is a crucial difference, however. The Micro instance, despite being cheaper, has a faster CPU and takes only 4:35 in actual CPU cycles to do the same work that the High-CPU Medium instances takes 5:49 to accomplish. But because you’re sharing the resources of that Micro instance with other EC2 customers, the High-CPU Medium instance ends up processing the documents 3.6 times faster than the Micro instance. The Micro takes 21:32 to process images, whereas the High-CPU Medium finishes in 5:25.
Our Recommendation: If raw speed is important to you, the High-CPU Medium makes more financial sense than the Small or Micro instances. But if speed is not an issue, then the cost of the Micro instance actually wins out for single-threaded workloads, since processing takes longer, but costs less overall. It all depends on your setup. With our parallel document imports, we could switch to using all Micro instances and end up processing the same number of pages per day for the same price, but each individual document would take nearly four times longer to finish. So we’re sticking with the High-CPU Medium instances.
Other Notes
Micro instances come with an optional 64-bit configuration, which is very useful if you ever work with large files, like a MongoDB database, large images or PDFs, or anything beyond 2GB in size. Additionally, Micro instances use Amazon’s EBS service for persistent storage. Because EBS is the same cost no matter the instance size, it’s very convenient if you decide to move up or down in instance size. This is comparable to many competing VPS services like Slicehost and Linode, just a different way of combining the various storage and compute components.
Also, there are many other VPS comparison blog posts which describe the differences between CPU-bound, memory-bound, and I/O-bound application performance. Eivind Uggedal compares a number of different applications on a few hosts, including Amazon. The Bit Source compares CPU performance between Amazon and Rackspace.
Enjoyed the post. FYI: 1st graph y-axis is mislabled.
notaddicted
15 Sep 10 at 5:19 pm
Thanks for the note. The graphs have been fixed and updated.
Samuel Clay
15 Sep 10 at 6:38 pm
The Micro instances seem to be CPU-throttled to prevent continuous usage; I can’t even compile anything sizable (e.g. Perl, Postgres, Apache) on them without the hypervisor stealing 99% of the CPU time (as shown in the rightmost column of vmstat).
Nice thing about EC2 is that I can stop the instance and resize it to small or medium to build out the server, then drop it to micro to handle sporadic requests.
Nic
15 Sep 10 at 10:14 pm
http://cloudharmony.com/ has benchmarks of most of the cloud hosting providers in their blogs. It’s a far more exhaustive set of tests than what I’ve seen from anyone else over the years.
Josiah
16 Sep 10 at 3:18 am
One thing I’ve been pondering in these early days of t1.micro is if there is segregation between the machines running those and other instance types. Reason of thought is, maybe these metrics are more favorable since they’ve [t1.micro] barely existed and the hardware is less subscribed to, giving better results now.
It would be interesting if you could re-do these results in 2-3 months and see if there is still roughly the same values.
Great information though and solid work!
Mark Stanislav
16 Sep 10 at 9:29 am
You are drawing the wrong conclusions from your benchmark. The problem is that you are comparing the normal instances (fixed CPU) to the micro instances (not fixed CPU).
The micro instances can “burst CPU capacity when additional cycles are available”. That means the amount of CPU you get can vary. It’s quite likely you ran your benchmark on an unloaded server (because they are new), so it looks like 2 full CPUs. If you ran it on a loaded server, the performance would be pathetic (maybe 1/10 of the small CPU).
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?concepts_micro_instances.html
Anonymous
16 Sep 10 at 11:31 am
The burst that you are referring to is allowed to last no more than a few hundreds milliseconds, far below the capacity we needed here. This means that the CPU was not bursting while we were running the benchmark. While it is quite probable that the load on the server would have an effect on the benchmark, both servers are already capped, so the effect would only cause Micro instances to perform slightly poorer, which means that they would continue to not suit our needs.
But load from adjacent instances has not caused noticeable effects on our existing instances.
Samuel Clay
16 Sep 10 at 11:45 am
The burst on micro instances is not milliseconds. It is more like 15 seconds. The micro instance’s CPU is reasonably fast while bursting, but when the burst runs out then the rate limit is pretty brutal. The rate limited speed is roughly 1/3 of the burst speed that you get for the first 15 seconds.
A simple test, showing compute power per second with some sleep in-between runs to allow the rate-limiter’s bucket to refill:
#!/usr/bin/perl
my $firsttime = my $time = time;
for(my $x = 0; time-$firsttime < 30; $x++) {
if($time != time) {
printf "%2d %dn", time-$firsttime, $x;
$x = 0;
$time = time;
}
}
# sleep 300; ./throttleme.pl; sleep 300; ./throttleme.pl
1 3050483
2 4499169
3 4351002
4 4480768
5 4491703
6 4495259
7 4502143
8 4494198
9 4174903
10 4097267
11 4259348
12 4370439
13 4216742
14 4379620
15 4499622
16 448604
17 132731
19 133197
20 132758
22 132523
23 129993
24 127614
25 133869
27 132596
28 133385
1 3637552
2 4357062
3 4086175
4 4352176
5 4357643
6 4044038
7 4353554
8 4356628
9 51296
10 129492
12 128712
13 126456
15 129196
16 125337
18 129433
19 111697
21 129684
22 128626
24 128390
25 129025
27 128435
28 128914
30 110801
One important observation here is that it appears to be skipping seconds once rate limiting kicks in. That implies the rate limiter is doing a few very long pauses to rate limit me (as opposed to doing lots of small pauses). So, I get really bad CPU jitter once the rate limiter kicks in.
Matt Buford
17 Sep 10 at 12:47 am
It appears that by pausing for 5 minutes just to get 8-15 seconds of CPU burst means that you can only run bursty jobs every so often, and if you are doing any sort of processing, you’re probably going to have bursts far more often than that.
By running the CPU for the entire 5-20 minutes in our tests, we demoed with real-world constraints. Unfortunately for us, the bursts are so infrequent and highly limited that we can’t rely on it for raw CPU power. Rather we consider the whole CPU.
Your perl script to measure CPU bursting, while simple, is pretty clever.
Samuel Clay
17 Sep 10 at 10:07 am
I did some additional long-term benchmarks. I simply used distributed.net’s client running for several days on ec2-micro as well as a variety of slow machines to compare against. This is by no means scientific, and is also only really reflective of CPU speed (no disk IO), but here are the results:
7,048,706 nodes/s = Atom 330, 1.6 ghz dual core
2,161,104 nodes/s = ec2 micro instance
2,116,629 nodes/s = AMD Geode LX800, 500 mhz
540,693 nodes/s = AMD Geode SC1100, 266 mhz
General conclusion: ec2 micro instance is slow for long-term CPU intensive tasks. However, the CPU is reasonably fast if your CPU needs are bursty.
However, because of the pausing of the rate limited, a micro instance is never going to be appropriate for anything interactive or where latency of response is important. To illustrate this, just ping an ec2-micro instance while doing anything that burns CPU (such as distributed.net or even just an empty while loop). On my slow 266 mhz system you won’t even notice the CPU usage. On an ec2-micro instance, the pings will become horrible (1500ms peaks) and you’ll find it’s hard to even type at a bash prompt.
Don’t get me wrong – they’re great for dev work or just playing around. I’m just not sure they’re good for most production work unless the CPU load is very lightweight.
Matt Buford
20 Sep 10 at 5:51 pm
I’m using a free-tier micro instance to run a couple of websites. They seem to be doing rather well despite the limited memory.
Steven Stern
15 Nov 10 at 6:54 pm
One thing I’ve been pondering in these early days of t1.micro is if there is segregation between the machines running those and other instance types. Reason of thought is, maybe these metrics are more favorable since they’ve [t1.micro] barely existed and the hardware is less subscribed to, giving better results now. It would be interesting if you could re-do these results in 2-3 months and see if there is still roughly the same values. Great information though and solid work!
Sharron Clemons
21 Dec 10 at 3:21 pm
Very nice writeup. I too am a huge fan of the EC2 Micro instance, and your work has helped. I’m curious if you looked into any of the larger instances for high cpu tasks. I’m also curious if one would ever see any performance difference on EC2 in 32-bit vs. 64-bit.
Jon Zobrist
30 Sep 11 at 4:11 am
Great post. DocSplit has been an excellent library for parsing through documents on my current project. Have you written anything about how you configured your EC2 instances to run DocSplit? I’d love to know more about this.
Paul Zaich
11 Oct 12 at 2:56 am