The Value and Perils of Performance Benchmarks in the Wake of TechEmpower's Web Framework Benchmark

April 1, 2013

The TechEmpower's Web Framework Benchmark is quite interesting but the comments following it at HackerNews are even more so. That is at least the constructively critical ones that highlight many of the issues with benchmarks while also reminding us of their value. One could formulate the benchmark paradox:

Benchmarks are important for rational technological choices yet it is very hard if not impossible to perform them in a sensible way.

I would like to record here some of the important points, mainly as a future reference for myself for whenever I will be dealing with benchmarking.

Some background

The benchmark published by TechEmpower's developers compares the performance of many web frameworks written in different languages and run on two different hardware configurations in three tests: returning a simple, hard-coded JSON, performing a single random select and returning it, and performing an increasing number of selects during a request. They have tried to configure the frameworks for best performance and ask the community to help them tune them (as nobody can know everything). The goal was to establish a "baseline performance," i.e. one from which everything can only get worse; not to assess the performance of the frameworks in general as it is impossible without real use scenarios.

The benchmark and setups are available at GitHub. It might be interesting to check also the history to see how individual frameworks have been tuned over time based on advice from experts.

The value and perils of benchmarking

Key points

General: Constructive criticism, preferably accompanied by pull requests, is valuable; "this is bullshit" comments are not
Include other metrics than only peak resp./sec (e.g. latency, max and avg response time from the end-user's point of view)
The configuration of the framework, server etc. and the choice of server/... matters, occasionally a lot => tune for your use cases
- Details may matter a lot as well - for example whether the JSON ObjectMapper is created only once or for each request anew
If the difference between the frameworks A nad B is 10ms vs 100ms per request: does it mean that B is ten times slower or that it is slower by 90 ms? That is a big difference once the request processing itself gets more resource-intensive => beware (raw performance) stats & conclusions
The benchmark is comparing apples with oranges, i.e. rich frameworks such as Play! and raw infrastructural elements such as Netty; be aware of that
Raw performance for edge cases (such as returning a hard-coded string) matters little for real-world cases (<=> compare the drop in the scale of the differences between simple response and multiple DB queries)
Mean is not enough. There should be also at least standard deviation, preferably also the key percentiles
The more realistic test scenario the less we know what we actually measure as the chance for other/invisible influences (confounding) grows rapidly

Me: Asynchronous isn't always better, it has its own overhead when switching the tasks executed by a thread. It is only beneficial if the number of threads is a limiting factor.

Quotes & comments

bradleyland #1:

I fully respect the JVM family of languages as well. I just think that Mark Twain said it best when he said: "There are three kinds of lies: lies, damned lies, and statistics." It's not that the numbers aren't true, it's that they may not matter as much, and in the way, that we initially perceive them. ... Issue #1) The 30-50x performance difference only exists in a very limited scenario that you're unlikely to encounter in the real world. ... The error would be in extrapolating that a move to Gemeni or Node.js would give you a 37x or 15x performance increase in your application. To understand why this is an error, we jump down to the "Database access test (multiple queries)" benchmark. [JH: the difference in that test was much smaller] ... Issue #2) Performance differences for one task doesn't always correlate proportionally with performance differences for all tasks.

bradleyland #2:

In statistics, there exists something called confounding variables. Confounding variables are factors that affect your outcome, but are hidden, or are outside your control. As your example becomes more complex the opportunity for confounding to impact your result goes up significantly .... Basing your choice of framework on speed alone is a pretty bad idea. When you select a framework, you need to optimize for success, and the factors that determine the success of a project are often less related to the speed of a framework and more related to the availability of talent and good tools.

Side note: You need to know statistics for benchmarking

Zed Shaw has published an excellent rant "Programmers Need To Learn Statistics Or I Will Kill Them All" pointing out some common errors when doing benchmarking and performance testing:

Beware of your confidence level (e.g. 95% => the true value of the measured metric is with 95% probability within the interval mean+- std.dev.) and calculate the sample size accordingly, beware different sampling strategies (a batch of 1000 vs. 10 batches of 100 experiments). Don't just guesstimate that e.g. 1000 repetitions is enough.
An average without a std.dev. is useless (=> consistency of behavior)
Confounding (a hidden variable influencing both the variable you control and the one you measure) => it is impossible to compare two things unless you can either control or account for the influence of all other factors, which you usually can't

Conclusion

Use benchmarks but be always very cautious about them. Try to benchmark in multiple different ways, using different hardware, different tools. Always also perform performance testing and monitoring on your real application with real scenarios and setup (again preferably in multiple ways) - the results are likely to differ a lot from all benchmarks.

For a Java app you can start by profiling it using VisualVM's profiler based on thread dump snapshots, then start to track performance at important spots using a performance monitoring library while correlating it with CPU and memory usage.