Kristian Lyngstol's Blog

A free software-hacker's blog

The many pitfalls of benchmarking

I was made aware of a synthetic benchmark that concerned Varnish today, and it looked rather suspicious. The services tested was Varnish, nginx, Apache and G-Wan. And G-Wan came out an order of magnitude faster than Varnish. This made me question the result. The first thing I noticed was AB, a tool I’ve long since given up trying to make behave properly. As there was no detailed data, I decided to give it a spin myself.

You will not find graphs. You will not find “this is best!”-quotes. I’m not even backing up my statements with httperf-output.


This is not a comparison of G-Wan versus Varnish. It is not complete. It is not even a vague attempt at making either G-Wan or Varnish perform better or worse. It is not realistic. Not complete and in no way a reflection on the overall functionality, usability or performance of G-Wan.

Why not? Because I would be stupid to publicize such things without directly consulting the developers of G-Wan so that the comparison would be fair. I am a Varnish-developer.

This is a text about stress testing. Not the result of stress testing. Nothing more.

The basic idea

So G-Wan was supposedly much faster than Varnish. The feature-set is also very narrow, as it goes about things differently. The test showed that Varnish, Apache and nginx were almost comparable in performance, whereas G-Wan was ridiculously much faster. The test was also conducted on a local machine (so no networking) and using AB. As I know that it’s hard to get nginx, Apache and Varnish to perform within the same level, this indicated that G-Wan did something differently that affected the test to me.

I installed g-wan and Varnish on a virtual machine and started playing with httperf.

What to test

The easiest number to demonstrate in a test is the maximum request rate. It tells you what the server can do under maximum load. However, it is also the hardest test to do precisely and fairly across daemons of vastly different nature.

Other things I have rarely written about is the response time of Varnish for average requests. This is often much more interesting to the end user, as your server isn’t going to be running at full capacity anyway. The fairness and concurrency is also highly relevant. A user doing a large download shouldn’t adversely affect other users.

I wasn’t going to bother with all that.

First test

The first test I did was “max req/s”-like. It quickly showed that G-Wan was very fast, and in fact faster than Varnish. At first glance. The actual request-rate was faster. The CPU-usage was lower. However, Varnish is massively multi-threaded, which offsets the cpu measurements greatly and I wasn’t about to trust it.

Looking closer I realized that the real bottleneck was in fact httperf. With Varnish, it was able to keep more connections open and busy at the same time, and thus hit the upper limit of concurrency. This in turned gave subtle and easily ignored errors on the client which Varnish can do little about. It seemed G-Wan was dealing with fewer sessions at the same time, but faster, which gave httperf an easier time. This does not benefit G-Wan in the real world (nor does it necessarily detract from the performance), but it does create an unbalanced synthetic test.

I experimented with this quite a bit, and quickly concluded that the level of concurrency was much higher with varnish. But it was difficult to measure. Really difficult. Because I did not want to test httperf.

The hardware I used was my home-computer, which is ridiculously overpowered. The VM (KVM) was running with two CPU cores and I executed the clients from the host-OS instead of booting up physical test-servers. (… That 275k req/s that’s so much quoted? Spotify didn’t skip a beat while it was running (on the same machine). ;))


The more I tested this, the more I was able to produce any result I wanted by tweaking the level of concurrency, the degree of load, the amount of bandwidth required and so forth.

The response time of G-Wan seemed to deteriorate with load. But that might as well be the test environment. As the load went up, it took a long time to get a response. This is just not the case with Varnish at all. I ended up doing a little hoodwinking at the end to see how far this went, and the results varied extremely with tiny variations of test-parameters. The concurrency is a major factor. And the speed of Varnish at each individual connection played a huge part. At large amounts of parallel requests Varnish would be sufficiently fast with all the connections that httperf never ran into problems, while G-Wan would be more uneven and thus trigger failures (and look slower)…

My only conclusion is that it will take me several days to properly map out the performance patterns of Varnish compared to G-Wan. They treat concurrent connections vastly different and perform very different depending on the load-pattern you throw at them. Relating this to real traffic is very hard.

But this confirms my suspicion of the bogus-ness of the blog post that lead me to perform these tests. It’s not that I mind Varnish losing performance tests if we are actually slower, but it’s very hard to stomach when the nature of the test is so dubious. The art of measuring realistic performance with synthetic testing is not one that can be mastered in an afternoon.

Lessons learned

(I think conclusions are supposed to be last, but never mind)

First: Be skeptical of unbalanced results. And of even results.

Second: Measure more than one factor. I’ve mainly focused on request-rate in my posts because I do not compare Varnish to anything but itself. Without a comparison it doesn’t make that much sense to provide reply latency (though I suppose I should start supplying a measure of concurrency, since that’s one of the huge strong-points of Varnish.).

Third: Conclude carefully. This is an extension of the first lesson.

A funny detail: While I read the license for the non-free G-Wan, which I always do for proprietary software, I was happy to see that it didn’t have a benchmark-clause (Oracle, anyone?). But it does forbid removing or modifying the Server:-header. It also forces me to give the G-Wan-guys permission to use my using of G-Wan in their marketing… Hmm — maybe I should … — err, never mind.

Varnish Seminar in Paris

I will be in Paris next week to participate in a seminar on Varnish at Capgemini’s premises. If you are in the area and interested in Varnish, take a look at The nature of the event is informational for technical minds.

(This must be my shortest blog-post by far)

High-End Varnish – 275 thousand requests per second.

Varnish is known to be quite fast. But how fast? My very first Varnish-job was to design a stress testing scheme, and I did so. But it was never really able to push things to the absolute max. Because Varnish is quite fast.

In previous posts I’ve written I about hitting 27k requests per second on an aging Opteron (see and then again about reaching 143k requests per second on a more modern quad-core using a ton of test-clients. (see

Recently, we were going to do a stress test at a customer setup before putting it live. The setup consisted of two dual Xeon x5670 machines. The X5670 is a 2.93GHz six-core cpu with hyperthreading, giving these machines 12 cpu-cores and 24 cpu threads. Quite fast. During our tests, I discovered some httperf secrets (sigh…). And was able to push things quite far. This is what we learned.

The hardware and software

As described above, we only had two machines for the test. One is the target and one would be the originating machine. The network was gigabit.

Varnish 2.1.3 on 64bit Linux.

Httperf for client load.

The different setups to be tested

Our goal was not to reach the maximum limits of Varnish, but to ensure the site was ready for production. That’s quite tricky on many accounts.

The machines were originally configured with heartbeat and haproxy.

One test I’m quite fond of is site traversal while hitting a “hot” set at the same time. The intention is to test how your site fares if a ruthless search bot hits your site. Does your front page slow down? As far as Varnish goes, it tests the LRU-capabilities and how it deals with possibly overloaded backend servers.

We also switched out haproxy in favor of a dual-varnish setup. Why? Two reasons: 1. Our expertise is within the realm of Varnish. 2. Varnish is fast and does keep-alive.

When testing a product like Varnish we also have to take the balance between requests and connections into account. You’ll see shortly that this is very important.

During our tests, we also finally got httperf to stress the threading model of Varnish. With a tool like siege, concurrency is defined by the threading level. That’s not the case with httperf, and we were able to do several thousand _concurrent_ connections.

As the test progressed, we reduced the size of the content and it became more theoretical in nature.

As the specifics of the backends is not that relevant, I’ll keep to the Varnish-specific bits for now.

I ended up using a 301 redirect as a test this time. Mostly because it was there. Towards the end, I had to remove various varnish-headers to free up bandwidth.

Possible bottlenecks

The most obvious bottleneck during a test like this is bandwidth. That is the main reason for reducing the size of objects served during testing.

An other bottleneck is how fast your web servers are. A realistic test requires cache misses, and cache misses requires responsive web servers.

Slow clients are a problem too. Unfortunately testing that synthetically isn’t easy. Lack of test-clients has been an issue in the past, but we’ve solved this now.

CPU? Traditionally, the cpu-speed isn’t much of an issue with Varnish, but when you rule out slow backends, bandwidth and slow clients, the cpu is the next barrier.

One thing that’s important in this test is that the sheer amount of parallel execution threads is staggering. My last “big” test had 4 execution threads, this one has 24. This means we get to test contention points that only occur if you have  massive parallelization. The most obvious bottleneck is the acceptor-thread. The thread charged with accepting connections and delegating them to a thread. Even if multiple thread pools is designed to leverage this problem, the actual accept()-call is done in a single thread of execution.


As Artur Bergman of Wikia has already demonstrated, the amount of TCP connections Varnish is able to accept per second is currently our biggest bottleneck. Fortunately for most users, Artur’s work-load is very different from most other Varnish users. We (Varnish Software) typically see a 1:10 ration between connections and requests. Artur suggested he’s closer to 1:3 or 1:4.

During this round of tests I was easily able to reach about 40k connections/s. However, going much above that is hard. For a “normal” workload, that would allow 400k requests/second, which is more than enough. However, it should be noted that the accept-rate goes somewhat down as the general load increases.

It was interesting to note that this was largely unaffected by having two varnishes in front of each other. This essentially confirms that the acceptor is the bottleneck.

There wasn’t much we could do to affect this limit either. Increasing the listen_depth isn’t going to help you in a synthetic test. The listen_depth defines how many outstanding connections is allowed to queue up before the kernel starts dropping them. In the real world, the connection-rate will be sporadic and on an almost-overloaded system, it might help to increase the listen depth, but in a synthetic test the connection rate is close to constant. That means increasing the listen depth just means there’s a bigger queue to fill – and it will fill anyway.

The number of thread pools had little effect too. By the time the connection is delegated to a thread pool, it’s already past the accept() bottleneck.

Now, keep in mind that this is still a staggering number. But it’s also an obvious bottleneck for us.

Request rate

The raw request rate is essentially defined by how big the request is compared to bandwidth, how much CPU power is available and how fast you can get the requests into a thread.

As we have already established that the acceptor-thread is a bottleneck, we needed to up the number of requests per connection. I tested mostly with a 1:10 ratio. This is the result of one such test:

The above image shows 202832 requests per second while doing roughly 20 000 connections/s. Quite a number.

It proved difficult to exceed this.

At about 226k req/s the bandwidth limit of 1gbit was easily hit. To reach that, I had to reduce the connection-rate somewhat. The main reason for that, I suspect, is increased latency when the network is saturated.

At this point, Varnish was not saturating the CPU. It still had 30-50% idle CPU power.

Just for kicks and giggles, I wanted to see how far we could really get, so I threw in a local httperf, thereby ignoring large parts of the network issue. This is a screenshot of Varnish serving roughly 1gbit traffic over network and a few hundred mbit locally:

So that’s 275k requests/s. The connection rate at that point was lousy, so not very interesting. And because httperf was running locally, the load on the machine wasn’t very predictable. Still, the machine was snappy.

But what about the varnish+varnish setup?

The above numbers are for a single Varnish server. However, when we tested with varnish as a load balancer in front of Varnish, the results were pretty identical – except divided by two.

It was fairly easy to do 100k requests/second on both the load balancer and the varnish server behind it – even though both were running on the same machine.

The good thing about Varnish as load balancer is the keep alive-nature, speed and flexibility. The contention-point of Varnish is long before any balancing is actually done, so you can have a ton of logic in your “Varnish Load balancer” without worrying about load increasing with complexity.

We did, however, discover that the number of HTTP header overflows would spike on the second varnish server. We’re investigating this. The good news is that it was not visible on the user-side.

The next step

I am re-doing part of our internal test infrastructure (or rather: shifting it around a bit) to test the acceptor thread regularly.

I also discovered an assert issue during some sort of race at around 220k req/s, but that was only under certain very very specific situations. It was not possible to reproduce on anything that wasn’t massively parallel and almost saturated on CPU.

We’re also constantly improving our test routines both for customer setups and internal quality assurance on the Varnish code base. I’ve already written several load-generating scripts for httperf to allow us to test even more realistic work loads on a regular basis.

What YOU should care about

The only thing that made a real difference while tuning Varnish was the number of threads. And making sure it actually caches.

Beyond that, it really doesn’t matter much. Our defaults are good.

However, keep in mind that this does NOT address what happens when you start hitting disk. That’s a different matter entirely.

The Client director

Some time in the middle of the night before 2.1.0, I implemented a director that used the client IP to direct traffic. The goal was to direct the same machine to the same backend – cheap session stickyness. Took about an hour to hack up and an other to pretty up.

Some time during roughly the same night, PHK refactored all the director infrastructure and in the process implemented a client director and a hash director as a sort of side-job. PHK’s version obviously entered trunk and mine never saw the light of day (possibly because it was December and there isn’t much daylight to see here in December anyway). At least the duplicated effort wasn’t significant.

We’ve kept that somewhat hidden, mostly because the actual code for the hash and client director is about a screenfull of text once the VCL-bits are taken care of. And it’s such a small feature. Truth be told, they both live within the random director and are just special exceptions. The VCL syntax is the same, except now it’s called a client director instead of random.

The first up, the client director, will use the ascii-representation of the client IP to pick a “random” backend. (see why it’s the ascii representation here:

There isn’t much to say about it. Your VCL will be the same as if it was a random director. If the “canonical backend” for the client is sick, the regular random algorithm will be used to pick a backend from the healthy ones.

In Varnish 2.1.4 you will also get a “client.identifier” VCL variable that you can use to tell varnish what identifies this as a unique client. In theory, you could pass a cookie to it and avoid wondering what happens if an IP changes. If it is set, the client director will use that instead of the IP. The only problem I have come up with using that approach is bootstrapping.

He very first request a client makes will have no cookies, so no data can be passed to client.identifier. The backend will presumably set a session-cookie of some sort so the next request could use that session cookie, but that means the first and second request is likely to go to different backends, even if the following requests will all go to the same backend.

Typical load balancers solve this by just setting the cookie them self. I suspect that is what it will come to, and I also suspect that it will be a lot nicer to do that in Varnish 3.0.0 when vmods are in place which could easily deal with these sort of things without messing up your regular VCL…

But until then, try out the client director for normal IP’s at least. The client director is available in 2.1.3 (I think it’s even in 2.1.0, though I can’t remember. Definitely not documented in 2.1.0, though).

I’m very interested in your take on how to deal with sessions, if just basing it on IP is “close enough” and if you think the client director will be able to solve most session-stickyness scenarios.


Get every new post delivered to your Inbox.