I was made aware of a synthetic benchmark that concerned Varnish today, and it looked rather suspicious. The services tested was Varnish, nginx, Apache and G-Wan. And G-Wan came out an order of magnitude faster than Varnish. This made me question the result. The first thing I noticed was AB, a tool I’ve long since given up trying to make behave properly. As there was no detailed data, I decided to give it a spin myself.
You will not find graphs. You will not find “this is best!”-quotes. I’m not even backing up my statements with httperf-output.
Disclaimer
This is not a comparison of G-Wan versus Varnish. It is not complete. It is not even a vague attempt at making either G-Wan or Varnish perform better or worse. It is not realistic. Not complete and in no way a reflection on the overall functionality, usability or performance of G-Wan.
Why not? Because I would be stupid to publicize such things without directly consulting the developers of G-Wan so that the comparison would be fair. I am a Varnish-developer.
This is a text about stress testing. Not the result of stress testing. Nothing more.
The basic idea
So G-Wan was supposedly much faster than Varnish. The feature-set is also very narrow, as it goes about things differently. The test showed that Varnish, Apache and nginx were almost comparable in performance, whereas G-Wan was ridiculously much faster. The test was also conducted on a local machine (so no networking) and using AB. As I know that it’s hard to get nginx, Apache and Varnish to perform within the same level, this indicated that G-Wan did something differently that affected the test to me.
I installed g-wan and Varnish on a virtual machine and started playing with httperf.
What to test
The easiest number to demonstrate in a test is the maximum request rate. It tells you what the server can do under maximum load. However, it is also the hardest test to do precisely and fairly across daemons of vastly different nature.
Other things I have rarely written about is the response time of Varnish for average requests. This is often much more interesting to the end user, as your server isn’t going to be running at full capacity anyway. The fairness and concurrency is also highly relevant. A user doing a large download shouldn’t adversely affect other users.
I wasn’t going to bother with all that.
First test
The first test I did was “max req/s”-like. It quickly showed that G-Wan was very fast, and in fact faster than Varnish. At first glance. The actual request-rate was faster. The CPU-usage was lower. However, Varnish is massively multi-threaded, which offsets the cpu measurements greatly and I wasn’t about to trust it.
Looking closer I realized that the real bottleneck was in fact httperf. With Varnish, it was able to keep more connections open and busy at the same time, and thus hit the upper limit of concurrency. This in turned gave subtle and easily ignored errors on the client which Varnish can do little about. It seemed G-Wan was dealing with fewer sessions at the same time, but faster, which gave httperf an easier time. This does not benefit G-Wan in the real world (nor does it necessarily detract from the performance), but it does create an unbalanced synthetic test.
I experimented with this quite a bit, and quickly concluded that the level of concurrency was much higher with varnish. But it was difficult to measure. Really difficult. Because I did not want to test httperf.
The hardware I used was my home-computer, which is ridiculously overpowered. The VM (KVM) was running with two CPU cores and I executed the clients from the host-OS instead of booting up physical test-servers. (… That 275k req/s that’s so much quoted? Spotify didn’t skip a beat while it was running (on the same machine).
)
Conclusion
The more I tested this, the more I was able to produce any result I wanted by tweaking the level of concurrency, the degree of load, the amount of bandwidth required and so forth.
The response time of G-Wan seemed to deteriorate with load. But that might as well be the test environment. As the load went up, it took a long time to get a response. This is just not the case with Varnish at all. I ended up doing a little hoodwinking at the end to see how far this went, and the results varied extremely with tiny variations of test-parameters. The concurrency is a major factor. And the speed of Varnish at each individual connection played a huge part. At large amounts of parallel requests Varnish would be sufficiently fast with all the connections that httperf never ran into problems, while G-Wan would be more uneven and thus trigger failures (and look slower)…
My only conclusion is that it will take me several days to properly map out the performance patterns of Varnish compared to G-Wan. They treat concurrent connections vastly different and perform very different depending on the load-pattern you throw at them. Relating this to real traffic is very hard.
But this confirms my suspicion of the bogus-ness of the blog post that lead me to perform these tests. It’s not that I mind Varnish losing performance tests if we are actually slower, but it’s very hard to stomach when the nature of the test is so dubious. The art of measuring realistic performance with synthetic testing is not one that can be mastered in an afternoon.
Lessons learned
(I think conclusions are supposed to be last, but never mind)
First: Be skeptical of unbalanced results. And of even results.
Second: Measure more than one factor. I’ve mainly focused on request-rate in my posts because I do not compare Varnish to anything but itself. Without a comparison it doesn’t make that much sense to provide reply latency (though I suppose I should start supplying a measure of concurrency, since that’s one of the huge strong-points of Varnish.).
Third: Conclude carefully. This is an extension of the first lesson.
…
A funny detail: While I read the license for the non-free G-Wan, which I always do for proprietary software, I was happy to see that it didn’t have a benchmark-clause (Oracle, anyone?). But it does forbid removing or modifying the Server:-header. It also forces me to give the G-Wan-guys permission to use my using of G-Wan in their marketing… Hmm — maybe I should … — err, never mind.
Advertisement
I actually ran a test of these 2 products a while back on amazon aws.
I tested on micro instances with 256 concurrent connections at once to retrieve static pages. when one connection was done, it was closed and instantly reopened. I ran the test for 6 hours in the end varnish won on both total connections handled, and lowest average page response time. I also collected data on io`s and dropped connections, Varnish uses many more io`s than g-wan, and g-wan dropped many many more connections than varnish. varnish actually had a very small number of dropped connections when compared to g-wan.
autoperf frontend for httperf is quite good for replaying accesslogs and finding the point where your app starts to fail.
https://github.com/igrigorik/autoperf
Nice article!
We actually did a comparison between Varnish vs Citrix Netscaler a month back where we loadtested them both using our usual loadtesting tool, proxySniffer, on our loadtest clusters.
Varnish performed really good but was beaten by the netscaler when it came to serving the most number of requests/s using a single byte file with a pre-filled cache contatning 50k objects.
The CPU was constantly on 100% on the varnish machine which resulted in a classic behavior where the response time increased.
We pushed the varnish machine up to 12,600 req/s and the Netscaler up to around 44,500 req/s.
Kristian, do you know the maximum number of request/s that have been pushed through a single varnish machine and the specs of the machine?
/Erik
12,600 req/s is pretty low for Varnish. How was the ratio between connections and requests?
I’ve done 275k req/s myself on my xeon x5650 (single) without much hassle – the biggest challenge was finding power in the clients. At around 50-60k connections/s (connections, not requests) you’ll run into a bottleneck in the acceptor thread, but the request-rate can go far higher.
On a more modest computer (single opteron 148, I think) I did 27k req/s and 143k req/s on an aging quad xeon. See http://kristianlyng.wordpress.com/2010/01/13/pushing-varnish-even-further/ though (and a few other related posts).
Oh, and I’m pretty sure we hit 300k/req using browsermob+wikia at VUG3, but I’m not sure, you’d have to ask Arthur Bergman. He can certainly handle a lot more load than my home computer
You say that you “did 143k req/s on an aging quad xeon”. Great performances!
Can you explain why in the test you are commenting Varnish is showing only 1/5th of these 143k req/s while other servers like nginx or Apache TS gave better results?
I would be interesting to know what caused the problem.
Hmm, not sure what you’re asking. Is this regarding some different test?
The 143k req/s was on a Quad core xeon, while the 27k req/s was on a single-core Opteron 148, and I haven’t compared this publicly to nginx or TS.
But we’re straying a bit off topic… The intention of this post was to illustrate the complexities of thorough benchmarking. I’ve posted results in the past regarding Varnish, but never comparing it to other solutions as that would require me to be much more thorough in my analysis. And I don’t want to comment on nginx or TS-performance for just those reasons. I can only comment on Varnish-performance and the complexity involved in benchmarking.
Generating the amount of traffic we are talking about here is not trivial. And very often the nature of the tool in use will determine the outcome of the test. So when two servers behave slightly differently, it’s natural to assume that the difference affects the test tool (unintentionally) in a way that offsets the result. A simple example would be a tool which only does one request per connection — a server optimized for fast connection-handling will then handle itself better than a server optimized for concurrent connections, each with several requests before they disconnect.
As it happens, a lot of tools DON’T do keep-alive (siege comes to mind) and most of the tools don’t do it by default or don’t do it very well.
An other example is how the tool behaves when it’s cpu-starved. Is it better at dealing with a server which will focus mainly on one connection at a time and thus letting the tool avoid context switching, or does it handle servers who answer all connections fairly just as well? I don’t have an example for that, though, but it’s something I’d have to measure if I ever were to compare one server to an other at the level of performance we’re talking about.
Kristian, how was the tests done when you 300k req/s (or 275k req/s)? We tested to push data as fast as possible against varnish and netscaler. So it was basically to see how many requests/s it could handle.
The test where we reached 12,600 req/s had around 20-100 concurrent users where the connections and users are one-to-one. Every user made 1 requests before it went out from the loop and started a new one.So each connection and user made 1 request.
I’m curious to see how the tests were performed when you reached 270-300k req/s? What kind of tool did you use to generate the load? How big was the cluster and what was the specifications on the server?
/E
Seeing your the references to Apache in your blog post (e.g. hard to get nginx, Apache and Varnish to perform in the same ballpark), I thought I’d point out that the blog post in question included testing on Apache Traffic Server, a very different thing than Apache httpd. Just being cautious that since you only refer in your blog post to “Apache”, and most people will assume “Apache” = httpd.
en.wikipedia.org/wiki/Traffic_Server
Apache HTTPd vs Apache TS is a valid point… But given that I explicitly pointed out that I didn’t actually want to comment on any specific blog post or service, and that I didn’t test anything related to Apache , I hardly think it matters whether I am specific as to whether it’s Apache Traffic Server or Apache HTTPd I’m not talking about. Yes, this post was written after I read a specific blog post, but it’s not in any way meant as a direct commentary or critique. It’s only a commentary on the general practice of benchmarking (which I don’t do – I do stress testing, which is not done for comparative purposes).
Just for the record: I’ve gotten a few similar comments (unlike the previous comment, the others were from the same AS, and I’ll leave it to the reader to decide if that’s enough to conclude on anything), either commenting on findings or asking for details on my setup… Given that the entire blog post is about being cautious when benchmarking, and not how to do it, I’ve chosen to not accept them to avoid derailing from the intent of the blog post.
Just a small reminder to all those of you posting requests for details on the “benchmarking”: Please read the post again.
I say again: I am a Varnish developer, I can not with any credibility post benchmarks that compare Varnish to anything except Varnish. The blog post is, as I’ve stated several times, a reflection on the pitfalls of benchmarking.
If you want hardware details, notes on whether I’ve used jmeter instead or for me to “defend” my position: Please give up.
Oh, and also, changing your name and e-mail doesn’t fool me much when I see numerous different comments from the same AS all with the rather obvious trolling intent. At least use a proxy, if not outside your country, then at least outside your AS. It’s not that it offends, it’s just pathetic. The first comment which was made several months before this blog post wasn’t accepted because I honestly couldn’t distinguish it from spam.
Since the blog post is ancient by now and getting rather weird comments, I’m closing the comments.