Kristian Lyngstol's Blog

A free software-hacker's blog

Tag Archives: tuning

High-End Varnish – 275 thousand requests per second.

Varnish is known to be quite fast. But how fast? My very first Varnish-job was to design a stress testing scheme, and I did so. But it was never really able to push things to the absolute max. Because Varnish is quite fast.

In previous posts I’ve written I about hitting 27k requests per second on an aging Opteron (see https://kristianlyng.wordpress.com/2009/10/19/high-end-varnish-tuning/) and then again about reaching 143k requests per second on a more modern quad-core using a ton of test-clients. (see https://kristianlyng.wordpress.com/2010/01/13/pushing-varnish-even-further/).

Recently, we were going to do a stress test at a customer setup before putting it live. The setup consisted of two dual Xeon x5670 machines. The X5670 is a 2.93GHz six-core cpu with hyperthreading, giving these machines 12 cpu-cores and 24 cpu threads. Quite fast. During our tests, I discovered some httperf secrets (sigh…). And was able to push things quite far. This is what we learned.

The hardware and software

As described above, we only had two machines for the test. One is the target and one would be the originating machine. The network was gigabit.

Varnish 2.1.3 on 64bit Linux.

Httperf for client load.

The different setups to be tested

Our goal was not to reach the maximum limits of Varnish, but to ensure the site was ready for production. That’s quite tricky on many accounts.

The machines were originally configured with heartbeat and haproxy.

One test I’m quite fond of is site traversal while hitting a “hot” set at the same time. The intention is to test how your site fares if a ruthless search bot hits your site. Does your front page slow down? As far as Varnish goes, it tests the LRU-capabilities and how it deals with possibly overloaded backend servers.

We also switched out haproxy in favor of a dual-varnish setup. Why? Two reasons: 1. Our expertise is within the realm of Varnish. 2. Varnish is fast and does keep-alive.

When testing a product like Varnish we also have to take the balance between requests and connections into account. You’ll see shortly that this is very important.

During our tests, we also finally got httperf to stress the threading model of Varnish. With a tool like siege, concurrency is defined by the threading level. That’s not the case with httperf, and we were able to do several thousand _concurrent_ connections.

As the test progressed, we reduced the size of the content and it became more theoretical in nature.

As the specifics of the backends is not that relevant, I’ll keep to the Varnish-specific bits for now.

I ended up using a 301 redirect as a test this time. Mostly because it was there. Towards the end, I had to remove various varnish-headers to free up bandwidth.

Possible bottlenecks

The most obvious bottleneck during a test like this is bandwidth. That is the main reason for reducing the size of objects served during testing.

An other bottleneck is how fast your web servers are. A realistic test requires cache misses, and cache misses requires responsive web servers.

Slow clients are a problem too. Unfortunately testing that synthetically isn’t easy. Lack of test-clients has been an issue in the past, but we’ve solved this now.

CPU? Traditionally, the cpu-speed isn’t much of an issue with Varnish, but when you rule out slow backends, bandwidth and slow clients, the cpu is the next barrier.

One thing that’s important in this test is that the sheer amount of parallel execution threads is staggering. My last “big” test had 4 execution threads, this one has 24. This means we get to test contention points that only occur if you have  massive parallelization. The most obvious bottleneck is the acceptor-thread. The thread charged with accepting connections and delegating them to a thread. Even if multiple thread pools is designed to leverage this problem, the actual accept()-call is done in a single thread of execution.

Connections

As Artur Bergman of Wikia has already demonstrated, the amount of TCP connections Varnish is able to accept per second is currently our biggest bottleneck. Fortunately for most users, Artur’s work-load is very different from most other Varnish users. We (Varnish Software) typically see a 1:10 ration between connections and requests. Artur suggested he’s closer to 1:3 or 1:4.

During this round of tests I was easily able to reach about 40k connections/s. However, going much above that is hard. For a “normal” workload, that would allow 400k requests/second, which is more than enough. However, it should be noted that the accept-rate goes somewhat down as the general load increases.

It was interesting to note that this was largely unaffected by having two varnishes in front of each other. This essentially confirms that the acceptor is the bottleneck.

There wasn’t much we could do to affect this limit either. Increasing the listen_depth isn’t going to help you in a synthetic test. The listen_depth defines how many outstanding connections is allowed to queue up before the kernel starts dropping them. In the real world, the connection-rate will be sporadic and on an almost-overloaded system, it might help to increase the listen depth, but in a synthetic test the connection rate is close to constant. That means increasing the listen depth just means there’s a bigger queue to fill – and it will fill anyway.

The number of thread pools had little effect too. By the time the connection is delegated to a thread pool, it’s already past the accept() bottleneck.

Now, keep in mind that this is still a staggering number. But it’s also an obvious bottleneck for us.

Request rate

The raw request rate is essentially defined by how big the request is compared to bandwidth, how much CPU power is available and how fast you can get the requests into a thread.

As we have already established that the acceptor-thread is a bottleneck, we needed to up the number of requests per connection. I tested mostly with a 1:10 ratio. This is the result of one such test:

The above image shows 202832 requests per second while doing roughly 20 000 connections/s. Quite a number.

It proved difficult to exceed this.

At about 226k req/s the bandwidth limit of 1gbit was easily hit. To reach that, I had to reduce the connection-rate somewhat. The main reason for that, I suspect, is increased latency when the network is saturated.

At this point, Varnish was not saturating the CPU. It still had 30-50% idle CPU power.

Just for kicks and giggles, I wanted to see how far we could really get, so I threw in a local httperf, thereby ignoring large parts of the network issue. This is a screenshot of Varnish serving roughly 1gbit traffic over network and a few hundred mbit locally:

So that’s 275k requests/s. The connection rate at that point was lousy, so not very interesting. And because httperf was running locally, the load on the machine wasn’t very predictable. Still, the machine was snappy.

But what about the varnish+varnish setup?

The above numbers are for a single Varnish server. However, when we tested with varnish as a load balancer in front of Varnish, the results were pretty identical – except divided by two.

It was fairly easy to do 100k requests/second on both the load balancer and the varnish server behind it – even though both were running on the same machine.

The good thing about Varnish as load balancer is the keep alive-nature, speed and flexibility. The contention-point of Varnish is long before any balancing is actually done, so you can have a ton of logic in your “Varnish Load balancer” without worrying about load increasing with complexity.

We did, however, discover that the number of HTTP header overflows would spike on the second varnish server. We’re investigating this. The good news is that it was not visible on the user-side.

The next step

I am re-doing part of our internal test infrastructure (or rather: shifting it around a bit) to test the acceptor thread regularly.

I also discovered an assert issue during some sort of race at around 220k req/s, but that was only under certain very very specific situations. It was not possible to reproduce on anything that wasn’t massively parallel and almost saturated on CPU.

We’re also constantly improving our test routines both for customer setups and internal quality assurance on the Varnish code base. I’ve already written several load-generating scripts for httperf to allow us to test even more realistic work loads on a regular basis.

What YOU should care about

The only thing that made a real difference while tuning Varnish was the number of threads. And making sure it actually caches.

Beyond that, it really doesn’t matter much. Our defaults are good.

However, keep in mind that this does NOT address what happens when you start hitting disk. That’s a different matter entirely.

Varnish best practices

A while ago I wrote about common Varnish issues, and I think it’s time for an updated version. This time, I’ve decided to include a few somewhat uncommon issues that, if set, can be difficult to spot or track down. A sort of pitfall-avoidance, if you will. I’ll add a little summary with parameters and such at the end.

1. Run Varnish on a 64 bit operating system

Varnish works on 32-bit, but was designed for 64bit. It’s all about virtual memory: Things like stack size suddenly matter on 32bit. If you must use Varnish on 32-bit, you’re somewhat on your own. However, try to fit it within 2GB. I wouldn’t recommend a cache larger than 1GB, and no more than a few hundred threads… (Why are you on 32bit again?)

2. Watch /var/log/syslog

Varnish is flexible, and has a relatively robust architecture. If a Varnish worker thread was to do something Bad and Varnish noticed, an assert would be triggered, Varnish would shut down and the management process would start it up again almost instantly. This is logged. If it wasn’t, there’s a decent chance you wouldn’t notice, since the downtime is often sub-second. However, your cache is emptied. We’ve had several customers contact us about performance-issues, only to realize they’re essentially restarting Varnish several times per minute.

This might make it sound like Varnish is unstable: It’s not. But there are bugs, and I happen to see a lot of them, since that’s my job.

An extra note: On Debian-based systems, /var/log/messages and /var/log/syslog is not the same. Varnish will log the restart in /var/log/messages but the actual assert error is only found in /var/log/syslog, so make sure you look there too.

The best way to deal with assert errors is to search our bug tracker for the relevant function-name.

3. Threads

The default values for threads is based on a philosophy I’ve since come to realize isn’t optimal. The idea was to minimize the memory footprint of Varnish. So by default, Varnish uses 5 threads per thread pool. By default, that’s 10 threads minimum. The maximum is far higher, but in reality, threads are fairly cheap. If you expect to handle 500 concurrent requests, tune Varnish for that.

A little clarification on the thread-parameters: thread_pool_min is the minimum number of threads for each thread pool. thread_pool_max is the maximum total number of threads. That means the values are not on the same scale. The thread_pools parameter can safely be ignored (tests have indicated that it doesn’t matter as much as we thought), but ideally having one thread_pool for each cpu core is the rule of thumb, if you want to modify it.

You also do not want more than 5000 as the thread_pool_max. It’s dangerous, though fixed in trunk. It’s also more often than not an indication that something else is wrong. If you find yourself using 5000 threads, the solution is to find out why it’s happening, not to increase the number of threads.

To reduce the startup time, you also want to reduce the thread_pool_add_delay parameter. ‘2’ is a good value (as opposed to 20 which makes for a slow start).

4. Tune based on necessity

I often look at sites where someone has tried to tune Varnish to get the most out of it, but taken it a bit too far. After working with Varnish I’ve realized that you do not really need to tune Varnish much: The defaults are tuned. The only real exception I’ve found to this is number of threads and possibly work spaces.

Varnish is – by default – tuned for high performance on the vast majority of real-life production sites. And it scales well, in most directions. By default. Do yourself a favor and don’t fix a problem which isn’t there. Of all the issues I’ve dealt with on Varnish, the vast majority have been related to finding out the real problem and either using Varnish to work around it, or fix it on the related system. Off the top of my head, I can really only remember one or two cases where Varnish itself has been the problem with regards to performance.

To be more specific:

  • Do not modify lru_interval. I often see the value “3600”. Which is a 180 000% (one hundred and eighty thousand percent) increase from the default. This is downright dangerous if you suddenly need the lru-list, and so far my tests haven’t been able to prove any noticeable performance improvement.
  • Setting sess_timeout to a higher value increase your filedescriptor consumption. There’s little to gain by doing it too. You risk running out of file descriptors. At least until we can get the fix into a released version.

So the rule of thumb is: Adjust your threads, then leave the rest until you see a reason to change it.

5. Pay attention to work spaces

To avoid locking, Varnish allocates a chump of memory to each thread, session and object. While keeping the object workspace small is a good thing to reduce the memory footprint (this has been improved vastly in trunk), sometimes the session workspace is a bit too small, specially when ESI is in use. The default sess_workspace is 16kB, but I know we have customers running with 5MB sess_workspace without trouble. We’re obviously looking to fix this, but so far it seems that having some extra sess_workspace isn’t that bad. The way to tell is by asserts (unfortunately), typically something related to “(p != NULL) Condition not true” (though there can obviously be other reasons for that). Look for it in our bug report, then try to increase the session workspace.

6. Keep your VCL simple

Most of your VCL-work should be focused around vcl_recv and vcl_fetch. That’s where you define the majority of your caching policies. If that’s where you do your work, you’re fairly safe.

If you want to add extra headers, do it in vcl_deliver. Adding a header in vcl_hit is not safe. You can use the “obj.hits” variable in vcl_deliver to determine if it was a cache hit or not.

You should also review the default vcl, and if you can, let Varnish fall through to it. When you define your VCL, Varnish appends the default VCL, but if you terminate a function, the default is never run. This is an important detail in vcl_recv, where requests with cookies or Authroization-headers are passed if present. That’s far safer than forcing a lookup. The default vcl_recv code also ensures that only GET and HEAD-requests go through the cache.

Focus on caching policy and remember that the default VCL is appended to your own VCL – and use it.

7. Choosing storage backend (malloc or file?)

If you can contain your cache in memory, use malloc. If you have 32GB of physical memory, using -smalloc,30G is a good choice. The size you specify is for the cache, and does not include session workspace and such, that’s why you don’t want to specify -smalloc,32G on a 32GB-system.

If you can not contain your cache in memory, first consider if you really need that big of a cache. Then consider buying more memory. Then sleep on it. Then, if you still think you need to use disk, use -sfile. On Linux, -sfile performs far better than -smalloc once you start hitting disk. We’re talking pie-chart-material. You should also make sure the filesystem is mounted with noatime, though it shouldn’t be necessary. On Linux, my cold-hit tests (a cold hit being a cache hit that has to be read from disk, as opposed to a hot hit which is read from memory) take about 6000 seconds to run on -smalloc, while it takes 4000 seconds on -sfile with the same hardware.  Consistently. However, your milage may vary with things such as kernel version, so test both anyway. My tests are easy enough: Run httperf through x-thousand urls in order. Then do it again in the same order.

Some of the most challenging setups we work with are disk-intensive setups, so try to avoid it. SSD is a relatively cheap way to buy yourself out of disk-issues though.

8. Use packages and supplied scripts

While it may seem easier to just write your own script and/or install from source, it rarely pays off in the long run. Varnish usually run on machines where downtime has to be planned, and you don’t want a surprise when you upgrade it. Nor do you want to risk missing that little bug we realized was a problem on your distro but not others. If you do insist on running home-brew, make sure you at least get the ulimit-commands from the startup scripts.

This is really something you want regardless of what sort of software you run, though.

9. Firewall and sysctl-tuning

Do not set “tw_reuse” to 1 (sysctl). It will work perfectly fine for everyone. Except thousands of people behind various NAT-based firewalls. And it’s a pain to track down. Unfortunately, this has been an advice in the past.

Avoid connection-tracking on the Varnish server too. If you need it, you’ll need to tune it for high performance, but the best approach is simply to not do connection-tracking on a server with potentially thousands of new connections per second.

10. Service agreements

(Service agreements are partly responsible for my salary, so with that “conflict of interest” in mind….)

You do not need a service agreement to run Varnish. It’s free software.

However, if you intend to run Varnish and your site is business critical, it’s sound financial advice to invest some money in it. We are the best at finding potential problems with your Varnish-setup before they occur, and solving them fast when they do occur.

We typically start out by doing a quick sanity-test of your configuration. This is something we can do fast, both with regards to parameters, VCL and system configuration. Some of our customers only contact us when there’s something horribly wrong, others more frequently to sanity-check their plans or check up on how to use varnisncsa for their particular logging tool and so on. It’s all up to you.

We also have a public bug tracker anyone can access and submit to. We do not have a private bug tracker, though there are bugs that never hit the public bug tracker – but that’s because we fix them immediately. Just like any other free software project, really. We have several public mailing lists, and we answer them to the best of our ability, but there is no guarantee and our time is far more limited. If you run into a bug, my work on other bugs will be postponed until your problems are solved. Better yet: if you run into something you don’t know is a bug, we can track it down.

A service agreement gives you saftey. And your needs will get priority when we decide where we want to take Varnish in the future.

We also offer training on Varnish, if you prefer not to rely on outside competence.

Oh, and I get to eat. Yum.

Summary

Keep it simple and clean. Do not use connection tracking or tw_reuse. Try to fit your cache into memory on a 64-bit system.

Watch your logs.

Parameters:

thread_pool_add_delay=2
thread_pools = <Number of cpu cores>
thread_pool_min = <800/number of cpu cores>
thread_pool_max = 4000
session_linger = 50
sess_workspace = <16k to 5m>

So if you have a dual quad core CPU, you would have 8 cpu cores. This would make sense: thread_pools=8, thread_pool_min=100, thread_pool_max=4000. The number 800 is semi random: it seems to cover most use-cases. I addedd session_linger into the mix because it’s a default in Varnish 2.0.5 and 2.0.6 but not in prior versions, and it makes good sense.

Pushing Varnish even further

A while ago I did a little writeup on high-end Varnish tuning, where I noted that I made our single core 2.2GHz Opteron reach 27k requests/second. This begged the questions as to how well Varnish scale with hardware. So I went ahead and tried to overload our quad-core Xeon at 2.4GHz. It would obviously take some extra fire power. At the very least, four times as much as the last batch of tests.

Hardware involved

Our main set of test servers for Varnish are called varnish1, varnish2, varnish3, varnish4, varnish6 and varnish7. These have mostly different software and hardware – which is done intentionally so we can perform tests under different circuimstances. We routinely run tests against Varnish2 and Varnish4, which run CentOS and FreeBSD, respectively. For my last test, I used Varnish2 as the server and the remaining servers as test nodes. By any normal math, I would need  about 4 times more fire power to overload a 2.4GHz Quad core, compared to a single core Opteron at 2.2GHz.

To sum it up as far as this round of tests go:

  • Varnish1 – Single core Opteron
  • Varnish2 – Single core Opteron at 2.2GHz (used in the last round of tests)
  • Varnish3 – Single core Xeon (if I’m not much mistaken). It’s also the nginx server used as backend, but that just means 1 request every X minutes.
  • Varnish4 – Single core Opteron (FreeBSD)
  • Varnish6 – Dual core Xeon of some kind
  • Varnish7 – Quad-core Xeon at 2.4GHz

So I needed more power. As it happens, we do alot of training and we have three classrooms full of computers for students, and I borrowed two of these class rooms, adding the following to the mix:

  • 10 x single core Pentium Celerons at 2.9x GHz
  • 10 x Core 2 Duos at 2.4ish GHz

As you might notice – a large part of the challenge when you want to test Varnish is getting your test systems to keep up.

Basic test procedures

Same as last time, more or less: 1 byte pages and httperf. I’ve tried ab, siege and curl… And they simply do not offer the raw power of httperf combined with the control – if anyone cares to enlighten me on how to get the most out of them, then I’m more than willing to listen.

Ideally I wanted to test with 10 requests for each connection, and with mixed data set size. As it turns out, I ended up using 100 requests / second and bursting all of the requests, which is far from realistic. More on this later.

I have an intricate script system for the nightly tests, but that’s a story for an other time. For these tests I simply used clusterssh to replicate my input on 37ish shells. This has allowed me to instantly test identical setups on all the nodes, and to quickly review what their status is. I probably ran a thousand or more different variants of the same test this time around.

I’ve used varnishstat to monitor the request rate and other relevant stats, and top to monitor general load.

The backend I use is hosted on varnish3, which runs nginx and a simple rewrite to ‘current.txt’, which for this occasion was linked to a 1byte file.

Results

Varnish uses alot of threads, and as such, when it does finally saturate the CPU, the load average will skyrocket. On the last test, Varnish2 had a load of 600-700. During this load, Varnish2 would use 10-15 seconds to start ‘top’.

During this round of tests I had roughly 87GHz worth of clients, spread over 25 physical computers. All of the tests systems were running at full load. Varnish7 had a load average around 45. Logging in and starting top was close to instant. And Varnish was serving 143k requests per second.

Based on the load and general snappiness, I think it is safe to conclude that while Varnish was close to the breaking point, it hadn’t actually reached it. To put it simply: My clients were not fast enough. Before I told httperf to burst 100 requests for each connection, Varnish was serving 110-120k requests per second with a load less than 1.0, and the clients were still using all their fire power. I ended up stress testing my clients. Dammit.

However, as I came fairly close to the breaking point, I still believe there are a few interesting things to look at.

The scaling nature of Varnish

It’s very rare that you can see an application scale so well just by throwing cpu power and cpu cores at it. Varnish essentially didn’t get affected at all by the extra work needed to synchronize work on 4 cpu cores. In fact, if you look at the math, the raw performance on 4 cpu cores was actually BETTER than on one cpu core, when you look at it on a cycle-by-cycle.

I think it’s reasonably safe to say that when it comes to raw performance, we’ve nailed it with Varnish.

In fact, scaling Varnish is far more difficult when you increase your active data set beyond physical memory. Or when you introduce latency. Or when when you have a low cache hit rate. Or any other number of corner cases. There will always be bottlenecks.

What you can learn from this is actually simple: Do not focus on the CPU when you want to scale your Varnish setup. I know it’s tempting to buy the biggest baddest server around for a high-traffic site, but if your active data set can fit within physical memory and you have a 64-bit CPU, Varnish will thrive. And for the record: All CPU-usage graphs I’ve seen from Varnish installations confirm this. Most of the time, those sexy CPUs are just sitting idle regardless of traffic.

Myths and further research

Since I didn’t reach the breaking point, there’s not much I can say conclusively. However, I can repeat a few points.

Adjusting the lru_interval had little impact regardless of data set and access patterns. If I repeat this often enough, perhaps I’ll stop seeing new installations with an lru_interval of 3600: DO NOT SET lru_interval TO 3600. There. I didn’t even add the usual “unless you know what you are doing” part. I might’ve explained it before, but the problem is that it leaves you with a really badly sorted lru-listed that will cause Bad Things once you need to lru-nuke something. Possibly really really bad things. Like throwing out the 200 most popular objects on your site at the same time.

And the size of your VCL has little impact on the performance. I have not tested this extensively, but I’ve never registered a difference, and since your cpu will be idle most of the time anyway, you should NOT worry about CPU cycles in VCL.

An otherimportant detail is that your shmlog shouldn’t trigger disk activity. On my setup, it didn’t sync to disk to begin with, but you may want to stick it on a tmpfs just to be sure. I suspect this has improved throughout the 2.0-series of Varnish, but it’s an easy insurance. Typically the shmlog is found in /usr/var/varnish, /usr/local/var/varnish or similar (“ls /proc/*/fd | grep _.vsl” is the lazy way to find it).

I tried several different settings of thread pools, acceptors, listen depth, shmlog parameters, rush exponent and such, but none of it revealed much – most likely because I never pressured Varnish enough. This will be what I want to investigate further. But it should tell you something about how far you have to go before these obscure settings start to matter.

Feedback wanted

I figure this must be some sort of record, but I’m interested in what sort of numbers others have seen or are seeing. Have anyone even come close to the numbers above – synthetic or otherwise – from a single server? Regardless of software or hardware? This is not meant as a challenge or boast, but I’m genuinely curious on what sort of traffic people are able to push. I’m interested in more “normal” requests rates too – I’m a sucker for numbers. What are you seeing on your site? Have you had scaling issues?

High-end Varnish-tuning

Most of the time when I tune varnish servers, the main problem is hit rate. That’s mostly a matter wack the weasel, and fairly straight forward. However, once you go beyond that, things get fun. I’ll take you through a few common tuning tricks. This is also based on no disk I/O too, so either sort that out first or expect different results.

The big ones

The first thing you want to do is sort your threads out. One thread pool for each CPU core. Never run with less than, say, 800 threads. If you think that’s alot, then you don’t need these tips. For max, I don’t advice going over 6000, I’ll explain that shortly. So if you have 8 cpu cores, you will want to set:

thread_pools 8
thread_pool_min 100
thread_pool_max 5000
thread_pool_add_delay 2

Note that I also set the thread_pool_add_delay to 2ms. That should drastically reduce the startup time for your threads, and is fairly safe. The reason we don’t create everything instantly is to avoid bombing the kernel.

The main danger with threads – if we rule out I/O – is file descriptors. Currently the log format we use have a 16 bit field reserved for file descriptors, which I believe is fixed in trunk, but that limits us to 64k file descriptors. And your kernel will clean them up periodically, so running out is very very relevant, and please keep in mind that synthetic tests are horrible at testing this. You can probably use 40 000 threads in a synthetic test without running into file descriptor issues, but do not use that in production. 6000 might be high, and unless you really really really need it, I wouldn’t go beyond 2000 or 3000. I’ve done quite a bit of testing and tried out different options on production sites, and have found that 800 is a sane minimum, and I’ve rarely seen max threads be an issue until you hit the fd-limit. You can watch /proc/<PID of varnish child>/fd/ to see how many fds varnish have allocated at any given time.

The next issue you are likely to run in to is cli_timeout. If your varnish is heavily loaded, it might not answer the management thread in a timely fashion, which in turn will kill it off. To avoid that, set cli_timeout to 20 seconds or more. Yes, 20. That’s the extreme, but I have gradually increased this over months of  routine tests. I’m currently running these tests with a cli_timeout of  25 seconds, which so far has worked. 23 worked until today. For most sites and most real work loads, I doubt this is necessary, but if it is and you actually hit this in production, your Varnish will restart when it’s most bussy – which is probably the worst possible scenario you have. Set it to at least 10-15 seconds (we increased the default to 10 seconds a while ago. It’s a sane compromise, but a tad low for an overloaded Varnish)

Last but not least of the common tricks is a well kept seceret; session_linger.  When you have a bunch of threads and Varnish become CPU-bound, you are likely to get killed by context switching and whatnot. To reduce this, setting session_linger can help. You may have to experiment a bit, as it depends on your content. I recently had to set it to 120ms to get it to really do the trick. The site load would climb to 60k req/s then crumble to a measly 2-5k req/s during tests. Session linger did the trick. However, don’t set it too high. That will leave your threads idling.

Session_linger has been improved in trunk, and will be enabled by default in 2.0.5, but it’s still useful in 2.0.4.

[Update] Session linger cause your threads to wait around for more data from the client it’s currently working with, without it, you risk switching threads between piped requests which requires moving alot of data around and allocating/freeing threads. It’s better to have spare threads than to constantly switch the ones you have around.

Misc

An other value you may want to change is lru_interval. This is mainly to update the lru list, and the default is 2 seconds. There are several pages that will mention an lru_interval of 3600, but we’ve seen such values cause problems in the past. I would consider something like 20 seconds. It’s not going to have a huge impact on your performance.

People also increase the listen depth, this might be necessary but I’ve not seen any solid evidence that it does, so I generally avoid it.

An other thing to consider is using critbit instead of classic hashing. That is more relevant for huge data sets, and I’ve not seen any significant performance gain on my synthetic tests yet, but I know some people have so it’s something you might want to look into.

Session timeout is generally fine at the default (4s), but you should not increase it, or you might run into file descriptor issues.

Then there’s your load balancer. We’ve had several cases where Varnish has run into issues because of enourmous amount of connections. You do NOT want to make a connection for every single request.

Summary

thread_pools 8
thread_pool_min 100
thread_pool_max 5000
thread_pool_add_delay 2
cli_timeout 25
session_linger 50/100/150
lru_interval 20

Testing

Testing all of this is a different story, but I will point out a few common pit falls:

  • Testing your stress testing tool.  You need a number of machines to test Varnish – otherwise varnish isn’t going to be the bottleneck but your stress testing system is. I use a cluster of 6 servers to test Varnish, one will be the varnish server and the other 5 will hammer it – and that’s barely enough, even though the Varnish server is not specced for high performance compared to the other nodes.
  • Using too few connections or too many – Real life seems to suggest that 10 requests per connection is fairly realistic.
  • Testing only cache hits. This is great for getting huge numbers, but obviously not all that realistic. For a proper test, you may want to generate urls from log files and balance them accordingly.

Results?

Our single-core Opteron at 2.2GHz handles 27k requests/s consistently. Sure, the load can hit 400-600 but hey, it works. This scales fairly well too, so if that was a dual quad core I wouldn’t be surprised if we could reach 180 k req/s (but I have no idea where we’d get the firepower from to test that – or the bandwidth. I assume there’d be some completely different issues at that point). This is with 1-byte pages, mind you. I’ve seen varnish deliver favicon.ico at 60k req/s on a dual quad, but that was an underachiever ;)

Follow

Get every new post delivered to your Inbox.