Hugging Websites
…very hard.
KDE relies heavily on web services, and many of them need to be kept responsive even under strenuous load. I’ve recently had the opportunity to spend some time on load testing one of our websites and would like to share how that worked out.
To properly test things I wanted to have multiple computers make concurrent requests to the service and ensure that the service still performs within acceptable limits. To that end I needed a bunch of servers, software that can pressure the web service, and software that can make sure the service works.
The server bit is the easiest task there… any cloud provider will do.
The software also seemed easy. After very quick research Locust seemed as good a choice as any to poke at the service and make sure it responds. Except, after some pondering I came to realize that this is actually not so. You see, Locust does HTTP performance testing. That is: it makes HTTP requests and tracks their response time / error rate. That is amazing for testing an API service, but when dealing with a website we also care about the javascripty bits on top being responsive. Clearly a two-prong approach was necessary here. On the one hand Locust can put pressure on the backend and then something else can poke the frontend with a stick to see if it is dead. Enter our old friend: Selenium. An obvious choice given my recent work on a Selenium-based application testing framework. The advantage here is that Selenium can more or less accurately simulate a user using the website giving us a fairly good idea about perceived performance being up to spec. Better yet, both Locust and Selenium have master/client architectures whereby we can utilize the cloud to do the work while a master machine just sits there orchestrating the show.
The three building blocks I’ve arrived at are:
- A cloud to scale in
- Locust for performance testing (that the thing stays responsive)
- Selenium for interaction testing (that the thing actually “works”)
I actually thought about showing you some code here, but it’s exceptionally boring. You can go look at it at https://invent.kde.org/sitter/load.
At first I needed to write some simple tests for Locust and Selenium. They were fairly straight forward, a bit of login, a bit of logout. Just to start putting pressure on the server-under-test.
With simple tests out of the way it was time to glue everything together. For this I needed a couple more puzzle pieces. I’ve mentioned that both Locust and Selenium have “server” components that can manage a number of clients. For Locust that is distributed load generation, and for Selenium it’s called a Grid. For convenience I’ve opted to manage them using docker-compose.
The last piece of the puzzle was some provisioning logic for the cloud server to install and start Selenium as well as Locust Workers.
When all the pieces were in place amazing things started happening!
On my local machine I had a Selenium Grid and a Locust master running. Magically, cloud servers started connecting to them as workers and after a while I didn’t have the Selenium and Locust power of one machine, no, UNLIMITED POWER! (cheeky Star Wars reference).
By starting a load test in Locust it was distributed across all available nodes, simulating more concurrent access than one machine would or could ordinarily do.
A simple loop also starts a number of concurrent Selenium tests that get distributed across the available grid nodes.
for i in {1..5}; python3 test.py& done
The end result is a Locust making hundreds of requests per second while Selenium is trying to use the UI. Well, for a while anyway… I naturally wanted to know the limits so I kept bumping the request numbers. At first all was cool.
Response times in the sub 100ms at 300 users is pretty good I think. CPU usage was also at a comfortable level.
So I increased it to 600 users, which was still OKish. But when I started going towards 1200 users the problems started to appear.
In the bottom graph you can see the two bumps from 300 to 600 and then to 1200. What you can see super clearly is how the response time keeps getting poorer and poorer the difference is so enormous that you almost can’t make out the response time changes from 300 to 600 anymore. Eventually the service started having intermittent interruptions when the Selenium tests were also trying to get their work done. Yet CPU and memory were not fully at capacity - In particular the intermittent failure hike is very suspicious. A look at the failures gave the hint: it was running into software limits. I bumped up the limits because the hardware still had leeway, and presto: no more failure spikes even when at 2048 users. Hooray! Response time does suffer though, so in the end there would need to be more reshuffling if that was an expected user count.
Conclusion
Knowing the limits of our services is very useful for a number of reasons, ranging from knowing how many users we can support, to how oversized our servers are for the task they are performing, to whether our service is susceptible to malicious use. Knowing the capabilities and weaknesses of our systems helps us ensure high availability.
To discuss this blog post check out KDE Discuss.