Load testing in production with Grafana Loki, Kubernetes and Golang

George Malamidis
loveholidays tech
Published in
4 min readJan 28, 2022

--

By David Annez, David Dios, Dmitri Lerko, George Malamidis and Mate Szvoboda

“Owlbot — It hunts at night”

Proving our services can scale with traffic

The period between the 26th of December and early February is one of increased holiday trading activity in the UK. At loveholidays we affectionately refer to it as Peaks. During Peaks, loveholidays.com’s throughput exceeds more than 10 times the average. In order to ensure our services can withstand the load, we continually test them by replaying production traffic from our access logs against our staging and production environments at multiples of the original throughput. Load tests are run against production at night, when UK and Ireland traffic is low. The system that runs tests against production at night is built around Grafana Loki, a Kubernetes CronJob and a HTTP traffic replay tool we’ve open sourced called ripley. We call this system Owlbot because it hunts for performance regressions at night.

Having access logs available for replay

Replaying an organic traffic profile by using access logs allows the distribution of requests to be realistic, e.g. how many users hit the homepage vs search result pages or the ratio of users who are searching across all destinations vs that of those only searching for holidays in Majorca. This distribution between different types of requests has an effect on performance. It can be harder to replicate realistic traffic with scripted synthetic load testing.

We store access logs in Loki, together with all of our services’ logs, for its efficiency and because it’s native to the Prometheus / Grafana ecosystem which is an integral part of our monitoring stack. Using Loki for this, as opposed to a custom solution, adheres to our Focus on differentiation engineering principle, as well as Focus on simplicity; there is minimal setup required and no intermediate systems such as GCS, which we tried on previous versions of this system.

Storing all of our access logs in Loki also captures periods of performance degradation or events that caused disruptions so we can replay them to prove our follow up improvements work.

We use access logs from NGINX, the entry point to our production cluster. When collecting these logs, we exclude sensitive data such as Personally Identifiable Information (PII).

How we replay access logs with ripley

Ripley is a custom Go utility we wrote, inspired by the Vegeta HTTP load testing tool. Other load testing tools usually generate load at a set rate, such as 100 requests per second. Such constant load doesn’t accurately represent user behaviour. By default, ripley replays requests at exactly the same rate they happened in production. It also allows for fast (or slow) playback at multiples of the recorded rate. This is closer to the behaviour of organic traffic, which, in loveholidays.com’s case, moderately ramps up rather than being aggressively bursty. This realistic simulation of traffic is useful for tuning Kubernetes’ Horizontal Pod Autoscaler (HPA) which we use to elastically scale our services as throughput goes up and down.

As an example of an HPA tuning related discovery, during a run we noticed one of our services struggling to handle increased load. The HPA for this service is based on CPU utilisation. As load increased during the test, CPU utilisation increased with it. Several new pods would come up, the CPU utilisation would drop, Kubernetes would take the pods down and the same process would repeat with pods flapping and performance ultimately degrading. This prompted us to tune the service’s scaleUp and scaleDown policies, including setting stabilisationWindowSeconds parameters to ensure smooth handling of traffic fluctuations.

Orchestrating load tests with a Kubernetes CronJob

Load tests run periodically against production, without human intervention unless they uncover a performance regression, in which case we get notified by our monitoring systems. The orchestration happens with a Kubernetes CronJob which:

  1. Fetches access logs from Loki using LogCLI
  2. Pipes the access logs into a tool that converts them to the ripley’s JSON Lines input format
  3. Ripley replays the access logs against our production cluster

The results are recorded by Prometheus and can be accessed there or in Grafana, together with our normal application metrics, including OpenTelemetry traces in Tempo.

Closing thoughts

Load testing is invaluable in understanding our systems’ ability to handle varying levels of traffic. Performing it in an isolated staging environment allows for repeatable tests with results that are easier to reason about and doesn’t carry the risk of disrupting live applications. Performing it against production systems is a straightforward option for achieving realistic tests, as it removes the need for cross environment alignment. In the future, we would like to explore what would give us enough confidence to run load tests against production at any time and to introduce chaos engineering into the process.

--

--