Linkerd at loveholidays — Our journey to a production service mesh

Published in

loveholidays tech

7 min readMar 3, 2023

We’ve recently posted about loveholidays’ open-source observability stack and the success we’ve had scaling up to 5 million timeseries.

It doesn’t matter if you have 5, 10 or 100 million timeseries available to you from your applications, if you don’t know how to expose that information in a meaningful way to your engineering teams.

We run our production workloads in a Google Kubernetes Engine cluster, with roughly 250 Deployments, 50 StatefulSets, 30 DaemonSets, 80 Cronjobs etc to circa 2100 pods, more or less depending on the performance of our HorizontalPodAutoscalers.

These workloads are made up of applications written in Java, Go, Python, Rust, JavaScript, TypeScript and more. Each language, application and framework has its own set of metrics it reports, all scraped by prometheus-operator and persisted in Grafana Mimir.

The problem is: Each language, application and framework has its own set of metrics it reports. Every application Dashboard has to reinvent the wheel, with each application having its own way of reporting golden signals like HTTP throughput, latency and error rate. This means our Dashboards are at best inconsistent to each other, and at worst, completely wrong where queries are copy/pasted without understanding the nuances of the metrics being used.

At loveholidays we strive to achieve reuse and uniformity, by identifying common problems across all services and solving the problem with Infrastructure. We standardised collecting logs with Loki, we standardised observability by integrating our Logs, Traces and Metrics into Grafana, we standardised content management with Sanity, and next we want to solve standardised metrics.

Introducing a service mesh will enable us to standardise the golden metrics we collect for our applications, and replace a large number of our operational Grafana Dashboards with a single Dashboard to show our application’s health.

A service mesh is a dedicated infrastructure layer that you can add to your applications. It allows you to transparently add capabilities like observability, traffic management, and security, without adding them to your own code. — Istio

Why Linkerd? Choosing a service mesh

We set out to run Proof of Concepts on the popular service mesh offerings to decide which mesh would suit us best, and we began our testing with Linkerd.

We installed it using the Linkerd CLI, added the Viz plugin (Viz is an awesome Dashboard which represents real time traffic information within your cluster, see here for more details), added the linkerd.io/inject: enabled to a few deployments and.. that was it. It just worked. We had standardised metrics coming from prometheus-viz, we had mTLS, we could listen to live traffic using the tap feature, and it had taken us all of about 15 minutes to get it up and running with a couple of workloads in a dev cluster. Linkerd delivered everything we had hoped to achieve with very little config or knowledge required, so we didn’t see a reason to continue evaluating other meshes and instead used the time to learn more about scaling Linkerd from development to production.

Deploying Linkerd to production

We deploy Linkerd using GitOps instead of the Linkerd CLI. For a detailed view of how we’ve accomplished this, please see our accompanying post: Linkerd at loveholidays — Deploying Linkerd in a GitOps world.

Onboarding applications into the mesh

Once we had established how to deploy a highly available Linkerd control plane that could handle the scale of our production cluster, we needed to start onboarding applications.

A lot of the Linkerd blog posts and content we’ve seen online talk about going from zero to full coverage in hours / days. This is definitely possible, but I’ll present a very different timeline: we took 6+ months to get to 100% coverage of our core microservices.

Onboarding an application to the mesh is as simple as adding the linkerd.io/inject: enabled annotation to the Deployment / StatefulSet / DaemonSet or Namespace.

Side note: I do not recommend Namespace level injection due to incompatibilities with K8s Cronjobs / Jobs. Jobs can be included in the mesh, but careful consideration must be made on how these jobs will terminate. You can read more here and we will likely share a follow-up post on a situation where meshing a Cronjob solved a problem we faced.

So that leaves us with setting an annotation on each workload. This is where you can rush ahead and mesh everything in no time at all, but we opted to onboard just a few services per week, monitor and tweak as necessary. For each application, you must consider:

Are the default proxy Memory/CPU resource requests/limits enough for this application? Throughput, request size and other factors will massively impact the resources each proxy requires, and we saw instances of the linkerd-proxy container being OOMKilled on the default settings. The proxy CPU/Memory requests/limits can be set via annotations on each workload. We have previously posted about our performance testing, which enabled us to tailor each application’s proxy requests/limits to withstand our highest expected load.
Does my application terminate gracefully? Once the proxy receives a SIGTERM, it blocks new requests (inbound and outbound) and handles the in-flight requests before shutting down. If your main container is still running after the proxy terminates, your application is effectively severed from the network, and you’ll lose any remaining in-flight requests. This caused us a lot of pain, with some applications not listening to SIGTERM, and other applications taking upwards of 30 minutes to gracefully shutdown. This topic can fill a blog post in itself, but you can find Linkerd’s documentation on the subject here.
Is Protocol Detection working correctly for this app? Protocol Detection is a complex subject in Linkerd and needs special attention to avoid 10s delays in establishing connections. Read more about this here.
ServiceProfiles. You will get overall metrics for your HTTP / gRPC applications out of the box, but if you want to capture metrics per endpoint/route, you need to define a ServiceProfile per K8s Service. We have a ServiceProfile defined for each of our applications with multiple endpoints / routes defined to capture HTTP Metrics per endpoint we care about.
Does everything look normal? This is a tricky one to define, and we ended up writing an internal runbook with log messages we’ve seen and what they likely mean. We’ve seen connection leaking, dropped requests due to non-HTTP/2 headers, TCP connection timeouts and other various network level errors that were not visible before being meshed. We’ve also ran into edge-cases with GKE that stopped us onboarding more applications as we worked with maintainers to resolve. As described in Linkerd’s Debugging 502s:

Linkerd turns connection errors into HTTP 502 responses. This can make issues which were previously undetected suddenly visible. This is a good thing.

Monitoring our applications with Linkerd

So we now have full coverage of our core microservices, everything is secured with mTLS, we have Linkerd’s advanced load balancing voodoo-magic reducing request latency between our services, metrics are being scraped by Prometheus and persisted in Mimir, and a whole host of other features that we’ve barely even scratched at this point.

The next step is to use all of this capability to solve the problem in the initial statement: uniform metrics. We developed two Grafana dashboards that cover 90% of our monitoring purposes, you can read more about this in our accompanying post Linkerd at loveholidays — Monitoring our apps using Linkerd metrics.

Linkerd’s resource usage

Running a sidecar container in each pod in a cluster requires an additional amount of compute power, which means extra memory, extra CPU, and of course, extra cost.

With Linkerd, we found the impact on our production cluster to be minimal, at the time of writing, our stats are as follows:

12GB memory and 11.5 CPU cores used across ~500 linkerd-proxy sidecar containers (34GB and 51 cores requested, so we can certainly reduce our default requests by some margin, although some of this will be based on our max load performance testing)
3GB memory and ~200m CPU used across all Linkerd control-plane components (5GB and 3 CPU requested, so again, we can trim some fat here!)
1.25GB memory and ~50m CPU used across linkerd-viz components. Remember we don’t use their Prometheus anymore which reduces overhead significantly.
12k RPS combined through all of our meshed microservices
112 onboarded services

As you can see, across our cluster with around 500 pods meshed at the time of writing (only our core microservices are meshed, and these numbers increase in the evenings during high traffic pod scaling), the Memory and CPU footprint of Linkerd is tiny. With around 2M active time-series from linkerd-proxy, Prometheus takes up considerably more resources to scrape Linkerd metrics than the rest of the entire Linkerd stack.

What’s next on our Linkerd journey?

Well, we are still in the early stages! We’ve done the hard part of getting it into production and onboarding our applications in a safe, controlled way. We have met our original objective of unified metrics, so what’s next?

We are currently trialling https://sloth.dev/ combined with Linkerd metrics to give us super easy SLIs and SLOs.
We are not currently making the most of any of the extra features Linkerd provides like Retries, Timeouts, Authorization Policy, Traffic Split, etc. The list goes on. As more of our engineers gain familiarity with the features on offer, we will start to leverage it more and more.
Linkerd 2.13 is coming, bringing a range of new features like circuit breaking and more advanced routing abilities. buoyant.io have an upcoming webinar on the 2.13 roadmap, register here.
Explore progressive delivery with Flagger to automatically roll-back during application regressions.

We have a few more posts planned on the Linkerd topic, as it’s been a great journey to get to where we are. You’ll find us over on the Linkerd Slack both asking and answering questions, and if this sort of thing interests you, we have an open vacancy for a Senior Site Reliability Engineer.