Out of the data Tar Pit: How loveholidays escaped from a legacy reporting system and kept the data flowing

Published in

loveholidays tech

8 min readOct 8, 2021

2 years ago at loveholidays, daily reports were often late, and the system which produced them was very hard to understand and work with. The Data Engineering team was struggling to keep the system running, let alone improve it. Business users were not getting important information in a reliable fashion, which impacted their daily decision-making. By investing in technology simplification (facilitated by a move to Google Cloud Platform) and building a brand new ETL (Extract-Transform-Load) system, we have drastically improved our situation. Here’s how we did it.

2 Years ago…

“Hi team, the daily reports are late, what’s the ETA?” chirped on our Slack channel. My heart sank. I’d just taken over as the Head of Data Engineering, and our team was responsible for processing and delivering daily Tableau reports, which detailed session counts, marketing costs, sales and attribution data for the whole company.

Our tech stack was quite old, with a Java app responsible for running and coordinating jobs. The ETL (Extract, Transform, Load) logic was defined in a large number of untested MariaDB stored procedures, some of which were thousands of lines long. All of this was supposed to feed aggregated data into our warehouse, which was MariaDB running on a single large VM. Only 2 out of the team of 4 really knew how the whole thing worked (I wasn’t one of those 2 people), and fixing errors meant logging into the database, querying log tables, then manually running stored procedures in exactly the right order until the chaos subsided.

The end result was an unreliable system with wildly unpredictable delivery times, which regularly woke team members up at horrible hours of the morning when something went wrong.

Figure 1 — Daily report delivery times with the old system. We almost never got reports out before 9am (horizontal dashed line), and the delivery times were highly variable. Exciting, but bad for stress levels.

The instability in the system regularly caused data to be delivered late if at all. This had a serious time impact on pretty much every other team in the company, who needed to see the previous day’s financial and marketing performance as soon as possible. Our aim was therefore “to predictably deliver accurate daily reporting by 9am every day”.

Before I’d arrived, a project had started to migrate the data warehouse away from MariaDB and onto BigQuery, so we had made a start. But the road ahead was daunting…

Problems

It’s often tempting to look at a system which has been around for a while and say “it’s old, it needs to go”. But that’s not really good enough. For a start, most of us are being paid for tangible outcomes, and it pays to begin there when thinking about these kinds of situations. So what were the actual problems and costs of the situation described above?

Mental overload

The team spent a large portion of their mental bandwidth battling with complexity:

The codebases were convoluted, with lots of duplicated sections and very long SQL stored procedures which were hard to reason about and had no automated tests (as an aside, the difficulty of writing automated tests for SQL is a good argument for keeping critical business logic outside of SQL altogether).
Each codebase had its own deployment process which worked in a different way to the others.
There was no easy way to visualise dependencies between jobs which formed part of a Directed Acyclic Graph (DAG). We spent a lot of time drawing out the dependencies on a whiteboard, or keeping them in our heads. Not easy.
There was no easy way to visualise what had happened during a job run. This required trawling through log tables to see error messages and job outputs.

This mental overload is a vicious circle — the less time you have to think, the less time you spend improving quality and simplicity. Which leads to more instability, which means less time to think….we had to break out of this cycle.

Late and inaccurate data delivery

The complexity of the system, combined with the lack of automated testing led to many instances of late and / or incorrect data which were identified by our users instead of by Data Engineering. This caused a waste of time and effort as the teams reported the problem back to us, we dropped what we were doing, and then tried to fix the issue. This fed back into the Mental Overload cycle, reducing our ability to step back and think about things clearly.

If allowed to continue, this kind of data downtime can lead to a long-term lack of trust in data, which can’t be allowed to happen in a fast-moving business.

Principles

At loveholidays we have a set of company principles, one of which is “invest in simplicity”. In many ways everything that we have done in data engineering for the last 2 years comes back to this principle. Simple systems are easy to reason about and change. They are often easier to scale. And they are always more pleasant to work with.

In addition to simplicity, there are a number of other system attributes which we believe contribute to higher agility and delivery speed:

Systems should have appropriate test coverage, and tests should be locally executable. In data processing systems especially, it is almost always faster and cheaper to catch a bug on your laptop than to wait for it to manifest in production.
Systems should be easy and fast to deploy to production. I’m a big fan of trunk-based development, allied with a robust automated test + deploy pipeline. Reducing time to production is a key driver of agility.
Systems should be easy to monitor in production. Self-describing systems are easier to improve, easier to debug, and nicer to work with.

With these principles in mind, we started to take action and introduce new technologies to get us to our goal of delivering daily reports reliably and on time.

Actions

Simplify our scheduling with Airflow

Our initial attempts to escape from the in-house java scheduler went down a bit of a blind alley. We attempted to pull portions of our ETL out of MariaDB and into BigQuery scripts, which were then co-ordinated by a series of Google Cloud Functions. With hindsight this was a mistake — Cloud Functions are not meant for this kind of workload and it showed. We had problems with runtime visibility, and ended up trawling through logs again — but at least they were cloud logs ;-). In addition, we saw some strange behaviours like duplicate function invocations which made us wary. Lesson learned.

Our next port of call was to look around to see what other people were using, which led us to Airflow. After some experimenting we started shifting our workflows onto our shiny new GCP Cloud Composer cluster and…wow, what a difference.

Figure 2 — an Airflow DAG. It’s very easy to see what has happened historically, and how various job dependencies link together.

No more trawling through logs or having to keep all of the job dependencies in our heads — a simple and intuitive way to build and monitor DAGs and scheduled jobs. What next?

Move transformations to Spark

In the legacy system our transformation logic was in the form of untested SQL stored procedures, which were difficult to understand, and could not be tested locally. So we started with the assumption that we wanted to be able to verify transformation behaviour before deployment. After some experimentation we decided on PySpark, because we could write locally-runnable unit tests with dummy datasets.

Simplify testing and deployment

Being able to write local tests was a game-changer for us, because it gave rapid feedback on our system quality, whilst at the same time acting as a safety net to prevent bugs creeping into production. Once we’d hooked up our tests to a Google Cloud Build pipeline, we had a full CI / CD setup which allowed us to build, test + deploy new functionality with confidence. And yes, we are doing trunk-based development ;-)

Move reports to Looker

We found Tableau tricky to work with — we needed a desktop client to alter our reports, and the data engineering team was running the Tableau server internally, which added to their workload. Sometimes it would take days to find the person who had the correct licence / desktop client to make a change, then have the change made and deployed.

After a rapid prototyping period, we shifted our daily reports over to Looker, which we are very happy with. Analysts can make experimental changes directly in the web interface and test them without interfering with live users. This rapid development cycle allows us to create new dashboards and alter existing ones with great speed, and we don’t have to run any infrastructure.

Start adding data health checks

Data reliability is hard. Manual checks soon become impossible once you are dealing with a reasonable number of datasets, and we have long since moved past that point. We also have some processes which have multiple upstream data dependencies. When bad data creeps into the upstream data, debugging the issue gets really difficult.

With our Airflow / Spark setup it now became simple to add scheduled jobs to check our core datasets for anomalies. It’s early days, and we have a long way to go with this effort. But now that our attention is not taken up by unstable systems, we can focus on the real job, which is to reliably deliver accurate data to the business.

Conclusion

This post could have been 5 times longer, there have been so many other improvements and changes we’ve made in the last 2 years. And to be honest, the process was far more messy and complicated than I’ve described here. But in essence, we have been driving our systems in the direction of greater simplicity and faster feedback loops, and in the end — what was the outcome?

Figure 3 — reports are coming out much more reliably, and before 9am on most days (above times are in UTC). More recently they have been arriving at 7.30am! Life is more boring, but far less stressful.

So, we did what we set out to do. Reports are being delivered on time and reliably. The team is no longer waking up in the early hours to fix issues, and instead can work on performance, stability, and fun things like upgrading to Spark 3 in a single day. The main thing is that we can now focus on data quality and availability at a much wider scale, which will benefit all parts of the business.

We’re excited about Data Mesh, Data Downtime detection, Airflow 2.0, plus a whole load of other good stuff. The journey is just beginning, more blog posts to come…..

If you like the sound of what we are doing, we’re hiring!