90% cost saving using Terraform to create BigQuery subscriptions to read from PubSub

Mark Duce
loveholidays tech
Published in
5 min readNov 4, 2022

--

At loveholidays, we deal with substantial amounts of data. To borrow a phrase from Google, we “favour data not opinion”. With that in mind, many of our applications are recording data that helps us reach data assisted decisions. A lot of that data is stored in BigQuery, from where we have a suite of reporting tools that the business uses.

For a long time, it’s been surprisingly complicated getting data from PubSub to BigQuery, but this year Google launched BigQuery subscriptions. In this article, we are going to talk you through how our architecture has evolved as well as how to implement a BigQuery subscription using Terraform.

We’ve reduced our costs by continually optimising our architecture

If you’d like to get straight to the good part of how to do it, head down to “How to setup BigQuery Subscription using Terraform” otherwise read on for how we got there.

How we’ve previously got data from PubSub to BigQuery

One of our principles is to “think big, deliver incrementally” and that’s certainly been the case on the journey to get to where we are now.

Iteration 1

Our first iteration involved deploying two applications — the initial application that we wanted to record data alongside a second application which had the responsibility of writing the data to BigQuery. In our case, both were deployed to our Kubernetes cluster and the table was created manually in BigQuery. This flow worked OK, but it did mean there were three copies of the schema to maintain, one in our initial app that was recording data, one in the second app responsible for writing data and one being self managed in BigQuery.

Iteration 2

Our second iteration used Dataflow to take the data to BigQuery. There was one big problem with this flow — cost. DataFlow was eye wateringly expensive and was costing us around £4.5k a month.

Iteration 3

So we looked for a different way to approach this. We dramatically lowered the cost by implementing a Cloud Function in Go that subscribed to PubSub and wrote the data into BigQuery. We also automated the provisioning of our BigQuery tables — we now use Terraform to create and manage these tables. This served us well in production for over a year, but there are some downsides. We still had 3 copies of the schema to maintain in 3 quite separate places. Adding a new column required changes in 3 repositories to get it live. The upside is that it only cost us around £1k a month.

Iteration 4

This year, we heard the good news that Google launched BigQuery subscriptions, allowing you to take data straight from PubSub to BigQuery. So we wanted to check that out and even better, we wanted to use Terraform to do it.

How to setup BigQuery Subscription using Terraform

First of all you need to make sure you’re using a fairly recent Google Terraform provider. You need to be on at least version 4.32.0 to be able to use this resource type.

In our example here we’re going to set up a BigQuery dataset, a BigQuery table, a PubSub topic, and the BigQuery subscription.

Here’s how to set up the BigQuery dataset and table

resource "google_bigquery_dataset" "dataset_bigquery_subscription_demo" {
project = "your-project-name"
dataset_id = "bigquery_subscription_demo"
friendly_name = "bigquery subscription demo"
description = "A demo dataset to show how it works"
}
resource "google_bigquery_table" "table_event_demo" {
project = "your-project-name"
dataset_id = google_bigquery_dataset.dataset_bigquery_subscription_demo.dataset_id
table_id = "demo_event"
description = "An example table"
schema = <<EOF
[
{"name": "creation_time", "type": "DATETIME", "mode": "REQUIRED"},
{"name": "event_name", "type": "STRING", "mode": "REQUIRED"}
]
EOF
}

Then here’s how to set up the topic. Note that it must have a AVRO schema applied and that schema should conform to your BigQuery table. In this example we’re using AVRO but you can use PROTOCOL_BUFFER instead.

resource "google_pubsub_schema" "schema_event_demo" {
project = "your-project-name"
name = "event_demo_schema"
type = "AVRO"
definition = <<EOF
{
"type" : "record",
"name" : "event_demo",
"fields" : [
{
"name" : "creation_time",
"type" : "string"
},
{
"name" : "event_name",
"type" : "string"
}
]
}
EOF
}
resource "google_pubsub_topic" "topic_event_demo" {
project = "your-project-name"
name = "demo_event_topic"
depends_on = [google_pubsub_schema.schema_event_demo]
schema_settings {
schema = "projects/your-project-name/schemas/${google_pubsub_schema.schema_event_demo.name}"
encoding = "JSON"
}
}

Next — we need to give the PubSub engine (Google managed resource at project level) permission to be able to see the BigQuery table and also to be able to insert data.

resource "google_project_iam_member" "permissions_event_demo" {
for_each = toset(["roles/bigquery.dataEditor", "roles/bigquery.metadataViewer"])
project = "your-project-name"
role = each.value
member = "serviceAccount:service-${data.google_project.project.number}@gcp-sa-pubsub.iam.gserviceaccount.com"
}

Finally we need to create a subscription. The “bigquery_config” provides the magic here.

resource "google_pubsub_subscription" "subscription_event_demo" {
project = "your-project-name"
name = "event_demo_subscription"
topic = google_pubsub_topic.topic_event_demo.name
bigquery_config {
table = "your-project-name:${google_bigquery_table.table_event_demo.dataset_id}.${google_bigquery_table.table_event_demo.table_id}"
use_topic_schema = true
}
depends_on = [google_project_iam_member.permissions_event_demo]
}

And that’s it — once you’ve set these all up, you’ll have data going directly from PubSub to BigQuery. There’s still one thing that we’re not comfortable with — there are still more copies of the schema to maintain than we’d like. Happily we only have the model in two repositories instead of three, but we still have it in three places across the two repositories. Hopefully in the future we’ll find a way to improve that, but in the meantime we’re happy that we’re following one of our principles by investing in the simplicity of this architecture.

And here’s the really good news, it looks like we can save around 90% of our costs that we were spending on Cloud Functions by using this, win!

--

--