avatarXiaoxu Gao

Summary

This text discusses building a real-time BigQuery pipeline for cost saving and capacity planning, focusing on evaluating query jobs, setting up alerting systems, and monitoring resources in Google Cloud Platform.

Abstract

Cost and resource management in cloud-based big data stacks like BigQuery are crucial for Data Engineers to avoid high bills or slow environments. This article focuses on BigQuery and provides insights into evaluating query jobs from three perspectives: cost, slots, and duration. The author explains the importance of monitoring and alerting in achieving cost-saving goals and understanding resource utilization.

Furthermore, the article discusses setting up a real-time pipeline to monitor BigQuery jobs using BigQuery Audit Logs, PubSub, and Cloud Functions. The author shares tips on parsing BigQuery audit logs and sending alerts to communication channels within the team. The goal is to find problematic jobs and optimize their performance, ultimately saving costs and improving capacity planning.

Bullet points

  • Evaluating query jobs from three perspectives: cost, slots, and duration
  • Importance of monitoring and alerting in BigQuery for cost saving and capacity planning
  • Explanation of slots in BigQuery and monitoring slot usage for capacity planning
  • Setting up a real-time pipeline to monitor BigQuery jobs using BigQuery Audit Logs, PubSub, and Cloud Functions
  • Parsing BigQuery audit logs and sending alerts to communication channels for optimizing problematic jobs
  • Saving costs and improving capacity planning by optimizing query performance
  • Using Sentry for centralized issue tracking and monitoring, and sending alerts to Slack
  • The article aims to inspire readers to try different monitoring and alerting strategies in BigQuery.

How I build a Real-time BigQuery Pipeline for Cost Saving and Capacity Planning

Cost and resource management in BigQuery

Photo by Morgan Housel on Unsplash

Cost and resource control is a significant issue for Data Engineers when working with cloud-based big data stacks. Without this knowledge, you can easily end up with an extremely-high cloud service bill or receive complaints from the organization about the slow environment. Besides a cheatsheet with cost-saving tips and tricks, another best practice is to have monitoring and alerting in place to achieve the following goals:

  • spot expensive/heavy queries executed by anyone from the organization. The data warehouse can be shared among the entire organization including people who don’t necessarily understand SQL but still try to look for information. An alert is to warn them about the low-quality of the query and Data Engineers can help them with good SQL practices.
  • spot expensive/heavy scheduled queries at the early stage. It’s going to be risky if a scheduled query is very expensive. Having the alerting in place can prevent a high bill at the end of the month.
  • understand the resource utilization and do a better job on capacity planning.

Each cloud provider has different house rules, in this article, we will focus on BigQuery. I will talk about how to evaluate a query job and how to set up a real-time alerting system in GCP. No matter if you are an engineer who manages the resources or a manager who receives the bills, this is the right article for you!

Three dimensions — cost, slot, duration

3 dimensions of BQ query

We evaluate a query from 3 perspectives — cost, slot, and duration.

Cost

Most organizations, especially startups put a lot of effort into cost-saving. Believe me, it’s an interesting topic on its own. Each time I save money by optimizing queries, I get super excited as if I “earned” money from Google. If the query is a scheduled one, the benefit will be even amplified. It’s worth noting that BigQuery provides 2 pricing models with different cost-saving strategies. There is an article from Google to help you choose pricing model.

On-demand pricing: You pay for the number of bytes processed by each query ($5 per TB). You can directly reduce costs by reducing the number of bytes that the query processes. Cost monitoring and alerting is pretty important for the on-demand pricing model because there is technically no upper limit for the cost.

On-demand pricing is for workloads where the volume of data is predictable (e.g. reporting workloads). Here a few sample alerting rules to find expensive scheduled jobs and expensive ad-hoc requests which are not predictable. If the cost for ad-hoc requests is too hight, you may want to move ad-hoc requests to a project with flat-rate pricing.

  • >X USD/query
  • >X USD/user/day
  • > X USD/job/day
  • > X USD for ad-hoc/month
  • …..

Flat-rate pricing: You pay for dedicated query processing capacity, measured in slots ($2000/month/100slots). It’s less important to monitor costs here because it’s the same each month. But you can reduce costs by making your queries run more efficiently to use fewer slots. In that case, monitoring your slot usage makes more sense.

Flat-rate pricing is for workloads with SLAs or ad-hoc requests that want cost predictability and capacity gurantee. Via BigQuery Reservations, slots can be shared among the organization. For example, projectA with high-SLAs reserves 1000 slots and projectB with ad-hoc requests with 300 slots. If there are less ad-hoc requests at the moment, projectA can borrow slots from projectB and vice versa. But when projectA runs, BigQuery will automatically and non-disruptively pre-empt the slots that have been shared with other projects. And you don’t care how much data the ad-hoc requests scan.

Resources — Slot and duration

Whichever pricing model you choose, the number of slots will always be fixed. (2K for the on-demand pricing model and X for the flat-rate pricing depending on how many slots you purchase). So what is a slot and why do we need to monitor it?

A BigQuery slot is a virtual CPU used by BigQuery to execute SQL queries. BigQuery automatically calculates how many slots each query requires, depending on query size and complexity.

If too many complex queries run at the same time, query demands will exceed the slots you committed to. BigQuery won’t charge for additional slots, but your individual units of work will queue up. You can find their performance in the Execution Details of a query job. Each job is broken into several stages: Wait, Read, Compute and Write. A peak time in Wait stage means that BigQuery probably runs out of slots.

BQ job execution details

Monitoring slot usage is required for capacity planning. If you see too many queries queueing, you might want to change your pricing model or purchase more slots or think about how to make your query more efficient. It is worth checking Optimize query computation page from BigQuery.

Ok, so we know what a slot is. But how should we set up alerting rules? We can start with something simple — a hardcoded value based on the slot capacity. Let’s say we have fixed 1000 slots. If a query uses more than 500 slots, then it’s definitely a red flag and needs further investigation.

However, this might result in many false positive alarms because we need another dimension: time. What if a query uses 600 slots on average, but only runs for 10 seconds, do we want to be alerted? And vice visa, what if a query runs for 30 minutes but only uses 50 slots on average?

It’s all up to your team’s SRE strategy. I recommend Google SRE book to learn SRE principles and the best practices. The rule of thumb is that an alert should be actionable.

In this graph, I visualized 3 queries and their time/slot usage. The green query uses a lot of slots but runs fast. The yellow query takes a long time but uses only a few slots. The red query is somewhere in between.

Queries with different resource usage (Created by Xiaoxu Gao)

Let’s say, my alerting strategy is to find out queries that occupy too much space for too long because they might get in the way of critical workloads. A tip of finding the threshold is to have a holistic view on the historical jobs with their time and slot information (I will tell how to do this in the next section). In the following example, my threshold would be (duration>30min AND slot>500). A conservative team may include yellow zone into account as well.

BQ resource threshold (Created by Xiaoxu Gao)

Another persective to look at the threshold is to use the area of query, which is totalSlotMs=duration*slot. That’s the total number of slots consumed by the query over its entire execution time, considered in milliseconds.

BQ resource threshold based on totalSlotMs (Created by Xiaoxu Gao)

Alright! We’ve learned how to measure a query from 3 different dimensions and how to define the threshold. Next, we will see how to retrieve information from BigQuery.

Real-time pipeline to monitor BigQuery jobs

Real-time system to receive BQ alerts (Created by Xiaoxu Gao)

BigQuery provides different options for monitoring queries such as via INFORMATION_SCHEMA or BigQuery Monitoring Chart. This article focuses on BigQuery Audit Logs which is the best data source for a real-time pipeline. The pipeline contains 3 stages.

LogsRouter — Route BigQuery data access logs to PubSub

Cloud Audit Logs are a collection of logs that provide insight into operational concerns related to your use of GCP services including BigQuery. Each data access job is streamed into Audit Logs.

Example of BigQuery data access logs (Created by Xiaoxu Gao)

You can route logs to different destinations like Cloud storage, Pub/Sub, and BigQuery via sinks. Among them, Pub/Sub is the best candidate because it natively fits real-time usecases. In the sink, you can include a filter that matches the log entries you want to include. Otherwise, the Pub/Sub topic will be overloaded by irrelevant messages.

Example of sink filter

PubSub— Route logs to cloud function for processing

PubSub is essentially a queue. We need subscribers like cloud function to consume logs and process them. After creating the topic, you can click “Trigger Cloud Function” button to create a cloud function in any programming language. How easy is that!

Button to create cloud function

CloudFunction— Process logs and send alerts

Each filtered log triggers a cloud function and the log is presented in JSON format like the example above. You can check the latest audit log schema here. We will leverage JobStatistics section for alerting.

Example of job statistics

Here is a piece of sample code to parse BigQuery audit logs.

Based on the threshold, jobs that are in the alerting zone will be sent to whatever communication channel you have within the team. In my team, we use Sentry as a centralized issue tracking and monitoring platform. Each problematic job is sent to Sentry and Sentry will then send it to Slack in a better format. This is our final result!

Sample alert received in Slack via Sentry (Created by Xiaoxu Gao)

Conclusion

As always, I hope you find this article useful and inspiring. Once you have the infrastructure in place, there is a lot of room for you to try out different monitoring and alerting strategies. Cheers!

Reference

Bigquery
Google Cloud Platform
Monitoring
Python
Programming
Recommended from ReadMedium