How I build a Real-time BigQuery Pipeline for Cost Saving and Capacity Planning
Cost and resource management in BigQuery
Cost and resource control is a significant issue for Data Engineers when working with cloud-based big data stacks. Without this knowledge, you can easily end up with an extremely-high cloud service bill or receive complaints from the organization about the slow environment. Besides a cheatsheet with cost-saving tips and tricks, another best practice is to have monitoring and alerting in place to achieve the following goals:
- spot expensive/heavy queries executed by anyone from the organization. The data warehouse can be shared among the entire organization including people who don’t necessarily understand SQL but still try to look for information. An alert is to warn them about the low-quality of the query and Data Engineers can help them with good SQL practices.
- spot expensive/heavy scheduled queries at the early stage. It’s going to be risky if a scheduled query is very expensive. Having the alerting in place can prevent a high bill at the end of the month.
- understand the resource utilization and do a better job on capacity planning.
Each cloud provider has different house rules, in this article, we will focus on BigQuery. I will talk about how to evaluate a query job and how to set up a real-time alerting system in GCP. No matter if you are an engineer who manages the resources or a manager who receives the bills, this is the right article for you!
Three dimensions — cost, slot, duration
We evaluate a query from 3 perspectives — cost, slot, and duration.
Cost
Most organizations, especially startups put a lot of effort into cost-saving. Believe me, it’s an interesting topic on its own. Each time I save money by optimizing queries, I get super excited as if I “earned” money from Google. If the query is a scheduled one, the benefit will be even amplified. It’s worth noting that BigQuery provides 2 pricing models with different cost-saving strategies. There is an article from Google to help you choose pricing model.
On-demand pricing: You pay for the number of bytes processed by each query ($5 per TB). You can directly reduce costs by reducing the number of bytes that the query processes. Cost monitoring and alerting is pretty important for the on-demand pricing model because there is technically no upper limit for the cost.
On-demand pricing is for workloads where the volume of data is predictable (e.g. reporting workloads). Here a few sample alerting rules to find expensive scheduled jobs and expensive ad-hoc requests which are not predictable. If the cost for ad-hoc requests is too hight, you may want to move ad-hoc requests to a project with flat-rate pricing.
- >X USD/query
- >X USD/user/day
- > X USD/job/day
- > X USD for ad-hoc/month
- …..
Flat-rate pricing: You pay for dedicated query processing capacity, measured in slots ($2000/month/100slots). It’s less important to monitor costs here because it’s the same each month. But you can reduce costs by making your queries run more efficiently to use fewer slots. In that case, monitoring your slot usage makes more sense.
Flat-rate pricing is for workloads with SLAs or ad-hoc requests that want cost predictability and capacity gurantee. Via BigQuery Reservations, slots can be shared among the organization. For example, projectA with high-SLAs reserves 1000 slots and projectB with ad-hoc requests with 300 slots. If there are less ad-hoc requests at the moment, projectA can borrow slots from projectB and vice versa. But when projectA runs, BigQuery will automatically and non-disruptively pre-empt the slots that have been shared with other projects. And you don’t care how much data the ad-hoc requests scan.
Resources — Slot and duration
Whichever pricing model you choose, the number of slots will always be fixed. (2K for the on-demand pricing model and X for the flat-rate pricing depending on how many slots you purchase). So what is a slot and why do we need to monitor it?
A BigQuery slot is a virtual CPU used by BigQuery to execute SQL queries. BigQuery automatically calculates how many slots each query requires, depending on query size and complexity.
If too many complex queries run at the same time, query demands will exceed the slots you committed to. BigQuery won’t charge for additional slots, but your individual units of work will queue up. You can find their performance in the Execution Details of a query job. Each job is broken into several stages: Wait, Read, Compute and Write. A peak time in Wait stage means that BigQuery probably runs out of slots.
Monitoring slot usage is required for capacity planning. If you see too many queries queueing, you might want to change your pricing model or purchase more slots or think about how to make your query more efficient. It is worth checking Optimize query computation page from BigQuery.
Ok, so we know what a slot is. But how should we set up alerting rules? We can start with something simple — a hardcoded value based on the slot capacity. Let’s say we have fixed 1000 slots. If a query uses more than 500 slots, then it’s definitely a red flag and needs further investigation.
However, this might result in many false positive alarms because we need another dimension: time. What if a query uses 600 slots on average, but only runs for 10 seconds, do we want to be alerted? And vice visa, what if a query runs for 30 minutes but only uses 50 slots on average?
It’s all up to your team’s SRE strategy. I recommend Google SRE book to learn SRE principles and the best practices. The rule of thumb is that an alert should be actionable.
In this graph, I visualized 3 queries and their time/slot usage. The green query uses a lot of slots but runs fast. The yellow query takes a long time but uses only a few slots. The red query is somewhere in between.
Let’s say, my alerting strategy is to find out queries that occupy too much space for too long because they might get in the way of critical workloads. A tip of finding the threshold is to have a holistic view on the historical jobs with their time and slot information (I will tell how to do this in the next section). In the following example, my threshold would be (duration>30min AND slot>500). A conservative team may include yellow zone into account as well.
Another persective to look at the threshold is to use the area of query, which is totalSlotMs=duration*slot. That’s the total number of slots consumed by the query over its entire execution time, considered in milliseconds.
Alright! We’ve learned how to measure a query from 3 different dimensions and how to define the threshold. Next, we will see how to retrieve information from BigQuery.
Real-time pipeline to monitor BigQuery jobs
BigQuery provides different options for monitoring queries such as via INFORMATION_SCHEMA or BigQuery Monitoring Chart. This article focuses on BigQuery Audit Logs which is the best data source for a real-time pipeline. The pipeline contains 3 stages.
LogsRouter — Route BigQuery data access logs to PubSub
Cloud Audit Logs are a collection of logs that provide insight into operational concerns related to your use of GCP services including BigQuery. Each data access job is streamed into Audit Logs.
You can route logs to different destinations like Cloud storage, Pub/Sub, and BigQuery via sinks. Among them, Pub/Sub is the best candidate because it natively fits real-time usecases. In the sink, you can include a filter that matches the log entries you want to include. Otherwise, the Pub/Sub topic will be overloaded by irrelevant messages.