avatarChristianlauer

Summarize

3 Best Practices for Data Engineers in BigQuery

How to easily improve Query Performance in Google’s Data Warehouse

Photo by Karsten Winegeart on Unsplash

Especially during data integration from source systems into Google BigQuery but also when creating views when directly accessing a data source, there are three key best practices that Data Engineers but also other experts such as Data Analysts and Data Scientists should follow to increase performance and to reduce costs.

Best Practice 1: Cluster Tables

Like in other databases it’s useful to create indexing. You can use indexes to improve query performance, for example, by creating clustered indexes for frequently searched columns.

Example of Clustering — Image Source: Google [1]

As you can see above, the content of the data in the table is sorted by a column, so to speak, which can improve the performance of certain types of queries, such as those using filter (scans of unnecessary data blocks can be eliminated) clauses and queries aggregating (sorted blocks collocate rows with similar values) data[2]. You can also cluster multiple columns as shown in the example above.

Best Practice 2: Use Partitions

With partitions you can divide large tables into smaller partitions to reduce the amount of data that must be scanned during a query.

Example of Partition in BigQuery — Image Source: Google[2]

Normally, you would use aTIMESTAMP/DATE column or an INTEGER column as a partition column. Also, it often makes sense to use partition and clustering together. Here, is a small example of how to use the two mechanisms in BigQuery SQL:

CREATE CLUSTERED TABLE 
`your_dataset.clustered_table` 
PARTITION BY DATE(timestamp_column) 
CLUSTER BY column1 
AS SELECT * FROM `your_dataset.your_table`

Best Practice 3: Working with Materialized Views

In my opinion, the third must do in BigQuery as a Data Engineer is to use materialized views. They can significantly improve query performance by pre-computing and storing the results of a query, which can save time and resources and are useful when you have use cases like[3]:

  • Pre-aggregating or pre-filtering of large data sets or streaming data.
  • Pre-joining data especially when joining between large and small tables.
  • Running queries from a clustering scheme that differs from the base tables.

In addition, materialized views also can be used to simplify data access for users by providing a simplified view of the data without requiring them to understand the underlying data structure. Most important to me in regards to costs is also the case if you have many users that frequently run the same query, because creating a materialized view for that query can save many resources and therefore also costs.

These three approaches in BigQuery are kind of the be-all and end-all for me if you’re a Data Engineer working with BigQuery. Of course there are also other ways to increase performance and save costs in BigQuery, but these three are the most well-known and should be taken into account already during data integration, so before you even load data, they should be considered in your concept. If you are also looking for tips and tricks for working in BigQuery, you might find the following article interesting:

Sources and Further Readings

[1] Google, Introduction to clustered tables (2023)

[2] Google, BigQuery explained: Storage overview, and how to partition and cluster your data for optimal performance (2023)

[3] Google, Introduction to materialized views (2023)

Data Science
Programming
Google
Bigquery
Technology
Recommended from ReadMedium