avatarChristianlauer

Summary

Google's BigQuery now supports Change Data Capture (CDC) using the BigQuery Storage Write API, allowing real-time data tracking and replication for better business performance.

Abstract

Google's BigQuery, a SaaS Data Warehouse, has introduced Change Data Capture (CDC) support using the BigQuery Storage Write API. CDC is a process that captures changes made to a database in real-time, enabling organizations to make informed decisions based on up-to-date information. The BigQuery Storage Write API is a unified data ingest API that combines streaming ingest and batch loading into a single high-performance API. To use BigQuery CDC, certain conditions must be met, such as using the Storage Write API in the default stream, declaring primary keys for the destination table, and having sufficient BigQuery compute resources available. Although the feature is currently in preview, it is a significant addition for Data Engineers using BigQuery.

Bullet points

  • Google's BigQuery now supports Change Data Capture (CDC) using the BigQuery Storage Write API.
  • CDC allows real-time data tracking and replication for better business performance.
  • The BigQuery Storage Write API is a unified data ingest API that combines streaming ingest and batch loading.
  • To use BigQuery CDC, certain conditions must be met:
    • Use the Storage Write API in the default stream.
    • Declare primary keys for the destination table in BigQuery.
    • The destination table in BigQuery must be clustered.
    • Sufficient BigQuery compute resources must be available.
  • The feature is currently in preview.

Google launches CDC for BigQuery

Change Data Capture in Real Time using BigQuery API

Photo by Cedric Letsch on Unsplash

Google’s SaaS Data Warehouse BigQuery now supports Change Data Capture (CDC) by processing and applying streamed changes in real-time to existing data using the BigQuery Storage Write API [1].

Change Data Capture (CDC) is a process that captures changes made to a database so that they can be tracked and replicated in real time. CDC is commonly used in data integration, data warehousing, and data analytics to keep data in sync between different systems. One of the main benefits of CDC is that it allows organizations to make data-driven decisions based on the most up-to-date information.

With CDC, companies can access real-time data that helps them make more informed decisions, respond quickly to changes, and improve their overall business performance.

CDC within BigQuery — Image Source: Google[2]

The BigQuery Storage Write API is a unified data ingest API for BigQuery. It combines streaming ingest and batch loading into a single high-performance API. The user is able to utilize the Storage Write API for streaming records into BigQuery in real time, or even process an arbitrarily large number of records at the same time and committing them in a single atomic operation. For using BigQuery CDC, your data workflow and data scheme must meet the following conditions [3]:

  • You must use the Storage Write API in the default stream.
  • You have to declare primary keys for the destination table in BigQuery. Composite primary keys that include up to 16 columns are supported.
  • Your destination table in BigQuery must be clustered.
  • Sufficient BigQuery compute resources must be available to perform the CDC row operations.

Once again, a great feature that Google now offers to the users of BigQuery, which should make especially Data Engineers happy. Nonetheless, please keep in mind that the feature is still in preview at this current point in time.

Sources and Further Readings

[1] Google, BigQuery release notes (2023)

[2] Google, Datenbankreplikation mit Change Data Capture (2020)

[3] Google, Stream table updates with change data capture (2023)

Data Science
Technology
Google
Bigquery
Programming
Recommended from ReadMedium