avatarNnaemezue Obi-Eyisi

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1830

Abstract

e. This is useful when your Delta Tables are sources to an event driven application.</li></ol><p id="8980">Prerequisites for using Change Data feed</p><ol><li>Create a Delta Table registered in Hive metastore or Unity Catalog</li><li>Enable the Change Data feed property for that table</li></ol><div id="d394"><pre><span class="hljs-keyword">ALTER</span> <span class="hljs-keyword">TABLE</span> myDeltaTable <span class="hljs-keyword">SET</span> TBLPROPERTIES (delta.enableChangeDataFeed <span class="hljs-operator">=</span> <span class="hljs-literal">true</span>)</pre></div><p id="7627"><b>Querying Change Data Feed Delta Tables</b></p><p id="e9f8">My preferred method of querying the Change Data Feed is to use the version number like below</p><div id="1195"><pre>-- version <span class="hljs-keyword">as</span> ints <span class="hljs-built_in">or</span> longs e.g. changes <span class="hljs-keyword">from</span> version <span class="hljs-number">0</span> <span class="hljs-keyword">to</span> <span class="hljs-number">10</span> <span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> table_changes(<span class="hljs-comment">'tableName', 0, 10)</span></pre></div><p id="e1cd">However, for us to know the correct version to query we need to check the History of the Delta table to see the Min and Max Version values. This will help us in designing our ELT pipeline for transmitting changes across our bronze, silver, and gold tables. See sample query below</p><div id="48b1"><pre><span class="hljs-keyword">DESCRIBE</span> HISTORY MYDELTATABLE</pre></div><p id="7415">Below are the Pros and Cons of Delta table change data feed.</p><p id="37c5">Pros</p><ol><li>Will improve ELT pipeline performance by allowing you faster access to the changed data since last processed batch. This will be better than

Options

filtering on a watermark field for really big tables</li><li>Allow you to keep track of deleted records in your delta lake house (if that were happening). Without change data feed, tracking and propagating physical deletes will be difficult.</li><li>You will save cluster compute cost because your pipelines runs faster</li></ol><p id="e608">Cons.</p><ol><li>You will accrue more storage cost because the change data feed creates a new folder in your Delta Lake house where it stores and tracks the changes in the Delta tables. However, storage is cheaper than compute in general…</li><li>You will need to register every single delta table in the hive metastore. I just wish there was a way we could have leverage this feature even as an external delta table.</li></ol><p id="f90f">Finally, I made a YouTube video demoing the real world application of this change data feed in processing incremental loads across bronze, silver and gold tables in our Delta Lake House. In the demo, I describe the best way to design the pipeline to automatically propagate the latest changes and avoid reprocessing already processed data changes. A beginner data engineer can also find this helpful in practicing their data engineering trainings and using this project to impress a potential employer.</p><p id="f472"><b>YouTube Video: <a href="https://www.youtube.com/watch?v=0gkHHokTg0U&amp;t=29s">https://www.youtube.com/watch?v=0gkHHokTg0U&amp;t=29s</a></b></p><p id="c383">Follow me on LinkedIn for more content on data engineering</p><p id="c618"><a href="https://www.linkedin.com/in/nobieyisi/">https://www.linkedin.com/in/nobieyisi/</a></p><p id="29b5">Reference:</p><p id="f51a"><a href="https://docs.databricks.com/delta/delta-change-data-feed.html">https://docs.databricks.com/delta/delta-change-data-feed.html</a></p></article></body>

My Review of Databricks Delta Lake Change Data Feed

In this article, I will give my unbiased assessment of Databricks Delta Lake change data feed, analyzing the potential use cases, pros and cons.

One of the recent features introduced by Databricks is Delta Table Change Data feed. Databricks describes it as below

Change data feed allows Databricks to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records change events for all the data written into the table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.” Reference

This means that in Delta Lake Tables we can now access the audit trail of the changes that happened in the table.

Before this we would need to use Databricks time travel feature to see what changed from the previous version of the table to the current version of the table.

How is change data feed useful

  1. Let’s say you want to implement an incremental load data extraction from your bronze delta table to your silver and gold tables. You can identify the changed records with a watermark field. However, Change Data Feed provides this information in a much cleaner and robust way. We can use this to implement incremental data extraction across our Delta Lake house platform.
  2. We can use the Change Data Feed files as a streaming source to propagate the changes to a target of our choice. This is useful when your Delta Tables are sources to an event driven application.

Prerequisites for using Change Data feed

  1. Create a Delta Table registered in Hive metastore or Unity Catalog
  2. Enable the Change Data feed property for that table
ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

Querying Change Data Feed Delta Tables

My preferred method of querying the Change Data Feed is to use the version number like below

-- version as ints or longs e.g. changes from version 0 to 10
SELECT * FROM table_changes('tableName', 0, 10)

However, for us to know the correct version to query we need to check the History of the Delta table to see the Min and Max Version values. This will help us in designing our ELT pipeline for transmitting changes across our bronze, silver, and gold tables. See sample query below

DESCRIBE HISTORY MYDELTATABLE

Below are the Pros and Cons of Delta table change data feed.

Pros

  1. Will improve ELT pipeline performance by allowing you faster access to the changed data since last processed batch. This will be better than filtering on a watermark field for really big tables
  2. Allow you to keep track of deleted records in your delta lake house (if that were happening). Without change data feed, tracking and propagating physical deletes will be difficult.
  3. You will save cluster compute cost because your pipelines runs faster

Cons.

  1. You will accrue more storage cost because the change data feed creates a new folder in your Delta Lake house where it stores and tracks the changes in the Delta tables. However, storage is cheaper than compute in general…
  2. You will need to register every single delta table in the hive metastore. I just wish there was a way we could have leverage this feature even as an external delta table.

Finally, I made a YouTube video demoing the real world application of this change data feed in processing incremental loads across bronze, silver and gold tables in our Delta Lake House. In the demo, I describe the best way to design the pipeline to automatically propagate the latest changes and avoid reprocessing already processed data changes. A beginner data engineer can also find this helpful in practicing their data engineering trainings and using this project to impress a potential employer.

YouTube Video: https://www.youtube.com/watch?v=0gkHHokTg0U&t=29s

Follow me on LinkedIn for more content on data engineering

https://www.linkedin.com/in/nobieyisi/

Reference:

https://docs.databricks.com/delta/delta-change-data-feed.html

Data
Data Engineering
Analytics
Databricks
Data Science
Recommended from ReadMedium