avatarMatthew Salminen

Summary

The web content discusses the integration of Databricks Auto Loader with Delta Live Tables to create efficient, real-time data pipelines using a medallion architecture.

Abstract

The article details the synergy between Databricks Auto Loader and Delta Live Tables, emphasizing their combined ability to handle large-scale data ingestion and processing in real-time. It explains the concept of Delta Live Tables (DLTs) as a framework for managing data pipelines that deal with high-volume and high-velocity data. The author distinguishes between Delta Tables, which are storage tables in Databricks, and Delta Live Tables, which define the flow of data within these tables. The article provides a practical example of implementing a multi-hop architecture—bronze, silver, and gold layers—using DLTs within Databricks notebooks, demonstrating how to define tables, enforce data quality, and execute pipelines. The author also points out the importance of the LIVE keyword in activating Delta Live Tables in the data pipeline and concludes by directing readers to official Databricks documentation for comprehensive tutorials and references on Delta Live Tables.

Opinions

  • The author suggests that Delta Live Tables simplify ETL development and maintenance, particularly for data with high volume and velocity.
  • They highlight the importance of understanding the distinction between Delta Tables and Delta Live Tables to effectively use the Databricks platform.
  • The author expresses that Delta Live Tables facilitate data quality checks through constraints and expectations, which can help ensure the integrity of the data pipeline.
  • It is implied that the medallion architecture, when combined with Delta Live Tables, provides a structured approach to data transformation and refinement, leading to more reliable and actionable data insights.
  • The author emphasizes the benefits of using Delta Live Tables, such as simplifying architecture, ensuring data quality, and delivering high performance, which enable informed decision-making.
  • The article subtly encourages readers to explore further by providing links to official Databricks documentation and tutorials, indicating a supportive stance towards continuous learning and exploration of the platform's capabilities.

The Power Duo: Databricks Auto Loader and Delta Live Tables

In my last two posts, I explained the benefits of using autoloader for your data pipelines with Databricks. A lot can be said that this is a very efficient way to handle incoming data batches for your ingestion process. The power of your data pipeline doesn’t stop there. Even if autoloader simplifies the process of ingesting large volumes of data into Databricks Delta Lake, we have the ability to take batch processing and allow real-time data streams instead. This is where Databricks Delta Live Tables comes in.

What are Delta Live Tables or DLTs?

Delta Live Tables or DLTs for short allows you to create data pipelines that are managed real-time. This will make ETL development easier to execute and maintain when the goal of your ETL involves data coming in with high volume and velocity. It is important to understand the difference between Delta Tables and Delta “Live” Tables before you proceed.

Delta Tables are a way to store your data as tables in Databricks while Delta Live Tables are a way to introduce how data flows within these tables. Whereas Delta Tables provide a table architecture, DLTs are a data pipeline framework that you can use when building your ETL pipeline.

Source: https://techcommunity.microsoft.com/t5/analytics-on-azure-blog/easier-data-model-management-for-power-bi-using-delta-live/ba-p/3500698

How can we incorporate Delta Live Tables into our data pipelines?

Creating Delta Live Tables involves working with the medallion or multi-hop architecture that I explained in my previous post, Databricks Autoloader and Medallion Architecture… Pt 2. Below is an example incorporating the bronze, silver, and gold layers of this architecture within your Databricks notebooks. SQL will be used in this example although the same can be accomplished with pyspark/python.

  1. Create Bronze Layer Tables: Here we are creating a live table called bronze_table and defining the schema/data source. You can also add comments for reference as you create each live table.
-- Bronze Layer Table

CREATE OR REFRESH STREAMING LIVE TABLE bronze_table
COMMENT "This is a sample table"
AS SELECT * FROM cloud_files("${file_path}/table_name_1", "delta",
                            map("schema", "col_name_1 STRING, col_name_2 LONG"))

2. Create Silver Layer Tables: The next level table involves adding a constraint, expectation, and join to pull specific information for our silver_table. Please note, we are using constrant violation DROP ROW and FAIL UPDATE to discard records or fail pipeline. Databricks allows us to define our constraints with expectations for data quality checks.

-- Silver Layer Table

CREATE OR REFRESH STREAMING LIVE TABLE silver_table (
  CONSTRAINT valid_col_name EXPECT (col_name_3 IS NOT NULL) ON VIOLATION DROP ROW
)
TBLPROPERTIES (
  "comment" = "Only valid column names with valid_col_name"
)
AS
SELECT
  col_name_3,
  b.col_name_4,
  col_name_5,
  c.col_name_6 AS name,
  c.col_name_7 AS number,
  CAST(time_period(timestamp, 'yyyy-MM-dd HH:mm:ss') AS TIMESTAMP) AS new_timestamp
FROM
  STREAM(LIVE.silver_table) a
LEFT JOIN
  LIVE.name b
ON
  a.col_name_4 = b.col_name_4;

3. Create Gold Layer Tables: The last level table involves selecting a specific subset of data from our bronze and silver layers that contains particular columns we would like to select for given requirements. Please note, the LIVE in LIVE.silver_table is what activates the Delta Live Table in our data pipeline and not having that provided will not allow you to execute your pipeline.

-- Gold Layer Table

CREATE OR REFRESH LIVE TABLE gold_table
COMMENT "Final results from Bronze/Silver Layers"
AS
  SELECT col_name_3, name, number, col_name_5, TIMESTAMP
  FROM LIVE.silver_table
  WHERE id = 'x'
  GROUP BY col_name_3, name, number

This is a brief example of how you can incorporate your medallion or multi-hop architecture from autoloader’s cloud files into Delta Live Tables. This can open up a realm of possibilities for your real-time data. By simplifying architecture, ensuring data quality, and delivering high performance, Delta Live Tables can allow you to make informed decisions swiftly and accurately.

Please note, this was an example of Delta Live Table creation in your Databricks notebooks and not covering the aspect of running your pipeline in Databricks UI. More information and documentation regarding that can be found here. Below is an example snapshot from Databricks of a pipeline in Databricks UI:

Source: https://www.databricks.com/product/delta-live-tables

(Disclaimer: Please refer to the official Databricks documentation and resources for the most up-to-date information on Delta Live Tables and related features.)

Delta Live Tables SQL Reference: https://docs.databricks.com/en/delta-live-tables/sql-ref.html

Delta Live Tables Python Reference: https://docs.databricks.com/en/delta-live-tables/python-ref.html

SQL Tutorial: https://docs.databricks.com/en/delta-live-tables/tutorial-sql.html

Python Tutorial: https://docs.databricks.com/en/delta-live-tables/tutorial-python.html

Databricks
Delta Live Tables
Data Engineering
Real Time Data
Streaming
Recommended from ReadMedium