This blog post discusses implementing Spark ETL (Extract, Transform, Load) processes using Apache HUDI to create a Lakehouse, detailing the steps to read data from MySQL, load it into HUDI format, and manage data with HUDI's features.
Abstract
The article provides a comprehensive guide on using Apache HUDI with Spark for ETL tasks to build a Lakehouse. It begins by referencing a previous blog on Delta Lake and then delves into the specifics of Apache HUDI, explaining its role in data management within Hadoop-based data lakes. The author outlines key features of HUDI, such as support for upserts, deletes, increments, transactional writes, delta storage, schema evolution, and data indexing. The post walks through practical steps, including reading data from a MySQL server into Spark, creating HIVE temporary views, loading filtered data into HUDI format, and exploring the data. It also covers the configuration of Spark with HUDI packages and the process of reading HUDI tables using Spark data frames. The article concludes with a summary of the concepts learned, emphasizing the practical applications of HUDI in managing big data effectively.
Opinions
The author positions Apache HUDI as a valuable tool for managing big data in a Hadoop-based data lake, highlighting its efficiency and reliability.
There is an emphasis on the practicality of HUDI's features, such as ACID transactions and schema evolution, for real-world ETL tasks.
The blog suggests that HUDI's approach to data storage and processing, including delta files and indexing capabilities, significantly enhances query performance.
The author implies that the integration of HUDI with Spark and HIVE is straightforward and beneficial for data engineers looking to implement a Lakehouse architecture.
By providing code examples and screenshots, the author conveys a hands-on, tutorial-like approach to learning, which can be particularly useful for developers and data engineers.
The mention of companies like Uber, Alibaba, and Verizon Media using HUDI suggests an industry-wide endorsement of its capabilities.
The inclusion of a video explanation at the end of the blog post indicates a preference for multimedia learning resources to complement the written content.
Spark ETL Chapter 8 with Lakehouse | Apache HUDI
Previous blog/Context:
In an earlier blog, we discussed Spark ETL with Lakehouse (with Delta Lake). Please find below blog post for more details.
In this blog, we will discuss Spark ETL with Apache HUDI. We will first understand what Apache HUDI is and why Apache HUDI is used for creating Lake houses. We will source data from one of the source systems which we have learned till now and load that data into Apache HUDI format. We will create an on-premise lake house and load all data into it.
What is Apache HUDI?
Apache Hudi is an open-source data management framework for Apache Hadoop-based data lakes. Hudi stands for “Hadoop Upserts Deletes and Incrementals.” It provides a way to manage data in a big data environment with features like data ingestion, data processing, and data serving. Hudi was originally developed by Uber and was later contributed to the Apache Software Foundation as an open-source project.
Hudi provides several key features that make it useful for managing big data, including:
Upserts, deletes, and increments: Hudi supports efficient updates and deletes existing data in a Hadoop-based data lake, allowing for incremental data processing.
Transactional writes: Hudi supports ACID transactions, ensuring that data is consistent and reliable.
Delta storage: Hudi stores data as delta files, which allows for fast querying and processing of data changes.
Schema evolution: Hudi supports schema evolution, enabling changes to the schema without requiring a full reload of the data.
Data indexing: Hudi provides indexing capabilities that make it easy to query data in a Hadoop-based data lake.
Overall, Hudi provides a flexible and efficient way to manage big data in a Hadoop-based data lake. It enables efficient data processing and querying while ensuring data consistency and reliability through ACID transactions. Hudi is used by a variety of companies and organizations, including Uber, Alibaba, and Verizon Media
Spark ETL with different Data Sources (Image by Author)
Today, we will be doing the operations below ETL and with this, we will also be learning about the Apache iceberg and how to build a lake house.
Read data from MySQL server into Spark
Create a HIVE temp view from a data frame
Load filtered data into HUDI format (create initial table)
Load filtered data again into HUDI format in the same table
Read HUDI tables using Spark data frame
Create Temp HIVE of HUDI tables
Explore data
First, clone below GitHub repo, where we have all the required sample files and solution
If you don’t have a setup for Spark instance follow the earlier blog for setting up Data Engineering tools in your system. (Data Engineering suite will setup Spark, MySQL, PostgreSQL, and MongoDB in your system) In that Spark instance, we already have packages installed for Azure blog storage and Azure Data Lake Services.
Start the Spark application with all required packages
First, we will start the Spark session with all the required packages and configurations for Apache HUDI. We know that with our spark instance, we don’t have packages (jar file) available for Apache HUDI, so when we start the spark session, we need to externally specify that. We will also be using MySQL so we will specify package requirements for MySQL also.
With Apache Iceberg, we also need to pass the configurations below.
Starting Spark application with required Apache Hudi Packages (Image by Author)
Now, we have our spark session available with all the required packages and configuration, so we can start the ETL process.
Read data from MySQL server into Spark
(If you have already completed chapter 7, you can skip reading data from MySQL and create the HIVE table, and can directly go to create HUDI table section)
For this ETL, we are also using the same MySQL as the source system and are loading the same table. We will not discuss much on how to load data from MySQL and how to create a HIVE table as we have already discussed in detail in Chapter 7.
If you don’t have already uploaded data into MySQL, please follow the earlier blog for the same.
MySQL Table which we will load into Spark (Image by Author)
We will read this data from Spark and we will create a spark data frame and HIVE table on this.
Creating connection with MySQL from Spark (Image by Author)
Create a HIVE temp view from a data frame
We will create a HIVE temp view from a data frame.
Create a HIVE table and explore Spark SQL (Image by Author)
We will explore data and check the highest food group and filter with one group.
Load filtered data into HUDI format (create initial table)
We have data frame “newdf” available in which we have only one group food is there. We will use that to create the first hudi table.
Create first Apache Hudi table (Image by Author)
With the HUDI format, we need to pass a few options. And some of the options are mandatory, if we don’t pass that it will not create a table. Few mandatory columns have passed in our example.
We also need to pass the base path.
Once, we prepare options parameters and base path, using the format “hudi” we can create a hudi table.
It will create a folder named “hudi_food” and create a parquet file in which it will store data and metadata.
Hudi first table on local drive (Image by Author)
Inside “hudI_food”, we have metadata and a parquet file in which we have data. The folder structure will be as below
Hudi Data and Metadata format (Image by Author)
Which looks like this in the folder
Image by Author
Inside the “.hoddie” folder
Hudi metadata folder (Image by Author)
properties file has all the properties of the HUDI table.
Properties in metadata file (Image by Author)
parquet file with actual data.
Data file (Image by Author)
If in case, if we also specify the partition property, it will create a folder as below
Creating a Hudi table with more properties (Image by Author)
It will create a folder as below
Hudi table with more properties (Image by Author)
Inside “hudi_food”
Data and Metadata at the file server level (Image by Author)
Load filtered data again into hudi format in the same table
We will create one more data frame by filtering one more food group and then appending data into the same hudi table.
It will create one more parquet file in the same folder.
Data files at file server level (Image by Author)
Read HUDI tables using Spark data frame
Now, we will read the HUDI table into the Spark data frame.
Read a Hudi table from Apache Spark (Image by Author)
Here, we see that there are extra columns added. The first 5 columns were added by the HUDI data frame. Those are the metadata. As we discussed, HUDI is not only creating separate files for storing metadata but it is also storing in the same file only.
Now, if we print data
Printing data from Hudi table (Image by Author)
If I do “truncate=False” to check the commit time and sequence.
Hudi properties (Image by Author)
Commit time is: 20230322103927766
Which is 2023/03/22 10:39:27.766 (YYYY/MM/DD HH:mm:SS.sss)
and commit sequence number: 20230322103927766_0_0
Create Temp HIVE of HUDI tables
We have data available in the data frame. Now we will create a HIVE temp table so that we can write Spark SQL.
Create a HIVE table from a Hudi data frame (Image by Author)
Explore data
We can Spark SQL queries and explore data.
Spark SQL on Hive table based on Hudi data frame (Image by Author)