avatarKalpan Shah

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5316

Abstract

ages (jar file) available for Apache iceberg, so when we start the spark session, we need to externally specify that. We will also be using MySQL so we will specify package requirements for MySQL also.</p><p id="c8ae">With Apache Iceberg, we also need to pass the configurations below.</p> <figure id="764d"> <div> <div>

            <iframe class="gist-iframe" src="/gist/developershomes/91451c744dbf6fec55dcb977034fa1ef.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="2373">By default, warehouse location is $PWD (it means the location where we start the spark session) but here we are giving warehouse so it will create a folder in our folder and put all tables inside it.</p>
    <figure id="5eb3">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/da903a373774a659c700d58a136b2e03.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="794f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*DkXPvpiC7U1HGxd-"><figcaption></figcaption></figure><p id="8c80">Now, we have our spark session available with all the required packages and configuration, so we can start the ETL process.</p><h1 id="9b44">Read data from MySQL server into Spark</h1><p id="415f">(If you have already completed <a href="https://readmedium.com/spark-etl-chapter-7-with-lakehouse-delta-lake-7fbbd66e0f87">chapter 7,</a> you can skip reading data from MySQL and create the HIVE table, and can directly go to create HUDI table section)</p><p id="a0bc">For this ETL, we are also using the same MySQL as the source system and are loading the same table. We will not discuss much on how to load data from MySQL and how to create a HIVE table as we have already discussed in detail in Chapter 7.</p><p id="1e61">If you don’t have already uploaded data into MySQL, please follow the earlier blog for the same.</p><figure id="49f6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*44LMlwp7v-WEtq3h"><figcaption></figcaption></figure><p id="7546">We will read this data from Spark and we will create a spark data frame and HIVE table on this.</p>
    <figure id="f6d6">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/ea34b59389695bd1f0c3152ed2e3fc46.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="bf4a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Xn9czX41U5vZlmnd"><figcaption></figcaption></figure><h1 id="1339">Create a HIVE temp view from a data frame</h1><p id="7af6">We will create a HIVE temp view from a data frame.</p>
    <figure id="f179">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/aff6fa516a281b7007ecda8b55fcf243.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="37ce"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*x6XmnM-3Bdf2YSvP"><figcaption></figcaption></figure><p id="afc9">We will explore data and check the highest food group and filter with one group.</p>
    <figure id="dcd3">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/883388173428dd2b833657c5fa3be751.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h1 id="420d">Load filtered data into iceberg format (create initial table)</h1><p id="b3da">We have data frame “newdf” available in which we have only one group food is there. We will use that to create the first iceberg table.</p>
    <figure id="5f2f">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/cecd7140c9f5c5290334d124f2329bc1.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="e733">This will create a folder with the name “iceberg_food” inside the warehouse folder.</p><figure id="3e8e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*uavlTJwA9CJP3-nd"><figcaption></figcaption></figure><p id="03f8">Inside “Iceberg_food” folder, has all the data and metadata. It has two separate folders for data and metadata as below.</p><figure id="a504"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*EgRwqFccxcM-Ufiy"><figcaption></figcaption></figure><p id="da76">In the data folder, it creates a parquet file that stores all data.</p><figure id="4196"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*yWnc1_zVGvqF7NzK"><figcaption></figcaption></figure><p id="7b84">And in the metadat

Options

a folder, has all metadata files.</p><figure id="bfba"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*01jUYvKMGIuNhVK8"><figcaption></figcaption></figure><p id="8783">Metadata stored in JSON file as below</p><figure id="a636"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*LZ_hBQ4_GqkRYbAR"><figcaption></figcaption></figure><h1 id="214f">Load filtered data again into iceberg format in same table</h1><p id="db15">We will create one more data frame by filtering one more food group and then appending data to the same iceberg table.</p> <figure id="6617"> <div> <div>

            <iframe class="gist-iframe" src="/gist/developershomes/d945d9ae7e88efb24ecffdd8ef368545.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="ae2f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*zDZCuFTzTdOmjHVr"><figcaption></figcaption></figure><p id="a807">Now, if we check folders, we will see more files for data and metadata.</p><figure id="355f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*k7UI2R3Aau4Q-pKi"><figcaption></figcaption></figure><p id="c380">For metadata</p><figure id="573a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*tnRiqjWLg5TFjA8g"><figcaption></figcaption></figure><h1 id="58e0">Read iceberg tables using Spark data frame</h1><p id="3c76">Now, we will read iceberg tables into the Spark data frame.</p>
    <figure id="8d4b">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/e606a898af37fa71b59df02f3a6eebb8.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="084f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*7UXzOj6NlkvASSMp"><figcaption></figcaption></figure><h1 id="61d3">Create Temp HIVE of delta tables</h1><p id="6c08">We will create a HIVE temp view from a data frame. So that we can write Spark SQL and explore data available in Iceberg format.</p>
    <figure id="abfc">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/b9a98e14601385f1f44077b22473ad8d.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="9864"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*DnEgJ-_utjgYqwlk"><figcaption></figcaption></figure><h1 id="1db2">Explore data</h1><p id="7cbc">We can Spark SQL queries and explore data.</p>
    <figure id="8830">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/developershomes/143b1e68af9e40fcf4d07cfbebf09567.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="53c9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*o79PhK_SpXM5dRJr"><figcaption></figcaption></figure><h1 id="bea2">Conclusion:</h1><p id="7108">Here, we have learned the concepts below.</p><ul><li>Understating of Apache Iceberg</li><li>How to install Iceberg packages from the Maven repo</li><li>How to configure Spark parameters for Iceberg</li><li>How to create an Iceberg table and load data</li><li>How data is stored in Iceberg format</li><li>How to read data from Iceberg tables</li><li>How to write Spark SQL queries on Iceberg</li></ul><h1 id="0f31">Video Explanation:</h1>
    <figure id="aedb">
        <div>
          <div>
            <img class="ratio" src="http://placehold.it/16x9">
            <iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F_aLjZCVNt44%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D_aLjZCVNt44&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F_aLjZCVNt44%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube" allowfullscreen="" frameborder="0" height="480" width="640">
          </div>
        </div>
    </figure></iframe></div></div></figure><h1 id="1a3d">Level Up Coding</h1><p id="1032">Thanks for being a part of our community! Before you go:</p><ul><li>👏 Clap for the story and follow the author 👉</li><li>📰 View more content in the <a href="https://levelup.gitconnected.com/?utm_source=pub&amp;utm_medium=post">Level Up Coding publication</a></li><li>💰 Free coding interview course ⇒ <a href="https://skilled.dev/?utm_source=luc&amp;utm_medium=article">View Course</a></li><li>🔔 Follow us: <a href="https://twitter.com/gitconnected">Twitter</a> | <a href="https://www.linkedin.com/company/gitconnected">LinkedIn</a> | <a href="https://newsletter.levelup.dev">Newsletter</a></li></ul><p id="9b82">🚀👉 <a href="https://jobs.levelup.dev/talent/welcome?referral=true"><b>Join the Level Up talent collective and find an amazing job</b></a></p></article></body>

Spark ETL Chapter 9 with Lakehouse | Apache Iceberg

Previous blog/Context:

In an earlier blog, we discussed Spark ETL with Lakehouse (with HUDI). Please find below blog post for more details.

Introduction:

In this blog, we will discuss Spark ETL with Apache iceberg. We will first understand what Apache iceberg is and why to use Apache iceberg for creating a Lake house. We will source data from one of the source systems which we have learned till now and load that data into Apache iceberg format. We will create an on-premise lake house and load all data into it.

What is Apache Iceberg?

Apache Iceberg is an open-source table format that aims to provide a more scalable and efficient way of managing large, complex datasets in big data environments. It was originally developed at Netflix to address some of the limitations of existing data storage formats, such as Apache Parquet and Apache ORC.

Iceberg is designed to support features that are common in modern data systems, such as schema evolution, ACID transactions, and time travel (the ability to query data as it existed at some point in the past). It accomplishes this by separating the table’s data files (which contain the actual data) from its metadata (which describes the table’s schema, partitions, and other properties).

Iceberg uses several performance optimizations to improve query performance, such as column-level statistics and Bloom filters. It also integrates with a variety of big data processing engines, including Apache Spark, Apache Hive, and Presto.

Overall, Apache Iceberg is a powerful and flexible tool for managing large, complex datasets in big data environments. It provides several advanced features that make it well-suited to a wide range of use cases, including data warehousing, analytics, and machine learning.

Why use Apache Iceberg?

There are several reasons why you might want to use Apache Iceberg:

  • Scalability: Iceberg is designed to scale petabytes of data, making it an ideal choice for large data sets.
  • Flexibility: Iceberg is designed to be flexible, making it easy to add or remove data without having to rewrite the entire table.
  • Performance: Iceberg is designed to provide a balance between performance and flexibility, making it an ideal choice for data warehousing and analytics workloads.
  • Open-source: Iceberg is an open-source project, which means that it is free to use and can be customized to meet your specific needs.

Today, we will be doing the operations below ETL and with this, we will also be learning about the Apache iceberg and how to build a lake house.

  1. Read data from MySQL server into Spark
  2. Create a HIVE temp view from a data frame
  3. Load filtered data into iceberg format (create initial table)
  4. Load filtered data again into iceberg format in the same table
  5. Read iceberg tables using Spark data frame
  6. Create Temp HIVE of delta tables
  7. Explore data

First, clone below GitHub repo, where we have all the required sample files and solution

If you don’t have a setup for Spark instance follow the earlier blog for setting up Data Engineering tools in your system. (Data Engineering suite will setup Spark, MySQL, PostgreSQL, and MongoDB in your system) In that Spark instance, we already have packages installed for Azure blog storage and Azure Data Lake Services.

Start the Spark application with all required packages

First, we will start the Spark session with all the required packages and configurations for Apache Iceberg. We know that with our spark instance, we don’t have packages (jar file) available for Apache iceberg, so when we start the spark session, we need to externally specify that. We will also be using MySQL so we will specify package requirements for MySQL also.

With Apache Iceberg, we also need to pass the configurations below.

By default, warehouse location is $PWD (it means the location where we start the spark session) but here we are giving warehouse so it will create a folder in our folder and put all tables inside it.

Now, we have our spark session available with all the required packages and configuration, so we can start the ETL process.

Read data from MySQL server into Spark

(If you have already completed chapter 7, you can skip reading data from MySQL and create the HIVE table, and can directly go to create HUDI table section)

For this ETL, we are also using the same MySQL as the source system and are loading the same table. We will not discuss much on how to load data from MySQL and how to create a HIVE table as we have already discussed in detail in Chapter 7.

If you don’t have already uploaded data into MySQL, please follow the earlier blog for the same.

We will read this data from Spark and we will create a spark data frame and HIVE table on this.

Create a HIVE temp view from a data frame

We will create a HIVE temp view from a data frame.

We will explore data and check the highest food group and filter with one group.

Load filtered data into iceberg format (create initial table)

We have data frame “newdf” available in which we have only one group food is there. We will use that to create the first iceberg table.

This will create a folder with the name “iceberg_food” inside the warehouse folder.

Inside “Iceberg_food” folder, has all the data and metadata. It has two separate folders for data and metadata as below.

In the data folder, it creates a parquet file that stores all data.

And in the metadata folder, has all metadata files.

Metadata stored in JSON file as below

Load filtered data again into iceberg format in same table

We will create one more data frame by filtering one more food group and then appending data to the same iceberg table.

Now, if we check folders, we will see more files for data and metadata.

For metadata

Read iceberg tables using Spark data frame

Now, we will read iceberg tables into the Spark data frame.

Create Temp HIVE of delta tables

We will create a HIVE temp view from a data frame. So that we can write Spark SQL and explore data available in Iceberg format.

Explore data

We can Spark SQL queries and explore data.

Conclusion:

Here, we have learned the concepts below.

  • Understating of Apache Iceberg
  • How to install Iceberg packages from the Maven repo
  • How to configure Spark parameters for Iceberg
  • How to create an Iceberg table and load data
  • How data is stored in Iceberg format
  • How to read data from Iceberg tables
  • How to write Spark SQL queries on Iceberg

Video Explanation:

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Apache Iceberg
Spark
Etl
Data Lakehouse
Databricks
Recommended from ReadMedium