avatarRobert Sanders

Summarize

AWS Glue + Apache Iceberg

Bringing ACID operations to Apache Glue and SparkSQL

AWS Glue + Apache Iceberg

Motivation

At Clairvoyant, we work with a large number of customers that use AWS Glue for their daily ETL processes. Many of these Glue jobs leverage SparkSQL statements to make transformations easier to understand and more readable.

We’ve been looking to identify ways to further make these sorts of SparkSQL operations easier. Mainly through providing ACID Operations (UPSERTs, INSERTs, UPDATEs, and DELETEs) in SparkSQL. For example, performing a DELETe would be simpler from an understanding and execution perspective rather than doing an INSERT OVERWRITE back into a table as you would typically do in Spark. There is a new file format that provides just that: Apache Iceberg.

This can also potentially help to improve AWS S3 costs and storage efficiency with the ability to only store Delta data and metadata:

This post will describe how you can configure you AWS Glue job to use Iceberg in SparkSQL through some simple examples.

Glue Job Configurations

Iceberg JARs

First, we will need the proper JARs to be loaded into S3 for use in the Glue Job

The following JARs should be downloaded and loaded into an S3 Bucket:

Glue Job Details

  1. Ensure the Glue Version is set to “3.0” — This ensures we’re using Spark 3.1 which allows for SparkSQL Keywords that enable ACID operations such as MERGE, DELETE, UPDATE, etc (since these are not available in Spark <=2.4).
  2. Update both the “Python library path” and “Dependant JARs path” to include a comma-separated list of S3 Paths to the required JARs: s3://{bucket}/{key}/iceberg-spark3-runtime-0.13.1.jar,s3://{bucket}/{key}/bundle-2.15.40.jar,s3://{bucket}/{key}/url-connection-client-2.15.40.jar

Spark Code Setup

The following code is used to setup the Glue Job

To enable Iceberg, you can see that we’ve added 4 new configurations to the SparkConf:

("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"), ("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog"), ("spark.sql.catalog.iceberg_catalog.warehouse", f"s3://{AWS_BUCKET}/iceberg_catalog/"), ("spark.sql.catalog.iceberg_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")

These options enable a custom catalog in Spark by the name of iceberg_catalog. Note: this can be changed to a different name if desired.

Example:

spark.sql.catalog.my_catalog.warehouse

spark.sql.catalog.cutom_catalog.warehouse

etc

This catalog will be used to reference the configurations needed to read and write a particular table as Iceberg format. In the implementation, you will see examples of this.

For parquet tables, you may reference them like normal:

SELECT * FROM some_db.parquet_table WHERE id = "1"

For Iceberg tables, you may reference them with the catalog:

SELECT * FROM iceberg_catalog.some_db.iceberg_table WHERE id = "1"

If desired, you can use this as the default catalog by adding the following configuration:

("spark.sql.defaultCatalog", "iceberg_catalog")

See Spark Configuration — Catalog for more details on the catalog

Implementation

AWS
Aws Glue
Spark
Cloud
Iceberg
Recommended from ReadMedium
avatarRahul dhanawade
AWS ETL Pipeline Using S3 and Glue

8 min read