Summary

The web content provides a guide on how to configure AWS Glue jobs to enable Spark logging and set up a local Spark History Server on a Windows machine to analyze performance metrics.

Abstract

The article outlines the process of capturing and utilizing Spark event logs from AWS Glue jobs for performance analysis. It details the necessary Terraform configuration to enable logging within AWS Glue jobs and persist the logs to an S3 bucket. Additionally, the post describes the steps to locally install and configure the Spark History Server on a Windows machine, including the required libraries, specific configuration settings, and environment variables. This setup allows users to monitor Spark job progress and metrics without relying on Docker or additional AWS infrastructure, such as an EC2 instance. The author emphasizes the importance of understanding the configuration and the benefits of accessing detailed Spark UI insights for resource allocation and performance optimization.

Opinions

The author prefers managing infrastructure as code, specifically using Terraform for AWS Glue job creation.
Docker is not a viable option for the author due to environment restrictions, and using an EC2 instance for Spark History Server is considered wasteful.
The author values the detailed metrics provided by the Spark Web UI for job performance analysis.
Setting up the Spark History Server locally is seen as a cost-effective and controlled alternative to cloud-based solutions.
The author suggests that the real value lies in extracting and interpreting the most useful information from the Spark event logs to optimize resource usage.

Display Spark logs from AWS Glue jobs in a local Spark History Server

With AWS Glue it is easy to create and run Spark ETL jobs without having to think about the underlying infrastructure. Unfortunately it does not provide access to the Spark Web UI which displays useful metrics and information. I use spark events logs a lot for jobs’ performance analysis, so in this post I will explain how to create Glue jobs that enable logging and how to set up and run a Spark History Server locally on your Windows machine to display them.

Create a Glue job that enables logging

Spark events logging can be enabled through the Glue job configuration and logs can be persisted on an s3 storage. The configuration can be done in the AWS console directly but I prefer to manage all my Infrastructure as Code so all my Glue jobs are created using Terraform.

Here is an example of job definition with a minimal configuration to enable logging (the full list of available parameters can be found here):

resource "aws_glue_job" "example-etl" {
  name = "example-etl-job"
  // specify the IAM role that will be used to run the job
  role_arn = "arn:aws:iam:xxxxxxxxxxx:role/glue-job-role"
  
  glue_version = "3.0"
  worker_type = "G.1X"
  number_of_workers = 2
  
  // the location of the pyspark script containing the ETL definition
  command {
      script_location = "s3//bucket.name/src/job-script.py"
  }
  
  default_arguments = {
      "--enable-continuous-cloudwatch-log" = "true"
      "--enable-job-insights" = "true"
      "--enable-metrics" = "true"
      "--enable-spark-ui" = "true"
      "--spark-event-logs-path" = "s3://bucket.name/logs"
  }
}

This ensures that the spark logs from all the runs of this job are persisted to my s3 bucket.

Local configuration for the Spark History Server

While a Spark job is running, the SparkContext creates a Web UI which enables you to monitor the job’s progress. The SparkContext in Glue jobs is wrapped in a GlueContext object which does not provides the same functionality. The AWS documentation proposes two options for setting up the Spark UI for your Glue jobs:

locally, using a Docker container
on AWS, using a Cloud Formation template that configures an EC2 instance to run your Spark History Server

I work in an environment where I have very little control over what I can install and Docker is unfortunately not an option for me. Setting up and paying for another EC2 instance just to access the logs seemed wasteful. So I started looking into what the Docker container actually does (this older article also helped with understanding the configuration).

To start the Spark UI you don’t need to actually start a cluster. However, you do need to install locally Java and some libraies and to carefully set some configuration options.

Necessary libraries

have Spark installed locally (download the version that you need from https://archive.apache.org/dist/spark/ and un-tar, assuming Java is already installed)
if you are working on Windows, then also download the corresponding version of Winutils for the Hadoop version in your Spark build https://github.com/cdarlint/winutils (3.2.0 in my case)
download extra AWS spécific librairies: hadoop-aws corresponding to the Hadoop version included in the chosen Spark build and aws-java-sdk corresponding to the chose hadoop-aws build (check in the dependencies list). Add them to Spark’s jars folder

Spark specific configuration

Create a spark-defaults.conf file in Spark’s conf/ folder using the existing template. Add the following parameters:

# path to your logs on S3
spark.history.fs.logDirectory s3a://bucket.name/logs

# Specific s3 file system configuration
spark.hadoop.com.amazonaws.services.s3.enableV4     true
spark.hadoop.fs.s3a.impl                            org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.AbstractFileSystem.s3a.impl         org.apache.hadoop.fs.s3a.S3A

# Configure the credentials provider to get the authentication parameters from the environment variables
spark.hadoop.fs.s3a.aws.credentials.provider  com.amazonaws.auth.EnvironmentVariableCredentialsProvider

# Configure proxy settings in case you are passing through one (I am)
spark.hadoop.fs.s3a.proxy.host      xxxxxxxxxxx
spark.hadoop.fs.s3a.proxy.port      xxxx
spark.hadoop.fs.s3a.proxy.username  xxxxxxxx
spark.hadoop.fs.s3a.proxy.password  xxxxxxxxxxx

Environnent variables

Set the the HADOOP_HOME variable and add it to the system path (make sure to point toward the version that is used in the Spark build):

SET HADOOP_HOME=path/to/winutils
SET PATH=%PATH%;%HADOOP_HOME%\bin

Set the AWS credentials in the environnement variables:

SET AWS_DEFAULT_REGION=xxxxxxx
SET AWS_ACCESS_KEY_ID=xxxxxxxxx
SET AWS_SECRET_ACCESS_KEY=xxxxxxxxx
SET AWS_SESSION_TOKEN=xxxxxxxxx

Launching the Spark History Server

If you are working on Linux you can just use the sbin/start-history-server.sh script to start the spark history server.

There isn’t an equivalent script for Windows, but internally the start-history-server.sh script starts the org.apache.spark.deploy.history.HistoryServer standalone application. We can start it directly using the bin\spark-class.cmd script with the following arguments:

C:\path\to\spark\folder\bin\spark-class.cmd -Xmx8G org.apache.spark.deploy.history.HistoryServer

The spark-class script is the Spark command-line launcher that is responsible for setting up JVM environment and executing a Spark application. Ultimately any Spark shell script, including spark-submit, calls the spark-class script. It can receive as parameters JVM setup options and the name of the class that needs to run.

By default this starts the spark history server at http://localhost:18080/

After creating Glue jobs that enable logging and setting up your local Spark history server this is where the real fun can begin. In my next post I will show you how to extract the most useful information from the spark events logs and evaluate performance in terms of allocated resources vs used resources.