Display Spark logs from AWS Glue jobs in a local Spark History Server
With AWS Glue it is easy to create and run Spark ETL jobs without having to think about the underlying infrastructure. Unfortunately it does not provide access to the Spark Web UI which displays useful metrics and information. I use spark events logs a lot for jobs’ performance analysis, so in this post I will explain how to create Glue jobs that enable logging and how to set up and run a Spark History Server locally on your Windows machine to display them.
Create a Glue job that enables logging
Spark events logging can be enabled through the Glue job configuration and logs can be persisted on an s3 storage. The configuration can be done in the AWS console directly but I prefer to manage all my Infrastructure as Code so all my Glue jobs are created using Terraform.
Here is an example of job definition with a minimal configuration to enable logging (the full list of available parameters can be found here):
resource "aws_glue_job" "example-etl" {
name = "example-etl-job"
// specify the IAM role that will be used to run the job
role_arn = "arn:aws:iam:xxxxxxxxxxx:role/glue-job-role"
glue_version = "3.0"
worker_type = "G.1X"
number_of_workers = 2
// the location of the pyspark script containing the ETL definition
command {
script_location = "s3//bucket.name/src/job-script.py"
}
default_arguments = {
"--enable-continuous-cloudwatch-log" = "true"
"--enable-job-insights" = "true"
"--enable-metrics" = "true"
"--enable-spark-ui" = "true"
"--spark-event-logs-path" = "s3://bucket.name/logs"
}
}This ensures that the spark logs from all the runs of this job are persisted to my s3 bucket.
Local configuration for the Spark History Server
While a Spark job is running, the SparkContext creates a Web UI which enables you to monitor the job’s progress. The SparkContext in Glue jobs is wrapped in a GlueContext object which does not provides the same functionality. The AWS documentation proposes two options for setting up the Spark UI for your Glue jobs:
- locally, using a Docker container
- on AWS, using a Cloud Formation template that configures an EC2 instance to run your Spark History Server
I work in an environment where I have very little control over what I can install and Docker is unfortunately not an option for me. Setting up and paying for another EC2 instance just to access the logs seemed wasteful. So I started looking into what the Docker container actually does (this older article also helped with understanding the configuration).
To start the Spark UI you don’t need to actually start a cluster. However, you do need to install locally Java and some libraies and to carefully set some configuration options.
Necessary libraries
- have Spark installed locally (download the version that you need from https://archive.apache.org/dist/spark/ and un-tar, assuming Java is already installed)
- if you are working on Windows, then also download the corresponding version of Winutils for the Hadoop version in your Spark build https://github.com/cdarlint/winutils (3.2.0 in my case)
- download extra AWS spécific librairies: hadoop-aws corresponding to the Hadoop version included in the chosen Spark build and aws-java-sdk corresponding to the chose hadoop-aws build (check in the dependencies list). Add them to Spark’s jars folder
Spark specific configuration
Create a spark-defaults.conf file in Spark’s conf/ folder using the existing template. Add the following parameters:
# path to your logs on S3
spark.history.fs.logDirectory s3a://bucket.name/logs
# Specific s3 file system configuration
spark.hadoop.com.amazonaws.services.s3.enableV4 true
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.AbstractFileSystem.s3a.impl org.apache.hadoop.fs.s3a.S3A
# Configure the credentials provider to get the authentication parameters from the environment variables
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider
# Configure proxy settings in case you are passing through one (I am)
spark.hadoop.fs.s3a.proxy.host xxxxxxxxxxx
spark.hadoop.fs.s3a.proxy.port xxxx
spark.hadoop.fs.s3a.proxy.username xxxxxxxx
spark.hadoop.fs.s3a.proxy.password xxxxxxxxxxx Environnent variables
Set the the HADOOP_HOME variable and add it to the system path (make sure to point toward the version that is used in the Spark build):
SET HADOOP_HOME=path/to/winutils SET PATH=%PATH%;%HADOOP_HOME%\bin
Set the AWS credentials in the environnement variables:
SET AWS_DEFAULT_REGION=xxxxxxx
SET AWS_ACCESS_KEY_ID=xxxxxxxxx
SET AWS_SECRET_ACCESS_KEY=xxxxxxxxx
SET AWS_SESSION_TOKEN=xxxxxxxxxLaunching the Spark History Server
If you are working on Linux you can just use the sbin/start-history-server.sh script to start the spark history server.
There isn’t an equivalent script for Windows, but internally the start-history-server.sh script starts the org.apache.spark.deploy.history.HistoryServer standalone application. We can start it directly using the bin\spark-class.cmd script with the following arguments:
C:\path\to\spark\folder\bin\spark-class.cmd -Xmx8G org.apache.spark.deploy.history.HistoryServer
The spark-class script is the Spark command-line launcher that is responsible for setting up JVM environment and executing a Spark application. Ultimately any Spark shell script, including spark-submit, calls the spark-class script. It can receive as parameters JVM setup options and the name of the class that needs to run.
By default this starts the spark history server at http://localhost:18080/
After creating Glue jobs that enable logging and setting up your local Spark history server this is where the real fun can begin. In my next post I will show you how to extract the most useful information from the spark events logs and evaluate performance in terms of allocated resources vs used resources.





