Running PySpark on EKS Fargate: Part 3(Last)
Running a PySpark job on EKS to access files stored on AWS S3
It is 3rd part of the series on how to run PySpark jobs on AWS EKS Fargate. In Part 1, we completed our setup w.r.t. the prerequisites listed. We also learned about the Docker file/image and some important docker commands. In Part 2 of the series, we saw how to merge an x-version of Spark with the y-version of Hadoop. We built an image and pushed it to DockerHub public repo. In this Part 3 of the blog, we will use the image and run a PySpark job.

Before getting started, let's recap and verify whether we are on the same page wrt. requirements.
- An EKS Cluster with Fargate profile setup done. If you still don’t have one, follow this AWS DOC for creating one. We need to have two Fargate Profile with different namespaces namely “default” and “kube-system”. We will use the “default” namespace for running our job whereas the “kube-system” will be used by coredns. coredns is a Kubernetes server DNS deployment.
@Reference awsdoc: By default, CoreDNS is configured to run on Amazon EC2 infrastructure on Amazon EKS clusters. If you want to only run your pods on Fargate in your cluster, You would also need to create a Fargate profile to target the CoreDNS pods.
To be precise, we need to have an EKS cluster + Fargate Profile (namespace = default) + coredns pod running(check by “kubectl get pods -n kube-system” if you see it pending create a new profile and redeploy the coredns pod with “kubectl rollout restart -n kube-system deployment coredns”)
- The next important requirement is a Spark docker image from part 2 either on Dockerhub or AWS ECR.
Each Job on an EKS cluster is in the form of an image. As the first step, we will define a spark job and build an image.
Defining PySpark Job
We will run a PySpark script that reads a CSV file from the AWS S3 bucket.
from pyspark import SparkContext, SQLContext
from pyspark.conf import SparkConfAWS_ACCESS_KEY_ID = "************"
AWS_SECRET_ACCESS_KEY = "*****************"
AWS_SESSION_TOKEN = "*************************" # optionalSPARK_CONF = SparkConf()
SC = SparkContext(conf=SPARK_CONF, appName="SparkOfferDistributor")SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")#Optional
SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(AWS_ACCESS_KEY_ID))
SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(AWS_SECRET_ACCESS_KEY))
SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(AWS_SESSION_TOKEN)) # Optional
SC._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")SQLCONTEXT = SQLContext(SC)
df = SQLCONTEXT.read.csv("s3a://bucket/key")print(df.show())
SC.stop()We will save this file in a folder ABC with file_name = job_to_run.py. In this case, we have directly used AWS credentials, but if you want to use an IAM role follow this blog.
Creating Image
Then we will go to the ABC directory and create a text file with the name Dockerfile(no extension) and paste the content below. From part 2, I assume that the base Spark image is on the docker hub.
FROM repo_name:v1
COPY job_to_run.py /tmp/job_to_run.pyNext, we will create a new public repository on DockerHub and run the below set of commands.
$ docker login
$ docker build -t new_repo_name .
$ docker tag new_repo_name:latest new_repo_name:v1
$ docker push new_repo_name:v1With these commands, we have successfully built our job image and pushed it to the recently created public repo on DockerHub. Now we will validate our image whether it has been properly built.
Validation
## List out all the images
$ docker images## Make sure you mention the recent image.
$ docker run -it new_repo_name:v1 sh
$ cd /tmp/## It should print job_to_run.py
$ ls
job_to_run.pyRunning the Job
Before running the job we require some information on the EKS cluster and Spark Binary file(on local).
EKS Cluster
Connect terminal to EKS Cluster
$ export KUBECONFIG=$KUBECONFIG:~/.kube/config-fargate
$ aws eks --region region-code update-kubeconfig --name cluster_nameGet Cluster Info
$ Kubectl cluster-infoKubernetes master is running at https://123456789.gr7.region.eks.amazonaws.comkube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxySpark Binary File
If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case, it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don't have a Spark binary get one form here and store at some location.
$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin
$ ls
spark-submit, many_other_files, . . . Running the Job
$ ./spark-submit --deploy-mode cluster \
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--conf spark.executor.instances=5 \
--conf spark.app.name=my_pyspark_job \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.container.image=new_repo_name:v1
--conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.pyAfter we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:
Check Job-status
$ kubectl get pods
1_spark_pods_driver, some_spark_pods_executerSTATUS
- CreatingContainer: The cluster is pulling the image to create a container.
- Running: The job is still running.
- Completed: If your job executed successfully, you will see completed.
- Pending: Pods are pending because of the unavailability of compute nodes.
- Error: Job has failed while execution.
## get log
$ kubectl logs 1_spark_pods_driver## Describe Pod
$ kubectl describe pods 1_spark_pods_driver$ kubectl get nodes
farget_nodes## Describe Node
$ kubectl describe nodes farget_node_1Now that we are done with the series of blogs on running the PySpark job on EKS with the Fargate profile, I am hopeful that you had no difficulty following the blog. In case you faced any difficulty in following it, please comment below. I would be happy to help resolve the issues.
You may be interested in reading this:
