Access AWS Services From EKS using IAM Role with PySpark
In this blog, we will read a CSV file stored in AWS S3 from an EKS cluster using an IAM role. Before we get started, there are certain prerequisites that must be fulfilled.

Prerequisites:
- Install all the prerequisites mentioned in this link. Go to the validation section of the same link and check if the installations were successful.
- Prepare a docker image of (PySpark 2.4.5 + Hadoop 3.1.2) with hadoop-aws-3.1.2.jar and aws-java-sdk-bundle-1.11.271.jar following link1 and link2.
- Create and attach an IAM role to the cluster
- Add policies to the IAM role for allowing permission to access different AWS Services(here S3).
Validate if the prerequisites are installed correctly:
It is very important to have the prerequisites installed and configured correctly before running the actual job.
Docker
Login to docker hub through the terminal, pls use the same credential that is used on DockerHub console.
$ docker loginKubectl
Configure Kubectl to run the commands on the EKS cluster.
$ export KUBECONFIG=$KUBECONFIG:~/.kube/config-aws-services
$ aws eks update-kubeconfig — name eks_cluster_name — region us-east-1Run these basic Kubectl commands,
$ kubectl get ns # namespace -> default
$ kubectl get sa # namespace -> default$ kubectl get nodes # If a nodegroup is configured with EKS, nodes should be listed hereIAM Role and Policies
Create a file named iam_pod.yaml and fill it with the below content(take care of indentation).
apiVersion: apps/v1
kind: Deployment
metadata:
name: eks-iam-test
spec:
replicas: 1
selector:
matchLabels:
app: eks-iam-test
template:
metadata:
labels:
app: eks-iam-test
spec:
serviceAccountName: default
containers:
- name: eks-iam-test
image: sdscello/awscli:latest
ports:
- containerPort: 80Run these commands
$ kubectl apply -f iam_pod.yaml
$ kubectl get pods # Note down <pod-name> from output
$ kubectl exec -it <pod-name> /bin/bash
$ aws s3 ls # This should work if the IAM role is configured correctlyNow that the setup is done, let’s move to run the actual PySpark job to access files stored in the AWS S3 bucket.
Steps
- Create a folder XYZ
- Under XYZ, create a python file job_to_run.py and fill it with the content below.
import os
import boto3
from pyspark import SparkContext, SQLContext
from pyspark.conf import SparkConfsts_connection = boto3.client('sts')
role_arn = os.getenv("AWS_ROLE_ARN")
web_identity_token = None
with open(os.getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), 'r') as content_file:
web_identity_token = content_file.read()
assume_role_object = sts_connection.assume_role_with_web_identity(
RoleArn=role_arn,
RoleSessionName="any_name",
WebIdentityToken=web_identity_token
)ACCESS_KEY = assume_role_object['Credentials']['AccessKeyId']
SECRET_KEY = assume_role_object['Credentials']['SecretAccessKey']
SESSION_TOKEN = assume_role_object['Credentials']['SessionToken']
SPARK_CONF = SparkConf()
SC = SparkContext(conf=SPARK_CONF, appName="test_app_1")
try:
SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(ACCESS_KEY))
SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(SECRET_KEY))
SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(SESSION_TOKEN))
SQLCONTEXT = SQLContext(SC)
df = SQLCONTEXT.read.csv("s3a://bucket_name/path/to/csv_file")
print(df.head(1))
except Exception as e:
pass
finally:
SC.stop()3. Creating Image
Under XYZ, create a text file with exact name Dockerfile(no extension) and paste this content. Assuming that the base Spark image is on the docker hub.
FROM spark_repo_name:tag_name
RUN pip install boto3
COPY job_to_run.py /tmp/job_to_run.pyNext, create a new public repository(new_repo_name) on the DockerHub console and run the below set of commands.
$ docker login
$ docker build -t new_repo_name .
$ docker tag new_repo_name:latest new_repo_name:v1
$ docker push new_repo_name:v1With these commands, we have successfully built our job image and pushed it to one of the public repo on DockerHub. Now we will validate our image whether it has been created correctly.
4. Validation
## List out all the images
$ docker images## Make sure you mention the recent image.
$ docker run -it new_repo_name:v1 sh
$ cd /tmp/## It should print job_to_run.py
$ ls
job_to_run.pyRunning the Job
Before running the job we require a few information on the EKS cluster and Spark Binary file(on local).
EKS Cluster Info
$ Kubectl cluster-info
Kubernetes master is running at https://123456789.gr7.region.eks.amazonaws.com
kube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxySpark Binary File
If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case(on Mac), it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don’t have a Spark binary get one form here and store at some location.
$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin
$ ls
spark-submit, many_other_files, . . .Running the Job
$ ./spark-submit --deploy-mode cluster \
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \
--conf spark.executor.instances=5 \
--conf spark.app.name=my_pyspark_job \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.container.image=new_repo_name:v1
--conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.pyAfter we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:
Check Job-status
$ kubectl get pods
1_spark_pods_driver, some_spark_pods_executerSTATUS
- CreatingContainer: The cluster is pulling the image to create a container.
- Running: The job is still running.
- Completed: If your job executed successfully, you will see completed.
- Pending: Pods are pending because of the unavailability of compute nodes.
- Error: Job has failed while execution.
## get log
$ kubectl logs 1_spark_pods_driver## Describe Pod
$ kubectl describe pods 1_spark_pods_driver
$ kubectl get nodes
farget_nodes1, some more## Describe Node
$ kubectl describe nodes node_nameNow that we are done with the blog, I am hopeful that you had no difficulty following it. In case you faced any difficulty, comment below. I would be happy to help resolve the issues.
