Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5745

Abstract

an> image: sdscello/awscli:latest ports: - containerPort: 80</pre></div>Run these commands<div id="523f"><pre> $kubectl apply -f iam_pod.yaml $ kubectl get pods # Note down <pod-name> from output $kubectl exec -it <pod-name> /bin/bash $ aws s3 ls # This should work if the IAM role is configured correctly</pre></div>Now that the setup is done, let’s move to run the actual PySpark job to access files stored in the AWS S3 bucket.<h2 id="5574">Steps</h2><ol><li>Create a folder XYZ</li><li>Under XYZ, create a python file job_to_run.py and fill it with the content below.</li></ol><div id="ac90"><pre>import os import boto3 from pyspark import SparkContext, SQLContext from pyspark.conf import SparkConf</pre></div><div id="5b59"><pre>sts_connection = boto3.client('sts') role_arn = os.getenv("AWS_ROLE_ARN") web_identity_token = None with open(os.getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), 'r') as content_file: web_identity_token = content_file.read() assume_role_object = sts_connection.assume_role_with_web_identity( RoleArn=role_arn, RoleSessionName="any_name", WebIdentityToken=web_identity_token )</pre></div><div id="f00d"><pre>ACCESS_KEY = assume_role_object['Credentials']['AccessKeyId'] SECRET_KEY = assume_role_object['Credentials']['SecretAccessKey'] SESSION_TOKEN = assume_role_object['Credentials']['SessionToken']

SPARK_CONF = SparkConf() SC = SparkContext(conf=SPARK_CONF, appName="test_app_1") try: SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(ACCESS_KEY)) SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(SECRET_KEY)) SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(SESSION_TOKEN))

SQLCONTEXT = <span class="hljs-built_in">SQLContext</span>(SC)
df = SQLCONTEXT<span class="hljs-selector-class">.read</span><span class="hljs-selector-class">.csv</span>(<span class="hljs-string">"s3a://bucket_name/path/to/csv_file"</span>)
<span class="hljs-built_in">print</span>(df<span class="hljs-selector-class">.head</span>(<span class="hljs-number">1</span>))

except Exception as e: pass finally: SC.stop()</pre></div>3. Creating ImageUnder XYZ, create a text file with exact name Dockerfile(no extension) and paste this content. Assuming that the base Spark image is on the docker hub.<div id="fc5c"><pre>FROM spark_repo_name:tag_name RUN pip install boto3 COPY job_to_run.py /tmp/job_to_run.py</pre></div>Next, create a new public repository(new_repo_name) on the DockerHub console and run the below set of commands.<div id="206f"><pre> $docker login$ docker build -t new_repo_name . $ docker tag <span class="hlj

Options

s-keyword">new_repo_name:latest new_repo_name:v1 $docker push new_repo_name:v1</pre></div>With these commands, we have successfully built our job image and pushed it to one of the public repo on DockerHub. Now we will validate our image whether it has been created correctly.4. Validation<div id="e860"><pre>## List out all the images $ docker images## Make sure you mention the recent image. $ docker run -it new_repo_name:v1 sh

$cd /tmp/## It should print job_to_run.py $ ls job_to_run.py</pre></div><h1 id="f4bc">Running the Job</h1>Before running the job we require a few information on the EKS cluster and Spark Binary file(on local).EKS Cluster Info<div id="d82a"><pre> $Kubectl cluster-info Kubernetes master is running at https://123456789.gr7.region.eks.amazonaws.com kube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxy</pre></div><h1 id="3ca8">Spark Binary File</h1>If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case(on Mac), it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don’t have a Spark binary get one form <a href="https://spark.apache.org/downloads.html">here</a> and store at some location.<div id="a33d"><pre>$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin $ls spark-submit, many_other_files, . . .</pre></div><h1 id="0e49">Running the Job</h1><div id="cbfc"><pre>$ ./spark-submit --deploy-mode cluster
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 --conf spark.kubernetes.authenticate.driver.serviceAccountName=default \ --conf spark.executor.instances=5
--conf spark.app.name=my_pyspark_job
--conf spark.kubernetes.namespace=default
--conf spark.kubernetes.driver.container.image=new_repo_name:v1 --conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.py</pre></div>After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:<h1 id="e761">Check Job-status</h1><div id="6a70"><pre> $kubectl get pods 1_spark_pods_driver, some_spark_pods_executer</pre></div>STATUS<ul><li>CreatingContainer: The cluster is pulling the image to create a container.</li><li>Running: The job is still running.</li><li>Completed: If your job executed successfully, you will see completed.</li><li>Pending: Pods are pending because of the unavailability of compute nodes.</li><li>Error: Job has failed while execution.</li></ul><div id="85a2"><pre>## get log $ kubectl logs 1_spark_pods_driver## Describe Pod $kubectl describe pods 1_spark_pods_driver $ kubectl get nodes farget_nodes1, some more</pre></div><div id="f324"><pre>## Describe Node $ kubectl describe nodes node_name</pre></div>Now that we are done with the blog, I am hopeful that you had no difficulty following it. In case you faced any difficulty, comment below. I would be happy to help resolve the issues.</article></body>

Access AWS Services From EKS using IAM Role with PySpark

In this blog, we will read a CSV file stored in AWS S3 from an EKS cluster using an IAM role. Before we get started, there are certain prerequisites that must be fulfilled.

AWS EKS | Docker | Node Group | IAM Role | AWS S3 | PySpark

Prerequisites:

Install all the prerequisites mentioned in this link. Go to the validation section of the same link and check if the installations were successful.
Prepare a docker image of (PySpark 2.4.5 + Hadoop 3.1.2) with hadoop-aws-3.1.2.jar and aws-java-sdk-bundle-1.11.271.jar following link1 and link2.
Create and attach an IAM role to the cluster

Creating an IAM role for ServiceAccount

Learn how to create a web identity IAM role and attach it with any Service account in EKS cluster.

medium.com

Add policies to the IAM role for allowing permission to access different AWS Services(here S3).

Validate if the prerequisites are installed correctly:

It is very important to have the prerequisites installed and configured correctly before running the actual job.

Docker

$ docker login

Kubectl

Configure Kubectl to run the commands on the EKS cluster.

$ export KUBECONFIG=$KUBECONFIG:~/.kube/config-aws-services
$ aws eks update-kubeconfig — name eks_cluster_name — region us-east-1

Run these basic Kubectl commands,

$ kubectl get ns # namespace -> default
$ kubectl get sa # namespace -> default

$ kubectl get nodes # If a nodegroup is configured with EKS, nodes should be listed here

IAM Role and Policies

Create a file named iam_pod.yaml and fill it with the below content(take care of indentation).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: eks-iam-test
spec:
  replicas: 1
  selector:
   matchLabels:
     app: eks-iam-test
  template:
   metadata:
     labels:
        app: eks-iam-test
    spec:
      serviceAccountName: default
      containers:
      - name: eks-iam-test
        image: sdscello/awscli:latest
        ports:
        - containerPort: 80

Run these commands

$ kubectl apply -f iam_pod.yaml
$ kubectl get pods # Note down <pod-name> from output
$ kubectl exec -it <pod-name> /bin/bash
$ aws s3 ls # This should work if the IAM role is configured correctly

Now that the setup is done, let’s move to run the actual PySpark job to access files stored in the AWS S3 bucket.

Steps

Create a folder XYZ
Under XYZ, create a python file job_to_run.py and fill it with the content below.

import os
import boto3
from pyspark import SparkContext, SQLContext
from pyspark.conf import SparkConf

sts_connection = boto3.client('sts')
role_arn = os.getenv("AWS_ROLE_ARN")
web_identity_token = None
with open(os.getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), 'r') as content_file:
    web_identity_token = content_file.read()
assume_role_object = sts_connection.assume_role_with_web_identity(
       RoleArn=role_arn,
       RoleSessionName="any_name",
       WebIdentityToken=web_identity_token
)

ACCESS_KEY = assume_role_object['Credentials']['AccessKeyId']
SECRET_KEY = assume_role_object['Credentials']['SecretAccessKey']
SESSION_TOKEN = assume_role_object['Credentials']['SessionToken']

SPARK_CONF = SparkConf()
SC = SparkContext(conf=SPARK_CONF, appName="test_app_1")
try:
    SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(ACCESS_KEY))
    SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(SECRET_KEY))
    SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(SESSION_TOKEN))
   
    SQLCONTEXT = SQLContext(SC)
    df = SQLCONTEXT.read.csv("s3a://bucket_name/path/to/csv_file")
    print(df.head(1))
except Exception as e:
    pass
finally:
    SC.stop()

3. Creating Image

Under XYZ, create a text file with exact name Dockerfile(no extension) and paste this content. Assuming that the base Spark image is on the docker hub.

FROM spark_repo_name:tag_name
RUN pip install boto3
COPY job_to_run.py /tmp/job_to_run.py

Next, create a new public repository(new_repo_name) on the DockerHub console and run the below set of commands.

$ docker login
$ docker build -t new_repo_name .
$ docker tag new_repo_name:latest new_repo_name:v1
$ docker push new_repo_name:v1

With these commands, we have successfully built our job image and pushed it to one of the public repo on DockerHub. Now we will validate our image whether it has been created correctly.

4. Validation

## List out all the images
$ docker images## Make sure you mention the recent image.
$ docker run -it new_repo_name:v1 sh
    
$ cd /tmp/## It should print job_to_run.py
$ ls 
job_to_run.py

Running the Job

Before running the job we require a few information on the EKS cluster and Spark Binary file(on local).

EKS Cluster Info

$ Kubectl cluster-info
Kubernetes master is running at https://123456789.gr7.region.eks.amazonaws.com
kube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxy

Spark Binary File

If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case(on Mac), it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don’t have a Spark binary get one form here and store at some location.

$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin
$ ls
spark-submit, many_other_files, . . .

Running the Job

$ ./spark-submit --deploy-mode cluster \
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \ 
--conf spark.executor.instances=5 \
--conf spark.app.name=my_pyspark_job \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.container.image=new_repo_name:v1
--conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.py

After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:

Check Job-status

$ kubectl get pods
1_spark_pods_driver, some_spark_pods_executer

STATUS

CreatingContainer: The cluster is pulling the image to create a container.
Running: The job is still running.
Completed: If your job executed successfully, you will see completed.
Pending: Pods are pending because of the unavailability of compute nodes.
Error: Job has failed while execution.

## get log
$ kubectl logs 1_spark_pods_driver## Describe Pod
$ kubectl describe pods 1_spark_pods_driver
$ kubectl get nodes
farget_nodes1, some more

## Describe Node
$ kubectl describe nodes node_name

Now that we are done with the blog, I am hopeful that you had no difficulty following it. In case you faced any difficulty, comment below. I would be happy to help resolve the issues.