avatarAman Ranjan Verma

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5745

Abstract

an> <span class="hljs-attribute">image</span><span class="hljs-punctuation">:</span> <span class="hljs-string">sdscello/awscli:latest</span> <span class="hljs-attribute">ports</span><span class="hljs-punctuation">:</span> <span class="hljs-bullet">-</span> <span class="hljs-string">containerPort: 80</span></pre></div><p id="1a7c">Run these commands</p><div id="523f"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl apply -f iam_pod.yaml</span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl get pods <span class="hljs-comment"># Note down <pod-name> from output</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl <span class="hljs-built_in">exec</span> -it &lt;pod-name&gt; /bin/bash</span> <span class="hljs-meta prompt_"> </span><span class="language-bash">aws s3 <span class="hljs-built_in">ls</span> <span class="hljs-comment"># This should work if the IAM role is configured correctly</span></span></pre></div><p id="fa76">Now that the setup is done, let’s move to run the actual PySpark job to access files stored in the AWS S3 bucket.</p><h2 id="5574">Steps</h2><ol><li>Create a folder <i>XYZ</i></li><li>Under <i>XYZ</i>, create a python file <b>job_to_run.py</b><i> </i>and fill it with the content below.</li></ol><div id="ac90"><pre><span class="hljs-keyword">import</span> os <span class="hljs-keyword">import</span> boto3 <span class="hljs-title">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext, SQLContext <span class="hljs-title">from</span> pyspark.conf <span class="hljs-keyword">import</span> SparkConf</pre></div><div id="5b59"><pre>sts_connection = boto3.client(<span class="hljs-string">'sts'</span>) role_arn = <span class="hljs-built_in">os</span>.<span class="hljs-built_in">getenv</span>(<span class="hljs-string">"AWS_ROLE_ARN"</span>) web_identity_token = None with <span class="hljs-built_in">open</span>(<span class="hljs-built_in">os</span>.<span class="hljs-built_in">getenv</span>(<span class="hljs-string">"AWS_WEB_IDENTITY_TOKEN_FILE"</span>), <span class="hljs-string">'r'</span>) as content_file: web_identity_token = content_file.<span class="hljs-built_in">read</span>() assume_role_object = sts_connection.assume_role_with_web_identity( RoleArn=role_arn, RoleSessionName=<span class="hljs-string">"any_name"</span>, WebIdentityToken=web_identity_token )</pre></div><div id="f00d"><pre>ACCESS_KEY = assume_role_object<span class="hljs-selector-attr">[<span class="hljs-string">'Credentials'</span>]</span><span class="hljs-selector-attr">[<span class="hljs-string">'AccessKeyId'</span>]</span> SECRET_KEY = assume_role_object<span class="hljs-selector-attr">[<span class="hljs-string">'Credentials'</span>]</span><span class="hljs-selector-attr">[<span class="hljs-string">'SecretAccessKey'</span>]</span> SESSION_TOKEN = assume_role_object<span class="hljs-selector-attr">[<span class="hljs-string">'Credentials'</span>]</span><span class="hljs-selector-attr">[<span class="hljs-string">'SessionToken'</span>]</span>

SPARK_CONF = <span class="hljs-built_in">SparkConf</span>() SC = <span class="hljs-built_in">SparkContext</span>(conf=SPARK_CONF, appName=<span class="hljs-string">"test_app_1"</span>) try: SC._jsc<span class="hljs-selector-class">.hadoopConfiguration</span>()<span class="hljs-selector-class">.set</span>(<span class="hljs-string">"fs.s3a.aws.credentials.provider"</span>, <span class="hljs-string">"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"</span>) SC._jsc<span class="hljs-selector-class">.hadoopConfiguration</span>()<span class="hljs-selector-class">.set</span>(<span class="hljs-string">"fs.s3a.access.key"</span>, <span class="hljs-string">"{}"</span><span class="hljs-selector-class">.format</span>(ACCESS_KEY)) SC._jsc<span class="hljs-selector-class">.hadoopConfiguration</span>()<span class="hljs-selector-class">.set</span>(<span class="hljs-string">"fs.s3a.secret.key"</span>, <span class="hljs-string">"{}"</span><span class="hljs-selector-class">.format</span>(SECRET_KEY)) SC._jsc<span class="hljs-selector-class">.hadoopConfiguration</span>()<span class="hljs-selector-class">.set</span>(<span class="hljs-string">"fs.s3a.session.token"</span>, <span class="hljs-string">"{}"</span><span class="hljs-selector-class">.format</span>(SESSION_TOKEN))

SQLCONTEXT = <span class="hljs-built_in">SQLContext</span>(SC)
df = SQLCONTEXT<span class="hljs-selector-class">.read</span><span class="hljs-selector-class">.csv</span>(<span class="hljs-string">"s3a://bucket_name/path/to/csv_file"</span>)
<span class="hljs-built_in">print</span>(df<span class="hljs-selector-class">.head</span>(<span class="hljs-number">1</span>))

except Exception as e: pass finally: SC<span class="hljs-selector-class">.stop</span>()</pre></div><p id="570f">3. <b>Creating Image</b></p><p id="c0f1">Under <i>XYZ</i>, create a text file with exact name <b>Dockerfile</b>(no extension) and paste this content. <i>Assuming that the base Spark image is on the docker hub.</i></p><div id="fc5c"><pre><span class="hljs-keyword">FROM</span> spark_repo_name:tag_name <span class="hljs-keyword">RUN</span><span class="language-bash"> pip install boto3</span> <span class="hljs-keyword">COPY</span><span class="language-bash"> job_to_run.py /tmp/job_to_run.py</span></pre></div><p id="1fd8">Next, create a new public repository(new_repo_name) on the DockerHub console and run the below set of commands.</p><div id="206f"><pre>docker login docker build -t <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span> . $ docker tag <span class="hlj

Options

s-keyword">new</span><span class="hljs-type">_repo_name</span>:latest <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span>:v1 docker push <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span>:v1</pre></div><p id="27fd">With these commands, we have successfully built our job image and pushed it to one of the public repo on DockerHub. Now we will validate our image whether it has been&nbsp;created&nbsp;correctly.</p><p id="5811"><b>4. Validation</b></p><div id="e860"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># List out all the images</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">docker images<span class="hljs-comment">## Make sure you mention the recent image.</span></span> <span class="hljs-meta prompt_">$ </span><span class="language-bash">docker run -it new_repo_name:v1 sh</span>

<span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">cd</span> /tmp/<span class="hljs-comment">## It should print job_to_run.py</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">ls</span></span> job_to_run.py</pre></div><h1 id="f4bc">Running the Job</h1><p id="0d62">Before running the job we require a few information on the EKS cluster and Spark Binary file(on local).</p><p id="7f51"><b>EKS Cluster Info</b></p><div id="d82a"><pre> Kubectl cluster-info Kubernetes master is running at https:<span class="hljs-regexp">//</span><span class="hljs-number">123456789</span>.gr7.region.eks.amazonaws.com kube-dns is running at https:<span class="hljs-regexp">//</span>{some_ip<span class="hljs-regexp">/url}/</span>api<span class="hljs-regexp">/v1/</span>namespaces<span class="hljs-regexp">/kube-system/</span>services<span class="hljs-regexp">/kube-dns/</span>proxy</pre></div><h1 id="3ca8">Spark Binary File</h1><p id="9a50">If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case(on Mac), it is at “<i>/Users/aman/spark-2.4.3-bin-hadoop2.7</i>”. If you don’t have a Spark binary get one form <a href="https://spark.apache.org/downloads.html">here</a> and store at some location.</p><div id="a33d"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">cd</span> /path/to/spark-2.4.3-bin-hadoop2.7/bin</span> <span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">ls</span></span> spark-submit, many_other_files, . . .</pre></div><h1 id="0e49">Running the Job</h1><div id="cbfc"><pre> ./spark-submit <span class="hljs-attr">--deploy-mode</span> cluster
<span class="hljs-attr">--master</span> k8s:<span class="hljs-comment">//https://123456789.gr7.region.eks.amazonaws.com:443 </span> <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.authenticate</span><span class="hljs-selector-class">.driver</span>.serviceAccountName=default \ <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.executor</span>.instances=<span class="hljs-number">5</span>
<span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.app</span>.name=my_pyspark_job
<span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span>.namespace=default
<span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.driver</span><span class="hljs-selector-class">.container</span>.image=new_repo_name:v1 <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.executor</span><span class="hljs-selector-class">.container</span>.image=new_repo_name:v1 local:<span class="hljs-comment">///tmp/job_to_run.py</span></pre></div><p id="4072">After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:</p><h1 id="e761">Check Job-status</h1><div id="6a70"><pre> kubectl <span class="hljs-built_in">get</span> pods 1_spark_pods_driver, some_spark_pods_executer</pre></div><p id="95b9"><b>STATUS</b></p><ul><li><b>CreatingContainer</b>: The cluster is pulling the image to create a container.</li><li><b>Running: </b>The job is still running.</li><li><b>Completed</b>: If your job executed successfully, you will see completed.</li><li><b>Pending: </b>Pods are pending because of the unavailability of compute nodes.</li><li><b>Error</b>: Job has failed while execution.</li></ul><div id="85a2"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># get log</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl logs 1_spark_pods_driver<span class="hljs-comment">## Describe Pod</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl describe pods 1_spark_pods_driver</span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl get nodes</span> farget_nodes1, some more</pre></div><div id="f324"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># Describe Node</span></span> <span class="hljs-meta prompt_">$ </span><span class="language-bash">kubectl describe nodes node_name</span></pre></div><p id="595a">Now that we are done with the blog, I am hopeful that you had no difficulty following it. In case you faced any difficulty, comment below. I would be happy to help resolve the issues.</p></article></body>

Access AWS Services From EKS using IAM Role with PySpark

In this blog, we will read a CSV file stored in AWS S3 from an EKS cluster using an IAM role. Before we get started, there are certain prerequisites that must be fulfilled.

AWS EKS | Docker | Node Group | IAM Role | AWS S3 | PySpark

Prerequisites:

  • Install all the prerequisites mentioned in this link. Go to the validation section of the same link and check if the installations were successful.
  • Prepare a docker image of (PySpark 2.4.5 + Hadoop 3.1.2) with hadoop-aws-3.1.2.jar and aws-java-sdk-bundle-1.11.271.jar following link1 and link2.
  • Create and attach an IAM role to the cluster
  • Add policies to the IAM role for allowing permission to access different AWS Services(here S3).

Validate if the prerequisites are installed correctly:

It is very important to have the prerequisites installed and configured correctly before running the actual job.

Docker

Login to docker hub through the terminal, pls use the same credential that is used on DockerHub console.

$ docker login

Kubectl

Configure Kubectl to run the commands on the EKS cluster.

$ export KUBECONFIG=$KUBECONFIG:~/.kube/config-aws-services
$ aws eks update-kubeconfig — name eks_cluster_name — region us-east-1

Run these basic Kubectl commands,

$ kubectl get ns # namespace -> default
$ kubectl get sa # namespace -> default
$ kubectl get nodes # If a nodegroup is configured with EKS, nodes should be listed here

IAM Role and Policies

Create a file named iam_pod.yaml and fill it with the below content(take care of indentation).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: eks-iam-test
spec:
  replicas: 1
  selector:
   matchLabels:
     app: eks-iam-test
  template:
   metadata:
     labels:
        app: eks-iam-test
    spec:
      serviceAccountName: default
      containers:
      - name: eks-iam-test
        image: sdscello/awscli:latest
        ports:
        - containerPort: 80

Run these commands

$ kubectl apply -f iam_pod.yaml
$ kubectl get pods # Note down <pod-name> from output
$ kubectl exec -it <pod-name> /bin/bash
$ aws s3 ls # This should work if the IAM role is configured correctly

Now that the setup is done, let’s move to run the actual PySpark job to access files stored in the AWS S3 bucket.

Steps

  1. Create a folder XYZ
  2. Under XYZ, create a python file job_to_run.py and fill it with the content below.
import os
import boto3
from pyspark import SparkContext, SQLContext
from pyspark.conf import SparkConf
sts_connection = boto3.client('sts')
role_arn = os.getenv("AWS_ROLE_ARN")
web_identity_token = None
with open(os.getenv("AWS_WEB_IDENTITY_TOKEN_FILE"), 'r') as content_file:
    web_identity_token = content_file.read()
assume_role_object = sts_connection.assume_role_with_web_identity(
       RoleArn=role_arn,
       RoleSessionName="any_name",
       WebIdentityToken=web_identity_token
)
ACCESS_KEY = assume_role_object['Credentials']['AccessKeyId']
SECRET_KEY = assume_role_object['Credentials']['SecretAccessKey']
SESSION_TOKEN = assume_role_object['Credentials']['SessionToken']

SPARK_CONF = SparkConf()
SC = SparkContext(conf=SPARK_CONF, appName="test_app_1")
try:
    SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(ACCESS_KEY))
    SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(SECRET_KEY))
    SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(SESSION_TOKEN))
   
    SQLCONTEXT = SQLContext(SC)
    df = SQLCONTEXT.read.csv("s3a://bucket_name/path/to/csv_file")
    print(df.head(1))
except Exception as e:
    pass
finally:
    SC.stop()

3. Creating Image

Under XYZ, create a text file with exact name Dockerfile(no extension) and paste this content. Assuming that the base Spark image is on the docker hub.

FROM spark_repo_name:tag_name
RUN pip install boto3
COPY job_to_run.py /tmp/job_to_run.py

Next, create a new public repository(new_repo_name) on the DockerHub console and run the below set of commands.

$ docker login
$ docker build -t new_repo_name .
$ docker tag new_repo_name:latest new_repo_name:v1
$ docker push new_repo_name:v1

With these commands, we have successfully built our job image and pushed it to one of the public repo on DockerHub. Now we will validate our image whether it has been created correctly.

4. Validation

## List out all the images
$ docker images## Make sure you mention the recent image.
$ docker run -it new_repo_name:v1 sh
    
$ cd /tmp/## It should print job_to_run.py
$ ls 
job_to_run.py

Running the Job

Before running the job we require a few information on the EKS cluster and Spark Binary file(on local).

EKS Cluster Info

$ Kubectl cluster-info
Kubernetes master is running at https://123456789.gr7.region.eks.amazonaws.com
kube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxy

Spark Binary File

If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case(on Mac), it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don’t have a Spark binary get one form here and store at some location.

$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin
$ ls
spark-submit, many_other_files, . . .

Running the Job

$ ./spark-submit --deploy-mode cluster \
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \ 
--conf spark.executor.instances=5 \
--conf spark.app.name=my_pyspark_job \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.container.image=new_repo_name:v1
--conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.py

After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:

Check Job-status

$ kubectl get pods
1_spark_pods_driver, some_spark_pods_executer

STATUS

  • CreatingContainer: The cluster is pulling the image to create a container.
  • Running: The job is still running.
  • Completed: If your job executed successfully, you will see completed.
  • Pending: Pods are pending because of the unavailability of compute nodes.
  • Error: Job has failed while execution.
## get log
$ kubectl logs 1_spark_pods_driver## Describe Pod
$ kubectl describe pods 1_spark_pods_driver
$ kubectl get nodes
farget_nodes1, some more
## Describe Node
$ kubectl describe nodes node_name

Now that we are done with the blog, I am hopeful that you had no difficulty following it. In case you faced any difficulty, comment below. I would be happy to help resolve the issues.

Pyspark
Aws Eks
AWS
Fargate
Iam Roles
Recommended from ReadMedium