avatarAman Ranjan Verma

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4990

Abstract

<span class="hljs-keyword">format</span>(AWS_SESSION_TOKEN)) # Optional SC._jsc.hadoopConfiguratio<span class="hljs-meta">n</span>().<span class="hljs-keyword">set</span>(<span class="hljs-string">"fs.s3a.impl"</span>, <span class="hljs-string">"org.apache.hadoop.fs.s3a.S3AFileSystem"</span>)</pre></div><div id="c57a"><pre><span class="hljs-attribute">SQLCONTEXT</span> <span class="hljs-operator">=</span> SQLContext(SC) <span class="hljs-attribute">df</span> <span class="hljs-operator">=</span> SQLCONTEXT.read.csv(<span class="hljs-string">"s3a://bucket/key"</span>)</pre></div><div id="cb57"><pre><span class="hljs-function"><span class="hljs-title">print</span><span class="hljs-params">(df.show()</span></span>) SC<span class="hljs-selector-class">.stop</span>()</pre></div><p id="1666">We will save this file in a folder <b>ABC</b> with file_name = <b>job_to_run.py. </b>In this case, we have directly used AWS credentials, but if you want to use an IAM role follow this blog.</p><div id="83b1" class="link-block"> <a href="https://readmedium.com/access-aws-services-from-eks-using-iam-role-with-pyspark-7f417d38740a"> <div> <div> <h2>Access AWS Services From EKS using IAM Role with PySpark</h2> <div><h3>In this blog, we will read a CSV file from AWS S3 from an EKS cluster using the IAM role. Before we get started, there…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*uPp7NR2q-rlkRAvjOVAjaQ.png)"></div> </div> </div> </a> </div><h1 id="f178">Creating Image</h1><p id="39ea">Then we will go to the ABC directory and create a text file with the name Dockerfile(no extension) and paste the content below. <i>From <a href="https://medium.com/@aman.ranjanverma/running-pyspark-on-eks-fargate-part-2-cc077d99bd5">part 2</a>, I assume that the base Spark image is on the docker hub.</i></p><div id="1535"><pre><span class="hljs-keyword">FROM</span> repo_name:v1 <span class="hljs-keyword">COPY</span><span class="language-bash"> job_to_run.py /tmp/job_to_run.py</span></pre></div><p id="52ab">Next, we will create a new public repository on DockerHub and run the below set of commands.</p><div id="531f"><pre>docker login docker build -t <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span> . docker tag <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span>:latest <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span>:v1 docker push <span class="hljs-keyword">new</span><span class="hljs-type">_repo_name</span>:v1</pre></div><p id="84cf">With these commands, we have successfully built our job image and pushed it to the recently created public repo on DockerHub. Now we will validate our image whether it has been properly built.</p><h1 id="e7f2">Validation</h1><div id="91cc"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># List out all the images</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">docker images</span></pre></div><div id="5ff7"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># Make sure you mention the recent image.</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">docker run -it new_repo_name:v1 sh</span>

<span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">cd</span> /tmp/</span></pre></div><div id="2efb"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># It should print job_to_run.py</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">ls</span></span> job_to_run.py</pre></div><h1 id="5f01">Running the Job</h1><p id="3d02">Before running the job we require some information on the EKS cluster and Spark Binary file(on local).</p><h2 id="082c">EKS Cluster</h2><p id="5699"><b>Connect terminal to EKS Cluster</b></p><div id="a1f8"><pre><span class="hljs-variable"> </span>export <span class="hljs-title class_">KUBECONFIG</span>=<span class="hljs-variable">KUBECONFIG</span><span class="hljs-symbol">:~/</span>.kube/config-fargate <span class="hljs-variable"> </span>aws eks --region region-code update-kubeconfig --name cluster_name</pre></div><p id="5a99"><b>Get Cluster Info</b></p><div id="8bcf"><pre> Kubectl <span class="hljs-keyword">cluster</span>-<span class="hljs-keyword">info</span></pre></div><div id="3c2b"><pre>Kubernetes <span class="hljs-keyword">master</span> <span class="hljs-title">is</span> running at https://<span class="hljs-number">123456789</span>.gr7.region.eks.amazonaws.com</pre></div><div id="810d"><pre>kube-dns is running at https:<span

Options

class="hljs-regexp">//</span>{some_ip<span class="hljs-regexp">/url}/</span>api<span class="hljs-regexp">/v1/</span>namespaces<span class="hljs-regexp">/kube-system/</span>services<span class="hljs-regexp">/kube-dns/</span>proxy</pre></div><h2 id="cfb0">Spark Binary File</h2><p id="ab62">If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case, it is at “<i>/Users/aman/spark-2.4.3-bin-hadoop2.7</i>”. If you don't have a Spark binary get one form <a href="https://spark.apache.org/downloads.html">here</a> and store at some location.</p><div id="b824"><pre><span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">cd</span> /path/to/spark-2.4.3-bin-hadoop2.7/bin</span> <span class="hljs-meta prompt_"> </span><span class="language-bash"><span class="hljs-built_in">ls</span></span> spark-submit, many_other_files, . . . </pre></div><h2 id="fcfd">Running the Job</h2><div id="f48f"><pre> ./spark-submit <span class="hljs-attr">--deploy-mode</span> cluster \ <span class="hljs-attr">--master</span> k8s:<span class="hljs-comment">//https://123456789.gr7.region.eks.amazonaws.com:443 \</span> <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.authenticate</span><span class="hljs-selector-class">.driver</span>.serviceAccountName=default \ <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.executor</span>.instances=<span class="hljs-number">5</span> \ <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.app</span>.name=my_pyspark_job \ <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span>.namespace=default \ <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.driver</span><span class="hljs-selector-class">.container</span>.image=new_repo_name:v1 <span class="hljs-attr">--conf</span> spark<span class="hljs-selector-class">.kubernetes</span><span class="hljs-selector-class">.executor</span><span class="hljs-selector-class">.container</span>.image=new_repo_name:v1 local:<span class="hljs-comment">///tmp/job_to_run.py</span></pre></div><p id="e56f">After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:</p><h2 id="a390">Check Job-status</h2><div id="fdab"><pre> kubectl <span class="hljs-built_in">get</span> pods 1_spark_pods_driver, some_spark_pods_executer</pre></div><p id="9d6c"><b>STATUS</b></p><ul><li><b>CreatingContainer</b>: The cluster is pulling the image to create a container.</li><li><b>Running: </b>The job is still running.</li><li><b>Completed</b>: If your job executed successfully, you will see completed.</li><li><b>Pending: </b>Pods are pending because of the unavailability of compute nodes.</li><li><b>Error</b>: Job has failed while execution.</li></ul><div id="7d76"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># get log</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl logs 1_spark_pods_driver</span></pre></div><div id="67ed"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># Describe Pod</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl describe pods 1_spark_pods_driver</span></pre></div><div id="cffe"><pre> kubectl <span class="hljs-built_in">get</span> nodes farget_nodes</pre></div><div id="1fad"><pre><span class="hljs-meta prompt_">#</span><span class="language-bash"><span class="hljs-comment"># Describe Node</span></span> <span class="hljs-meta prompt_"> </span><span class="language-bash">kubectl describe nodes farget_node_1</span></pre></div><p id="e0db">Now that we are done with the series of blogs on running the PySpark job on EKS with the Fargate profile<b>, </b>I am hopeful that you had no difficulty following the blog. In case you faced any difficulty in following it, please comment below. I would be happy to help resolve the issues.</p><p id="49ac">You may be interested in reading this:</p><div id="cdbe" class="link-block"> <a href="https://medium.com/@aman.ranjanverma/running-python-job-on-aws-eks-fargate-520e0cf6853a"> <div> <div> <h2>Running Python Job on AWS EKS Fargate</h2> <div><h3>In this article, I will demonstrate to you how to submit a Python job on EKS cluster up and running with Fargate as a…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*CfWMpH9OHMBWucee2LtPxw.png)"></div> </div> </div> </a> </div></article></body>

Running PySpark on EKS Fargate: Part 3(Last)

Running a PySpark job on EKS to access files stored on AWS S3

It is 3rd part of the series on how to run PySpark jobs on AWS EKS Fargate. In Part 1, we completed our setup w.r.t. the prerequisites listed. We also learned about the Docker file/image and some important docker commands. In Part 2 of the series, we saw how to merge an x-version of Spark with the y-version of Hadoop. We built an image and pushed it to DockerHub public repo. In this Part 3 of the blog, we will use the image and run a PySpark job.

Before getting started, let's recap and verify whether we are on the same page wrt. requirements.

  • An EKS Cluster with Fargate profile setup done. If you still don’t have one, follow this AWS DOC for creating one. We need to have two Fargate Profile with different namespaces namely “default” and “kube-system”. We will use the “default” namespace for running our job whereas the “kube-system” will be used by coredns. coredns is a Kubernetes server DNS deployment.

@Reference awsdoc: By default, CoreDNS is configured to run on Amazon EC2 infrastructure on Amazon EKS clusters. If you want to only run your pods on Fargate in your cluster, You would also need to create a Fargate profile to target the CoreDNS pods.

To be precise, we need to have an EKS cluster + Fargate Profile (namespace = default) + coredns pod running(check by “kubectl get pods -n kube-system” if you see it pending create a new profile and redeploy the coredns pod with “kubectl rollout restart -n kube-system deployment coredns”)

  • The next important requirement is a Spark docker image from part 2 either on Dockerhub or AWS ECR.

Each Job on an EKS cluster is in the form of an image. As the first step, we will define a spark job and build an image.

Defining PySpark Job

We will run a PySpark script that reads a CSV file from the AWS S3 bucket.

from pyspark import SparkContext, SQLContext
from pyspark.conf import SparkConf
AWS_ACCESS_KEY_ID = "************"
AWS_SECRET_ACCESS_KEY = "*****************"
AWS_SESSION_TOKEN = "*************************" # optional
SPARK_CONF = SparkConf()
SC = SparkContext(conf=SPARK_CONF, appName="SparkOfferDistributor")
SC._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")#Optional
SC._jsc.hadoopConfiguration().set("fs.s3a.access.key", "{}".format(AWS_ACCESS_KEY_ID))
SC._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "{}".format(AWS_SECRET_ACCESS_KEY))
SC._jsc.hadoopConfiguration().set("fs.s3a.session.token", "{}".format(AWS_SESSION_TOKEN)) # Optional
SC._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
SQLCONTEXT = SQLContext(SC)
df = SQLCONTEXT.read.csv("s3a://bucket/key")
print(df.show())
SC.stop()

We will save this file in a folder ABC with file_name = job_to_run.py. In this case, we have directly used AWS credentials, but if you want to use an IAM role follow this blog.

Creating Image

Then we will go to the ABC directory and create a text file with the name Dockerfile(no extension) and paste the content below. From part 2, I assume that the base Spark image is on the docker hub.

FROM repo_name:v1 
COPY job_to_run.py /tmp/job_to_run.py

Next, we will create a new public repository on DockerHub and run the below set of commands.

$ docker login
$ docker build -t new_repo_name .
$ docker tag new_repo_name:latest new_repo_name:v1
$ docker push new_repo_name:v1

With these commands, we have successfully built our job image and pushed it to the recently created public repo on DockerHub. Now we will validate our image whether it has been properly built.

Validation

## List out all the images
$ docker images
## Make sure you mention the recent image.
$ docker run -it new_repo_name:v1 sh
    
$ cd /tmp/
## It should print job_to_run.py
$ ls 
job_to_run.py

Running the Job

Before running the job we require some information on the EKS cluster and Spark Binary file(on local).

EKS Cluster

Connect terminal to EKS Cluster

$ export KUBECONFIG=$KUBECONFIG:~/.kube/config-fargate
$ aws eks --region region-code update-kubeconfig --name cluster_name

Get Cluster Info

$ Kubectl cluster-info
Kubernetes master is running at https://123456789.gr7.region.eks.amazonaws.com
kube-dns is running at https://{some_ip/url}/api/v1/namespaces/kube-system/services/kube-dns/proxy

Spark Binary File

If you already have Spark installed on your system. Try finding the bin directory of the Spark binary. In my case, it is at “/Users/aman/spark-2.4.3-bin-hadoop2.7”. If you don't have a Spark binary get one form here and store at some location.

$ cd /path/to/spark-2.4.3-bin-hadoop2.7/bin
$ ls
spark-submit, many_other_files, . . . 

Running the Job

$ ./spark-submit --deploy-mode cluster \
--master k8s://https://123456789.gr7.region.eks.amazonaws.com:443 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=default \ 
--conf spark.executor.instances=5 \
--conf spark.app.name=my_pyspark_job \
--conf spark.kubernetes.namespace=default \
--conf spark.kubernetes.driver.container.image=new_repo_name:v1
--conf spark.kubernetes.executor.container.image=new_repo_name:v1 local:///tmp/job_to_run.py

After we run this job command, logs will start appearing on the terminal. Meanwhile, you can check the job status using these commands:

Check Job-status

$ kubectl get pods
1_spark_pods_driver, some_spark_pods_executer

STATUS

  • CreatingContainer: The cluster is pulling the image to create a container.
  • Running: The job is still running.
  • Completed: If your job executed successfully, you will see completed.
  • Pending: Pods are pending because of the unavailability of compute nodes.
  • Error: Job has failed while execution.
## get log
$ kubectl logs 1_spark_pods_driver
## Describe Pod
$ kubectl describe pods 1_spark_pods_driver
$ kubectl get nodes
farget_nodes
## Describe Node
$ kubectl describe nodes farget_node_1

Now that we are done with the series of blogs on running the PySpark job on EKS with the Fargate profile, I am hopeful that you had no difficulty following the blog. In case you faced any difficulty in following it, please comment below. I would be happy to help resolve the issues.

You may be interested in reading this:

Spark
Pyspark
Aws Eks
Aws Eks Fargate
Fargate
Recommended from ReadMedium