Implementing and Integrating Argo Workflow and Spark on Kubernetes

Summary

The web content provides a comprehensive guide on integrating Argo Workflow with Spark on Kubernetes, detailing the setup process for both tools and how to run Spark jobs within Argo Workflows.

Abstract

The article discusses the use of Argo Workflow, a container-native workflow engine for Kubernetes, as an alternative to Airflow for orchestrating tasks. It outlines the steps to deploy Argo Workflow using Helm, including the configuration of ingress for accessing the Argo UI. Additionally, the article explains how to run Apache Spark on Kubernetes, either through spark-submit or the Spark operator, emphasizing the flexibility and scalability of using Kubernetes for Spark jobs. It also provides a detailed example of submitting a Spark job from within an Argo Workflow, demonstrating the integration of both technologies. The author notes the scarcity of resources on this integration and aims to fill the gap with this guide, which includes references to documentation, examples, and performance optimization on AWS EKS.

Opinions

The author suggests that Argo Workflow is a robust solution for those seeking a Kubernetes-native workflow engine and may be a suitable alternative to Airflow.
There is an endorsement of Argo Workflow's ease of use and its detailed documentation, which is described as well-explained.
The author expresses enthusiasm about running Spark on Kubernetes, highlighting the ability to deploy on-demand driver and executor pods without the need for dedicated Spark clusters.
The article implies that the Spark operator, while still in development, offers a native Kubernetes experience for Spark workloads and integrates well with kubectl and sparkctl.
The author indicates a gap in existing online resources for combining Argo Workflow with Spark, which motivates the creation of this comprehensive guide.
A recommendation is made to consult specific resources for creating Spark images, indicating the author's view on the usefulness of these references.
The author encourages readers to explore further optimizations for Spark performance on AWS EKS, suggesting a proactive approach to performance tuning in a cloud environment.

In my first article, I spoke about Argo CD. This time it’s going to be Argo Workflow which is again from Argo project, Spark on Kubernetes and how do we make both work together.

Argo Workflow

Argo Workflow is a container native workflow engine where we can orchestrate jobs having sequence of tasks (each step in a workflow as a container). Using workflow definition we can capture the dependencies between the task using a DAG. An alternative to Airflow? Well.. maybe! If you’re looking at something native to kubernetes, I’m sure Argo Workflow will not disappoint you.

Deploying Argo Workflow to K8s (Please do create a namespace for argo workflow):

1. Install helm:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3

chmod 700 get_helm.sh

./get_helm.sh

2. Add argo repo to helm and install

helm repo add argo https://argoproj.github.io/argo-helm

helm repo update

helm install argo-wf argo/argo -n argo-wf -f values.yaml

In values.yaml, you can enable ingress (if ingress controller is available in your cluster)

ingress:

  enabled: true

  annotations:

    kubernetes.io/ingress.class: nginx

    kubernetes.io/tls-acme: “true”

  hosts:

   - argo.example.com

You will have argo workflow up and running in no time!

The workflow automation in Argo is driven by YAML templates. Argo has provided rich documentation with examples for the same. We can even submit the workflows via REST APIs if you’re looking at automation. I’m not going to deep dive much here as the documentation is detailed and well explained. Let’s move on to the next topic.

Spark on Kubernetes

Starting from Spark 2.3, you can use Kubernetes to run and manage Spark resources. Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark. We can run on demand spark driver and executor pods which means no dedicated spark clusters.. cool right?

There are two ways to run Spark on Kubernetes: by using Spark-submit and Spark operator.

Spark-submit: By using spark-submit CLI, you can submit Spark jobs with various configuration options supported by Kubernetes.

spark-submit delegates the job submission to the Spark on Kubernetes backend which prepares the submission of the driver via a pod in the cluster and finally creates the related Kubernetes resources by communicating to the Kubernetes API server.

Spark-submit is the easiest way to run Spark on Kubernetes. If you review the code snippet, you’ll notice two minor changes. One is to change the Kubernetes cluster endpoint. Second, is the container image that hosts your Spark application.

./bin/spark-submit \ — master k8s://https://<KUBERNETES_CLUSTER_ENDPOINT> \ — deploy-mode cluster \ — name spark-pi \ — class org.apache.spark.examples.SparkPi \ — conf spark.executor.instances=3 \ — conf spark.kubernetes.container.image=aws/spark:2.4.5-SNAPSHOT \ — conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ — conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.11–2.4.5-SNAPSHOT.jar

Please refer this link for all the available parameters/properties.

Spark operator: Spark operator (What are operators?) provides a native Kubernetes experience for Spark workloads. In addition, you can use kubectl and sparkctl to submit Spark jobs. Examples for these can be found here.

Spark operator is still in phase 4 capability level

Now how do we submit a spark job from Argo Workflow?

Spark-submit:

apiVersion: argoproj.io/v1alpha1

kind: Workflow

metadata:

  name: wf-spark-pi

  namespace: argo

spec:

  arguments: {}

  entrypoint: sparkapp

  templates:

  - arguments: {}

    container:

      args:

      - /opt/spark/bin/spark-submit

      - --master

      - k8s://https://kubernetes.default.svc

      - --deploy-mode

      - cluster

      - --conf

      - spark.kubernetes.namespace=spark-apps

      - --conf

      - spark.kubernetes.container.image=sparkimage

      - --conf

      - spark.executor.instances=2

      - --class

      - org.apache.spark.examples.SparkPi

      - local:///opt/spark/examples/examples/jars/spark-examples_2.12-2.4.2.jar

      command:

      - sh

      image: sparkimage

      imagePullPolicy: Always

      name: ""

      resources: {}

  inputs: {}

  metadata: {}

  name: sparkapp

  outputs: {}

Spark operator:

We can use Argo Workflow resource template to achieve this. Examples are provided in the link.

For creating spark images this is a good reference.

Please do read this blog for optimizing spark performance on AWS EKS.

I couldn’t find anything online having both Argo Workflow and Spark working together. It was scattered here and there, so thought of collating into one. Hence this article. Hope it helps!