Spark Installing external Packages

Introduction

In this blog, we will discuss how to install external packages in Spark. We will discuss how to install packages using below different ways

From Jupyter notebook
From the terminal using Py Spark
From the terminal during submitting jobs (spark-submit)

Pre-installed Packages

With Spark installation, we already have a few packages installed. It depends on how you install your Spark. If you have used our Data Engineering suite, you will have a few of the packages already installed like Azure Blob, Azure Data Lake services, AWS S3, Snowflake, and for delta.

To check what all packages are already installed in your spark, you can go to

/opt/spark/jars

Install Packages

Now, for example, you want to install a package for MySQL. You can go to the Maven repository.

https://mvnrepository.com/

And search for MySQL packages, which gives a link: https://mvnrepository.com/artifact/mysql/mysql-connector-java

On this page, it will display all the versions, you can select any of the versions.

I am selecting the latest version. And on that page, it will display the properties below.

So, from here specify the package in Spark.

groupId:artifactId:version

So, for MySQL, it will become

mysql:mysql-connector-java:8.0.32

Same way, if you want to install package for MongoDB, search for MongoDB and select any version

If we prepare the package name as we discussed earlier, it will be as below

org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

Install Packages from Jupyter Notebook

With Jupyter notebook, with starting session we need to pass the config as below.

The config name will be “spark.jars.packages’

spark = SparkSession.builder.appName("packageinstall")\
        .config('spark.jars.packages', 'org.postgresql:postgresql:42.5.4')\
        .getOrCreate()
sqlContext = SparkSession(spark)
#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")

With this configuration, before starting the spark session, it will check if this package is already there or not. If it is locally not available, it will first download and install. We will see the below logs.

Installing Packages from Terminal (Spark Shell)

We have our setup of Spark in the docker container so we will go to the docker terminal.

Normally, we write pyspark to start the spark session.

for specifying packages, we will pass below with pyspark

pyspark --conf "spark.jars.packages=org.postgresql:postgresql:42.5.4"

Installing packages during submitting job

We can also submit packages while submitting jobs. normally, we use spark-submit for submitting jobs.

Now, if in case, we need external packages while submitting jobs, we can use below

spark-submit --conf "spark.jars.packages=org.postgresql:postgresql:42.5.4" sample.py

Video explanation:

Summarize