PySpark ML and XGBoost setup using a docker image
I this tutorial we will build and test a docker image where we will be able to run a jupyter notebook with xgboost fully integrated.
TL;DR docker image here.
We will use a jupyter-spark notebook base image, that will be compatible with the XGBoost python wrapper.
FROM jupyter/pyspark-notebook:7f1482f5a136RUN cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.72/xgboost4j-0.72.jarRUN cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.72/xgboost4j-spark-0.72.jarRUN cd work/ && wget https://github.com/dmlc/xgboost/files/2161553/sparkxgb.zipRUN pip install findsparkThis will add the necessary jars and dependencies for XGBoost.
To build the image just run docker build :
docker build -t xgboost-pyspark:latest .
And then we can run the docker image:
docker run -d -p 8888:8888 -e JUPYTER_TOKEN=letmein -e GRANT_SUDO=yes --user root xgboost-pysparkWith JUPYTER_TOKEN we setup a password and with GRANT_SUDO=yes --user root we make sure we will be able to add and edit notebooks.
Now in a browser we will be able to log in into Jupyter by running localhost:8888 and once we introduce the password, for me letmein we will be able to use it.
We will use findspark to connect Jupyter to the spark installation, so we can the following dependencies:
import findspark
findspark.init()
import pyspark
from pyspark.sql.session import SparkSessionThen we can create a spark session:
spark = SparkSession\
.builder\
.appName("PySpark Session")\
.getOrCreate()Next we need to load the python wrapper that is available in the image:
spark.sparkContext.addPyFile("sparkxgb.zip")And we will be able to import the XGBoostEstimator and run any pyspark ml pipelines with it:
from sparkxgb import XGBoostEstimatorHope you found this tutorial useful. You can check my other medium article that looks into using XGBoost API with pyspark in more depth.





