Summary

This tutorial demonstrates how to set up a Docker image with PySpark ML and XGBoost integration for use with Jupyter notebooks.

Abstract

The tutorial outlines the process of creating a Docker image that combines a Jupyter-Spark notebook environment with the XGBoost machine learning library. It provides step-by-step instructions on downloading the necessary XGBoost JAR files, the sparkxgb.zip Python wrapper, and setting up the environment to run PySpark ML pipelines with XGBoost. The image is built using the jupyter/pyspark-notebook base image, and the tutorial includes commands for building and running the Docker container. Once set up, users can access the Jupyter environment via a web browser, authenticate with a predefined token, and begin using PySpark with XGBoost by initializing a Spark session and importing the XGBoostEstimator. The author also references a more detailed article on using XGBoost with PySpark for those interested in a deeper dive.

Opinions

The author believes that integrating XGBoost with PySpark ML is valuable for users, as evidenced by the creation of the Docker image for this purpose.
The use of a pre-built base image (jupyter/pyspark-notebook) is recommended for convenience and compatibility.
Setting up a Jupyter environment with PySpark and XGBoost is presented as a straightforward process, with the expectation that users will find the provided instructions clear and easy to follow.
The author assumes that users will appreciate the inclusion of a password (JUPYTER_TOKEN) for secure access to the Jupyter environment.
Providing root access and the ability to grant sudo within the Docker container is considered important for users to have full control over their notebook environment.
The tutorial is written with the intent to be useful, and the author directs readers to another article for a more comprehensive understanding of using XGBoost with PySpark.

PySpark ML and XGBoost setup using a docker image

I this tutorial we will build and test a docker image where we will be able to run a jupyter notebook with xgboost fully integrated.

TL;DR docker image here.

We will use a jupyter-spark notebook base image, that will be compatible with the XGBoost python wrapper.

FROM jupyter/pyspark-notebook:7f1482f5a136

RUN cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j/0.72/xgboost4j-0.72.jar

RUN cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/ml/dmlc/xgboost4j-spark/0.72/xgboost4j-spark-0.72.jar

RUN cd work/ && wget https://github.com/dmlc/xgboost/files/2161553/sparkxgb.zip

RUN pip install findspark

This will add the necessary jars and dependencies for XGBoost.

To build the image just run docker build :

docker build -t xgboost-pyspark:latest .

And then we can run the docker image:

docker run -d -p 8888:8888 -e JUPYTER_TOKEN=letmein -e GRANT_SUDO=yes --user root xgboost-pyspark

With JUPYTER_TOKEN we setup a password and with GRANT_SUDO=yes --user root we make sure we will be able to add and edit notebooks.

Now in a browser we will be able to log in into Jupyter by running localhost:8888 and once we introduce the password, for me letmein we will be able to use it.

We will use findspark to connect Jupyter to the spark installation, so we can the following dependencies:

import findspark
findspark.init()

import pyspark
from pyspark.sql.session import SparkSession

Then we can create a spark session:

spark = SparkSession\
        .builder\
        .appName("PySpark Session")\
        .getOrCreate()

Next we need to load the python wrapper that is available in the image:

spark.sparkContext.addPyFile("sparkxgb.zip")

And we will be able to import the XGBoostEstimator and run any pyspark ml pipelines with it:

from sparkxgb import XGBoostEstimator

Hope you found this tutorial useful. You can check my other medium article that looks into using XGBoost API with pyspark in more depth.