avatarBogdan Cojocar

Summary

The website provides a tutorial on how to integrate XGBoost with the Spark Notebook, a tool for Apache Spark and Scala data analysis.

Abstract

The web content outlines a comprehensive guide for installing XGBoost, an optimized distributed gradient boosting library, within the Spark Notebook environment. It begins by instructing users to clone the XGBoost repository from GitHub and build the project, with additional steps for Mac users. The next step involves creating a JAR file for XGBoost using Maven, with the option to skip tests for a faster installation. Finally, the tutorial explains how to include the XGBoost JAR into the Spark Notebook by adding it as a dependency in the notebook's metadata and restarting the kernel to make XGBoost available for use in data analysis tasks.

Opinions

  • The author provides a clear, step-by-step approach, assuming the reader may not have prior experience with XGBoost or Spark Notebook.
  • There is an emphasis on the importance of setting up JAVA_HOME correctly and having Maven installed, indicating these are common stumbling blocks.
  • The author suggests skipping tests during the Maven installation to save time, implying that the tests may be time-consuming and not critical for users following the tutorial.
  • The inclusion of screenshots for adding dependencies in the Spark Notebook metadata suggests a user-friendly approach, anticipating that visual guidance may be helpful for users.
  • The author notes the specific version of XGBoost used at the time of writing, which could be important for users to achieve the same results or troubleshoot issues.

How to make XGBoost available in the Spark Notebook

This is a step by step tutorial on how to install XGBoost (an efficient implementation of gradient boosting) on the Spark Notebook (tool for doing Apache Spark and Scala analysis and plotting graphs, similar to the Jupyter notebook).

If you don’t have the Spark Notebook installed you can follow this quick guide.

Step 1: Build XGBoost

For this step we need to clone the repository from github and build the project:

git clone --recursive https://github.com/dmlc/xgboost

Next we need to go into the newly cloned repository and build the project:

cd xgboost
make -j4

For Mac users you have to do an additional step, before building with make :

cp make/config.mk ./config.mk

Step 2: Create the XGBoost jar

Before we are able to build the jar we need to make sure we have JAVA_HOME set up and pointing to the JDK directory. We also need maven installed. Next we need to run maven in the xgboost directory and publish the artifact on your local maven repository:

mvn install

This command will also run some tests, but we can skip that to make the installation faster. If you wish to do that simply run this command instead:

mvn -DskipTests install

Step 3: Include the XGBoost jar into the Spark Notebook

As we now have the jar available in our local repository we can include it as a dependency in the notebook. In order to do that we need to go to the notebook and open the metadata window in the menu:

Edit -> Edit Notebook Metadata

A window will open containing a configuration JSON file. We will add the xgboost dependency in the customDeps property as shown in the screenshot below:

"customDeps": [
   "ml.dmlc:xgboost4j-spark:0.8-SNAPSHOT"
]

Please note that at the moment this tutorial was created I was using xgboost version 0.8.

The last step is to restart the kernel of the notebook and we should have the dependency available. To restart the kernel go in the menu at the kernel tab and run restart:

Kernel -> Restart

Now we should be able to import XGBoost:

When we add import ml.dmlc.xgboost4j.scala.spark into a cell of the notebook and we run the code (in the menu Cell -> Run ) it should run succefully.

Xgboost
Machine Learning
Apache Spark
Scala
Big Data
Recommended from ReadMedium