Train XGBoost Models in Amazon SageMaker in 4 Simple Steps
How to train & deploy XGBoost models as endpoints using SageMaker
Getting started with Amazon SageMaker can be challenging as there are many tricks that AWS just expects you to know… In return once you get a handle on them, you can significantly speed up the deployment of your ML models without having to worry about Docker and setting up compute resources.
The goal of this post is to simplify getting started with SageMaker as much as possible and give you a quick walkthrough of what it takes to deploy an XGBoost Classifier, as you typically would in an industry setup. Most tutorials are direct recitation of AWS documentation and not very applicable if you want to tailor your models to a realistic problem. For example, using the sample XGBoost Customer Churn Notebook only works for predicting probability of a class and not the individual classes (0 or 1) themselves.
What we are going to build
If you’re like me and would love to get a puppy, but due to space limitations just can’t at the moment, we can at least do projects that involve them, right? Let’s build a simple XGBoost model that tells people whether they should get a Beagle or a German Shepherd based on how big their home is.

More specifically, we’re going to use a dummy dataset which consists of X variable representing the house area in square feet and y target variable being 0 (Beagle) or 1 (German Shepherd). For simplicity, we’ve set Beagle to be most suitable for homes smaller than 500 sq.ft and German Shepherd for those that are larger than 500 sq.ft.

Before we dive in, you might be wondering how much will this SageMaker learning cost me? According to AWS Pricing Page and assuming you’ll be working in the US East 1 Region, and it takes you about 4hrs to write the training script, 0.5hr of model training and 1hr of testing the endpoint, it would end up being less than a $1!
On-demand Notebook Instance
(ml.t2.medium): $0.0464/hr * 4hr = $0.2Training Instance
(ml.m4.xlarge): $0.24/hr * 0.5hr = $0.12Real-time Inference
(ml.m4.xlarge): $0.24/hr * 1hr = $0.24Total <$1Another funny thing is that majority of the SageMaker tutorials expect you to magically know how their infrastructure is set up. To mitigate that, here is a simplified version of the general workflow and how it’ll be applied to our project:

👉 Steps
- Need to provision the Notebook Instance
- Store the training/validation data inside the S3 bucket
- Train & Output ML model artifact to the S3 bucket
- Deploy & Test SageMaker Inference Endpoint
1. Create SageMaker (SM) Jupyter Notebook Instance



Give it a name:

You can leave the default IAM role to allow SM to access your S3 buckets:


Once the Notebook Instance is provisioned, you can then start the Jupyter Lab & launch a Conda Python 3 Environment.

Side Note (if life was only this easy)
How we’d typically train an XGBoost locally






