Democratizing Machine Learning with AWS SageMaker AutoML
An Overview
Introduction
AI is still one of the hottest topics as of now, especially with the rise of ChatGPT. Many companies are now trying to make use of AI in order to extract useful insights from data that can be used to optimize their processes or bring out better products.
However, building effective AI models requires a lot of expertise in different fields, like data preprocessing, model selection, hyperparameter tuning and many more. All these fields can be time-consuming and require specialized knowledge.
This is where AutoML comes into play. AutoML automates many of the above-mentioned fields required for building an AI model.
AutoML is rapidly becoming a popular solution for businesses and data scientists. It empowers organizations to leverage ML and AI to make informed decisions, without requiring them to be experts in data science. With the increasing demand for ML in businesses, AutoML provides an easy and efficient way to create accurate models, regardless of one’s expertise.
In this article, we’ll examine one very popular AutoML tool available in the market today, AWS SageMaker AutoML, and demonstrate how it can be used to solve complex ML use cases.
I will train a model with the old-fashioned manual approach and compare the results to the ones that AWS SageMaker AutoML produces.
I will use the credit card fraud detection dataset from Kaggle for this comparison [1]. You can find the dataset here.
By the end of this article, you’ll have a clear understanding of how AutoML can help leverage ML to drive meaningful insights and make informed decisions.
AWS SageMaker AutoML
Figure 1 gives an overview on the different steps that AWS SageMaker AutoML solves.
It includes the following steps:
- Data Preparation: You can easily upload your data to Amazon S3. Once your data is uploaded, SageMaker AutoML automatically analyzes your data in order to detect any missing values, outliers or data types that need to be transformed.
- Automatic Model Creation: AWS SageMaker AutoML automatically trains multiple machine learning models with different hyperparameters and algorithms to determine the best model for your data. It also provides automatic model tuning, which adjusts the hyperparameters of the selected models to further optimize their performance. It also creates the notebooks for running the model selection automatically for you, so that you have a full visibility about what is executed during this process.
- Model Deployment: Once the best model has been selected, AWS SageMaker offers to deploy the model to a SageMaker endpoint or a batch transform job, where it can be used to make predictions on new data. On top of it, AWS SageMaker Model Monitor can be utilized in order to be alerted if any issues arise (like data drifts, concept drifts, …). It also provides tools for retraining the model with new data, as well as updating the model’s hyperparameters or algorithms to improve its performance.
AWS SageMaker AutoML offers a Python SDK that can be used for starting your AutoML job and a GitHub repository with various different notebook examples on how to utilize the AutoML SDK for concrete ML use cases.
There are also other powerful and well-known AutoML tools available in the market, such as Google Cloud AutoML and H2O.ai, which also have their own unique strengths and weaknesses.
Google Cloud AutoML is known for its ease of use and intuitive interface, which makes it perfect in case you are new to ML and also not that deep into coding. Google Cloud AutoML supports image data, video data, text data and tabluar data. You can read more about that here.
H20.ai is known for its speed and scalability, making it a good option for large datasets and complex models. H20.ai offers interfaces in R, Python, or a web GUI. You can read more about its features here.
Manual Training Approach
Before I’m going to use AWS SageMaker AutoML to come up with a classifier for the credit card dataset, I first train a model in the classic way: Doing everything myself from scratch.
This later helps me to have a baseline and to compare my own approach to the AutoML approach from AWS, with the expectation that AWS SageMaker AutoML outperforms my manual, semi-optimal, approach.
For the manual approach, I make use of Scikit-learn and I will run through the steps highlighted in the next chapters.
You can also find the complete notebook in my GitHub repository here.
Data preparation
I load the dataset from a CSV file and first check the distribution of the dataset. This shows that the dataset is highly imbalanced, with only 0.17% of all samples being positive.
The dataset itself doesn’t contain any missing values.
I then split the dataset with an 80/20 split into train and test and scale the data to be in the range 0–1, while the standard scaler is only trained on the training set to avoid some overly optimistic results.
You can find the code for these steps below.
import sys
import os
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# step 1: Load the dataset from the csv file.
# You can download the dataset from Kaggle
filepath = os.path.join("data", "creditcard.csv")
df = pd.read_csv(filepath)
# step 2: check data imbalance on target
count_neg_class = np.sum(df["Class"] == 0)
count_pos_class = np.sum(df["Class"] == 1)
print(f"There are {count_neg_class} negative samples ({np.round(100 * count_neg_class / num_samples, 2)} % of total data).")
print(f"There are {count_pos_class} positive samples ({np.round(100 * count_pos_class / num_samples, 2)} % of total data).")
# step 3: split data into train and test set
X = df.drop(columns="Class").to_numpy()
y = df["Class"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# step 4: scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Typically, extensive Exploratory Data Analysis (EDA) would also be part of the data preparation step. But for the sake of this experiment, I did not do extensive EDA, as the dataset is already well-prepared for ML.
But just keep in mind that this part is also crucial to the success of your ML training and also takes some time typically.
Model Selection
The next step is to figure out what ML algorithm is best suited for the data. For this purpose, I first train a very simple baseline model using logistic regression. This is to already have something simple that I can then compare more complex algorithms to.
The goal should always be: Keep it simple! Don’t start with a Neural Network, which is harder to explain at the end, if also a simple algorithm, like logistic regression, could do the job.
The logistic regression model achieved an F1-Score of 70.6%. I am using the F1-Score for this dataset, as it is highly imbalanced and the accuracy would not deliver a meaningful measure of the model, as only predicting all classes as negative would already lead to an accuracy of more than 99%!
You can find the code for training the baseline model below.
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
preds = log_model.predict(X_test)
print(f"Test Acc: {accuracy_score(y_test, preds)}")
print(f"Test F1-Score: {f1_score(y_test, preds)}")
print(f"Test Precision: {precision_score(y_test, preds)}")
print(f"Test Recall: {recall_score(y_test, preds)}")
Okay, we now have a baseline. Let’s now try out different classification algorithms with their default hyperparameters, and let’s see what algorithm performs best on the data.
I used a 5-fold cross-validation to train each of the following models:
- decision tree
- supported vector machine
- k-nearest neighbors
- random forest
- ada-boost
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_predict
dict_models = {
"Decision Tree": DecisionTreeClassifier(),
"SVM": SVC(),
"Nearest Neighbor": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier(),
"Ada Boost": AdaBoostClassifier()
}
# train all models by using the models dictionary
results_dict = {}
for model_name, model in dict_models.items():
print(f"Start training {model_name}...")
preds = cross_val_predict(model, X_train, y_train, cv=5)
f1 = f1_score(y_train, preds)
precision = precision_score(y_train, preds)
recall = recall_score(y_train, preds)
print(f"F1-Score: {f1}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print("\n\n")
results_dict[model_name] = (f1, precision, recall)
# create a pandas dataframe with the results on sort on f1-score
df_results = (pd.DataFrame.from_dict(results_dict, orient="index", columns=["F1-Score", "Precision", "Recall"])
.sort_values(by="F1-Score", ascending=False))
df_results
The random forest algorithm delivered the best results with an F1-Score of 86.9%, followed by the nearest neighbor algorithm with an F1-Score of 84.8%. Not bad!
Next step is to fine tune the winner (random forest).
For this, I selected some values of the hyperparameters that I want to try out and used a randomized cross-validation search in order to find the best set of hyperparameters leading to the best model.
The code for this evaluation:
from sklearn.model_selection import RandomizedSearchCV
params = {
"n_estimators": [10, 20, 30, 60, 80, 100],
"criterion" : ["gini", "entropy"],
"max_depth" : [4, 5, 10, None],
"min_samples_split": [2, 4, 6],
"class_weight": [None, "balanced", "balanced_subsample"]
}
clf_rf = RandomizedSearchCV(RandomForestClassifier(), params, n_iter=50, scoring="f1", cv=5, verbose=1, n_jobs=-1)
clf_rf.fit(X_train, y_train)
# let's print the best score and save the best model
print(f"Best f1-score: {clf_rf.best_score_}")
print(f"Best parameters: {clf_rf.best_params_}")
best_random_forest_model = clf_rf.best_estimator_
The best model scored more or less the same as the one that I got without tuning the hyperparameters. What a waste of time ;)
Model Evaluation
Last but not least, I evaluate my final model on the hold-out test set to see how it would perform on real-world data.
final_preds = best_random_forest_model.predict(X_test)
f1 = f1_score(y_test, final_preds)
precision = precision_score(y_test, final_preds)
recall = recall_score(y_test, final_preds)
print(f"F1-Score: {f1}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
This gives me an F1-Score of 82%, so close to our validation results but a bit below.
I know that one could get more out of the model by tuning it a bit more. But the goal in this article is to just do some basic ML and compare the results to AutoML to already see how good AutoML performs and how the effort is to run an AutoML job in comparison to only do the basic training myself.
Training with AWS SageMaker AutoML
Okay, now that I have my baseline, I can try to get a better model with less effort using AWS SageMaker AutoML.
I will again guide you through the different steps and provide the code snippets for all of these steps. You can also find the complete notebook in my GitHub repository here.
Data Upload
For SageMaker AutoML to work, the data needs to be stored in s3. Therefore, I am first creating a bucket and then uploading the CSV file to this bucket (Gif 1).
Setup SageMaker Notebook
The next step is to setup the environment where I can run the AutoML job in.
To achieve this, I first create a notebook in SageMaker where I can then create and run the code in.
In AWS, access to services by other services is handled with IAM roles. The SageMaker notebook gets an IAM role attached with some default access rights. But for accessing my before created s3 bucket, I first have to explicitly adapt the policy attached to this role.
Gif 2 shows the complete process of creating the notebook and how I adapted the policy of the notebook role. The quality is unfortunately not that good, but I still think it is valuable to see the sequence of actions taken. But I also added some screenshots of the exact settings when creating the notebook (Figure 2), and I added the complete policy I attached to the notebook role in my GitHub repository here.
Run AWS SageMaker AutoML Job
Now I can finally start running some code.
The code I am running is mostly copied and adapted from this AWS tutorial notebook.
As a first step, I am loading the data from s3 into a Pandas dataframe and setup some general variables required for SageMaker AutoML later.
import numpy as np
import pandas as pd
import boto3
import sagemaker
import os, sys
# get some variables required for AutoML later
sess = sagemaker.Session()
bucket = sess.default_bucket()
region = boto3.Session().region_name
prefix = 'sagemaker/fraud-detection-auto-ml'
# Role when working on a notebook instance
role = sagemaker.get_execution_role()
# get some sagemaker clients
sm = boto3.Session().client(service_name='sagemaker',region_name=region)
sm_rt = boto3.Session().client('runtime.sagemaker', region_name=region)
# load data from s3
bucket_data = 'patrick-fraud-detection-ml-kaggle'
filename = 'creditcard.csv'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket_data, Key=filename)
df = pd.read_csv(obj['Body']) # 'Body' is a key word
Then I split the dataset into a training set and a hold-out test set. I will then use the latter to compare the AutoML model to my own trained model. I will then upload the data to the s3 bucket created by SageMaker so that the AutoML job can access it directly from s3.
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(df, test_size=0.2)
# Save to CSV files and upload to S3
train_file = "automl-train.csv"
train_data.to_csv(train_file, index=False, header=True, sep=',') # Need to keep column names
train_data_s3_path = sess.upload_data(path=train_file, key_prefix=prefix + "/train")
print("Train data uploaded to: " + train_data_s3_path)
# save test file only to a CSV file
# -> will be send as POST request to inference endpoint later
test_file = "automl-test.csv"
test_data.to_csv(test_file, index=False, header=False, sep=',')
Now I can set up the AutoML job and launch it. You can find more about the required input parameters and settings in the SageMaker SDK documentation here.
from time import gmtime, strftime, sleep
# setup config for input data
input_data_config = [{
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://{}/{}/input'.format(bucket,prefix)
}
},
'TargetAttributeName': 'Class' # the column we want to predict
}
]
# setup config for output data
output_data_config = { 'S3OutputPath': 's3://{}/{}/output'.format(bucket,prefix) }
# Optional parameters
problem_type = 'BinaryClassification'
job_objective = { 'MetricName': 'F1' } # using F1 because of highly imbalanced dataset
# launch the AutoML job
# but: limit to max. 20 candidates to limit overall execution time
timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
auto_ml_job_name = 'fraud-detection-' + timestamp_suffix
sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
AutoMLJobConfig={"CompletionCriteria": {"MaxCandidates": 20}},
AutoMLJobObjective=job_objective,
ProblemType=problem_type,
RoleArn=role)
The job now runs in the background and creates the required AWS resources for you.
You can then run the following code in order to track the progress of the AutoML job:
job_run_status = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobStatus']
print(job_run_status)
while job_run_status not in ('Failed', 'Completed', 'Stopped'):
describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
job_run_status = describe_response['AutoMLJobStatus']
print (describe_response['AutoMLJobStatus'] + " - " + describe_response['AutoMLJobSecondaryStatus'])
sleep(60)
The job is running through the following stages:
- Analyzing Data
- Feature Engineering
- Model Tuning
- Merging AutoML Tasks Reports
AWS SageMaker is generating two notebooks for you. One for exploring the data and one for defining the different candidates that are evaluated on the dataset. You can get these notebooks if you are interested in what code AWS SageMaker is running for these stages.
It is possible to list all the experiments that AWS SageMaker has executed and you can list all the explored candidates, too. I did not add the code in this article, but you can find it in my notebook, if you are curios how this works.
Evaluate Best Candidate on Testset
Now it is time to test the best candidate on the hold-out test set. This is the most interesting part to me, as it shows whether AutoML can score something better than my manual approach.
First, I am retrieving the best candidate from the AWS SageMaker AutoML job.
best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
Next, I will host this model as an endpoint in AWS where I can send data to for inference.
timestamp_suffix = strftime("%d-%H-%M-%S", gmtime())
model_name = best_candidate_name + timestamp_suffix + "-model"
# create a model in SageMaker that can be hosted as endpoint
model_arn = sm.create_model(
Containers=best_candidate["InferenceContainers"], ModelName=model_name, ExecutionRoleArn=role
)
# setup config for endpoint (including instance type)
epc_name = best_candidate_name + timestamp_suffix + "-epc"
ep_config = sm.create_endpoint_config(
EndpointConfigName=epc_name,
ProductionVariants=[
{
"InstanceType": "ml.m5.2xlarge",
"InitialInstanceCount": 1,
"ModelName": model_name,
"VariantName": "main",
}
],
)
# deploy endpoint
ep_name = best_candidate_name + timestamp_suffix + "-ep"
create_endpoint_response = sm.create_endpoint(EndpointName=ep_name, EndpointConfigName=epc_name)
# wait until endpoint is ready for inference
sm.get_waiter("endpoint_in_service").wait(EndpointName=ep_name)
And last but not least, read the CSV file with the test data and send the data to the final model for inference. I then compare the predictions to the ground truth and then use a quite manual approach to count up the true positives, true negatives, false positives and false negatives.
I honestly was just too lazy to implement something on my own and just adated the code again from the AWS SageMaker tutorial notebook that you can find here.
tp = tn = fp = fn = count = 0
with open('automl-test.csv') as f:
lines = f.readlines()
for l in lines[1:]: # Skip header
l = l.split(',') # Split CSV line into features
label = l[-1] # Store 0/1 label
l = l[:-1] # Remove label
l = ','.join(l) # Rebuild CSV line without label
response = sm_rt.invoke_endpoint(EndpointName=ep_name, ContentType='text/csv', Accept='text/csv', Body=l)
response = response['Body'].read().decode("utf-8")
#print ("label %s response %s" %(label,response))
if '1' in label:
# Sample is positive
if '1' in response:
# True positive
tp=tp+1
else:
# False negative
fn=fn+1
else:
# Sample is negative
if '0' in response:
# True negative
tn=tn+1
else:
# False positive
fp=fp+1
count = count+1
if (count % 100 == 0):
sys.stdout.write(str(count)+' ')
# get final scores
# Confusion matrix
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
f1 = (2*precision*recall)/(precision+recall)
print ("Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1: %.4f" % (accuracy, precision, recall, f1))
The final model achieves an F1-Score of 96% on the hold-out test set!
This is awesome! As comparison, the model I trained with Scikit-Learn only achieved an F1-Score of 82%.
Shortcomings of AutoML
AutoML is definetly powerful and can help to speed up the ML development cycle. But there are also shortcomings of using AutoML that should be recognized.
One of the main limitations of AutoML is that it can be somewhat of a black box approach, as it automates much of the process of building a machine learning model. This can make it difficult for data scientists to fully understand how the model works and may limit their ability to fine-tune the model or debug any issues that arise.
AWS tries to combat this by providing jupyter notebooks that show the code for stages like data exploration or exploring different ML candidates. This already helps to get some insights, but if there are any issues on the code or the findings that the AutoML job had, then the data scientist has no influence on changing the code behind, as everything is auto-generated.
Another potential drawback of using AutoML is that it can be less flexible than a more traditional approach to building a machine learning model. AutoML is optimized for efficiency and ease of use, but this may come at the expense of customization options or the ability to work with specialized datasets or models.
For this article, I have used a very simple dataset and AWS SageMaker AutoML was able to train a good candidate. But if there is a more challenging dataset, than it is not clear how well AutoML would perform on the dataset.
Personal Findings
In this chapter, I want to highlight my personal findings while using AWS SageMaker AutoML.
The first thing I recognized is that it is quite complex to use and to set it up. I personally have some more experience in using AWS in general and also in coding, but if there is a person with not that mutch of prior knowledge, then the learning curve of using AWS SageMaker AutoML could be too steep.
I also think that the documentation is not that well. I had a hard time in the beginning to find something about end-to-end usecases in the documentation. I could then find video trainings and GitHub repositories with example notebooks, but I personally like written documentation more than videos.
And please be careful about the costs. I initially played around with SageMaker AutoML and had the AutoML job running more often, because I wanted to try out different things there. But that wasn’t as cheap as I hoped and I ended up paying more on this experiment than I planned.
Conclusion
In this article, I compared AWS SageMaker AutoML to manually training a model using Scikit-Learn.
As dataset, I decided to use the fraud detection dataset, as this has some difficulties in it because of the high degree of imbalance.
I then manually tried to find a good classifier and did the same with AWS SageMaker AutoML.
I finally compared the results of both approaches on a hold-out test set, where AWS SageMaker AutoML outscored my manual approach, as it achieved an F1-Score of 96%, compared to 82%.
This shows that it totally makes sense to use AWS SageMaker AutoML to train ML on your dataset in order to quickly come up with a classifier that you can use.
It is even not required to have expertise on the field of ML for making use of AWS SageMaker AutoML.
Of course, my manual approach has to be taken with a grain of salt.
I did not invest a lot of time to fine tune my final classifier and I am pretty sure with a bit more invested time I would have scored also better results on the hold-out test set.
At least I hope so ;)
But the idea of this article was to show how easy it can be to just make use of an AutoML library to come up with a classifier for your ML dataset.
This at the end doesn’t need to be the final one that you put into a product, but you could at least do a quick initial proof-of-concept in order to see whether you can get something useful out of your data.
Outlook
I till now only took a deeper look into AWS SageMaker AutoML, but it would definetly be interesting to also check other offerings, like Google Cloud AutoML.
I am planing on doing a complete evaluation of Google Clouds Vertex AI offering in the near future and will then write about my findings in Medium as well.
Then I can also talk specifically about how Sagemaker performs compared to Google Clouds offer.
So follow me in case you don’t want to miss that!
Thank you for reading my article to the end! I hope you enjoyed this article. If you want to read more articles like this in the future, follow me to stay updated.
Join my email list if you want to learn more about machine learning and the cloud.
Contact
References
[1]: Machine Learning Group — ULB, “Credit Card Fraud Detection”, Kaggle, 2018, Database Contents License (DbCL) v1.0
[2]: AWS, Amazon SageMaker Autopilot (accessed 4/1/2023)