avatarProsenjit Chakraborty

Summary

The web content discusses the use of Databricks AutoML for automated machine learning tasks and the deployment of trained models using MLflow model serving.

Abstract

The article provides an overview of Databricks' AutoML capabilities, detailing how it automates the machine learning process by preparing datasets, performing trials, and evaluating multiple models to identify the best one based on performance metrics like the r2 score. It also outlines the steps for model serving, including registering the model in the Model Registry, enabling model serving for online predictions, and testing the deployed model using REST API calls. The process involves creating a Delta table from a sample dataset, configuring the AutoML experiment, and using the generated code to train and deploy the model. The article emphasizes the ease of use and the ability to manage the model lifecycle, suggesting Azure API Management for production API management.

Opinions

  • The author endorses Databricks AutoML for its ability to streamline the machine learning workflow, from data preparation to model evaluation.
  • The article suggests that the AutoML feature is user-friendly, providing a Python notebook with source code for transparency and reproducibility.
  • The author implies that the integration of MLflow with Databricks enhances the model management and deployment process.
  • The recommendation of Azure API Management indicates the author's preference for this platform in managing APIs for production use.
  • The author encourages readers to engage with the content by following on Medium, Blogger, and LinkedIn, and to try out a cost-effective AI service alternative to ChatGPT Plus.

Databricks — AutoML & Model Serving

In our previous blog, we talked about different MLflow components and concentrated on tracking, managing models & deploying into model registry. In this blog, we’ll talk about Databricks AutoML feature and MLflow model serving.

AutoML

Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

To start with, let’s first prepare the dataset (I have taken a sample dataset from sklearn) to train the model and save this as a Delta table.

%python
import sklearn
input_pdf = sklearn.datasets.fetch_california_housing(as_frame=True)
chDf = spark.createDataFrame(input_pdf)

chDf.write\
    .format("delta")\
    .save("/mnt/delta/california_housing")
spark.sql ("CREATE TABLE default.california_housing USING DELTA LOCATION '/mnt/delta/california_housing'")

Once we prepare the training dataset we can use Databricks AutoML experience to train multiple models and present to us.

Create AutoML Experiment

We’ll select a cluster (I’m using Azure Databricks Runtime for Machine Learning 8.3 ML Beta). As this is a regression problem we’re trying to solve, we’ll select ML problem type as Regression (other option available is Classification).

Next, we’ll browse and select the training dataset i.e. the Delta table we’ve created and the prediction target column.

AutoML Experiment Configuration screen.

Once the right configuration is set, AutoML will start training. The training process will last for 60 minutes. We can stop the process if required.

While AutoML is training the model.

After an hour, the training will be completed and AutoML will identify the best model based on r2 score.

We can select View notebook for best model to open the auto generated code.

At the end of the notebook, we’ll find the model URI.

Register the final model into Model Registry

If we’re happy with this model, we can register it into the Model Registry.

%python
import time
import mlflow
model_name = "california_housing_regression"
model_uri = "runs:/07e95f5a9c8d4949a5f8ff2c03067605/model"
new_model_version = mlflow.register_model(model_uri, model_name)
 
# Registering the model takes a few seconds, 
# so add a delay before continuing with the next cell
time.sleep(5)
Once our model has been registered.

Once our model is registered, we can promote it to a higher stage (refer my previous blog for lifecycle management).

Promote the model to production.

Model Serving

In this section, we’ll see how to deploy the registered model for on-line serving.

Select the ‘Serving’ tab & press ‘Enable Serving’.

This will start a job cluster, install required libraries and our chosen model.

A new job cluster is instantiated.

We can amend the cluster settings or select a different instance type based on our need.

‘Cluster Settings’ page.

Once the cluster is in running state and the model is deployed, we can find multiple model URLs —

  • using the version number, we can call any particular model version: https://<databricks instance>/model/<model name>/<version number>/invocations
  • or, we can call the latest version of each stage: https://<databricks instance>/model/<model name>/<stage>/invocations
Once the model is ready.

Now, we can test the service by sending a REST request and getting the desired response.

Test the on-line model

We need a Databricks access token to access the REST endpoint of the deployed model.

Generate an access token.

I’ve used Postman to access the REST API. We’ll use the model URL from earlier screen, select authorization type as Bearer Token and input Databricks access token.

POST REST call.
A sample request.
A returned response.

If we want to publish the online model for production use, we should use Azure API Management platform for managing our APIs.

References

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium, Blogger & LinkedIn.

Databricks
Mlflow
Machine Learning
Azure Databricks
Recommended from ReadMedium