MLflow & Azure Databricks

Databricks is one of the top choices among data scientists to run their ML codes. To help them to manage their codes and models, MLflow has been integrated with Databricks.

MLflow is an open source platform for managing the end-to-end machine learning lifecycle..Azure Databricks provides a fully managed and hosted version of MLflow integrated with enterprise security features, high availability, and other Azure Databricks workspace features..

Find below the components of MLFlow and few other important components have been added along with.

Azure Databricks Machine Learning components.

In this blog, we’ll cover few of the components — tracking, models & model registry.

Prerequisite

I’m using Azure Databricks Runtime for Machine Learning specifically, 8.3 ML Beta throughout this blog.

Data Preparation

We have used the Pima Indians Diabetes dataset (download it from here and for details, refer here).

df = spark.table ('pima_indians_diabetes')
print(f"""There are {df.count()} records in the dataset.""")
df.show(5)

Train & Test Datasets

We’ll split the dataset into train and test sets.

# Convert Spark DataFrame into Panda DataFrame
import pandas as pd
dataset = df.toPandas()

# Extract features & labels
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, \
                     test_size = 0.25, \
                     random_state = 0)

Standard Scaling

We’ll use feature scaling to keep the features into a same scale.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Enter MLflow

We’ll start by enabling or configuring logging of all the parameters, metrics and artifacts by calling mlflow.autolog() method. This will avoid to explicitly log models, libraries and parameters.

import mlflow
# Enable MLflow autologging for this notebook
mlflow.autolog()

Start run

We’ll start the MLflow by calling start_run method.

with mlflow.start_run(run_name='Pima_Indians_Diabetes_KNN') as run:
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(\
                             n_neighbors = 5, \
                             weights = 'uniform', \
                             algorithm = 'auto', \
                             leaf_size = 20, \
                             p = 2, \
                             metric = 'minkowski', \
                             metric_params = None, \
                             n_jobs = None)
  knn.fit(X_train, y_train)
  
  # Predicting the Test set results
  y_pred_knn = knn.predict(X_test)
  
  from sklearn.metrics import accuracy_score
  acc_knn = accuracy_score(y_test, y_pred_knn)
  # Explicitly log the metric (optional)
  mlflow.log_metric("test_accuracy_score", acc_knn)
  print ("Test Accuracy Score: {}".format(acc_knn))

We’ll try passing KNeighborsClassifier with different parameters and compare the accuracy score. All of these experiments will be tracked by Databricks. Select the Experiment icon at top-right of the notebook to see all of the notebook runs.

Otherwise, we can select Experiments option from the left menu to see all experiments for all notebooks. We’ll then select our current notebook.

By selecting our current notebook, all of the experiment runs will be listed with model details, parameters, generated metrics and tags.

Comparing runs manually

We can select two/multiple experiments and compare.

Databricks MLflow will automatically calculate & log different metrics like — training_accuracy_score, training_f1_score, training_log_loss, training_precision_score, training_recall_score, training_roc_auc_score, training_score.

We can also download the runs as CSV.

Retrieve the best model

We’ll use the mlflow.search_runs() to search and retrieve the best model.

# Sort runs by their test accuracy; 
# in case of ties, use the most recent run
best_run = mlflow.search_runs(
 order_by  = ['metrics.test_accuracy_score DESC', \
                 'start_time DESC'], \
 max_results = 10).iloc[0]

print ("Test Accuracy Score: {}"\
       .format(best_run["metrics.test_accuracy_score"]))

Predict inputs with the best model

model_loaded = mlflow.pyfunc.load_model(
  'runs:/{run_id}/model'.format(
    run_id=best_run.run_id
  )
)
 
best_model_predictions = model_loaded.predict(X_test[1:2])
print("Test prediction: {}".format(best_model_predictions))
print("Actual value: {}".format(y_test[1:2]))

Register the final model in Model Registry

Once we’re happy with our model, we can register it in the Model Registry. By default, current stage will be None.

import time
 
model_name = "pima_indians_diabetes"
model_uri = best_run.artifact_uri+"/model"
new_model_version = mlflow.register_model(model_uri, model_name)
 
# Registering the model takes a few seconds, 
# so add a delay before continuing with the next cell
time.sleep(5)

print ("Model Name: {}".format(new_model_version.name))
print ("Model Current Stage: {}"\
       .format(new_model_version.current_stage))
print ("Model Version: {}".format(new_model_version.version))

We can browse the registered model from Models menu.

Managing the Model Lifecycle

Transition model to Staging

We can promote the code from None to Staging or Production using UI or code. Transiting the model from None to Staging using code:

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage( \
      name=model_name, \
      version=new_model_version.version, \
      stage="Staging"
)

Transition model to Production

We can promote our model to Production programmatically or using Models screen. We’ll use Request transition to rather than direct Transition to.

Select ‘*Request transition to’ -> ‘Production’.*

The request will be pending for approval.

The model has now been promoted to Production.

Sharing Models across Workspaces

In a large enterprise, there are multiple business groups, teams and projects. They use separate instances of Databricks and not share in general. However, it’ll be good if individual data science teams can log & maintain their final machine learning models into a central model registry with proper description so, these can be used by other teams as well.

Or, there can be cases where users use different instances of Databricks for development, staging & production and use a separate centralized model registry.

Whatever is the usecase, find below the steps to access a remote Model Registry.

Step 1 — Generate token in central model registry or remote registry

Access to the centralized registry is maintained using Databricks personal access tokens (PATs). Share the PATs with the teams who want to use the registry.

Step 2 — Store the remote PAT in local workspace

We need to store the given PAT & remote store information in local Databricks secret scope.

We can use Databricks CLI to configure the secret scope.

# Install databricks-cli
pip install databricks-cli

# Configure databricks-cli to connect local workspace
databricks configure

Input local Databricks instance details.

Create the secret scope (I’m using Databricks Standard).

databricks secrets create-scope --scope my_prod_scope --initial-manage-principal users

Once the scope is created, the following keys need to be setup.

databricks secrets put --scope <my_prod_scope> --key <central_model_registry>-host

databricks secrets put --scope <my_prod_scope> --key <central_model_registry>-token

databricks secrets put --scope <my_prod_scope> --key <central_model_registry>-workspace-id

Step 3 — Code to connect to the remote registry & use required model(s)

Construct the remote registry URI:

#registry_uri = f'databricks://<scope>:<prefix>'
registry_uri = f'databricks://my_prod_scope:central_model_registry'

Specify to the remote registry:

import mlflow
mlflow.set_registry_uri(registry_uri)

Verify the current specified registry:

mlflow.get_registry_uri()

List all of the models available in the remote registry.

client = MlflowClient(tracking_uri=None, registry_uri=registry_uri)
registered_models = client.list_registered_models()
for model in registered_models:
  print ("Model Name: {}".format(model.name))

Predict with the production version model (I’ve already scaled the inputs i.e. feature scaling has been applied).

model = mlflow.pyfunc.load_model(f'models:/pima_indians_diabetes/Production')
import numpy as np
input = np.array([[-0.82986389, -1.26778492, 0.12192245, -0.19524251, -0.37696732, -0.70520517, -0.558692, -0.79908332]])
output = model.predict(input)
print ("Model predicted output:{}".format (output))

Conclusion

In this blog, we have seen how to experiment with models & parameters, track our experiments, manage ML models, register final models into Model Registry and share across enterprise. In the next blog, we’ll see how to deploy our model for online serving.

References

Thanks for reading!! If you have enjoyed, Clap & Share it!! To see similar posts, follow me on Medium, Blogger & LinkedIn.