avatarIsaac Kargar

Summary

The provided content is a comprehensive guide on setting up MLflow on Google Cloud Platform (GCP) using both Compute Engine and Cloud-Run for machine learning experiment tracking and management.

Abstract

The article discusses the implementation of MLflow on GCP, detailing the architecture for a distributed experiment tracking system. It outlines the steps to deploy MLflow using Google Compute Engine, including setting up a tracking server, a PostgreSQL database as a backend store, and a Google Cloud Storage bucket for artifact storage. The guide also covers the configuration of firewall rules, creation of virtual machines, and installation of necessary software components. Additionally, it presents an alternative approach using Cloud-Run with Terraform for resource provisioning, emphasizing the use of secret management and Cloud SQL for metadata storage. The author aims to fill a gap in available resources by providing a step-by-step tutorial for MLflow setup on GCP, ensuring that data scientists can collaborate effectively by centralizing their experimentation data.

Opinions

  • The author believes that a centralized tracking server is beneficial for teams of data scientists to collaborate and maintain a unified record of their experiments.
  • The article suggests that using MLflow's Tracking Server with proxied artifact storage access enhances security by eliminating the need for end-users to have direct access to remote object stores.
  • The author indicates a preference for using private IP addresses for database connections to ensure secure communication between the tracking server and the backend store.
  • The guide emphasizes the importance of using Cloud-Run for a more resource-efficient deployment of MLflow, as opposed to traditional virtual machine-based setups.
  • The author values the use of Terraform for infrastructure as code (IaC) practices, which facilitate the reproducibility and manageability of the deployment process.
  • The article promotes the best practice of storing secrets, such as database credentials, in Google Cloud's Secret Manager to enhance security.

Setting Up MLFlow on GCP

I searched the internet and couldn’t find a good step-by-step tutorial on setting up MLFlow on GCP. That’s why I decided to write a tutorial on that.

In this blog post, I will show the steps to set up MLFlow on Google Cloud for distributed experiment tracking. MLFlow can be used for machine learning experiment tracking. There are several ways to use MLFLow which you can check here.

The architecture that we want to implement here is like scenario number 5 and would be as the following image:

source

This is useful for teams with multiple data scientists to have one tracking server to be shared between all of them. So they all can do their experimentation and have everything in one place. The tracking server is not also dependent on the backend store and artifact store and can be scaled. In addition, the scientists will not lose their local data if they want to scale their machine or change it. Everything is decentralized.

Here is the explanation from MLFLow documentation:

MLflow’s Tracking Server supports utilizing the host as a proxy server for operations involving artifacts. Once configured with the appropriate access requirements, an administrator can start the tracking server to enable assumed-role operations involving the saving, loading, or listing of model artifacts, images, documents, and files. This eliminates the need to allow end users to have direct path access to a remote object store (e.g., s3, adls, gcs, hdfs) for artifact handling and eliminates the need for an end-user to provide access credentials to interact with an underlying object store. Enabling the Tracking Server to perform proxied artifact access in order to route client artifact requests to an object store location: Part 1a and b: The MLflow client creates an instance of a RestStore and sends REST API requests to log MLflow entities. The Tracking Server creates an instance of an SQLAlchemyStore and connects to the remote host for inserting tracking information in the database (i.e., metrics, parameters, tags, etc.) Part 1c and d: Retrieval requests by the client return information from the configured SQLAlchemyStore table Part 2a and b: Logging events for artifacts are made by the client using the HttpArtifactRepository to write files to MLflow Tracking Server The Tracking Server then writes these files to the configured object store location with assumed role authentication Part 2c and d: Retrieving artifacts from the configured backend store for a user request is done with the same authorized authentication that was configured at server start. Artifacts are passed to the end user through the Tracking Server through the interface of the HttpArtifactRepository

Deploying MLFlow on GCP using Compute Engine

In this distributed architecture, we will have : - one virtual machine as a tracking server - one google storage bucket as artifact store — persists artifacts (files, models, images, in-memory objects, or model summary, etc). - one PostgreSQL as backend store — persists MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc).

The architecture on GCP would be like the following image:

Virtual Machine as The Tracking Server

We need a firewall rule which can be created like the following:

gcloud compute firewall-rules create mlflow-tracking-server \
 - network default \
 - priority 1000 \
 - direction ingress \
 - action allow \
 - target-tags mlflow-tracking-server \
 - source-ranges 0.0.0.0/0 \
 - rules tcp:5000 \
 - enable-logging

Here is the firewall rule after creation:

We then can create a virtual instance as the tracking server.

gcloud compute instances create mlflow-tracking-server \
 - project=<PROJECT_ID> \
 - zone=europe-west1-b \
 - machine-type=e2-standard-2 \
 - network-interface=network-tier=PREMIUM,subnet=default \
 - maintenance-policy=MIGRATE \
 - provisioning-model=STANDARD \
 - service-account=<PROJECT_NUMBER>[email protected] \
 - scopes=https://www.googleapis.com/auth/cloud-platform \
 - tags=mlflow-tracking-server \
 - create-disk=auto-delete=yes,boot=yes,device-name=mlflow-tracking-server,image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20220610,mode=rw,size=10,type=projects/<PROJECT_ID>/zones/europe-west1-b/diskTypes/pd-balanced \
 - no-shielded-secure-boot \
 - shielded-vtpm \
 - shielded-integrity-monitoring \
 - reservation-affinity=any

change PROJECT_ID based on your project. You can also change other configs like zone, machine, etc. if you want. Note that you have to change them in multiple places. The service account is the default service account for compute engine and is as follows:

PROJECT_NUMBER-compute@developer.gserviceaccount.com

Where PROJECT_NUMBER is the project number of the project that owns the service account. You can find it here.

You can also use the UI to simply create the virtual machine. Just make sure you use the default network for VPC and the created firewall rule for the networks tags in the Network Interfaces section. Also, give the VM Allow full access to all Cloud APIs in the Management -> availability policies section.

Here is the networking section of the VM after creation (other configs can be based on your choice):

Database as the Backend Store

We also need a PostgreSQL database as the backend store.

- Go to GCP dashboard and search for SQL and then select create instance and the select PostgreSQL. - Put a name and password for the instance. Select the Database version and region. You can choose one option for Zonal availability too. - Expand the Customize your instance part, and in connections, select Private IP and deselect `Public IP` and from the drop-down options for Network in Private IP part, select default. This is the VPC which our virtual machine should be also on it too. So the VM and DB can see each other. - You can change other configs for the DB too. I leave them as their default values. - Select Create Instance option.

It will take you to the overview page and will take some time to create the database instance. Then we can create a database. GCP will create a default one named postgres, but I will create a new one.

Then go to the Databases section and select Create Database and name it mlflow_db, for example.

Then we need to create a user too. Go to the User section and click on the Add User Account. Select a username and password for that.

Now, you should be able to connect to the tracking server via ssh and run the following command to install and then see the list of databases. You can see the created database with its private IP.

sudo apt-get update
sudo apt-get install postgresql-client
gcloud sql instances list

Then run the following command to see if you can connect to the database.

psql -h CLOUD_SQL_PRIVATE_IP_ADDRESS -U USERNAME DATABASENAME

It will ask you for the password of the user you created before.

Now that you can connect to the database from the tracking server using private IP, let’s go to the next part.

Google Cloud Storage Bucket as Artifact Store

In the google cloud dashboard, search for cloud storage and then select Create Bucket. Do the required configs and done. You can also create a folder like mlruns in the bucket.

Run the MLFlow Server on Tracking Server

Now we have all the resources. Go back to the ssh terminal for the tracking server or connect to it again. I had some problems with installing the required python packages. So I created a virtual env and installed the packages there.

sudo apt install python3.8-venv
python3 -m venv mlflow
source mlflow/bin/activate
pip install mlflow boto3 google-cloud-storage psycopg2-binary

Then run the MLFlow server:

mlflow server \
 -h 0.0.0.0 \
 -p 5000 \
 - backend-store-uri postgresql://<user>:<pass>@<db private ip>:5432/<db name> \
 - default-artifact-root gs://<bucket name>/<folder name>

Then you can go to http:<tracking server external IP>:5000 address and you should see the MLFlow UI!

Now, you can train a model on your machine or another VM and log MLFlowdata.

import mlflow
import os

TRACKING_SERVER_HOST = "<tracking server external IP>"
mlflow.set_tracking_uri(f"http://{TRACKING_SERVER_HOST}:5000")
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

mlflow.set_experiment("my-experiment-1")

with mlflow.start_run():
    X, y = load_iris(return_X_y=True)
    params = {"C": 0.1, "random_state": 42}
    mlflow.log_params(params)
    lr = LogisticRegression(**params).fit(X, y)
    y_pred = lr.predict(X)
    mlflow.log_metric("accuracy", accuracy_score(y, y_pred))
    mlflow.sklearn.log_model(lr, artifact_path="models")

print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'")
 
mlflow.list_experiments()

Note that you need to install google-cloud-storage via pip on your machine.

You should now see my-experiment-1 in the output of the above code and also in UI (refresh the page if you don’t see it).

You can also assign a fixed external IP address for your tracking server. So you don’t need to change it in the code every time you start the VM. You can do this by going to the IP addresses section in VPC network as shown in the below image:

Now if you check the mlflow-tracking-server VM, you should see the External IP even if the VM is stopped.

In addition to the above solution and architecture, I found another solution here to deploy MLFlow on GCP using Cloud-Run to use fewer resources.

Deploying MLFlow on GCP using Cloud-Run

Here I just repeat the steps with small modifications to complete the content of this blog post. Feel free to refer to the main resource.

  • This will build an architecture like the following:
  • Postgres SQL using Cloud SQL for metadata
  • Cloud storage for artifacts
  • service account to be used with CloudRun for accessing other GCP services as a best practice.
  • Secrets such as user password, database connection string are stored in Secret Manager and for storing artifacts Cloud storage is used.

Pre-Requisites

In Local Machine

  • Install docker (To build images and pull)
  • Install gcloud CLI tools and set the project
  • Install Terraform

In GCP

  • Enable the following APIs for the project - Cloud SQL - Cloud SQL Admin - Secret Manager - Cloud Run
  • Create a Service Account for Terraform with the following roles attached - Cloud Run Admin - Cloud Run Invoker - Cloud SQL Admin - Secret Manager Admin - Secret Manager Secret Version Adder - Secret Manager Secret Version Manager - Security Reviewer - Service Account User - Storage Admin
  • Clone the Repo
#To clone the code
git clone [email protected]:kujalk/MLFlow_GCP_Terraform.git
cd MLFlow_GCP_Terraform/Terraform_Resources
  • Download the JSON key and put it inside the Terraform_Resources folder.

Method

Open terraform.tfvars and fill the values accordingly

  • keyfile — Absolute path to the Service Account key file
  • mlflow_tracking_username — username for MLFlow
  • mlflow_tracking_password — password for MLFlow

Give the commands below,

gcloud auth login
gcloud auth configure-docker

#you might need to set GCP the project
gcloud config set project <project id>

#To create the resources
terraform init
terraform plan
terraform apply -auto-approve

#To destroy the resources
terraform destroy -auto-approve

Note

  • Terraform will output the “Cloud run” service URL in the output
  • Totally it will take 15–20 min to create the resources (More time for Cloud SQL) and 5–10 minutes to destroy the resources
  • Cloud SQL DB instance name must be set differently each time, because, for 1 week we cannot use the same name
  • If you get “failed to delete database” error -> Wait for another 15 min and delete the resources

Then in the code:

import os
os.environ["MLFLOW_TRACKING_USERNAME"] = "<tracking username set in terraform tfvar>"
os.environ["MLFLOW_TRACKING_PASSWORD"] = "<tracking password set in terraform tfvar>"

mlflow.set_tracking_uri("<cloud run endpoint generated by terraform - you can also find it under cloud run apps>")
mlflow.set_experiment('<experiment name>')

Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.

Recommended from ReadMedium