avatarSascha Heyer

Summary

This text discusses the deployment and serving of machine learning models at scale using Google Vertex AI Endpoints.

Abstract

Google Vertex AI Endpoints provide a dedicated service for deploying machine learning models, offering great flexibility compared to using virtual machines. The article covers the process of deploying models, including using pre-built containers, custom containers, and custom prediction routines. It also explains the requirements for building a custom container, including implementing an HTTP server that listens for requests on 0.0.0.0 on port 8080, providing an HTTP path for health checks, and ensuring the request and response bodies are in JSON format. The article also touches on scaling, limitations, and pricing.

Opinions

  • Using Google Vertex AI Endpoints for deploying models is recommended over using virtual machines or on-prem machines.
  • Custom containers offer more customization options but require more effort than pre-built containers.
  • The model's resource requirements must be understood to properly put the models into production.
  • The prediction request and response cannot be larger than 1.5 MB, which is a limitation.
  • Vertex AI Endpoints do not scale down to zero, which can lead to higher costs for certain use cases.
  • Automating the deployment process is important for a production environment.
  • The costs of using Vertex AI Endpoints depend on the machine type, the time needed to handle requests, and the number of parallel nodes needed.

Serving Machine Learning models with Google Vertex AI

Deploying and serving any kind of machine learning model at any scale.

Companies frequently deploy their models to virtual machines (Google Compute Engine or even on-prem machines). This is something that should be avoided. Google Cloud provides a dedicated service called Vertex AI Endpoints to deploy your models.

Vertex AI Endpoint provides great flexibility compared with easy usage. You can keep it simple or go full in and customize it to your needs using custom containers.

A Google data center reimagined by Open AI’s DALL·E 2

This article covers everything needed to put your models into production and serve requests at a large scale. Including a large section on how to properly scale your models. And a few workarounds around the limitations of the service.

YouTube

Jump Directly to the Notebook and Code

All the code for this article is ready to use in a Google Colab notebook. If you have questions, please reach out to me via LinkedIn or Twitter.

Different ways to serve your models

Google Vertex AI provides three ways to serve your models. Which one to choose depends on your requirements.

  1. Using pre-built containers for prediction
  2. Using custom containers for prediction
  3. Using a custom prediction routine , this way you don’t need to build the custom container yourself. And Google is taking care of providing the HTTP server for you. PREVIEW Feature.

This article covers option 2 in detail, serving your models with a custom container. Let me know if you’re interested in option 3.

Option 1 — Pre-built container for prediction

Use the pre-built container for prediction if your model is trained with TensorFlow, scikit-learn, or XGBoost. It’s the easiest way. You simply need to upload your model, deploy it to an endpoint, and you’re ready to go.

We need to define one of the Google pre-built containers container-image-uri when uploading the model with the gcloud ai model upload command.

The pre-built container supports different versions of the ML frameworks. As well as optimized images for predictions with CPU and GPU. Make sure you choose the right container image. https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers

For example, if our model was trained on TensorFlow 2.8 and we want to use a GPU, we choose us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-8:latest as a container image.

No magic, effortless, and easy to use. But you might have some more specialized requirements. And for that, we can use a custom container.

Option 2 — Custom Container for prediction

Custom containers for prediction are used for more customized use cases. Some of them are:

  1. The pre-built containers do not support your ML framework like PyTorch.
  2. You want complete control over the prediction code (e.g., custom logging).
  3. You need to do post or preprocessing for your predictions. For example, tokenize your input to be used with transformers.
  4. You want to host multiple models with the same endpoint. If you’re using TensorFlow, better use shared resources instead. There can be challenges around scaling when your endpoint serves multiple models.

A custom container is just a docker container containing everything needed to serve your model. You need to create this docker container yourself. We cover that later in the article.

Deploy model

Deploying the model is a three-step process. It can be performed with the API, gcloud, or the available SDKs for different programming languages.

  1. You have to upload your model to the Vertex AI Model Registry.
  2. Create a Vertex AI Endpoint (once)
  3. After that, you can deploy your model to the Vertex AI Endpoint.

Uploading the model requires only a few parameters. The container-image-uri is either the pre-built container or your customer container.

Later in the article, we cover a sample implementation of a custom container for a Hugginface transformer model.

The artifact-uri points to a Google Cloud Storage location where your model files are stored. This parameter is usually used for the pre-built container, though you can also use it with your custom prediction container and access the path with an environment variable os.environment.get('AIP_STORAGE_URI') .

Before we can deploy our model, we need to create a Vertex AI Endpoint. This step is performed only once. Make sure the region is the same as your model’s region.

And finally, we deploy the model to an endpoint. The first ID is the endpoint we created in the previous step. And we also need to provide the model as an ID. In our example, it takes approx. 4 min for the model deployment.

Check out the documentation as there are more parameters like the machine type, scaling, traffic-splitting, accelerators, and more. The default machine type is n1-standard-2.

The endpoint is now running and ready to serve your prediction requests. I recommend using one of the many client SDKs to get your predictions.

Container requirements

As the name suggests, it has to be a dockerized application. The docker container needs to follow the requirements defined by Google. The most important requirement is to provide an HTTP server that listens for requests on 0.0.0.0 on port 8080.

Additionally, you need to provide an HTTP path for health checks. It has to return a 200 when you’re container is ready to handle requests. For example, if you need to load the model, ensure you return the 200 status code after the model is loaded.

Default values can be changed for the health check, predict, and port. The default path for the health check is / and for the prediction /predict all listening on port 8080 .

To change it, you can provide additional parameters when uploading the model.

The last requirement is around the request and response body. The response and the request have to be in JSON format.

The request body needs to contain an instances key.

The response body needs to contain a predictions key.

If you follow those requirements, you can get the container up and running with a few lines of code.

Model

The model we are using in this article was trained previously. You can follow the article or the YouTube video if you want to train this model yourself.

Build the custom container

In this example, we build the custom container using Python, the most used language among ML use cases. Anyway, remember you are not limited to Python. You only need to make sure you follow the container requirements. I have seen customers running C++ for their model serving, all possible.

Our server implementation contains the logic to load the model and get the prediction. We implement a python HTTP server. I am using FastAPI, just one of many ways to implement an HTTP server.

You can easily implement everything that is needed for your specific use case. This could be additional pre or post-processing or loading of a tokenizer, or adding additional logging. No limitations. The following code is just a scaffold because it is unique to your model. Check the notebook to see a complete example with Hugginface Transformers.

To build a docker image, we need a dockerfile. I am using a base image with FastAPI. Remember, if you want to run your predictions on a GPU, you have to use another base image or install the dependencies needed. Additionally, we installed a few required dependencies like the Hugginface Transformers and TensorFlow. And finally, we copy the main.py and the sentiment folder that contains our model artifacts into the docker image.

Ensure you integrate all your artifacts into the Docker container, like your model, tokenizers, and everything needed to serve your model. Don’t download those artifacts on runtime. Depending on your artifact size, this can lead to scaling issues and increase the cold start times. Additionally, if you, for example, download an artifact from a third-party service, you might run into rate limits during quick scale-ups. I recommend integrating it into the container, no matter how small your artifacts are.

To build the container, we use Google Cloud Build and the corresponding build configuration with the following steps:

  1. Download the model from Google Cloud Storage
  2. Build the container image
  3. Push the container image to Google Container Registry

To run the cloud build job for our container, we use gcloud.

The custom container is now ready to use and stored in the Google Container Registry. The container can be defined with the container-image-uri parameter when uploading the model to the Vertex AI model registry.

But it’s not ending here now the real work starts you need to make sure your service scales according to your requirements.

Getting predictions

To get predictions, you can use one of the available client SDKs, the API, gcloud, or the UI (for rapid testing).

The request.json follows the requirements we already covered.

As a response, we get a JSON containing the predictions for all instances and additional meta information.

Scaling

To scale your endpoints correctly, you need to understand the resources needed for your model. This defines the machine type and GPU required to serve your model. In addition to that, we also need to serve a specific number of prediction requests. Google takes care of that by automatic autoscaling the number of nodes based on CPU and GPU usage. The CPU and GPU default threshold is set to 60% and can be changed by setting the AutoscalingMetricSpec.

It is possible to define a maximum number of nodes, to prevent cost explosion with the downside of maybe not being able to deliver all requests.

My recommendation: Deploy your model and run a load test using k6 or any other load-testing framework. This way, you feel how the model behaves under your specific expected production load.

Make sure you’re not over or underutilizing your chosen machine. Check the metrics for CPU / GPU as well as memory utilization. If, for example, you’re using an n1-standard-4 machine, you get out of the box 4 vCPUs. Your CPU load per node has to be below 400% (keep some space). If you see it is overutilizing or close to it, consider using a larger machine. Otherwise, you most likely will run into scaling issues. Also, consider the amount of memory needed to serve your model.

An indication of scaling issues are suddenly appearing errors in the metrics + underutilization of the max number of configured replicas during high load.

The same applies to underutilization. If you’re nodes constantly very low on CPU or Memory, consider downscaling your machine to save costs.

It’s always a bit more complicated to handle async processing and workers

FastAPI is very famous among models serving as of writing this article. Combined with uvicorn, this is a great combination as we can process multiple prediction requests asynchronously. In the example notebook, we are also using uvicorn.

In our example, we are using uvicorn-gunicorn-fastapi-docker as a base image. This specific version sets the configuration automatically based on the server it is running on.

By default, is setting the number of workers to the number of threads available. For example, if we use an n1-standard-8 machine, the number of available threads is 8. Uvicorn will, by default, start 8 workers in parallel, ready to serve requests.

Assume one instance of our model consumes 4GB of memory. And we are using the n1-standard-8. To serve 8 instances of the model, we would need 32GB of memory. An n1-standard-8 has “only” 30GB of memory. So we won’t be able to handle this amount of workers. You could either take a high memory machine, or you need to manually reduce the number of workers by setting the uvicorn--workers parameter. Our model used in this example uses 0,7GB per model instance.

There are more ways to optimize memory usage, such as preloading or optimizing the model. We won’t cover them in today's article.

That brings me to the most important point. And if there is something you take with you from this article then please this:

You need to understand the resources need for you model. Without knowing them you won’t be able to properly put your models into production.

Limitations

  • The prediction request and response cannot be larger than 1.5 MB. Clunky limitation, and I have seen many customers using different services due to this limitation. I would love to see this limitation gone shortly. It had followed me since the early days when the product was still called ML Engine. You can circumvent this by saving the response like an image on Google Cloud Storage and returning the GCS path instead.
  • Vertex AI Endpoints (as of August 2022) do not scale down to zero. You have at least one instance running. This is less of an issue if you only have a few models. If your use case requires building models for each user, your cloud costs will explode. As an alternative, you can deploy your models to Cloud Run. You will lose some features like GPUs or explainability. I recommend Cloud Run for serving your ML models. See the feature request for Vertex AI https://issuetracker.google.com/issues/206042974.

Pricing

The costs depend on the machine type, the time you need to handle your specific number of requests, as well as the number of parallel nodes needed.

To give you an idea, the smallest machine type ann1-standard-2 costs $0.123 per node hour. The minimum cost for the instance running 24/7 for one month without automatic scaling is ~90$.

What's next?

The steps we did in this article to deploy our model to an endpoint are done manually. In a production environment, you need to automate those steps. For that, Google provides Vertex AI Pipelines.

Thanks for reading.

Your feedback and questions are highly appreciated. You can find me on Twitter @HeyerSascha or connect with me via LinkedIn. Even better, subscribe to my YouTube channel ❤️.

Machine Learning
Google Cloud Platform
Transformers
AI
Vertex AI
Recommended from ReadMedium