This article provides a comprehensive guide to using an EC2 instance as a server for training a Machine Learning model with GitHub Actions.
Abstract
The article discusses the benefits of hosting your own runners, such as executing jobs within a custom hardware environment and easily increasing or decreasing the number of runners for parallel model training. It then explains how to configure an AWS EC2 instance, get AWS and Github credentials, and set up encrypted secrets for synchronization. The article also covers implementing a machine learning workflow with GitHub actions and your self-hosted EC2 runner, including data acquisition, processing, model training, evaluation, and metadata serialization. Finally, it explains how to train your model using an EC2 self-hosted runner and save the metadata in DVC storage.
Opinions
Hosting your own runner allows you to execute jobs within a custom hardware environment with the required processing power and memory storage.
Using cloud services like AWS, GCP, or Azure, the runner can benefit from all the services depending on the subscription level.
The machine learning task covered in this article is the implementation of a 3-class classification using the BERT model.
The training generates two main files: the fine-tuned BERT model and a metrics file containing precision, recall, and F1-score.
The metadata (metrics.json and model) is stored on DVC and the metrics values are tracked by MLFlow on DagsHub.
The workflow is started by a push or pull request by the developer/Machine Learning Engineer.
The training is triggered in a provisioned EC2 instance, and the metadata is stored on DVC and the metrics values are tracked by MLFlow on DagsHub.
CI/CD for Machine Learning Model Training with GitHub Actions
A comprehensive guide to using an EC2 instance as a server for training your Machine Learning model
Proper orchestration of a machine learning pipeline can be performed using multiple open-source tools. Github actions is one of the well-known out there. It is a built-in Github tool primarily developed to automate the development, testing, and deployment process of software.
Nowadays, Machine Learning practitioners have been using it to automate the entire workflow of their projects. All these workflows are related to specific jobs that can be executed using Github actions servers or by using your own servers.
At the end of this conceptual blob, you will understand:
The benefits of hosting your own runners.
How to create an EC2 instance and configure it for the task at hand.
Implement a machine learning workflow with GitHub actions using your runners.
Use DVC to store your model metadata.
MLFlow to track the performance of your model.
Why use self-hosting runners?
Hosting your own runner means allows you to execute jobs within a custom hardware environment with the required processing power and memory storage.
Doing so has the following benefits:
The user can easily increase or decrease the number of runners, which can be beneficial when it comes to training models in parallel.
There is no restriction in terms of operating systems. Both Linux, Windows, and macOS are supported.
When using cloud services like AWS, GCP, or Azure, the runner can benefit from all the services depending on the subscription level.
How to configure your AWS EC2 instance?
To successfully perform this section, we need to perform two main tasks. First, get the AWS and Github credentials, then set up the encrypted secrets to synchronize EC2 and Github.
Get your AWS and Github credentials
Before starting, you need to first create an EC2 instance, which can be done from this article. There are three main credentials required and the process of acquiring them is detailed below.
→ Get your PERSONAL_ACCESS_TOKEN from your Github account. This is used as an alternative to passwords, required to interact with Github API.
→ The ACCESS_KEY_IDand SECRET_ACCESS_KEY can be retrieved after following these 5 steps:
Click your username near the top right.
Select the security credentials tab.
Select Access keys (access key ID and secret access key).
Create a new access key.
Click Show Access Key to view your Access Key ID & Secret Access Key.
Illustration of the 5 steps to get your AWS Access Key ID and Secret Access Key (Image by Author)
Set up encrypted secrets for synchronization
This step is performed from your project repository on Github. All the steps are described in the Encrypted Secrets section of the following article by Khuyen Tran.
In the end, you should have something similar to mine, like this.
Environment secrets added to Github secrets (Image by Author)
Implement the machine learning workflow with GitHub actions and your self-hosted EC2 runner.
The Machine learning task covered in this section is the implementation of a 3-class classification using the BERT model. This is performed using the following workflow:
Data Acquisition → Data Processing → Model Training & Evaluation → Model & metadata serialization.
We will start by explaining the underlying tasks performed in each step, along with their source code before explaining how to run the model training using our self-hosted runner.
→ Data acquisition: responsible for collecting data from DVC storage.
→ Data processing: limited to special characters removal for simplicity.
→ Model training & Evaluation: create a BERT transformer model.
The model training consists of two main sections: (1) train, and evaluate the model performance, and (2) use MLFlow to track the metrics.
The training generates two main files:
model/finetuned_BERT_epoch{x}.model:corresponding to the fine-tuned BERT model generated after the epoch n°x.
metrics/metrics.json:containing the precision, recall, and F1-score of the previously generated model.
For the model tracking, we need to acquire the credentials as follows from the top right corner of your DagsHub project repository.
Steps to get your MLFlow credentials from your DagsHub project (Image by Author)
The script below only shows the model tracking section with MLFlow because showing all the source code would be too long. However, the complete code is available at the end of the article.
→ Model & Metadata serialization: responsible for saving into DVC storage the following metadata: metrics.json, finetuned_BERT_epoch{x}.model.
Train your model using an EC2 self-hosted runner.
This section mainly focuses on firing the training of the model after provisioning the self-hosted runner that will be used to train the model instead of using a default GitHub actions runner.
The training workflow is implemented in .github/workflows/training.yaml
You can name the training.yamlfile whatever you want. However, it must be located in the .github/workflows folder with the .yaml extension. This way it will be considered by Github as a workflow file.
Below is the general format of the training workflow:
name: the name of the workflow.
on: push and pull_request are the main events in our case responsible for triggering the entire workflow.
jobs: contains the set of jobs in our workflow which are: (1) the deployment of the EC2 runner, and (2) the training of the model.
What are the steps within each job?
Before diving into the source code, let’s understand the underlying visual illustration of the workflow.
General workflow of the blog scope (Image by Author)
The workflow is started by a push or pull request by the developer/Machine Learning Engineer.
The training is triggered in a provisioned EC2 instance.
The metadata (metrics.json and model) is stored on DVC and the metrics values are tracked by MLFlow on DagsHub.
→ First job: deploy-runner
We start by using the continuous machine learning library by Iterative to automate the server provisioning and model training. Then we acquire all the credentials required to run the provisioning.
The EC2 instance being used is:
a free-tier t2.micro instance.
located in theus-east-1aregion.
tagged with the cml-runnerlabel that will be used to identify the instance when training the model.
→ Second job: train-model
By using the previously provisioned runner, we are able to perform the training using the CML library.
Save the metadata in DVC storage.
All the files generated from the training step are automatically stored in the local machine. But we might want to keep track of all the changes on those data. This is the reason why we are usingdvc in this section.
There are overall two main steps: (1) get your credentials, and (2) implement the logic to push the data.
→ Get your DVC credentials.
Getting your DVC credentials follows the same process, similar to MLFlow:
Steps to get your DVC credentials from your DagsHub project (Image by Author)
→ Implement the logic
First, we implement the DVC configuration and data-saving logic in thesave_metadata.py file.
Then, this file is called by the save_metadata job in the workflow.
jobs:
...# deploy-runner & train-model jobs
save_metadata:steps:-name: Save metadata into DVCrun:python save_metadata.py
Now I can commit the changes and push the code on Github to be able to perform a pull request.
gitadd .
git commit -m “Pushing the code for pull request experimentation”
gitpush -u origin main
After changing the number of epochs to 2, we get a new version of the training script which triggers pull requests. Below is an illustration.
Pull request
The following illustration shows the model metrics before and after the pull request, respectively for an epoch at 1 and an epoch at 2.
Conclusion
When implementing a complete machine learning pipeline with Github actions, using a self-hosted server can be beneficial in many ways as illustrated at the beginning of the article.
In this conceptual blog, we explored how to provision an EC2 instance, trigger the train of the model from push and pull requests, then save the metadata into a DVC storage and track the model performance using MLFlow.
Mastering these skills will help you provide a valuable skill to take the entire machine learning pipeline of your organization to the next level.
Also, If you like reading my stories and wish to support my writing, consider becoming a Medium member to unlock unlimited access to stories on Medium.
Feel free to follow me on Medium, Twitter, YouTube, or say Hi on LinkedIn. It is always a pleasure to discuss AI, ML, Data Science, NLP, and MLOps stuff!