avatarDaniel Warfield

Summary

The article outlines a method for leveraging Google Colab to create a temporary compute cluster for running machine learning experiments across multiple notebooks.

Abstract

The author presents a guide for utilizing Google Colab to execute distributed machine learning tasks without the need for a traditional computing cluster. The process involves defining an experiment with a custom orchestrator, writing the experiment code, creating worker notebooks that execute the main experiment code, and managing multiple worker instances. The approach is suitable for conducting hyperparameter sweeps across various models and datasets, with the orchestrator assigning tasks and aggregating results. The article emphasizes the importance of trusted code sources when using exec and os.system for running arbitrary code and notes that Google Colab allows up to six simultaneous sessions for a modest compute cluster ready on-demand.

Opinions

  • The author suggests that using multiple Colab notebooks as a cluster is a simple alternative to traditional clusters for certain compute needs.
  • W&B Sweeps is recommended for simpler use cases as an acceptable orchestrator for managing experiments.
  • The custom orchestrator is highlighted as a solution for complex modeling problems, with a reference to a previous article for further details.
  • The use of exec and os.system in the worker notebooks necessitates caution and trust in the source of the main notebook to avoid running potentially harmful code.
  • Google Colab's feature to manage sessions centrally is presented as a convenient way to oversee and control multiple worker notebooks.
  • The author expresses enthusiasm for future posts that will delve into landmark machine learning papers with practical and intuitive explanations.
  • The images created by Daniel Warfield for the article are permitted for non-commercial use, provided proper attribution is given.

Building a Cluster using Google Colab in 5 Minutes

Sometimes you need more compute, but a traditional cluster might be overkill. This is a simple way to use multiple Colab notebooks to run the same code on multiple workers.

“Note Net” by Daniel Warfield. All images by author unless otherwise specified

Step 1) Define an Experiment

In this example we’ll be conducting a hyperparameter sweep across multiple model types and databases in order to explore general performance characteristics. We’ll be using a custom orchestrator (which I discussed in another article) to tell individual workers which model should be used with which hyperparameters on which dataset. This orchestrator will also ingest results and aggregate them into a single database for further analysis.

You are likely working in a simpler use case in which W&B Sweeps will be a perfectly acceptable orchestrator for you to use. W&B sweeps are also described in the previous article.

The relationship between orchestrator and worker

Step 2) Define Experiment Code

Naturally this will change depending on your experiment, but will likely look something like this:

  1. Get API token or authentication credential
  2. Connect to orchestrator
  3. declare an experiment/sweep, or get an existing one
  4. Ask the orchestrator for a task for that experiment/sweep
  5. Execute that task
  6. Send Results
  7. Repeat steps 4–6 until the experiment is complete

In my example, I have a list of databases and models (each model having it’s own hyperparameter space) and I want to execute 60 training/testing jobs for each model-database pair with different hyperparameter values. The experiment code will be a Colab notebook that does all of those things. For the purposes of this post, we will call this document MAIN.

"""MAIN.ipynb
This is pseudo code for the notebook
"""

# Get API Keys and set up communication authentication
dataset_API_Key = readDatasetKey()
orchestrator_API_key = readOrchKey()
setupComs(dataset_API_Key, orchestrator_API_key)

#Declaring a new experiment with the orchestrator, or getting an existing experiment by the same name
exp_name = 'sample_experiment'
exp = declareExp(exp_name)

#running until experiment is complete
while exp.noDone():
  
  #getting a current task to work on
  task = exp.getTask()

  #do whatever you need to do
  task.compute()

  #send data to the cloud
  task.sendData()

  #mark the task as done
  task.done()

Step 3) Define a Worker

We have a notebook which includes all of the code necessary to initiate communication with the orchestrator, declare an experiment, get a task, run that task, and log results. Now, we can run the same exact code in other notebooks. To do this we’ll create a worker notebook which mounts google.colab.drive , reads our MAIN notebook, and executes all of the MAINnotebook’s cells.

""" Setup
mounting the drive and grabbing main
"""

from google.colab import drive
import os
drive.mount('/content/drive')

main_file = 'path/to/MAIN'

""" Running
Parsing the content to get all code cells, and running them sequentially.
"""

import json
import nbformat

for cell in nbformat.read(main_file, as_version=4).cells:
    if cell['cell_type'] == 'code':
        if len(cell['source']):
            src = cell['source']
            if src[0] == '!':
                #only works if ! is the only line in the block
                print('============= running command "{}" =============='.format(src[1:]))
                os.system(src[1:])
            else:
                try:
                    exec(src)
                except Exception as e:
                    print('---------')
                    print(src)
                    print('---------')
                    raise(e)

Naturally, with the use of exec and os.system running arbitrary code, it’s very important that you’re sure that your MAIN file is from a trusted source, preferably yourself. We’ll call this notebook the worker notebook.

Step 4) Copy and Run

Great, now that you have a MAIN notebook with your code, and a worker notebook that reads MAIN and runs the code, you can simply copy your worker notebook to create as many workers as you need!

Naturally, to get your code running, you will have to open all workers and run them. You can use Runtime > Run All to quickly run all cells in a worker node.

Luckily, Colab allows you to centrally manage your workers from here on out. If you click the RAM/disk usage on the top right of any notebook, then “manage session”, you will be able to view and terminate any active session.

Click RAM/Disk
click “Manage Session”
You can view and terminate all running workers from this menu

Based on my experimentation, it appears that Google Colab allows you to run 6 sessions simultaniously. Thats a modest amount of compute ready to go at a moments notice!

Follow For More!

In future posts I’ll be describing several landmark papers in the ML space, with an emphasis on practical and intuitive explanations.

Attribution: All of the images in this document were created by Daniel Warfield, unless a source is otherwise provided. You can use any images in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both.

Data Science
Machine Learning
AI
Recommended from ReadMedium