avatarEgor Howell

Summary

This text provides a guide for data scientists on how to use Make and Makefiles to optimize their machine learning pipeline.

Abstract

The text begins by explaining the background of Make, a powerful Linux command used by developers and increasingly adopted by data scientists. It highlights the benefits of using Make for data science, such as automating the setup of machine learning environments, clearer end-to-end pipeline documentation, easier testing of models with different parameters, and a more obvious structure and execution of projects. The text then explains what a Makefile is, its components, and provides a basic example. The author then describes a machine learning pipeline using Make and Makefiles, based on a previous project to forecast US airline passengers using an ARIMA model. The pipeline includes three stages: read_clean_data.py, model.py, and analysis.py. The text provides a walkthrough of each stage and the corresponding Makefile. The author concludes by summarizing the key takeaways and providing references for further reading.

Bullet points

  • Make is a powerful Linux command used by developers and increasingly adopted by data scientists.
  • Make can be used to simplify and breakdown workflows into logical groupings of shell commands.
  • A Makefile is a file that Make commands read and execute from, consisting of targets, dependencies, and commands.
  • Make and Makefiles can be used to build efficient machine learning pipelines.
  • The benefits of using Make for data science include automating the setup of machine learning environments, clearer end-to-end pipeline documentation, easier testing of models with different parameters, and a more obvious structure and execution of projects.
  • The text provides a walkthrough of a machine learning pipeline using Make and Makefiles, based on a previous project to forecast US airline passengers using an ARIMA model.
  • The pipeline includes three stages: read_clean_data.py, model.py, and analysis.py.
  • The author concludes by summarizing the key takeaways and providing references for further reading.

A Data Scientist’s Guide to Make and Makefiles

How to use Make and Makefiles to optimise your machine learning pipeline

Photo by Nubelson Fernandes on Unsplash

Background

Data Scientists are now expected to write production code to deploy their machine learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is make. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.

But first, make sure to subscribe to my YouTube Channel!

Click on the link for video tutorials that teach you core data science concepts in a digestible manner!

What Is Make?

make is a terminal command/executable just like ls or cd that is in most UNIX-like operating systems such as MacOS and Linux.

The use of make is to simplify and breakdown your workflow into a logical grouping of shell commands.

It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.

Why Make For Data Science?

make is a powerful tool that Data Scientists should be utilising for the following reasons:

  • Automate the setup of machine learning environments
  • Clearer end-to-end pipeline documentation
  • Easier to test models with different parameters
  • Obvious structure and execution of your project

What Is A Makefile?

A Makefile is basically what the make commands read and execute from. It has three components:

  • Targets: These are the files you are trying to build or you have a PHONY target if you are just carrying out commands.
  • Dependencies: Source files that need to be run before this target is executed.
  • Command: As it says on the tin, these are the list of steps to produce the target.

Basic Example

Let’s run through a very simple example to make this theory concrete.

Below is a Makefile that has the target hello with the command echo to print 'Hello World' to the screen and it has no dependencies:

# Define our target as PHONY as it does not generate files
.PHONY: hello

# Define our target
hello:
 echo "Hello World!"

We can run this by simply executing make hello in the terminal which will give the following output:

echo "Hello World!"
Hello World!

It essentially just listed and carried out the command. This is the essence of make there is nothing too complicated going on.

Notice that we made the target hello a .PHONY as it doesn’t produce a file. This is the meaning behind .PHONY, only use it for targets that don’t spit out a file.

We can add an @ symbol before the echo command if we don’t want to print it to the screen.

We can add another target in the Makefile to generate a file:

# Define some targets as PHONY as they do not generate files
.PHONY: hello

# Define our target
hello:
 echo "Hello World!"

# Define our target to generate a file
data.csv:
 touch data.csv

To run the data.csv target, we execute make data.csv:

touch data.csv

And you should notice a data.csv file in your local directory.

Machine Learning Pipeline

Overview of a Pipeline

Below is an example pipeline for a machine learning project we will construct using Makefile and make. It is based on a previous project where I built on ARIMA model to forecast US airline passengers. You can check out more about it here:

Diagram by author.

So, the read_clean_data.py file will load in and make the time series data stationary. The model.py file will fit an ARIMA model to the cleaned data. Finally, the analysis.py file will compute the performance of our forecast.

Another key thing to notice here is the dependency between files. The analysis.py can’t run unless model.py has been executed. This is where the dependencies in the Makefile become useful.

Walkthrough

Below is our first file read_clean_data.py:

Data from Kaggle with a CC0 licence.

Here we read our US airline data and make it stationary through differencing and the Box-Cox transform and save it to a file in the local directory called clean_data.csv.

Then, we have the model.py file:

And finally, we have our analysis file, analysis.py:

We can then code the following Makefile for our three stage pipeline:

.PHONY: all read_clean_data model analysis

all: analysis

read_clean_data:
 python read_clean_data.py

model: read_clean_data
 python model.py

analysis: model
 python analysis.py

.PHONY: clean
clean:
 rm -f clean_data.csv lam.pickle train_data.csv test_data.csv forecasts.csv

Notice how we have declared the dependencies of each step to the previous one to ensure we have the correct files to carry every step. We have also added the clean target to remove the generated files if needed.

The whole pipeline can run through the make all command, and the output will look like this:

Output:

python read_clean_data.py
python model.py
python analysis.py

And will generate the following plot:

Plot generated by author in Python.

As you can see, the Makefile file pipeline worked and the forecasts look pretty good!

Summary & Further Thoughts

That’s it! I hope you enjoyed this short tutorial on make and Makefile. Of course, there is more complexity and fancy things you can do with these tools, but this post can serve as your starting point. The key things to remember are:

  • make is a UNIX command that automates the running of certain workflows
  • A Makefile allows us to write several make commands and sequences to automate the machine learning pipeline

The full code used in this article is available on my GitHub here:

References & Further Reading

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist, and the latest AI news to keep you in the loop. There is no “fluff” or “clickbait”, just pure actionable insights from a practicing Data Scientist.

Connect With Me!

(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA 4.0)

Data Science
Machine Learning
Coding
Programming
Makefile
Recommended from ReadMedium