A Data Scientist’s Guide to Make and Makefiles
How to use Make and Makefiles to optimise your machine learning pipeline
Background
Data Scientists are now expected to write production code to deploy their machine learning algorithms. Therefore, we need to be aware of software engineering standards and methods to ensure our models are deployed robustly and effectively. One such tool that is very well known in the developer community is make
. This a powerful Linux command that has been known to developers for a long time and in this article I want to show how it can be used to build efficient machine learning pipelines.
But first, make sure to subscribe to my YouTube Channel!
Click on the link for video tutorials that teach you core data science concepts in a digestible manner!
What Is Make?
make
is a terminal command/executable just like ls
or cd
that is in most UNIX-like operating systems such as MacOS and Linux.
The use of make
is to simplify and breakdown your workflow into a logical grouping of shell commands.
It is used widely by developers and is also being adopted by Data Scientists as it simplifies the machine learning pipeline and enables more robust production deployment.
Why Make For Data Science?
make
is a powerful tool that Data Scientists should be utilising for the following reasons:
- Automate the setup of machine learning environments
- Clearer end-to-end pipeline documentation
- Easier to test models with different parameters
- Obvious structure and execution of your project
What Is A Makefile?
A Makefile
is basically what the make
commands read and execute from. It has three components:
- Targets: These are the files you are trying to build or you have a
PHONY
target if you are just carrying out commands. - Dependencies: Source files that need to be run before this target is executed.
- Command: As it says on the tin, these are the list of steps to produce the target.
Basic Example
Let’s run through a very simple example to make this theory concrete.
Below is a Makefile
that has the target hello
with the command echo
to print 'Hello World'
to the screen and it has no dependencies:
# Define our target as PHONY as it does not generate files
.PHONY: hello
# Define our target
hello:
echo "Hello World!"
We can run this by simply executing make hello
in the terminal which will give the following output:
echo "Hello World!"
Hello World!
It essentially just listed and carried out the command. This is the essence of make
there is nothing too complicated going on.
Notice that we made the target hello
a .PHONY
as it doesn’t produce a file. This is the meaning behind .PHONY
, only use it for targets that don’t spit out a file.
We can add an @ symbol before the
echo
command if we don’t want to print it to the screen.
We can add another target in the Makefile
to generate a file:
# Define some targets as PHONY as they do not generate files
.PHONY: hello
# Define our target
hello:
echo "Hello World!"
# Define our target to generate a file
data.csv:
touch data.csv
To run the data.csv
target, we execute make data.csv
:
touch data.csv
And you should notice a data.csv
file in your local directory.
Machine Learning Pipeline
Overview of a Pipeline
Below is an example pipeline for a machine learning project we will construct using Makefile
and make
. It is based on a previous project where I built on ARIMA model to forecast US airline passengers. You can check out more about it here:
So, the read_clean_data.py
file will load in and make the time series data stationary. The model.py
file will fit an ARIMA model to the cleaned data. Finally, the analysis.py
file will compute the performance of our forecast.
Another key thing to notice here is the dependency between files. The analysis.py
can’t run unless model.py
has been executed. This is where the dependencies in the Makefile
become useful.
Walkthrough
Below is our first file read_clean_data.py
:
Data from Kaggle with a CC0 licence.