avatarGabriel Harris Ph.D.

Summary

This context provides a guide on using Poetry, Make, and pre-commit-hooks to set up a repo template for a data science team, ensuring consistency, rigor, and best practices.

Abstract

The context discusses the prerequisites for setting up a repo template, including installing Python, Make, and Poetry. It then outlines the initial setup process, which involves cloning the template repo and running the setup target using Make. The Makefile is explained in detail, along with its rules and targets. The context also covers the installation of dependencies and pre-commit hooks, as well as managing packages and integrating with VSCode. The use of Cookiecutter for a monolithic repo is also discussed.

Bullet points

  • The context provides a guide for setting up a repo template for a data science team using Poetry, Make, and pre-commit-hooks.
  • Prerequisites include installing Python, Make, and Poetry.
  • The initial setup involves cloning the template repo and running the setup target using Make.
  • The Makefile is explained in detail, along with its rules and targets.
  • The context covers the installation of dependencies and pre-commit hooks.
  • Managing packages and integrating with VSCode are also discussed.
  • The use of Cookiecutter for a monolithic repo is explained.
  • The context ends with a note on what's next, suggesting an Agile-Waterfall Hybrid framework for data science teams.

Python HOW: Using Poetry, Make, and pre-commit-hooks to Setup a Repo Template for your Team

Bring consistency, rigour, and best practices to your messy data science team

Photo by Scott McNiel from Pexels

Last update 05 Aug 2022

If part of your job is to constantly poke your fellow data scientist to isolate projects environments, updating requirements, cleaning code, writing consistent docstrings, etc., then you should definitely follow along 💊

All the work shown here is for Windows and PowerShell (PS) but you can adapt it for Mac and your favourite Command Line Interface (CLI)

Template is available on GitHub

1. Prerequisites

You need to have python, Make, and Poetry installed on your machine

Install Python 🐍

Download the latest Python 3.9 releases for Windows. Select Customize installation and mark py launcher for installation

You can now use py launcher in the CLI to list all installed versions of python (you can have as many as you like), and to use a specific version of python:

Install Make 🐐

We will be using Make from the GNU Project to setup and manage our repo using a Makefile. Think of Make as a tool for automating processes

To install Make for Windows, first install Chocolatey, then use it to install make. Open a new CLI and check make is working:

Install Poetry 👨‍🏭

We use Poetry to manage the project virtual environment and resolve dependencies. Install Poetry as described here. Open a new CLI and check Poetry is working:

2. Initial setup

Clone Repo 🤡

Clone the template locally, and copy it (without the .git folder) to your newly created project’s repo (which has its own .git). The template has the following structure:

Notes on using Cookiecutter are at the end

Run setup 🔨

Make sure you don’t have any virtual environment activated in the CLI. Run the setup target using make, and you are done!

Three things have happened!

  1. 💻 An isolated .venv is created in the project’s directory
  2. 📦 Some packages are installed in .venv
  3. 🧷 the pre-commit hooks are installed

Let’s look into these things in detail 👇

3. Makefile 📜

make looks for a Makefile in the project’s root that contains a set of rules to run. Each rule has 3 parts: a target, a list of prerequisites, and a recipe in the following format:

This is what our Makefile looks like:

The setup target on Makefile > line 8 doesn’t have any recipes but rather 3 prerequisites, which are 3 make targets (a target that runs other targets). Let’s have a look at these 3 targets 👇

3.1 💻Create virtual environment

The venv target in Makefile > line 10 has one prerequisite $(GLOBAL_PYTHON) which is the value of a variable defined earlier in Makefile > line 4. The variable GLOBAL_PYTHON grabs the full path to the python interpreter which we installed earlier. If the prerequisite interpreter path doesn’t exist, you will get an error when running the venv target

Makefile > line 12 is where poetry creates an isolated .venv folder in the project’s root using the interpreter full path. To make sure .venv is created in the root directory of the project, the following configuration is added in the poetry.toml 📃 (where all poetry configurations go):

To understand how poetry manages environments check 🔗

2.2 📦Install dependencies

The install target in Makefile > line 14 has one prerequisite $(LOCAL_PYTHON) which is the value of a variable defined earlier in Makefile > line 5. The variable LOCAL_PYTHON checks if there is a path to a python interpreter in .venv. If the prerequisite interpreter path doesn’t exist, you will get an error when running the install target

Makefile > line 16 is where poetry installs the projects’ dependencies found in the pyproject.toml file. This is what our pyproject.toml looks like:

Poetry separates packages into dependencies pyproject.toml > line 7 and development dependencies pyproject.toml > line 11. When Poetry has finished installing all packages in .venv, it writes their exact versions to a poetry.lock file that you should commit to the project’s repo 🔗 so that the team working on the project is locked to the same versions of dependencies 🔗

Our packages have different version constraints. For example "*" means latest, while "^1" means >=1.0.0 <2.0.0. To understand dependency specification 🔗

Included Dev Packages 📦📦📦

These are the dev packages I’m currently using for our team:

Black

Black is “the uncompromising Python code formatter” with “a strict subset of PEP 8 coding style”. Black has a very opinionated code style 🔗. Black defaults to 88 characters per line 🔗, if you would rather change it, you can use a different number for the line-length option in pyproject.toml > line 26

Flake8

Flake8 is a code linter that warns you of syntax errors, possible bugs, stylistic errors, etc. “There are a few deviations that cause incompatibilities with black. To fix this, we can pass few options to make flake8 consistent with black in the .flake8 📃 (flake8 has not yet adopted pyproject.toml📃)

iSort

iSort sorts imports alphabetically, and automatically separated into sections and by type. “Black also formats imports, but in a different way from isort defaults which leads to conflicting changes” 😵. To fix this, we can tell isort to use black as a profile option in pyproject.toml > line 29

More details on using isort with black 🔗 and a full list of isort CLI flags 🔗

nbStripOut

nbstripout strips the output from jupyter and ipython notebooks

PyDocStyle

Pydocstyle is a static analysis tool for checking compliance with Python docstring conventions. Three conventions are available: pep257, numpy and google . The pep257 convention is enabled by default, to change it, you can use the convention option in pyproject.toml > line 32. You can also ignore specific error codes (e.g., missing docstrings) by using the add-ignore option in pyproject.toml > line 33

More details on supported conventions 🔗 and error codes 🔗

Notebook

Your beloved classic Jupyter notebook

Rich

Rich render pretty tables, progress bars, markdown, syntax highlighted source code, tracebacks, and more in the terminal. For a video introduction check 🔗

Pre-commit

To be able to install the pre-commit hooks in the next step, first you need to install the pre-commit framework

What are these hooks for? Every time you commit a code change, the hooks are run on it to automatically point out issues (e.g. is it black compliant?). By pointing issues out before a code review, it allows reviewers to focus on the architecture of a change while not wasting time with trivial style nitpicks

2.3 🧷Install pre-commit hooks

The pre-commit target in Makefile > line 18 has one additional prerequisite$(LOCAL_PRE_COMMIT) which is the value of a variable defined earlier in Makefile > line 6. The variable LOCAL_PRE_COMMIT checks if there is a path to the pre-commit package in .venv

Makefile > line 20 is where pre-commit installs a git hook script in .git/hooks/pre-commit if it finds the configuration file .pre-commit-config.yaml ⚙️ (which we have 🤘)

The .pre-commit-config.yaml defines the hooks to use. Each hook has the following format:

For example, to add a hook to pydocstyle, go to the package repo, look for .pre-commit-hooks.yaml to find the hook id, look for the release you would like to use, and copy the details over to your .pre-commit-config.yaml ⚙️ with any additional arguments:

Included hooks 🧷🧷🧷

These are the hooks I’m currently using for our team:

You can find a list of supported hooks 🔗

Now, pre-commit will run automatically when you git commit and you will see the following git output:

Some hooks will fix your code while other hooks will only point out to an issue that you need to fix

2.4 Cleaning up

The clean target in Makefile > line 20 cleans up your project by:

  • Removing the directory .git\hooks if it exists
  • Removing the directory .venv if it exists
  • Removing the file poetry.lock if it exists

🧁 Bonus 🧁

Managing packages

To add a package to the project’s dependencies or dev dependencies:

To remove a package from the project’s dependencies or dev dependencies:

These commands will automatically update the pyproject.toml and the poetry.lock files

VSCode integration 💏🏻

If you’re using VSCode, you can integrate some of the dev packages to run automatically on your code on save 🔥

To do this, add the following to the workspace settings (overrides user settings 🔗):

Whenever you save a script (ctrl + s), it will be formatted with black, linted with flake8, and packages sorted with isort 🔥🔥🔥

Using Cookiecutter

You could use Cookiecutter if you have a monolithic repo (i.e., one repo for all your projects), as the structure will be:

To do this, change the following in pyproject.toml:

And the top level folder name in the template:

However, this will look ugly for isolated repos (i.e., one repo per project) 💩, such as:

Double (Tribble?) Guards

If you are wondering, but Gabriel, we have installed dev packages, added hooks for few of them, and added VSCode integration. Isn’t this excessive?

YES IT IS! 😈😈😈

What is next?

  • An Agile-Waterfall Hybrid framework that works for a data science team (coming soon)?

Happy coding!

Python
Data Science
Data Engineering
Mlops
DevOps
Recommended from ReadMedium