Docker for Data Science Projects: A Beginner-Friendly Introduction

Elevate Your Data Science Workflow: Harness Docker’s Power for Seamless Project Management

When shipping your machine learning code to the engineering team, encountering compatibility issues with different operating systems and library versions can be frustrating.

Docker can solve compatibility issues between operating systems and library versions when shipping machine learning code to engineering teams, making code execution seamless regardless of its underlying setup.

In this comprehensive tutorial, we will introduce Docker’s essential concepts, guide you through installation, demonstrate its practical use with examples, uncover industry best practices, and answer any related queries along the way — so say goodbye to compatibility woes and streamline machine learning workflow with Docker!

Docker for Data Science Projects: A Beginner-Friendly Introduction / Image by Author

Introduction to Docker 1.1. Docker vs Containers vs Images 1.2. Importance of Docker for Data Scientists
Getting Started with Docker 2.1. Installing Docker on Your Machine 2.2. 10 Docker Basic Commands
Dockerizing a Machine Learning Application 3.1. Defining the environment 3.2. Write a Dockerfile 3.3. Build the Image

If you want to study Data Science and Machine Learning for free, check out these resources:

Free interactive roadmaps to learn Data Science and Machine Learning by yourself. Start here: https://aigents.co/learn/roadmaps/intro
The search engine for Data Science learning resources (FREE). Bookmark your favorite resources, mark articles as complete, and add study notes. https://aigents.co/learn
Want to learn Data Science from scratch with the support of a mentor and a learning community? Join this Study Circle for free: https://community.aigents.co/spaces/9010170/

Are you looking to start a career in data science and AI and need to learn how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

Subscribe to my newsletter To Data & Beyond to get full and early access to my articles:

To Data & Beyond | Youssef Hosni | Substack

Data Science, Machine Learning, AI, and what is beyond them. Click to read To Data & Beyond, by Youssef Hosni, a…

youssefh.substack.com

1. Introduction to Docker

1.1. Docker vs Containers vs Docker Images

Docker is a commercial containerization platform and runtime that helps developers build, deploy, and run containers. It uses a client-server architecture with simple commands and automation through a single API.

With Docker, developers can create containerized applications by writing a Dockerfile, which is essentially a recipe for building a container image. Docker then provides a set of tools to build and manage these container images, making it easier for developers to package and deploy their applications in a consistent and reproducible way.

A container is a lightweight and portable executable software package that includes everything an application needs to run, including code, libraries, system tools, and settings.

Containers are created from images that define the contents and configuration of the container, and they are isolated from the host operating system and other containers on the same system.

This isolation is made possible by the use of virtualization and process isolation technologies, which enable containers to share the resources of a single instance of the host operating system while providing a secure and predictable environment for running applications.

A Docker Image is a read-only file that contains all the necessary instructions for creating a container. They are used to create and start new containers at runtime.

1.2. Importance of Docker for Data Scientists

Docker lets developers access these native containerization capabilities using simple commands, and automate them through a work-saving application programming interface (API). Docker offers:

Improved and seamless container portability: Docker containers run without modification across any desktop, data center, or cloud environment.
Even lighter weight and more granular updates: Multiple processes can be combined within a single container. This makes it possible to build an application that can continue running while one of its parts is taken down for an update or repair.
Automated container creation: Docker can automatically build a container based on application source code.
Container versioning: Docker can track versions of a container image, roll back to previous versions, and trace who built a version and how. It can even upload only the deltas between an existing version and a new one.
Container reuse: Existing containers can be used as base images — essentially like templates for building new containers.
Shared container libraries: Developers can access an open-source registry containing thousands of user-contributed containers.

2. Getting Started with Docker

Now after introducing Dockers let's see how we can use it for our data science projects. Let's first start with installing Docker on your local machine and after that, we will introduce basic Docker commands.

2.1. Installing Docker on Your Machine

Installing docker on your machine is fairly easy. You can follow the instruction available on the official documentation:

Instructions to install Docker for Linux.
Instructions to install Docker for Windows.
Instructions to install Docker for Mac.

It is important to note that if you like to create your own images and push them to Docker Hub, you must create an account on Docker Hub. Think of Docker Hub as a central place where developers can store and share their Docker images.

2.2. 10 Docker Basic Commands

Now after you have installed Docker on your machine. Let's explore some of the basic docker commands that you should be familiar with.

docker run: The “docker run” command is used to create and start a new container based on a Docker image. Here’s the basic syntax for running a container:

docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

OPTIONS: Additional options that can be used to customize the container’s behavior, such as specifying ports, volumes, environment variables, etc.
IMAGE: The name of the Docker image to use for creating the container.
COMMAND: (Optional) The command to be executed inside the container.
ARG: (Optional) Arguments passed to the command inside the container.

For example, to run a container based on the “ubuntu” image and execute the “ls” command inside the container, you would use the following command:

docker run ubuntu ls

This will create a new container using the “ubuntu” image and run the “ls” command, which lists the files and directories inside the container’s file system. Note that if the specified image is not available locally, Docker will automatically pull it from a Docker registry before creating the container.

2. docker ps: The “docker ps” command is used to list the running containers on your Docker host. It provides information such as the container ID, the image used, the command being executed, status, and port mappings. Here’s the basic syntax:

docker ps [OPTIONS]

The “docker ps” command is used to list the running containers on your Docker host. It provides information such as the container ID, the image used, the command being executed, status, and port mappings. Here’s the basic syntax:

docker ps [OPTIONS]

By default, “docker ps” only shows the running containers. If you want to see all containers, including those that are stopped or exited, you can use the “-a” option:

docker ps -a

3. docker stop: The “docker stop” command is used to stop one or more running containers. It sends a signal to the container’s main process, requesting it to stop gracefully. Here’s the basic syntax:

docker stop [OPTIONS] CONTAINER [CONTAINER...]

OPTIONS: Additional options that can be used to customize the stop behavior. For example, you can specify a timeout period with the “ — time” or “-t” option to allow the container more time to stop gracefully before forcefully terminating it.
CONTAINER: The name or ID of the container(s) to stop. You can specify multiple containers separated by spaces.

For example, to stop a container with the name “my-container”, you would use the following command:

docker stop my-container

4. docker rm: The “docker rm” command is used to remove one or more stopped containers from your Docker host. It permanently deletes the specified container(s) and frees up the associated resources. Here’s the basic syntax:

docker rm [OPTIONS] CONTAINER [CONTAINER...]

OPTIONS: Additional options that can be used to customize the removal behavior. For example, you can use the “-f” or “ — force” option to force the removal of a running container.
CONTAINER: The name or ID of the container(s) to remove. You can specify multiple containers separated by spaces.

docker rm my-container

If you want to remove multiple containers, you can list their names or IDs separated by spaces:

docker rm container1 container2 container3

5. docker images: The “docker images” command is used to list the Docker images that are available on your Docker host. It displays information about the images, such as the repository, tag, image ID, creation date, and size. Here’s the basic syntax:

docker images [OPTIONS] [REPOSITORY[:TAG]]

OPTIONS: Additional options that can be used to customize the output or filter the images. For example, you can use the “ — format” option to specify a format template for the output or the “-a” or “ — all” option to show all images, including intermediate image layers.
REPOSITORY: (Optional) The repository name of the image.
TAG: (Optional) The tag of the image.

By default, the “docker images” command lists all images available on your Docker host. For example:

docker images

6. docker rmi: The “docker rmi” command is used to remove one or more Docker images from your Docker host. It permanently deletes the specified image(s) from your local image cache. Here’s the basic syntax:

docker rmi [OPTIONS] IMAGE [IMAGE...]

OPTIONS: Additional options that can be used to customize the removal behavior. For example, you can use the “-f” or “ — force” option to force the removal of an image, even if it’s being used by running containers.
IMAGE: The name or ID of the image(s) to remove. You can specify multiple images separated by spaces.

For example, to remove an image with the name “my-image:latest”, you would use the following command:

docker rmi my-image:latest

If you want to remove multiple images, you can list their names or IDs separated by spaces:

docker rmi image1 image2 image3

7. docker build: The “docker build” command is used to build a Docker image from a Dockerfile. It allows you to define the instructions and dependencies required to create a customized image. Here’s the basic syntax:

docker build [OPTIONS] PATH | URL | -

OPTIONS: Additional options that can be used to customize the build process. Some commonly used options include “-t” or “ — tag” to specify the name and optional tag for the image, “-f” or “ — file” to specify the Dockerfile’s location, and “ — build-arg” to pass build-time variables to the Dockerfile.
PATH | URL | -: The path to the directory containing the Dockerfile, a URL to a Git repository, or “-” to build from the standard input.

For example, to build an image using a Dockerfile located in the current directory and tag it as “my-image:latest”, you would use the following command:

docker build -t my-image:latest .

The “.” indicates that the Dockerfile is in the current directory.

8. docker exec: The “docker exec” command is used to execute a command inside a running Docker container. It allows you to run commands interactively or in a detached mode. Here’s the basic syntax:

docker exec [OPTIONS] CONTAINER COMMAND [ARG...]

OPTIONS: Additional options that can be used to customize the execution behavior. Some commonly used options include “-i” or “ — interactive” to keep STDIN open for interactive commands, “-t” or “ — tty” to allocate a pseudo-TTY, and “-d” or “ — detach” to run the command in the background.
CONTAINER: The name or ID of the container where the command should be executed.
COMMAND: The command to be executed inside the container.
ARG: (Optional) Arguments passed to the command inside the container.

For example, to execute the “ls” command inside a container named “my-container”, you would use the following command:

docker exec my-container ls

This will run the “ls” command inside the specified container and display the list of files and directories.

If you want to run an interactive command, such as starting a shell inside the container, you can use the “-it” options together:

docker exec -it my-container bash

This will start an interactive shell session inside the container, allowing you to execute multiple commands interactively.

9. docker pull: The “docker pull” command is used to download Docker images from a Docker registry, such as Docker Hub. It retrieves the specified image or images and saves them to your local image cache. Here’s the basic syntax:

docker pull [OPTIONS] IMAGE[:TAG]

OPTIONS: Additional options that can be used to customize the pull process. Some commonly used options include “ — all-tags” to pull all available tags for an image, “ — platform” to specify the platform for which to pull the image, and “ — quiet” to suppress the progress output.
IMAGE: The name of the image to pull from the Docker registry. It can be in the format “repository/image” or “repository/image:tag”. If the tag is not specified, “latest” is used by default.

For example, to pull the latest version of the “ubuntu” image from Docker Hub, you would use the following command:

docker pull ubuntu

If you want to pull a specific tagged version of the image, you can specify the tag:

docker pull ubuntu:20.04

The specified image will be downloaded from the Docker registry and saved to your local image cache. Once the image is pulled, you can use it to create and run containers on your Docker host.

10. docker push: The “docker push” command is used to upload Docker images to a Docker registry, such as Docker Hub or a private registry. It allows you to share your locally built or modified images with others. Here’s the basic syntax:

docker push [OPTIONS] NAME[:TAG]

OPTIONS: Additional options that can be used to customize the push process. Some commonly used options include “ — all-tags” to push all tags for an image, “ — disable-content-trust” to skip content trust verification, and “ — quiet” to suppress the progress output.
NAME: The name of the image to push. It should include the repository and image name. For example, “username/repository:image”.
TAG: (Optional) The tag of the image to push. If not specified, the “latest” tag is used by default.

Before pushing an image, you need to ensure that you are authenticated to the Docker registry. You can log in to the registry using the “docker login” command, providing your username, password, and registry URL if necessary.

For example, to push an image named “my-image” with the “latest” tag to Docker Hub, assuming you are logged in to Docker Hub, you would use the following command:

docker push username/my-image:latest

The specified image will be uploaded to the Docker registry and made available for others to download and use.

3. Dockerizing a Machine Learning Application

To dockerize a machine learning application there are three main steps:

Create a requirements.txt file
Write a Dockerfile
Build the Docker image

Lets a simple machine learning application and see a step-by-step guide on how to dockerize it. The application below trains a simple classification model (logistic regression ) on the iris dataset.

# Load the libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score


# Load the iris dataset

iris = load_iris()

X = iris.data

y = iris.target


# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# Train a logistic regression model

clf = LogisticRegression()

clf.fit(X_train, y_train)


# Make predictions

y_pred = clf.predict(X_test)


# Print the accuracy of the model

accuracy_score = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy_score}')

3.1. Defining the environment

The first step is to define precisely the current environment to be able to replicate it in another location. The most effective way and the most straightforward way is to create a requirements.txt file that outlines all the libraries your project is using, including their versions. To create this file you can simply run the following command in the command line:

pip3 freeze > requirements.txt  # Python3

This will generate a requirements.txt file with all the used packages and libraries with the exact version used.

3.2. Write a Dockerfile

The next step is to create a file named Dockerfile that can create the environment and executes our application in it.

FROM python:3.9

WORKDIR /src

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python","iris_classification.py"]

This Dockerfile uses the official Python image as the base image, sets the working directory, copies the requirements.txt file, installs the dependencies, copies the application code, and runs the python iris_classification.py command to start the application.

3.3. Build the Image

The final step to create a reproducible environment is to create an image (also known as a template) that can be run to create any number of containers with the same configurations.

You can build the image by running the command docker build -t <image-name> . in the same directory where the Dockerfile is located.

If you like the article and would like to support me, make sure to:

👏 Clap for the story (50 claps) to help this article be featured
Follow me on Medium
📰 View more content on my medium profile
🔔 Follow Me: LinkedIn |Youtube | GitHub | Twitter

To Data & Beyond | Youssef Hosni | Substack

Data Science, Machine Learning, AI, and what is beyond them. Click to read To Data & Beyond, by Youssef Hosni, a…

youssefh.substack.com

Looking to start a career in data science and AI and do not know how. I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

Docker for Data Science Projects: A Beginner-Friendly Introduction

Elevate Your Data Science Workflow: Harness Docker’s Power for Seamless Project Management

Table of Contents:

Are you looking to start a career in data science and AI and need to learn how? I offer data science mentoring sessions and long-term career mentoring:

Subscribe to my newsletter To Data & Beyond to get full and early access to my articles:

To Data & Beyond | Youssef Hosni | Substack

Data Science, Machine Learning, AI, and what is beyond them. Click to read To Data & Beyond, by Youssef Hosni, a…

1. Introduction to Docker

1.1. Docker vs Containers vs Docker Images

1.2. Importance of Docker for Data Scientists

2. Getting Started with Docker

2.1. Installing Docker on Your Machine

2.2. 10 Docker Basic Commands

3. Dockerizing a Machine Learning Application

3.1. Defining the environment

3.2. Write a Dockerfile

3.3. Build the Image

To Data & Beyond | Youssef Hosni | Substack

Data Science, Machine Learning, AI, and what is beyond them. Click to read To Data & Beyond, by Youssef Hosni, a…