avatarYash Prakash

Summary

The article provides a guide on using Bash scripts to automate repetitive tasks in Data Science projects, such as setting up directories, virtual environments, and installing essential libraries.

Abstract

The article "How To Use Bash To Automate The Boring Stuff For Data Science" outlines a method for Data Scientists to streamline their workflow by creating a Bash script that automates initial project setup tasks. It emphasizes the repetitive nature of starting new projects, which often involves creating directories, setting up virtual environments, and installing common libraries like NumPy, Pandas, Matplotlib, Plotly, and Streamlit. The author provides a step-by-step process for writing a Bash script that encapsulates these commands, making it executable with a single command. This approach saves time and reduces the potential for human error by automating the setup process. The article also covers how to make the script executable and suggests placing it in the home directory for easy access. Additionally, the author encourages readers to explore further automation possibilities and provides links to further resources and similar articles.

Opinions

  • The author believes that automating repetitive tasks with Bash scripts is beneficial for efficiency and consistency in Data Science workflows.
  • It is suggested that forgetting to install necessary libraries can be mitigated by using a script, implying that manual installation is prone to oversight.
  • The author expresses a preference for using Streamlit for quickly creating web applications for machine learning projects.
  • Visual Studio Code (VSCode) is recommended as the preferred code editor for Data Science projects, as evidenced by its inclusion in the automation script.
  • There is an endorsement of exploring more complex automation tasks, such as auto-pushing code to GitHub, indicating the author's view that automation can extend beyond initial project setup.
  • The author's enthusiasm for sharing knowledge and tools is evident through the invitation to follow their work for more insights and tips in the field of Data Science.

How To Use Bash To Automate The Boring Stuff For Data Science

A guide for using the command line to write some reusable code for your Data Science projects

Photo by Elena Koycheva on Unsplash

As a Data Scientist, you tend to use some commands via the terminal over and over again. These may be commands to make a new directory for a project, start a new virtual environment and activate it, install a few standard libraries, etc.

The standard workflow that you might have for yourself when you begin your work on a project might not be that different each time.

For instance, I always create a new folder for a new project and move into it, via:

mkdir newproject
cd newproject

and then I make a new virtual environment via:

pipenv shell 
# or
python -m venv newenv

and finally, I also do numpy and pandas as a boilerplate installation for the project. For visualisation purposes, I also use matplotlib and plotly frequently.

As a matter of fact, I also like to quickly spin off a web application for my machine leaning apps, so I also tend to install streamlit along the way too, as it’s the best library for it. If you don’t know about it much, here’s a quick introduction I wrote to get you started with it.

Thus, the commands we need to execute so far are:

pip install numpy pandas matplotlib plotly streamlit

If you forget to add a library, you have to go back and install them again via the terminal. Looking at it now, won’t it be a good idea to do all of this automatically, via just one command?

You can use a script that you can execute everytime to automatically perform a few repetitive commands to get a little ahead and save yourself some valuable time.

In this article, I will demonstrate a simple command line process that you can easily get used to for automating the boring stuff efficiently, one that I tend to use quite often.

Let’s get started! 👇

Checking for bash on the system

A simple way to get to know where bash is located on your system is to use:

$ which bash

The output will be something like:

/bin/bash

Check for the bash version:

bash --version

The output should be like:

bash version on a Mac, for instance

Great, now that we have that little bit of information, let’s see what we can build with it.

Making a new bash script

We will be making one file which will contain all our boilerplate commands to execute. This will be called as our bash script, and it will have the extension of .sh as far as common practise is concerned.

First, create a new file.

touch createmlapp.sh

Next, let’s add a line of code to the top of our file to make sure the system knows to use default bash shell to use for running our script.

#!/bin/bash

Now, let’s understand what we want to do here. Our process will be as follows:

  1. Create a new directory for our project
  2. Create and activate virtual environment
  3. Install whatever packages we require
  4. Open up VSCode inside the project directory.

Let’s go through them now.

Writing our script

All our commands will be similar to what we run normally in the terminal.

There’s just one difference — in order to make our project, we need a name which we’ll pass through an argument.

APP_NAME="$1"

The first argument entered while executing the script will be our $1.

Now, the rest of the code will the familiar:

cd
cd Desktop/ # use whatever directory you wish here 
mkdir $APP_NAME
cd $APP_NAME

Now enters the virtual environment:

python -m venv newenv
source newenv/bin/activate

Finally, we install whatever packages we need:

pip install --upgrade pip
pip install numpy pandas matplotlib streamlit

And lastly, as a bonus, let’s open up our favourite code editor VSCode to begin with our project.

code .

And we’re done! 😄. Your script should now look like this:

#!/bin/bash
APP_NAME="$1"
cd
cd Desktop/
mkdir $APP_NAME
cd $APP_NAME
python -m venv newenv
source newenv/bin/activate
pip install --upgrade pip
pip install numpy pandas matplotlib streamlit
code .

Running our script

First, we do chmod +x (on our script) to make it executable.

chmod +x createmlapp.sh

Awesome! This is great. The only thing left is for us to move this script into our home directory so that we can run it without cd — ing into any other folder every time.

First, find out what your home directory is — type cd into the terminal. Wherever that takes you, you need to move the script there in order to make it executable from there.

Now, simply type:

./createmlapp.sh yourmlappname

to see the script in action!

Concluding…

Congrats! After this little guide, you should now be able to automate similar workflows like these to save yourself some time when starting out with a new project!

All I can recommend you now is to explore more and experiment with creating more scripts, possibly to perform more complicated tasks as well such as running a new app, auto pushing code to GitHub, etc.

You can find the code repository here.

If you liked this article, I share little bits of helpful tools and techniques from the Data Science world every week here. Follow me to never miss them!

Finally, here are a couple similar articles of mine you might find useful too:

Python
Machine Learning
Data Science
Bash
Command Line
Recommended from ReadMedium