Data Science 101

8 Things Most Data Science Programs Ddon’t Teach (But You Should Know) — Part 2

MIT calls this “the missing semester of your CS education”

What data science and software engineering have in common is writing code. But while code is the main outcome of software engineering, data science projects typically end with models, results, and reports. Consequently, in data science the quality, structure, and delivery of code is often an afterthought at best.

The implicit expectation with data science projects is that the results reported at the end can be trusted.

This means that if someone asked you to re-run your or somebody else’s analysis, you would be able to obtain the same results, regardless of how much time has passed since you first performed the analysis.

Similarly, if you are developing a component for a product, the implicit expectation is that component you developed represents the best possible performance given what is reasonably possible within the requirements of the product.

These statements may seem obvious, but satisfying both expectations can be quite difficult.

If you don’t believe me, think about your past projects.

Have you ever struggled to run your old code or to figure out which version of your data or which hyperparameters you used to obtain a specific result?

This is a second article of a series where I talk about practical data science skills that are in my experience not talked about in data science courses, but will occupy much of your day to day as a data scientist. This post is inspired by a course I taught at the University of Tennessee in Knoxville — DSE 511, and a fantastic MIT course that is aptly called “the missing semester of your CS education.”

8 Things Most Data Science Programs Don’t Teach (But You Should Know) — Part 1

MIT Calls this “the missing semester of your CS education”

towardsdatascience.com

This second post focuses on skills to help you make your results more reliable and your code more reusable.

Here’s what this article series covers:

Part 1 (previous post):

Part 2 (this post):

Help your future self by caring about reproducibility
Version everything including data and models
Organize your experiments with ML experiment tracking tools
Test and document your code

In separate future posts I will do a deep dive into each of these topics.

1. Help your future self by caring about reproducibility

In 2016, the journal Nature ran a survey in which it asked over 1,500 scientists various questions about their experience in reproducing other people’s and their own work. Two results that really stood out in the study:

“More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.” [1]

Yes, you read that right. More than half of the respondents could not reproduce their own work.

If this sounds like a fluke, try to think about your own work. Did you ever go back to your old code and fail to remember how you processed your data or which script needs to be run first?

The idea behind reproducibility is ensuring that results of any experiments are not down to chance, but can be recreated by following the same method. In the context of data science, this means documenting all code and data so that rerunning the analysis will yield identical results.

If you are wondering why this matters, ask yourself if you would you trust the conclusions drawn from somebody’s experiments if you weren’t sure they can get the same numbers each time they run their analysis.

How to achieve computational reproducibility? The bad news is that creating data science code that is fully reproducible takes a lot of work and planning. The good news is that reproducibility doesn’t have to be all or nothing. One way to think about reproducibility is as a “spectrum of possibilities” [2] where each incremental step provides a meaningful improvement. Or as Karl Broman puts it:

“…, you shouldn’t try to do these things all at once; start with one, or part of one. Then in your next project, do that plus another thing.” [3]

What are some things you can do to make your work more reproducible? There are lots of fantastic resources on reproducibility. While the complete list of recommendations for achieving reproducible code can seem overwhelming, there are three things you can implement that can not only make a big difference to the reproducibility of your code but also improve your productivity and the quality of your work. The rest of this article is dedicated to these three topics:

Use version control for everything including data and models.
Keep track of how you obtained each result.
Clearly document everything you are doing and add tests for important parts of your code.

2. Version everything including data and models

Using version control systems like git to track all your code changes can save you a lot of headaches.

If you ever break your code, you can easily go back to a working version.

It makes it very easy to share your code with others or even work on projects collaboratively.

It serves as a backup so you don’t lose your work.

If you are not using git for all your projects, including projects that are not team based, you should start now. Don’t be afraid to use features like branching and pull requests — they help communication and tracking work.

What about data? It could be argued that what code is to a software engineer, data and models are to a data scientist, but data often doesn’t get the same treatment as code. When it comes to data and models, you mainly need to worry about:

Storage — ideally, your data should be stored in a location that provides backup and can be accessed by team members if needed.
Correctly mapping data, code, and model versions — if you ever used a model you developed only to find out that you’re not sure which version of your code and data the model was trained on, you know what I’m talking about.

While using git to version data and models alongside code may seem like an obvious solution, git was not made for storing datasets. In fact, storing files larger than just text files containing code can quickly cause your repository size to blow up, taking unnecessary space and degrading performance of push/pull operations (ask me how I know this).

Fortunately, there are solutions to data and model versioning like git LFS (git Large File Storage) and DVC (Data Version Control).

Both Git LFS and DVC are based on the same idea — instead of storing large files, they create small text files that serve as pointers to the actual data. The difference between them is that git LFS was designed to work seamlessly with git while DVC provides a file management system that is independent from git. Git LFS stores data on GitHub servers and the free tier offers up to 1Gb of space, while DVC supports many different types of storage including Amazon S3, Google Drive, and others. Both are worth exploring as they can be used alongside each other and can serve different purposes.

3. Organize your experiments with ML experiment tracking tools

It’s not uncommon to use a single notebook for everything from data exploration to model training and to do everything ad hoc. It’s also not uncommon to decide to switch to a different model in the middle of experimentation and to do this by simply changing the model used in the existing code. I’m guilty of doing this myself, especially when I try to get a prototype together quickly.

There are many issues that could arise from this approach. Given the theme of this article, by now you probably guessed that this approach doesn’t do well with reproducibility.

But aside of the fact that you can easily lose track of how you trained your model there are many other downsides.

It can be difficult to compare results across models or to share results and trained models with team members.
The more models you experiment with, the harder it can be to keep track of performance.
You can easily waste compute resources by re-running experiments you already ran but didn’t properly log.
It can be harder to know if you properly explored all hyperparameters.

It’s not just good practice to properly track your ML experiments, it can save you time and help you improve your results.

Tracking ML experiments is the practice of saving all important variables that led to your result. What needs to be tracked can differ based on what models, tools, and languages you are using, and generally includes:

Experiment code
The data used (the specific version, data split, etc.)
The environment your code was run in (system and programming language versions, libraries, library versions)
Model configuration (hyperparameters, etc.)
Model weights
Which metrics were used to measure performance and the metric scores

Ideally, this information should be stored in a way that enables you to reproduce your past experiments (retrain your models and get the same performance scores) and compare results across your experiments.

Broadly, there are two main options for how to address experiment tracking — DYI and experiment tracking tools.

Experiment tracking does not have to rely on a tool, you can build a custom workflow and track experiments yourself. In fact, this used to be and still is common practice — to save results and other relevant information in a spreadsheet or another file. I think it can be a good idea to try and create a project where you track all your experiments manually yourself to try and see if this approach works for you.

If you would rather rely on a tool, nowadays there are many options to choose from with different features. Some of the most popular ones include free tools like MLFlow and DVC (yes, DVC can be used for experiment tracking, not only for data storage) and commercial tools like ClearML, Weights & Biases, and Comet. Which one to choose will depend on your project and your needs.

While it can take some time to understand how to properly track your experiments and find a workflow that works for you, it will be well worth it in the long run.

4. Test and document your code

Guido van Rossum, the creator of Python, shared one of his key insights in PEP 8 (which I can highly recommend reading):

“Code is read much more often than it is written.” [4]

Keeping your code readable benefits your collaborators as much as yourself.

Maybe you have experienced one of these before:

Looking at your own code and not being able to remember what you did
Struggling to use a library that doesn’t have good documentation
Not being able to find the source of a bug in your code

Properly documenting your code and writing unit tests are two great ways to improve the readability of your code. Documentation and unit tests help you define expected behavior and outcomes. Both also help you avoid problems in your code, and when problems do occur they help you solve them faster.

You might be thinking that doing all that — writing documentation and unit tests, sounds like a lot of work that will only slow you down.

While both require some upfront work, ask yourself how many times you have struggled with trying to reuse your old code or with chasing some obscure bug in your code.

If you never struggle with these kinds of problems, then great!

But chances are that, as Guido suggests, you actually do spend much more time reading and editing your code than writing it in the first place. If that is the case, putting in the work upfront can save your future self a lot of time and frustration.

Documenting your Python code

There are different kinds of documentation in Python that you should be familiar with:

First, there are standard Python comments which are prefixed with a hash (#) — these can be helpful for describing the intent of your code.
Then there are docstrings which are structured tags that help describe the purpose, inputs, outputs, and other information of various objects including functions, classes and methods, and modules. There are different docstring formats, some popular ones include Sphinx, Numpy, and Google styles. Many IDEs help you auto-generate docstrings, so you don’t have to remember the tags.
Then there are Python type hints which help specify expected inputs and outputs without needing to write docstrings or comments.
There are also other kinds of documentation, like readme files, commit messages, and issue trackers.

It can be helpful to become familiar with the different types of documentation listed above to make documenting your code something you do by default rather than retrospectively (or never).

Writing tests for data science

The obvious reason to test your code is because tests help you find issues faster. However, there are many more benefits to writing unit tests beyond debugging. Unit tests help you write more readable and reproducible code because they force you to think about what you expect from your code and help you break it down into smaller pieces. They can also serve as unofficial documentation for your project.

There are many libraries for writing unit tests in Python from the built-in PyUnit to the popular open-source library PyTest. You don’t need to have 100% test coverage to start. A good starting place can be creating unit tests for your data processing functions. That way, if you happen to make changes to data preprocessing as you test different models, you can be certain that those changes will not affect your previous experiments.

If you have made it this far, thank you for reading! I hope you enjoyed this article and that with these recommendations, you feel empowered to experiment with tools and techniques to improve the reproducibility and reusability of your data science code. If you would like to receive notifications when I publish a new article, consider signing up for email notifications here. If you missed the first article in the series, you can find it here:

8 Things Most Data Science Programs Don’t Teach (But You Should Know) — Part 1

MIT Calls this “the missing semester of your CS education”

towardsdatascience.com

References

[1] M. Baker, 1,500 scientists lift the lid on reproducibility (2016), Nature

[2] R. D. Peng, Reproducible Research in Computational Science (2011), Science

[3] K. Broman, initial steps toward reproducible research, online (https://kbroman.org/steps2rr/)

[4] G. van Rossum, B. Warsaw and A. Coghlan, PEP 8 — Style Guide for Python Code (2001), Python Enhancement Proposals

Data Science 101

MIT calls this “the missing semester of your CS education”

8 Things Most Data Science Programs Don’t Teach (But You Should Know) — Part 1

MIT Calls this “the missing semester of your CS education”

1. Help your future self by caring about reproducibility

2. Version everything including data and models

​​3. Organize your experiments with ML experiment tracking tools

4. Test and document your code

Documenting your Python code

Writing tests for data science

8 Things Most Data Science Programs Don’t Teach (But You Should Know) — Part 1

MIT Calls this “the missing semester of your CS education”

References

3. Organize your experiments with ML experiment tracking tools