avatarVaticAI

Summary

The website content provides guidance on improving Jupyter Notebook practices for data scientists by addressing common issues such as version control difficulties, lack of IDE features, testing challenges, poor naming conventions, bad variable naming, unused code accumulation, excessive execution counts, underutilization of markdown for documentation, not learning notebook shortcuts, and writing spaghetti code.

Abstract

Jupyter Notebooks are powerful tools for data scientists, but they can be misused, leading to disorganized and inefficient workflows. The content outlines seven practices to enhance notebook usage: adopting a consistent naming convention for files, using meaningful variable names, removing or archiving unused code, keeping execution counts manageable, documenting with markdown, learning Jupyter shortcuts, and avoiding spaghetti code by structuring code clearly. These practices are aimed at improving readability, maintainability, and collaboration, ultimately helping data scientists deliver value more effectively.

Opinions

  • The author emphasizes the importance of a standardized naming convention for notebooks to maintain organization and clarity.
  • Good variable naming is seen as a simple yet effective way to improve code readability and facilitate hand-offs between team members.
  • Accumulation of unused code is discouraged; the author suggests maintaining a separate repository for it instead.
  • Notebooks should be kept concise, with a recommended limit on execution counts to facilitate debugging and structuring of code.
  • Markdown documentation within notebooks is underutilized, and the author encourages its use for better documentation practices.
  • The author believes that learning Jupyter Notebook shortcuts is crucial for efficiency, given the significant time data scientists spend using them.
  • The article advises against writing "spaghetti code" by ensuring that each function has a single responsibility and by using comments and docstrings.
  • The use of version control, such as git, is recommended to keep track of changes and maintain code integrity.

7 Jupyter notebooks practices to avoid right away

Jupyter Notebooks come with several problems not limited to:

  1. Hard if not impossible to version control
  2. No IDE, code correction, auto styling
  3. Hard to unit test or integration test.

but it’s still an invaluable tool for data scientists to deliver value.

7 simple things that I learned reviewing 1000’s of notebooks over the last 10 years, that you can bring in as a Data scientist to make your notebook game better:

Untitled4.ipnyb , final_final_training.ipnyb, model_pipeline_try_3.ipnyb

You’ve probably seen this, The naming of notebooks can help you be organized. Follow a simple trick/nomenclature to avoid this:

[2_word_project_name]_[2_word_functionality]_[1_word_name_initials]_[ISO 8601 date_yyyy_mm_dd].ipynb

e.g. if a notebook belongs to a project ‘face_detection’ and it covers the ‘model training pipeline’ and is created by “elon” at “2022–12–31” then you could name your notebook like this:

face_detection_train_pipeline_elon_2022_12_31.ipynb

Bad variable naming

Possibly the easiest way to have better readability and hand-offs is using variable naming. Don’t make the mistake of “will rename better later on”, do it at the start. It’s the best way to help yourself!

Unused code, comments, and variables

“I might require this later”, is the most common response for not deleting unused code/variables. You could create a running folder for all unused code and reference it back.

suspectus provides one of the best answers for this:

Large execution count

Execution count beyond 30 to 40 is a sign that you need to break it into multiple notebooks. Think about a good milestone where you’re confident about the execution and end the notebook gracefully. Start another one to continue your pipelining.

1_face_detect_data_prep.ipnyb 2_face_detect_model_train.ipnyb 3_face_detect_model_predict.ipnyb 4_face_detect_model_evaluate.ipnyb

This automatically structures your code to some extent and helps people understand the flow of your projects. It’s easier to debug and introduces some sort of fault tolerance. It also provides a way forward to convert your notebook into a python package.

Not using markdown for documentation

Notebooks provide useful markdown functionality to use cells as documentation. It is surprising how only a handful of data scientists use this for documenting what’s going on in large notebooks.

Pro Tip: Instead of code write markdown on a cell for documentation, latex-like equations, or headers, pointers. Press Esc + m and voila.

Not learning Notebooks shortcuts

Data scientists spend almost 80% time modeling and exploring on Jupyter notebooks. If you’re terribly slow in Jupyter land, it could work to your disadvantage over the long term.

Some basic shortcuts need to be learned on the keyboard to introduce speed. Here’s possibly the best documentation I’ve seen.

https://gist.github.com/kidpixo/f4318f8c8143adee5b40

Spaghetti code

Photo from Unsplash link here

Avoid feeding the spaghetti monster. So there we go, some tips in no particular order.

  • 1 function — 1 responsibility
  • Split your files
  • Use comments/docstrings
  • Self-contained methods
  • Use VERSION CONTROL, aka git

One could keep underlying pointers for these — but give these a try!

Read this next to UP your game in being more efficient when you’re notebooking!

Jupyter Notebook
Python3
Data Science
Recommended from ReadMedium