7 Jupyter notebooks practices to avoid right away
Jupyter Notebooks come with several problems not limited to:
- Hard if not impossible to version control
- No IDE, code correction, auto styling
- Hard to unit test or integration test.
but it’s still an invaluable tool for data scientists to deliver value.
7 simple things that I learned reviewing 1000’s of notebooks over the last 10 years, that you can bring in as a Data scientist to make your notebook game better:
Untitled4.ipnyb , final_final_training.ipnyb, model_pipeline_try_3.ipnyb
You’ve probably seen this, The naming of notebooks can help you be organized. Follow a simple trick/nomenclature to avoid this:
[2_word_project_name]_[2_word_functionality]_[1_word_name_initials]_[ISO 8601 date_yyyy_mm_dd].ipynbe.g. if a notebook belongs to a project ‘face_detection’ and it covers the ‘model training pipeline’ and is created by “elon” at “2022–12–31” then you could name your notebook like this:
face_detection_train_pipeline_elon_2022_12_31.ipynb
Bad variable naming
Possibly the easiest way to have better readability and hand-offs is using variable naming. Don’t make the mistake of “will rename better later on”, do it at the start. It’s the best way to help yourself!
Unused code, comments, and variables
“I might require this later”, is the most common response for not deleting unused code/variables. You could create a running folder for all unused code and reference it back.
suspectus provides one of the best answers for this:
Large execution count
Execution count beyond 30 to 40 is a sign that you need to break it into multiple notebooks. Think about a good milestone where you’re confident about the execution and end the notebook gracefully. Start another one to continue your pipelining.
1_face_detect_data_prep.ipnyb 2_face_detect_model_train.ipnyb 3_face_detect_model_predict.ipnyb 4_face_detect_model_evaluate.ipnyb
This automatically structures your code to some extent and helps people understand the flow of your projects. It’s easier to debug and introduces some sort of fault tolerance. It also provides a way forward to convert your notebook into a python package.
Not using markdown for documentation
Notebooks provide useful markdown functionality to use cells as documentation. It is surprising how only a handful of data scientists use this for documenting what’s going on in large notebooks.
Pro Tip: Instead of code write markdown on a cell for documentation, latex-like equations, or headers, pointers. Press Esc + m and voila.
Not learning Notebooks shortcuts
Data scientists spend almost 80% time modeling and exploring on Jupyter notebooks. If you’re terribly slow in Jupyter land, it could work to your disadvantage over the long term.
Some basic shortcuts need to be learned on the keyboard to introduce speed. Here’s possibly the best documentation I’ve seen.
https://gist.github.com/kidpixo/f4318f8c8143adee5b40
Spaghetti code
Avoid feeding the spaghetti monster. So there we go, some tips in no particular order.
- 1 function — 1 responsibility
- Split your files
- Use comments/docstrings
- Self-contained methods
- Use VERSION CONTROL, aka git
One could keep underlying pointers for these — but give these a try!
Read this next to UP your game in being more efficient when you’re notebooking!
