avatarBex T.

Summary

The article introduces six cutting-edge data science libraries that are essential for enhancing a data scientist's skill set in 2023, focusing on MLOps tools that streamline model deployment, experiment tracking, data versioning, and performance monitoring.

Abstract

In the rapidly evolving field of data science, staying current with new libraries and tools is crucial for success. The article "6 New Booming Data Science Libraries You Must Learn To Boost Your Skill Set in 2023" emphasizes the importance of moving beyond traditional libraries like Pandas, NumPy, and Scikit-learn. It highlights BentoML for model deployment, MLFlow for experiment tracking, DVC for data and model versioning, Weights & Biases for experiment tracking with a collaborative UI, NannyML for monitoring model performance, and Poetry for Python dependency management. These tools are designed to facilitate the development of best-performing models and their efficient deployment into production, reflecting the growing significance of MLOps in the data science ecosystem.

Opinions

  • The author suggests that knowing only the classic machine learning libraries is no longer sufficient for data scientists in 2023.
  • BentoML is praised for its ability to deploy models from any framework to any cloud provider with ease.
  • MLFlow is highlighted as a personal favorite of the author for its simplicity and utility in tracking experiments.
  • DVC is described as a "go-to" library for data and model versioning, akin to Git for code.
  • Weights & Biases is noted for its beautiful UI and robust features, backed by significant funding and a prestigious client base.
  • NannyML is recommended for its novel approach to detecting silent model failures in production.
  • The author expresses dissatisfaction with pip for dependency management and endorses Poetry as a superior alternative.
  • The article concludes with a call to action for readers to keep up with the latest tools and libraries, suggesting a Medium membership for access to more insights.

6 New Booming Data Science Libraries You Must Learn To Boost Your Skill Set in 2023

Data science isn't just Pandas, NumPy, and Scikit-learn anymore

Photo by Tobit Nazar Nieto Hernandez

Motivation

With 2023 just in, it is time to discover new data science and machine learning trends. While the old stuff is still essential, knowing Pandas, NumPy, Matplotlib, and Scikit-learn won't just be enough anymore.

Last year's version of this post was more about classic ML, including libraries like CatBoost, LightGBM, Optuna, or UMAP.

In 2022, I observed many more "My lovely connections, I am happy to announce I am an MLOps engineer" posts than in 2021. Correspondingly, there was much more content on MLOps, and there was a massive increase in the popularity of MLOps tools.

So, this year's article is on the six rising stars in the MLOps ecosystem; tools focused on producing best-performing models in the most efficient way possible and then throwing them into production.

1. BentoML

You are probably tired of hearing, "Machine learning models don't live in Jupyter Notebooks". If you aren't, I will go ahead and say it once again:

Machine learning models don’t live inside Jupyter, gathering rust.

They live in production, doing what they are actually supposed to do — predicting new data.

One of the best libraries I found last year to deploy models is BentoML. BentoML is an all-in-one framework to maintain, package and deploy models of any framework to any cloud provider as API services.

It supports saving/loading models in a unified format (versioned and tagged), enabling you to build an organized model registry.

Screenshot by author with permission of BentoML

From there, you can build a Docker image of your best model with a single command and serve it locally:

$ bentoml containerize my_classifier:latest
$ docker run -it --rm -p 3000:3000 my_classifier:6otbsmxzq6lwbgxi serve --production

Or deploy it to any cloud provider with a few commands without leaving the CLI. Here is an example for AWS Sagemaker:

$ pip install bentoctl terraform
$ bentoctl operator install aws-sagemaker
$ export AWS_ACCESS_KEY_ID=REPLACE_WITH_YOUR_ACCESS_KEY
$ export AWS_SECRET_ACCESS_KEY=REPLACE_WITH_YOUR_SECRET_KEY
$ bentoctl init
$ bentoctl build -b model_name:latest -f deployment_config.yaml
$ terraform init
$ terraform apply -var-file=bentoctl.tfvars -auto-approve

Here is a step-by-step tutorial where I show how to deploy an XGBoost model to AWS Lambda:

Stats and links:

2. MLFlow

Before deploying your best model into production, you must produce it via experimentation. Typically, this may take dozens or even hundreds of iterations.

As the number of iterations grows, it gets harder and harder to keep track of what configurations you've already tried and which of the past experiments look promising.

To help you with the process, you need a reliable framework to keep track of code, data, models, hyperparameters, and metrics simultaneously.

Building that framework manually (or using Excel like a caveman) is the worst idea in the world, as there are so many superb Python libraries for the job.

One of those is MLFlow, my personal favorite. By adding the following line of code to a script that trains a scikit-learn model, MLFlow will capture everything — the model itself, its hyperparameters, and any metric you calculate using sklearn.metrics functions:

mlflow.sklearn.autolog()

Once you finish tinkering around, you run mlflow ui on the terminal, and it brings up an experiments dashboard with controls to sort and visualize your experiments:

Screenshot by author

MLFlow has a mlflow.framework.autolog() feature for more frameworks than you can name. It is so simple and useful that you cannot not use it.

Here is my tutorial on the framework, discussing its features and integration with the rest of the tools in the data ecosystem.

Stats and links:

3. DVC

In a sentence, DVC is Git for data.

Screenshot by author

DVC (Data Version Control) is becoming a go-to data and model versioning library. It can:

  1. Track gigabyte-sized datasets or models like Git tracks lightweight scripts.
  2. Create branches of the main code base for safe experimentation without duplicating the large files.

When you track a large file or directory with dvc add directory, a lightweight directory.dvc metadata file is created. Then, DVC manages these light files as placeholders for the original, heavy-weight files.

DVC lifts the weights, while Git handles the small stuff like your scripts. Together, they make a perfect duo.

Another selling point of DVC is smart workflow pipelines. A typical machine learning workflow involves steps like collecting data, cleaning it, feature engineering, and training a model.

DVC can create a smart pipeline from all these steps so you can run them all with two keywords — dvc repro.

What's the smart part? DVC only executes modified steps of the pipeline, saving you hours of time and computing resources.

Add MLFlow to your training scripts, track the model artifacts with DVC, and you have the perfect trio (Git, DVC, MLFlow).

Check out my beginner-friendly tutorial on DVC to get started:

Stats and links:

4. Weights & Biases

Another fully open-source experiment tracking framework is Weights & Biases (wandb.ai). The only difference? It is provided by a company with over $200M in funding and a client base that contains OpenAI, NVIDIA, Lyft, BMW, Samsung, and so on.

Their main selling points are:

  • Excellent integration with the rest of the ML ecosystem, just like MLFlow
  • The most beautiful UI for tracking and comparing experiments (personally)
Screenshot by author
  • Collaborative reports and dashboards
  • Hyperparameter optimization (not possible in MLFlow)

And the best part is, all the above features are available straight through Jupyter. This means you don't have to ditch your favorite IDE and move into scripts just to track experiments.

So, your perfect trio might actually be Git, DVC, and Weights & Biases.

Stats and links:

5. NannyML

Deploying models is only part of the story. To maintain a successful ML-powered product, you must consistently monitor their performance.

The problem with monitoring is that you won't have nice, fat, red errors when models fail. Their predictions may become worse and worse as time passes, leading to a phenomenon called a silent model failure.

For example, you deployed a model that detects Nike clothes from images. Fashion is fast-changing, so Nike constantly improves its designs.

Since your model training didn't include the new designs, it starts to miss Nike clothes from images more and more. You won't receive errors, but your model will soon be useless.

NannyML helps solve this exact problem. Using a novel Confidence-Based Performance Estimation algorithm they developed and a few other robust statistical tests, they can detect performance drops or silent model failures in production.

NannyML also features smart alerting so you can always stay in tune with what's happening in your production environment.

Here is a hands-on tutorial to get you started:

Stats and links:

6. Poetry

You've probably heard Python programmers whine about pip and its dependency issues a thousand times already, and I was one of those whiners until I saw Poetry.

Poetry is a game-changing open-source Python packaging and dependency management framework. In its simplest use case, Poetry can detect dependency conflicts BEFORE you install libraries so that you can avoid dependency hell entirely.

Screenshot by author

You can also configure your Python projects as packages with pyproject.toml files, and Poetry will take care of virtual environments, building and publishing the repo to PyPI with simple commands.

Screenshot by author

Here is a comprehensive Real Python tutorial on Poetry:

Stats and links:

Conclusion

The field of data science is constantly evolving, and new tools and libraries are being developed at a blazing pace. The pressure to keep up is harder than ever. In this post, I did my best to narrow your focus to one area of machine learning that's promised to skyrocket in 2023. Thank you for reading!

Loved this article and, let’s face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that’s me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use my referral link, you will earn my supernova of gratitude and a virtual high-five for supporting my work.

Machine Learning
Data Science
Mlops
Artificial Intelligence
Python
Recommended from ReadMedium