avatarIsmael Araujo

Summary

The article introduces lazy predict, a Python library that simplifies the process of running multiple machine learning models quickly to assess their performance on a given dataset, using the Titanic Survivor Prediction as a case study.

Abstract

The article "How to Run 30 Machine Learning Models with a Few Lines of Code" provides a guide on how to efficiently test a variety of machine learning algorithms on a dataset without the need for extensive coding. It demonstrates the use of the lazy predict library, which can run 30 different models simultaneously, offering a time-saving approach for data scientists when starting a new project. The author illustrates the process by applying it to the Titanic Survivor Prediction dataset from Kaggle, showcasing how to import data, preprocess it, and split it into training and test sets before running the models. The results are then compared with individual models like Random Forest and Logistic Regression to validate the accuracy of the lazy predict outcomes. The article emphasizes that while lazy predict is a powerful tool for initial model assessment, it should not replace a thorough understanding of the models and the statistics behind them.

Opinions

  • The author views lazy predict as a successful tool for quickly evaluating multiple machine learning models, saving time in the initial stages of a project.
  • There is an opinion that lazy predict provides a reliable indication of model performance, as evidenced by the comparison with more complex approaches like Random Forest and Logistic Regression.
  • The article suggests that data scientists should not solely rely on lazy predict for their final model selection; instead, they should use it as a starting point to inform further, more detailed analysis.
  • The author implies that a good grasp of data science principles and the specifics of each model is crucial, as there are no "magic solutions" in the field.
  • lazy predict is recommended for its ease of use and speed, making it a valuable asset for initial model iteration and confirmation of the planned approach.

MACHINE LEARNING

How to Run 30 Machine Learning Models with a Few Lines of Code

Learn how to run multiple machine learning models using lazy predict — code along

Image by Keira Burton. Source: Pexels

When starting a new supervised Machine Learning project, one of the first steps is to analyze the data, understand what we are trying to accomplish, and which machine learning algorithms could help us achieve our goals. While the scikit-learn library makes our lives easier by making possible to run models with a few lines of code, it can also be time-consuming when you need to test multiple models. However, what if we could run multiple vanilla models at once before diving into more complex approaches and have a better idea of what models in which we should invest our precious time?

That’s what lazy predict tries (successfully) to accomplish. It runs 30 machine learning models in just a few seconds and gives us a grasp of how models will perform with our dataset. To better understand how we can use lazy predict, I created a Titanic Survivor Prediction project so that you can code along. You can find the full notebook here. You can code along with me. Basic experience with Python, Pandas, and scikit-learn will help you better understand what is going on.

Importing and cleaning data

First, let's import pyforest. PyForest imports the 40 most popular Python libraries with one line of code. I wrote an article about it, and you can find it here. I will turn some ugly warning messages off using the warning library. I will also import some metrics libraries. We will need it later on.

import pyforest
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.metrics import accuracy_score

Now, let's import the dataset we will be using from Kaggle. You can find the dataset on this link. Note that I didn't import Pandas. That's because it comes included with pyforest.

# importing .csv files using Pandas
train = pd.read_csv(‘train.csv’)
test = pd.read_csv(‘test.csv’)

I will skip some Exploratory Data Analysis in this article because our primary focus is to start using lazypredict. However, in my initial EDA that you can find in my GitHub, I noticed that we need to convert the column Sex into numeric. We can easily do that with a lambda function.

train['Sex'] = train['Sex'].apply(lambda x: 1 if x == 'male' else 2)

We can also drop a few categorical columns that we will not be used for this micro project. For homework, I recommend you trying to play around with these features when you finish this article.

train.drop(columns=[‘Name’,’Ticket’,’Cabin’, ‘PassengerId’, ‘Parch’, ‘Embarked’], inplace=True)

Train Test Split

Let's now split our train set into the variables X and y. I will address all the features to X, except Survived, which is our target label.

X = train.drop([‘Survived’], axis=1)
y = train.Survived

And now, let's split the variable into train and test sets. I will go with the default 0.25 for the test size. You can easily add other values using.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Modeling with LazyPredict

Now it's time for some fun. If this is your first time using lazypredict, you will have to install it. To do so, you can type pip install lazypredict in your terminal. If you already have it installed, let's import it to your project. Since this is a classification project, let's import LazyClassifier as well.

import lazypredict
from lazypredict.Supervised import LazyClassifier

Finally, let's run the models and see how it goes.

clf = LazyClassifier(verbose=0,ignore_warnings=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
models

Voilá. You just iterated over 30 models in less than 2 seconds. That is incredibly fast and effortless. However, how can we make sure that these results are accurate? Well, we can now check the results by running a few models and comparing them. For this project, I will test the dataset with a Random Forest and Logistic Regression model. Let’s see if we can get anywhere close to the results we just saw. Let’s start with Random Forest.

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

As we can see, the accuracy and F1-Score got a very similar result. Let's now try Logistic Regression.

rf = LogisticRegression()
rf.fit(X_train, y_train)
y_pred_lr = rf.predict(X_test)

Again, we got a very similar result. It seems that the results are trustworthy. That was easy.

Conclusion

lazypredict is an easy and fast library that gives a good prediction of how our models will behave if your dataset. Keep in mind that the results obtained with lazy predict SHOULD NOT be considered final models. Statistics is a large component of Data Science, and since different models have different approaches, you should know how each of the models works before choosing your final one. More importantly, Data Science is a complex field, and there aren’t magic solutions without pros and cons. Use it as your first iteration to confirm your approach planned and always double-check the results using tools such as scikit-learn. Have fun and let me know if it worked for you!

Machine Learning
Python
Projects
Data Science
Libraries
Recommended from ReadMedium