Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5979

Abstract

ata = pd.read_csv(<span class="hljs-string">'path/to/your/data.csv'</span>)

<span class="hljs-comment"># split the dataset into training and testing sets</span> X_train, X_test, y_train, y_test = train_test_split(data.drop(<span class="hljs-string">'target'</span>, axis=<span class="hljs-number">1</span>), data[<span class="hljs-string">'target'</span>], test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)</pre></div><p id="27cb">For now, I won’t provide any dataset as solving a concrete classification problem will be the purpose of another article. You can download one online if you want, or just follow this article to understand how to do it without trying it.</p><p id="4529">Next, we create an instance of the <code>LogisticRegression</code> classifier and fit it to the training data:</p><div id="0baf"><pre>lr = LogisticRegression()

<span class="hljs-comment"># train the model using the training data</span> lr.fit(X_train, y_train)</pre></div><p id="d068">By using the logistic regression algorithm, we suppose that our target variable can only take 2 values as this is a binary classifier.</p><p id="d2de">Once the model is trained, we can use it to predict the target variable for new data:</p><div id="6824"><pre>y_pred = lr.predict(X_test)</pre></div><p id="10f4">y_pred might look like the following:</p><div id="f69f"><pre>array([<span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>])</pre></div><p id="8714">We only get 0 and 1 because our target variable can only take on these values.</p><p id="2446">Each value in the array corresponds to the predicted target value for a single observation in the testing set. For example, the first value in the array (1) corresponds to the predicted target value for the first observation in the testing set, the second value in the array (0) corresponds to the predicted target value for the second observation, and so on.</p><p id="42f1">Finally, we can evaluate the performance of the model using various metrics, such as accuracy, precision, recall, and F1 score:</p><div id="69f8"><pre><span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Accuracy:"</span>, accuracy) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Precision:"</span>, precision) <span class="hljs-built_in">print</span>(<span class="hljs-string">"Recall:"</span>, recall) <span class="hljs-built_in">print</span>(<span class="hljs-string">"F1 score:"</span>, f1)</pre></div><h2 id="c7b7">Fine-Tuning a Classification</h2><p id="7e02">After building a classification model, you can fine-tune its parameters to optimize its performance. There are many ways to do this.</p><p id="5bbf"><b>Hyperparameter tuning: </b>Many classification algorithms have hyperparameters that can be tuned to improve their performance. Hyperparameters are parameters that are not learned from the data, but are set by the user before the model is trained. Examples of hyperparameters include the regularization parameter in logistic regression, the number of trees in a random forest, and the maximum depth of a decision tree.</p><p id="f9d5">One way to fine-tune a classification model is to perform a grid search over a range of hyperparameters and evaluate the model’s performance on a validation set. Scikit-learn provides a convenient <code>GridSearchCV</code> class for this purpose. Here's an example of how to use <code>GridSearchCV</code> to tune the hyperparameters of a <code>LogisticRegression</code> classifier:</p><div id="0693"><pre><span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> GridSearchCV

<span class="hljs-comment"># define the hyperparameters to tune</span> hyperparameters = {<span class="hljs-string">'C'</span>: [<span class="hljs-number">0.1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">10</span>], <span class="hljs-string">'penalty'</span>: [<span class="hljs-string">'l1'</span>, <span class="hljs-string">'l2'</span>]}

<span class="hljs-comment"># create an instance of the LogisticRegression classifier</span> lr = LogisticRegression()

<span class="hljs-comment"># perform a grid search over the hyperparameters</span> grid_search = GridSearchCV(lr, hyperparameters, cv=<span class="hljs-number">5</span>) grid_search.fit(X_train, y_train)

<span class="hljs-comment"># print the best hyperparameters</span> <span class="hljs-built_in">print</span>(<span class="hljs-string">"Best hyperparameters:"</span>, grid_search.best_params_)</pre></div><p id="5b27">In this example, we define a grid of hyperparameters to tune (<code>C</code> and <code>penalty</code>) and create an instance of the <code>LogisticRegression</code> classifier. We then use <code>GridSearchCV</code> to perform a grid search over the hyperparameters, using a 5-fold cross-validation strategy (<code>cv=5</code>). Finally, we print the best hyperparameters found by the grid search.</p><p id="d650"><b>Feature Selection</b>:

Options

Another way to fine-tune a classification model is to perform feature selection, which involves selecting a subset of the input features that are most relevant to the target variable. Feature selection can help to reduce the dimensionality of the data, improve the model’s performance, and make it more interpretable.</p><p id="902b">Scikit-learn provides several feature selection techniques, such as <code>SelectKBest</code>, <code>SelectPercentile</code>, and <code>RFE</code> (Recursive Feature Elimination). Here's an example of how to use <code>SelectKBest</code> to select the top k features in a dataset:</p><div id="391e"><pre><span class="hljs-keyword">from</span> sklearn.feature_selection <span class="hljs-keyword">import</span> SelectKBest, f_classif

<span class="hljs-comment"># select the top k features using the f_classif score</span> selector = SelectKBest(f_classif, k=<span class="hljs-number">10</span>) selector.fit(X_train, y_train)

<span class="hljs-comment"># transform the training and testing sets</span> X_train_new = selector.transform(X_train) X_test_new = selector.transform(X_test)</pre></div><p id="6716"><b>Ensemble Methods</b>:</p><p id="fe7b">Ensemble methods are techniques that combine multiple classification models to improve their performance. Examples of ensemble methods include bagging, boosting, and stacking.</p><p id="c5f1">Scikit-learn provides several ensemble methods, such as <code>RandomForestClassifier</code>, <code>AdaBoostClassifier</code>, and <code>GradientBoostingClassifier</code>. Here's an example of how to use <code>RandomForestClassifier</code> to build an ensemble of decision trees:</p><div id="6d35"><pre><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier

<span class="hljs-comment"># create an instance of the RandomForestClassifier with 100 trees</span> rf = RandomForestClassifier(n_estimators=<span class="hljs-number">100</span>)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Accuracy:"</span>, accuracy_score(y_test, y_pred))</pre></div><p id="04cc"><b>Model Stacking</b>: Model stacking is a technique that combines multiple classification models using a meta-model. The idea is to use the predictions of several base models as input features to a meta-model, which learns how to combine them to make the final prediction.</p><p id="568d">Scikit-learn does not provide a built-in implementation of model stacking, but it is relatively easy to implement using the <code>StackingClassifier</code> class from the <code>mlxtend</code> library. Here's an example of how to use <code>StackingClassifier</code> to build a stacked ensemble of logistic regression and decision tree classifiers:</p><div id="2fcf"><pre><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> StackingClassifier <span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression <span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeClassifier <span class="hljs-keyword">from</span> mlxtend.classifier <span class="hljs-keyword">import</span> StackingClassifier

<span class="hljs-comment"># create instances of the base classifiers</span> lr = LogisticRegression() dt = DecisionTreeClassifier()

<span class="hljs-comment"># create an instance of the StackingClassifier with logistic regression as meta-classifier</span> sc = StackingClassifier(classifiers=[lr, dt], meta_classifier=lr)

sc.fit(X_train, y_train)

y_pred = sc.predict(X_test)

<span class="hljs-built_in">print</span>(<span class="hljs-string">"Accuracy:"</span>, accuracy_score(y_test, y_pred))</pre></div><h2 id="12e1">Final Note</h2><p id="c574">Now you have an extra tool in your arsenal and you know how to solve classification problems in Python.</p><p id="673f">In a next article, we will see a concrete application case. Don’t hesitate to follow me if you don’t want to miss it!</p><p id="7f41"><i>To explore the other stories of this story, click below!</i></p><div id="ad69" class="link-block"> <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f"> <div> <div> <h2>Data Science with Python</h2> <div><h3>Aka the best programming language for data scientists</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div> </div> </div> </a> </div><p id="dc9e"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — Classification Analysis

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

One of the most important tasks in data science is classification analysis, which involves predicting the class or category of an observation based on its features. Classification analysis has a wide range of applications, from fraud detection to medical diagnosis to image recognition.

Today, we will explore how to perform classification analysis using Python.

Classification Algorithms

Classification is a type of supervised learning in which the goal is to predict discrete values. There are many different types of classification algorithms, each with its own strengths and weaknesses.

Logistic Regression: Logistic regression is a simple yet powerful classification algorithm that is widely used in data science. It is used to model the probability of a binary outcome (e.g., yes or no) based on one or more predictor variables. The algorithm produces a logistic curve that maps the probability of the outcome to the input variables.
K-Nearest Neighbors (KNN): KNN is a non-parametric classification algorithm that works by finding the k nearest neighbors to a given observation in the feature space. The class of the new observation is then determined by a majority vote of its k-nearest neighbors.
Decision Trees: Decision trees are a popular classification algorithm that works by recursively partitioning the feature space into smaller and smaller regions. At each step, the algorithm selects the feature that provides the most information gain (i.e., reduces the entropy or impurity of the dataset the most) and creates a split based on its value. The process continues until a stopping criterion is met.
Random Forests: Random forests are an extension of decision trees that work by building an ensemble of many trees and averaging their predictions. The individual trees are built using a random subset of the features and a random subset of the data. This helps to reduce overfitting and improve the generalization performance of the model.
Support Vector Machines (SVMs): SVMs are a powerful classification algorithm that works by finding the hyperplane that separates the two classes in the feature space with the largest margin. The hyperplane is chosen to maximize the distance between the nearest points of each class.

These are just a few of the many classification algorithms that are available in data science. But these are the most famous and important.

Logistic Regression

Let’s focus on logistic regression, the most basic classification algorithm.

The logistic regression model is a type of generalized linear model that uses the logistic function (also known as the sigmoid function) to model the probability of the outcome. The logistic function has an S-shaped curve that ranges from 0 to 1, and it can be written as:

where p(x) is the probability of the outcome given the input variables x, z is a linear combination of the input variables and their coefficients, and exp() is the exponential function.

The logistic regression algorithm works by estimating the values of the coefficients that maximize the likelihood of the observed data.

The logistic regression algorithm uses optimization techniques, such as gradient descent or Newton’s method, to find the values of the coefficients that maximize the log-likelihood function. Once the coefficients are estimated, the model can be used to predict the probability of the outcome for new observations.

Building Classification Models in Python

The easiest way to build classification models in Python is to use scikit learn as it provides a lot of classification algorithms we can use.

I’ll suppose you have already installed scikit learn. Else, you can do it easily using pip install scikit-learn

Once Scikit-learn is installed, you can import the necessary modules and load your data:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = pd.read_csv('path/to/your/data.csv')

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)

For now, I won’t provide any dataset as solving a concrete classification problem will be the purpose of another article. You can download one online if you want, or just follow this article to understand how to do it without trying it.

Next, we create an instance of the LogisticRegression classifier and fit it to the training data:

lr = LogisticRegression()

# train the model using the training data
lr.fit(X_train, y_train)

By using the logistic regression algorithm, we suppose that our target variable can only take 2 values as this is a binary classifier.

Once the model is trained, we can use it to predict the target variable for new data:

y_pred = lr.predict(X_test)

y_pred might look like the following:

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1])

We only get 0 and 1 because our target variable can only take on these values.

Each value in the array corresponds to the predicted target value for a single observation in the testing set. For example, the first value in the array (1) corresponds to the predicted target value for the first observation in the testing set, the second value in the array (0) corresponds to the predicted target value for the second observation, and so on.

Finally, we can evaluate the performance of the model using various metrics, such as accuracy, precision, recall, and F1 score:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)

Fine-Tuning a Classification

After building a classification model, you can fine-tune its parameters to optimize its performance. There are many ways to do this.

Hyperparameter tuning: Many classification algorithms have hyperparameters that can be tuned to improve their performance. Hyperparameters are parameters that are not learned from the data, but are set by the user before the model is trained. Examples of hyperparameters include the regularization parameter in logistic regression, the number of trees in a random forest, and the maximum depth of a decision tree.

One way to fine-tune a classification model is to perform a grid search over a range of hyperparameters and evaluate the model’s performance on a validation set. Scikit-learn provides a convenient GridSearchCV class for this purpose. Here's an example of how to use GridSearchCV to tune the hyperparameters of a LogisticRegression classifier:

from sklearn.model_selection import GridSearchCV

# define the hyperparameters to tune
hyperparameters = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}

# create an instance of the LogisticRegression classifier
lr = LogisticRegression()

# perform a grid search over the hyperparameters
grid_search = GridSearchCV(lr, hyperparameters, cv=5)
grid_search.fit(X_train, y_train)

# print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)

In this example, we define a grid of hyperparameters to tune (C and penalty) and create an instance of the LogisticRegression classifier. We then use GridSearchCV to perform a grid search over the hyperparameters, using a 5-fold cross-validation strategy (cv=5). Finally, we print the best hyperparameters found by the grid search.

Feature Selection: Another way to fine-tune a classification model is to perform feature selection, which involves selecting a subset of the input features that are most relevant to the target variable. Feature selection can help to reduce the dimensionality of the data, improve the model’s performance, and make it more interpretable.

Scikit-learn provides several feature selection techniques, such as SelectKBest, SelectPercentile, and RFE (Recursive Feature Elimination). Here's an example of how to use SelectKBest to select the top k features in a dataset:

from sklearn.feature_selection import SelectKBest, f_classif

# select the top k features using the f_classif score
selector = SelectKBest(f_classif, k=10)
selector.fit(X_train, y_train)

# transform the training and testing sets
X_train_new = selector.transform(X_train)
X_test_new = selector.transform(X_test)

Ensemble Methods:

Ensemble methods are techniques that combine multiple classification models to improve their performance. Examples of ensemble methods include bagging, boosting, and stacking.

Scikit-learn provides several ensemble methods, such as RandomForestClassifier, AdaBoostClassifier, and GradientBoostingClassifier. Here's an example of how to use RandomForestClassifier to build an ensemble of decision trees:

from sklearn.ensemble import RandomForestClassifier

# create an instance of the RandomForestClassifier with 100 trees
rf = RandomForestClassifier(n_estimators=100)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Model Stacking: Model stacking is a technique that combines multiple classification models using a meta-model. The idea is to use the predictions of several base models as input features to a meta-model, which learns how to combine them to make the final prediction.

Scikit-learn does not provide a built-in implementation of model stacking, but it is relatively easy to implement using the StackingClassifier class from the mlxtend library. Here's an example of how to use StackingClassifier to build a stacked ensemble of logistic regression and decision tree classifiers:

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.classifier import StackingClassifier

# create instances of the base classifiers
lr = LogisticRegression()
dt = DecisionTreeClassifier()

# create an instance of the StackingClassifier with logistic regression as meta-classifier
sc = StackingClassifier(classifiers=[lr, dt], meta_classifier=lr)

sc.fit(X_train, y_train)

y_pred = sc.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Final Note

Now you have an extra tool in your arsenal and you know how to solve classification problems in Python.

In a next article, we will see a concrete application case. Don’t hesitate to follow me if you don’t want to miss it!

To explore the other stories of this story, click below!

Data Science with Python

Aka the best programming language for data scientists

medium.com

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Join Medium with my referral link — Esteban Thilliez

Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…

medium.com