Data Science with Python — Classification Analysis
This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
One of the most important tasks in data science is classification analysis, which involves predicting the class or category of an observation based on its features. Classification analysis has a wide range of applications, from fraud detection to medical diagnosis to image recognition.
Today, we will explore how to perform classification analysis using Python.
Classification Algorithms
Classification is a type of supervised learning in which the goal is to predict discrete values. There are many different types of classification algorithms, each with its own strengths and weaknesses.
- Logistic Regression: Logistic regression is a simple yet powerful classification algorithm that is widely used in data science. It is used to model the probability of a binary outcome (e.g., yes or no) based on one or more predictor variables. The algorithm produces a logistic curve that maps the probability of the outcome to the input variables.
- K-Nearest Neighbors (KNN): KNN is a non-parametric classification algorithm that works by finding the k nearest neighbors to a given observation in the feature space. The class of the new observation is then determined by a majority vote of its k-nearest neighbors.
- Decision Trees: Decision trees are a popular classification algorithm that works by recursively partitioning the feature space into smaller and smaller regions. At each step, the algorithm selects the feature that provides the most information gain (i.e., reduces the entropy or impurity of the dataset the most) and creates a split based on its value. The process continues until a stopping criterion is met.
- Random Forests: Random forests are an extension of decision trees that work by building an ensemble of many trees and averaging their predictions. The individual trees are built using a random subset of the features and a random subset of the data. This helps to reduce overfitting and improve the generalization performance of the model.
- Support Vector Machines (SVMs): SVMs are a powerful classification algorithm that works by finding the hyperplane that separates the two classes in the feature space with the largest margin. The hyperplane is chosen to maximize the distance between the nearest points of each class.
These are just a few of the many classification algorithms that are available in data science. But these are the most famous and important.
Logistic Regression
Let’s focus on logistic regression, the most basic classification algorithm.
The logistic regression model is a type of generalized linear model that uses the logistic function (also known as the sigmoid function) to model the probability of the outcome. The logistic function has an S-shaped curve that ranges from 0 to 1, and it can be written as:

where p(x) is the probability of the outcome given the input variables x, z is a linear combination of the input variables and their coefficients, and exp() is the exponential function.

The logistic regression algorithm works by estimating the values of the coefficients that maximize the likelihood of the observed data.
The logistic regression algorithm uses optimization techniques, such as gradient descent or Newton’s method, to find the values of the coefficients that maximize the log-likelihood function. Once the coefficients are estimated, the model can be used to predict the probability of the outcome for new observations.
Building Classification Models in Python
The easiest way to build classification models in Python is to use scikit learn as it provides a lot of classification algorithms we can use.
I’ll suppose you have already installed scikit learn. Else, you can do it easily using pip install scikit-learn
Once Scikit-learn is installed, you can import the necessary modules and load your data:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
data = pd.read_csv('path/to/your/data.csv')
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.2, random_state=42)
For now, I won’t provide any dataset as solving a concrete classification problem will be the purpose of another article. You can download one online if you want, or just follow this article to understand how to do it without trying it.
Next, we create an instance of the LogisticRegression
classifier and fit it to the training data:
lr = LogisticRegression()
# train the model using the training data
lr.fit(X_train, y_train)
By using the logistic regression algorithm, we suppose that our target variable can only take 2 values as this is a binary classifier.
Once the model is trained, we can use it to predict the target variable for new data:
y_pred = lr.predict(X_test)
y_pred might look like the following:
array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1])
We only get 0 and 1 because our target variable can only take on these values.
Each value in the array corresponds to the predicted target value for a single observation in the testing set. For example, the first value in the array (1) corresponds to the predicted target value for the first observation in the testing set, the second value in the array (0) corresponds to the predicted target value for the second observation, and so on.
Finally, we can evaluate the performance of the model using various metrics, such as accuracy, precision, recall, and F1 score:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
Fine-Tuning a Classification
After building a classification model, you can fine-tune its parameters to optimize its performance. There are many ways to do this.
Hyperparameter tuning: Many classification algorithms have hyperparameters that can be tuned to improve their performance. Hyperparameters are parameters that are not learned from the data, but are set by the user before the model is trained. Examples of hyperparameters include the regularization parameter in logistic regression, the number of trees in a random forest, and the maximum depth of a decision tree.
One way to fine-tune a classification model is to perform a grid search over a range of hyperparameters and evaluate the model’s performance on a validation set. Scikit-learn provides a convenient GridSearchCV
class for this purpose. Here's an example of how to use GridSearchCV
to tune the hyperparameters of a LogisticRegression
classifier:
from sklearn.model_selection import GridSearchCV
# define the hyperparameters to tune
hyperparameters = {'C': [0.1, 1, 10], 'penalty': ['l1', 'l2']}
# create an instance of the LogisticRegression classifier
lr = LogisticRegression()
# perform a grid search over the hyperparameters
grid_search = GridSearchCV(lr, hyperparameters, cv=5)
grid_search.fit(X_train, y_train)
# print the best hyperparameters
print("Best hyperparameters:", grid_search.best_params_)
In this example, we define a grid of hyperparameters to tune (C
and penalty
) and create an instance of the LogisticRegression
classifier. We then use GridSearchCV
to perform a grid search over the hyperparameters, using a 5-fold cross-validation strategy (cv=5
). Finally, we print the best hyperparameters found by the grid search.
Feature Selection: Another way to fine-tune a classification model is to perform feature selection, which involves selecting a subset of the input features that are most relevant to the target variable. Feature selection can help to reduce the dimensionality of the data, improve the model’s performance, and make it more interpretable.
Scikit-learn provides several feature selection techniques, such as SelectKBest
, SelectPercentile
, and RFE
(Recursive Feature Elimination). Here's an example of how to use SelectKBest
to select the top k features in a dataset:
from sklearn.feature_selection import SelectKBest, f_classif
# select the top k features using the f_classif score
selector = SelectKBest(f_classif, k=10)
selector.fit(X_train, y_train)
# transform the training and testing sets
X_train_new = selector.transform(X_train)
X_test_new = selector.transform(X_test)
Ensemble Methods:
Ensemble methods are techniques that combine multiple classification models to improve their performance. Examples of ensemble methods include bagging, boosting, and stacking.
Scikit-learn provides several ensemble methods, such as RandomForestClassifier
, AdaBoostClassifier
, and GradientBoostingClassifier
. Here's an example of how to use RandomForestClassifier
to build an ensemble of decision trees:
from sklearn.ensemble import RandomForestClassifier
# create an instance of the RandomForestClassifier with 100 trees
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Model Stacking: Model stacking is a technique that combines multiple classification models using a meta-model. The idea is to use the predictions of several base models as input features to a meta-model, which learns how to combine them to make the final prediction.
Scikit-learn does not provide a built-in implementation of model stacking, but it is relatively easy to implement using the StackingClassifier
class from the mlxtend
library. Here's an example of how to use StackingClassifier
to build a stacked ensemble of logistic regression and decision tree classifiers:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from mlxtend.classifier import StackingClassifier
# create instances of the base classifiers
lr = LogisticRegression()
dt = DecisionTreeClassifier()
# create an instance of the StackingClassifier with logistic regression as meta-classifier
sc = StackingClassifier(classifiers=[lr, dt], meta_classifier=lr)
sc.fit(X_train, y_train)
y_pred = sc.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Final Note
Now you have an extra tool in your arsenal and you know how to solve classification problems in Python.
In a next article, we will see a concrete application case. Don’t hesitate to follow me if you don’t want to miss it!
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: