Summary

The website content provides a comprehensive explanation of logistic regression, a statistical non-linear approach in machine learning, illustrated with Python code examples and visualizations, emphasizing its application in binary classification problems.

Abstract

The article delves into the concept of logistic regression, explaining why it is preferred over linear regression for categorical data. It covers the transition from linear to non-linear models, the use of probability, and the logistic function to handle binary outcomes. The author demonstrates the practical implementation of logistic regression using Python, including data preprocessing, model fitting, evaluation, and visualization. The article also discusses the interpretation of model coefficients and pseudo R-squared values, offering insights into model performance and the statistical underpinnings of the logistic regression approach.

Opinions

The author suggests that a good understanding of logistic regression is crucial for efficiency, implying that it saves time in model development.
The preference for logistic regression over linear regression for categorical data is presented as a given, reflecting a consensus view in the field of machine learning.
The use of maximum likelihood estimates and log(odds) is highlighted as important for the logistic regression model, indicating the author's emphasis on these statistical concepts.
The author implies that the logit function is superior to the ordinary logistic function due to its interpretability.
The article conveys the opinion that certain metrics, such as McFadden’s R-squared, are indicative of a well-fitting logistic model, with values between 0.2 to 0.4 being considered good.
The conclusion of the article expresses the author's belief in the utility of logistic regression for binary classification problems and encourages reader engagement through social media platforms.

Machine Learning

Fully Explained Logistic Regression with Python

A Statistical non-linear approach in the machine learning algorithm

Linear and Non-Linear diagram. A photo by Author

The reason we switch from a linear regression model to a non-linear regression model is because of the output feature variable. While studying linear regression last week, I got a data set in which the dependent variable has categories.

In this article, we will discuss basic concepts regarding logistic regression and learn how we will about maximum likelihood estimates and log(odds). A good understanding is very much important and it saves lots of our time.

Fully Explained Linear Regression with Python

How the regression problem is solved with a real-life example.

medium.com

First, we need to know why linear regression is not suitable for categories of data. From the graph below we observe that the first one is for linear regression and the second one is also for linear but with binary category values. The insights from these two graphs we can take are that the first graph has values in linearly approach i.e. the independent variable increases the dependent variable is also increased. But, the second graph doesn’t tell this type of behavior rather the dependent variable values are spotted on two values i.e. “0” and “1” only.

The linear approach on a different type of dependent variables. A photo by Author

If we use the linear approach on the second values the error rate will increase and our model won’t fit well and one more thing to be noticed that the linear line is more above and more below the data points that we don’t need for prediction. So, we need an approach in which the prediction will be in “0” and “1” only.

Logistic Regression Curve. A photo by Author

From this thought, we can think about probability in which the probability values ranges from “0” to “1”. Alright, but we also need to change our prediction line. Many functions give value in “0” and “1” based on some threshold. This curve can be called a logistic regression curve or a logistic function.

The logit regression model is shown below.

The logistic model. The photo is from sphweb.bumc.bu.edu

So, the log of odds is equal to the linear model.

The logit function is more interpret-able than the ordinary logistic function. So, this function is nothing but a sigmoid function that gives the values in “0” and “1”.

When we are trying to fit our model then it calculates the iteration and function value internally. Both words mean after that much iteration the model optimization won’t work and the value obtain in function value is the value of the objective function through which we get convergence.

Fundamentals of series and Data Frame in Pandas with python

Basics of common use parameters in a data frame

medium.com

Let’s do practical now with python.

We created a small data set to explain the classification method for binary output in logistic regression.

#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Now reading the excel file and view its first 5 rows.

df = pd.read_excel("logistic.xlsx")
df.head()

First 5 rows of Data set. A photo by Author

Dividing the data set into an independent and dependent variable.

x = df.iloc[:,[0,1]].values
y = df.iloc[:,2].values

Now, dividing the data into train and test data.

from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

Standardizing the data so that the variations in the number become normal.

from sklearn.preprocessing import StandardScaler 
sc_x = StandardScaler() 
x_train = sc_x.fit_transform(x_train)  
x_test = sc_x.transform(x_test)

Fitting the training set to fit in the model.

from sklearn.linear_model import LogisticRegression 
classifier = LogisticRegression(random_state = 0) 
classifier.fit(x_train, y_train)

To make the prediction with the classifier.

y_pred = classifier.predict(x_test)

Generate the confusion matrix.

from sklearn.metrics import confusion_matrix 
conf_matrix = confusion_matrix(y_test, y_pred) 
  
print ("Confusion Matrix : \n", conf_matrix)
#output:
Confusion Matrix : 
 [[10  0]
 [ 0 10]]

To check the accuracy of the logistic model.

from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test, y_pred))
#output:
Accuracy :  1.0

To plot the binary classification model.

from matplotlib.colors import ListedColormap 
X_set, y_set = x_test, y_test 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,  
                               stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1,  
                               stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, classifier.predict( 
             np.array([X1.ravel(), X2.ravel()]).T).reshape( 
             X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max()) 
  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green'))(i), label = j) 
      
plt.title('Classifier (Test set)') 
plt.xlabel('Age') 
plt.ylabel('Bank Saving') 
plt.legend() 
plt.show()

Logistic regression with stats model.

import statsmodels.api as sm

FIt the logistic regression

x1 = sm.add_constant(x)
log_reg = sm.logit(y,x1)
log_output = log_reg.fit()

Now check the summary of the stats model.

log_output.summary()

Part of Summary of the logistic model. A photo by Author

In this logistic summary, we have Pseudo R-square. Generally, we have some like AIC, BIC and McFadden’s R-squared. In this fit it used McFadden and the value of this is 0.3458. The good range value of a good pseudo-r-square is between 0.2 to 0.4 value. The logit model becomes as shown given below:

We created a general model of the logistic regression with the logit model.

Conclusion:

This article is showing the basic idea of the working of logistic regression in a binary classification problem. The result values may vary according to the data set and speed of the machine on which we will run our model.

I hope you like the article. Reach me on my LinkedIn and twitter.