Machine Learning
Fully Explained Logistic Regression with Python
A Statistical non-linear approach in the machine learning algorithm
The reason we switch from a linear regression model to a non-linear regression model is because of the output feature variable. While studying linear regression last week, I got a data set in which the dependent variable has categories.
In this article, we will discuss basic concepts regarding logistic regression and learn how we will about maximum likelihood estimates and log(odds). A good understanding is very much important and it saves lots of our time.
First, we need to know why linear regression is not suitable for categories of data. From the graph below we observe that the first one is for linear regression and the second one is also for linear but with binary category values. The insights from these two graphs we can take are that the first graph has values in linearly approach i.e. the independent variable increases the dependent variable is also increased. But, the second graph doesn’t tell this type of behavior rather the dependent variable values are spotted on two values i.e. “0” and “1” only.
If we use the linear approach on the second values the error rate will increase and our model won’t fit well and one more thing to be noticed that the linear line is more above and more below the data points that we don’t need for prediction. So, we need an approach in which the prediction will be in “0” and “1” only.
From this thought, we can think about probability in which the probability values ranges from “0” to “1”. Alright, but we also need to change our prediction line. Many functions give value in “0” and “1” based on some threshold. This curve can be called a logistic regression curve or a logistic function.
The logit regression model is shown below.
So, the log of odds is equal to the linear model.
The logit function is more interpret-able than the ordinary logistic function. So, this function is nothing but a sigmoid function that gives the values in “0” and “1”.
When we are trying to fit our model then it calculates the iteration and function value internally. Both words mean after that much iteration the model optimization won’t work and the value obtain in function value is the value of the objective function through which we get convergence.
Let’s do practical now with python.
We created a small data set to explain the classification method for binary output in logistic regression.
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Now reading the excel file and view its first 5 rows.
df = pd.read_excel("logistic.xlsx")
df.head()
Dividing the data set into an independent and dependent variable.
x = df.iloc[:,[0,1]].values
y = df.iloc[:,2].values
Now, dividing the data into train and test data.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
Standardizing the data so that the variations in the number become normal.
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
Fitting the training set to fit in the model.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
To make the prediction with the classifier.
y_pred = classifier.predict(x_test)
Generate the confusion matrix.
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix : \n", conf_matrix)
#output:
Confusion Matrix :
[[10 0]
[ 0 10]]
To check the accuracy of the logistic model.
from sklearn.metrics import accuracy_score
print ("Accuracy : ", accuracy_score(y_test, y_pred))
#output:
Accuracy : 1.0
To plot the binary classification model.
from matplotlib.colors import ListedColormap
X_set, y_set = x_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(
np.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Classifier (Test set)')
plt.xlabel('Age')
plt.ylabel('Bank Saving')
plt.legend()
plt.show()
Logistic regression with stats model.
import statsmodels.api as sm
FIt the logistic regression
x1 = sm.add_constant(x) log_reg = sm.logit(y,x1) log_output = log_reg.fit()
Now check the summary of the stats model.
log_output.summary()
In this logistic summary, we have Pseudo R-square. Generally, we have some like AIC, BIC and McFadden’s R-squared. In this fit it used McFadden and the value of this is 0.3458. The good range value of a good pseudo-r-square is between 0.2 to 0.4 value. The logit model becomes as shown given below:
We created a general model of the logistic regression with the logit model.
Conclusion:
This article is showing the basic idea of the working of logistic regression in a binary classification problem. The result values may vary according to the data set and speed of the machine on which we will run our model.
I hope you like the article. Reach me on my LinkedIn and twitter.
Recommended Articles
2. Python Data Structures Data-types and Objects