Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

ss="hljs-number">4.50, 1, 100) FEV1_1 = np.random.uniform(3.75, 1.2, 100) FEV1_2 = np.random.uniform(2.35, 1.2, 100)</pre></div><div id="fc90"><pre>#Creating values for Broncho Dilation with 3 classes: BD_0 = np.random.normal(150,49, 100) BD_1 = np.random.uniform(250,50,100) BD_2 = np.random.uniform(350, 50, 100)</pre></div><div id="75b8"><pre>#Creating labels variable with two classes (1)Disease (0)No disease: not_asthma = np.zeros((150,), dtype=int) asthma = np.ones((150,), dtype=int</pre></div>Now we will concatenate the previous created values into three predictive variables and one outcome variable:<div id="eeda"><pre>#Concatenate classes into one variable:</pre></div><div id="220b"><pre>FeNO = np.concatenate([FeNO_0, FeNO_1, FeNO_2]) FEV1 = np.concatenate([FEV1_0, FEV1_1, FEV1_2]) BD = np.concatenate([BD_0, BD_1, BD_2]) dx = np.concatenate([not_asthma, asthma])</pre></div>The next step is to create a DataFrame and add the variables to the DataFrame:<div id="d8b6"><pre>#Create DataFrame: df = pd.DataFrame()</pre></div><div id="d8e6"><pre>#Add variables to DataFrame: df['FeNO'] = FeNO.tolist() df['FEV1'] = FEV1.tolist() df['BD'] = BD.tolist() df['dx'] = dx.tolist()</pre></div>We can take a look of our DataFrame by simply typing “df” to check if everything is OK. It is possible to see that our DataFrame has 4 columns (three predictive variables and one outcome variable), and 300 rows.<figure id="85b3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*r8k-f0fNuUiy2XQttg9XDw.png"><figcaption></figcaption></figure>Exploratory Data Analysis:This is a simple Exploratory Data Analysis (EDA) just to understand how our variables behave and how they relate with each other and with the outcome (disease presence or absence). First we will see how our variables are distributed according to the disease presence or not:<div id="aede"><pre>#Exploring dataset:</pre></div><div id="2fb3"><pre>sns.pairplot(df, kind="scatter", hue="dx") plt.show()</pre></div><figure id="aa22"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cSIRa2rjEBwkvhJfjymKrQ.png"><figcaption></figcaption></figure>We can check the ability of the different parameters to distinguish between disease presence/absence. The parameter “BD” showed the worst power to distinguish between the two classes, has we can see higher overlap of values.<div id="7cb7"><pre>sns.boxplot( x=df["dx"], y=df["FEV1"] )</pre></div><figure id="9ce8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*oJPdqXlM_D7HjsoX_KO55Q.png"><figcaption></figcaption></figure><div id="fb61"><pre>sns.boxplot( x=df["dx"], y=df["FeNO"] )</pre></div><figure id="e203"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*37nBUAJeKvKnoUQ9-lH-zw.png"><figcaption></figcaption></figure><div id="f539"><pre>sns.boxplot( x=df["dx"], y=df["BD"] )</pre></div><figure id="76c9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uzQD2iHIDuY6YlobdFWMcw.png"><figcaption></figcaption></figure>And then we can check correlations between the difference parameters. Are correlations are always stronger for the cases when asthma is present, and less significant when disease is absent.<div id="5d9f"><pre>sns.lmplot(x="FEV1", y="FeNO", data=df, fit_reg=True, hue='dx', legend=True)</pre></div><figure id="16d4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9qVlPkCojojcQY8nUZq9Qg.png"><figcaption></figcaption></figure>We see a stronger correlation between FEV1 and FeNO when disease is presence, and this correlation is negative (seen by the negative slope), which means that for higher values of FeNO we will find lower values of FEV1.<div id="0f57"><pre>sns.lmplot(x="FEV1", y="BD", data=df, fit_reg=True, hue='dx', legend=True)</pre></div><figure id="c458"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*za9FAf5RG4CDeREV2DzMuQ.png"><figcaption></figcaption></figure>When checking the correlation between FEV1 and BD, we found that a negative correlation is seen when asthma diagnosis is presence. This way, lower values of FEV1 are associated with higher broncho dilation values.<div id="2a06"><pre>sns.lmplot(x="FeNO", y="BD", data=df, fit_reg=True, hue='dx', legend=True)</pre></div><figure id="8717"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0VESoy7lUVlmfgIIPGrjfw.png"><figcaption></figcaption></figure>Lastly, the correlation between FeNO and BD shows that higher values of FeNO are associated with higher values of BD when asthma is present (positive correlation).Splitting data into train and test Data sets:We will use 80% of our data to build the model, and the remaining 20% to test the model. But first, we will create our X and y variables, where X represents the dataset with th

Options

e predictors, and y an array of values with the outcome.<div id="135b"><pre>#Creating X and y: X = df.drop('dx', axis=1) y = df['dx']</pre></div><div id="ccd2"><pre>#Data split into train and test: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)</pre></div>Build and evaluate the model:<div id="b8ed"><pre>#Fit the model: logisticregression = LogisticRegression().fit(X_train, y_train)</pre></div><div id="5399"><pre>#Evaluate the model: print("training set score: %f" % logisticregression.score(X_train, y_train))print("test set score: %f" % logisticregression.score(X_test, y_test))</pre></div><figure id="1905"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wvfwXvkNd5deQHflOG8szg.png"><figcaption></figcaption></figure>As we can see, our model performed slightly better on the training set, which may indicate we are over-fitting. Fortunately, we can use the C value to adjust the model and try to find a best model that weights the compromise between model complexity, over-fitting and under-fitting.Lets see what happens if we set C=100:<div id="7eff"><pre>#C=100 logisticregression100 = LogisticRegression(C=100).fit(X_train, y_train)</pre></div><div id="7d6e"><pre>print("training set score: %f" % logisticregression100.score(X_train, y_train)) print("test set score: %f" % logisticregression100.score(X_test, y_test))</pre></div><figure id="2ded"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wpqSqr9k2JoouJVKUX8lRw.png"><figcaption></figcaption></figure>Setting C=100 has a small effect on training set score, with no effect on test set score, which means that it does not improve the model.Setting C=0.01:<div id="57e0"><pre>#C=0.01 logisticregression001 = LogisticRegression(C=0.01).fit(X_train, y_train)</pre></div><div id="d459"><pre>print("training set score: %f" % logisticregression001.score(X_train, y_train)) print("test set score: %f" % logisticregression001.score(X_test, y_test))</pre></div><figure id="1363"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*u0C0LVN8imVLf88AH9cHwA.png"><figcaption></figcaption></figure>Setting C=0.01 decreases both training set score and test set score, which means that it is not a good value for this parameter.Finding the best C value:To find the best C value we should use a more sophisticated approach than trial and error. One way to do it is by plotting several accuracy score at different C values for both training and test sets.<div id="6931"><pre>training_accuracy = [] test_accuracy = []</pre></div><div id="676d"><pre># try c values from 0.001 to 100: c_settings = np.arange(0.001, 100, 0.1)</pre></div><div id="372b"><pre>for i in c_settings: # build the model clf = LogisticRegression(C=i) clf.fit(X_train, y_train) # record training set accuracy training_accuracy.append(clf.score(X_train, y_train)) # record generalization accuracy test_accuracy.append(clf.score(X_test, y_test))</pre></div><div id="f3f0"><pre>plt.plot(c_settings, training_accuracy, label="training accuracy") plt.plot(c_settings, test_accuracy, label="test accuracy") plt.legend()</pre></div><figure id="0dbc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rcymtSRoTiqyh_RJBKaSaQ.png"><figcaption></figcaption></figure>In the plot we can see that training and test accuracy values come closer after approximately a C value of 10. It is not entirely clear in this plot, so we can plot a smaller interval:<figure id="a94f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qMyFDWkxMqywiiOtAMgihw.png"><figcaption></figcaption></figure>Now it is easy to see that at C=12 training and test accuracy values are closer, which means this is our optimal C value. We can check it:<figure id="3179"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1ANb4utXMj5itQIRHAlA9w.png"><figcaption></figcaption></figure>Build a visualisation for the model:Lastly, we can visualise our model performance by build a graph with the decision regions. For doing that, we need our DataFrame to be in csv format:<div id="f380"><pre>df.to_csv('data.csv', index = False) data = pd.read_csv('data.csv')</pre></div>Then we build a function for the graph using two main variables (we know we will use FEV1 and FeNO, once BD has greater class overlap):<div id="a64c"><pre>def logisticReg_comparison(data,c): x = data[['FEV1','FeNO',]].values y = data['dx'].astype(int).values LogReg = LogisticRegression(C=c) LogReg.fit(x,y) print(LogReg.score(x,y)) #Plot decision region: plot_decision_regions(x,y, clf=LogReg, legend=2) #Adding axes annotations: plt.xlabel('X_train') plt.ylabel('y_train') plt.title('LogReg with C='+str(c)) plt.show()</pre></div><div id="a7e5"><pre>logisticReg_comparison(data,12)</pre></div><figure id="569a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TC6vUpCD3wwnFdIg707IeA.png"><figcaption></figcaption></figure>Thank you for reading! Let me know if you have corrections or suggestions and don’t forget to subscribe to receive notifications about my future publications.If: you liked this article, don’t forget to follow me and thus receive all updates about new publications.Else If: you want to read more, you can subscribe to Medium membership with <a href="https://cdanielaam.medium.com/membership">my referral link</a>. It will not cost you more but will pay me for a coffee.Else: Thank you!</article></body>

Logistic Regression for Binary Classification: Hands-On with SciKit-Learn

Using Python and Google Colab

Table of contents:

1. Introduction
2. What type of problems can be solved with Logistic Regression
3. Mathematical Interpretation
4. The C parameter
5. Hands-On:
   -Import Libraries
   -Create Data
   -Exploratory data analysis
   -Splitting data into train and test data sets
   -Build and evaluate the model
   -Finding the best C value
   -Build a visualisation for the model

Logistic Regression is one of the first algorithms you will read, speak and listen about when the topic is Machine Learning. Logistic Regression is similar to multiple regression, but with a binary (dependent) output variable and continuous or categorical predictive variables. When the dependent variable is binary (1 or 0), we cannot use linear regression.

The output in logistic regression is expressed through a probability of occurrence, while in simple regression, a numerical value is obtained. Therefore, logistic regression presents itself as a method of determining the probability of occurrence of the predicted values of a binary variable.

What type of problems can be solved with Logistic Regression?

Logistic Regression is used to solve binary or multi-class classification tasks. In this article we will only discuss binary classification. One good example of Logistic Regression use for classification purposes is for detecting disease, where several variables are used to predict a binary outcome (presence or absence of disease). The predictive variables can be either categorical, continuous or ordinal.

Mathematical interpretation

The general expression for Logistic Regression is:

And the graphical representation is:

As you can see, the function range is between 0 and 1. When we fit the model to the logistic function we will change the x value with the coefficients of our data parameters and get a probability of our outcome being closer to 1, or closer to 0. You can also notice by graphical inspection that for x<0, the y value will be closer to 0, and for x>0, y value will be closer to 1. So, basically if you know if x is a positive or negative value, you will know which class the object belongs without needing to calculate y. However, you still need to know the value of y if you want know how strong is the prediction.

For example:

With x = 3:

We will get:

Which means a great probability of the object to be in class 1. In contrast, if x= -4, then it has a great probability of being in class 0.

But how is x calculated in the model?

To fit our model, which means to calculate x value for the model, it is assumed a linear relationship between the predictive variables, and the probability of the outcome (y=1 or y=0). This way, x will be fitted using a linear regression technique, with an interception m, and coefficient values for predictive variables B1, B2, B3...

I have explained linear regression in previous publications, and you can find it here and here.

The C parameter:

The C value in Logistic Regression is an user adjustable parameter that controls regularisation. In simple terms, higher values of C will instruct our model to fit the training set as best as possible, while lower C values will favour a simple models with coefficients closer to zero.

Hands-on:

Now we will fully implement and evaluate a Logistic Regression model using SciKit-Learn and Python. We will also learn how to adjust model parameters to control model for complexity, over-fitting and/or under-fitting.

Import the necessary libraries:

#Import Libraries:

from random import random
from random import randint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from mlxtend.plotting import plot_decision_regions

Create Data:

We will create data for this case study by building a dataset to be used as example. I will name the variables as a parameters of lung function (FEV1 and BD) and lung inflammation (FeNO), and the outcome variable as disease presence or not, in this case for asthma. But please remember, this is not real data, is data artificially created for this example, and based on my expert knowledge in this field. If you are using your own data, you can jump directly to step “Exploratory Data Analysis”.

#Fabricating variables:

#Creating values for FeNO with 3 classes:
FeNO_0 = np.random.normal(15,20, 100)
FeNO_1 = np.random.normal(35,20, 100)
FeNO_2 = np.random.normal(65, 20, 100)

#Creating values for FEV1 with 3 classes:
FEV1_0 = np.random.normal(4.50, 1, 100)
FEV1_1 = np.random.uniform(3.75, 1.2, 100)
FEV1_2 = np.random.uniform(2.35, 1.2, 100)

#Creating values for Broncho Dilation with 3 classes:
BD_0 = np.random.normal(150,49, 100)
BD_1 = np.random.uniform(250,50,100)
BD_2 = np.random.uniform(350, 50, 100)

#Creating labels variable with two classes (1)Disease (0)No disease:
not_asthma = np.zeros((150,), dtype=int)
asthma = np.ones((150,), dtype=int

Now we will concatenate the previous created values into three predictive variables and one outcome variable:

#Concatenate classes into one variable:

FeNO = np.concatenate([FeNO_0, FeNO_1, FeNO_2])
FEV1 = np.concatenate([FEV1_0, FEV1_1, FEV1_2])
BD = np.concatenate([BD_0, BD_1, BD_2])
dx = np.concatenate([not_asthma, asthma])

The next step is to create a DataFrame and add the variables to the DataFrame:

#Create DataFrame:
df = pd.DataFrame()

#Add variables to DataFrame:
df['FeNO'] = FeNO.tolist()
df['FEV1'] = FEV1.tolist()
df['BD'] = BD.tolist()
df['dx'] = dx.tolist()

We can take a look of our DataFrame by simply typing “df” to check if everything is OK. It is possible to see that our DataFrame has 4 columns (three predictive variables and one outcome variable), and 300 rows.

Exploratory Data Analysis:

This is a simple Exploratory Data Analysis (EDA) just to understand how our variables behave and how they relate with each other and with the outcome (disease presence or absence). First we will see how our variables are distributed according to the disease presence or not:

#Exploring dataset:

sns.pairplot(df, kind="scatter", hue="dx")
plt.show()

We can check the ability of the different parameters to distinguish between disease presence/absence. The parameter “BD” showed the worst power to distinguish between the two classes, has we can see higher overlap of values.

sns.boxplot( x=df["dx"], y=df["FEV1"] )

sns.boxplot( x=df["dx"], y=df["FeNO"] )

sns.boxplot( x=df["dx"], y=df["BD"] )

And then we can check correlations between the difference parameters. Are correlations are always stronger for the cases when asthma is present, and less significant when disease is absent.

sns.lmplot(x="FEV1", y="FeNO", data=df, fit_reg=True, hue='dx', legend=True)

We see a stronger correlation between FEV1 and FeNO when disease is presence, and this correlation is negative (seen by the negative slope), which means that for higher values of FeNO we will find lower values of FEV1.

sns.lmplot(x="FEV1", y="BD", data=df, fit_reg=True, hue='dx', legend=True)

When checking the correlation between FEV1 and BD, we found that a negative correlation is seen when asthma diagnosis is presence. This way, lower values of FEV1 are associated with higher broncho dilation values.

sns.lmplot(x="FeNO", y="BD", data=df, fit_reg=True, hue='dx', legend=True)

Lastly, the correlation between FeNO and BD shows that higher values of FeNO are associated with higher values of BD when asthma is present (positive correlation).

Splitting data into train and test Data sets:

We will use 80% of our data to build the model, and the remaining 20% to test the model. But first, we will create our X and y variables, where X represents the dataset with the predictors, and y an array of values with the outcome.

#Creating X and y:
X = df.drop('dx', axis=1)
y = df['dx']

#Data split into train and test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Build and evaluate the model:

#Fit the model:
logisticregression = LogisticRegression().fit(X_train, y_train)

#Evaluate the model:
print("training set score: %f" % logisticregression.score(X_train, y_train))print("test set score: %f" % logisticregression.score(X_test, y_test))

As we can see, our model performed slightly better on the training set, which may indicate we are over-fitting. Fortunately, we can use the C value to adjust the model and try to find a best model that weights the compromise between model complexity, over-fitting and under-fitting.

Lets see what happens if we set C=100:

#C=100
logisticregression100 = LogisticRegression(C=100).fit(X_train, y_train)

print("training set score: %f" % logisticregression100.score(X_train, y_train))
print("test set score: %f" % 
logisticregression100.score(X_test, y_test))

Setting C=100 has a small effect on training set score, with no effect on test set score, which means that it does not improve the model.

Setting C=0.01:

#C=0.01
logisticregression001 = LogisticRegression(C=0.01).fit(X_train, y_train)

print("training set score: %f" % logisticregression001.score(X_train, y_train))
print("test set score: %f" % 
logisticregression001.score(X_test, y_test))

Setting C=0.01 decreases both training set score and test set score, which means that it is not a good value for this parameter.

Finding the best C value:

To find the best C value we should use a more sophisticated approach than trial and error. One way to do it is by plotting several accuracy score at different C values for both training and test sets.

training_accuracy = []
test_accuracy = []

# try c values from 0.001 to 100:
c_settings = np.arange(0.001, 100, 0.1)

for i in c_settings:
    # build the model
    clf = LogisticRegression(C=i)
    clf.fit(X_train, y_train)
    # record training set accuracy
    training_accuracy.append(clf.score(X_train, y_train))
    # record generalization accuracy
    test_accuracy.append(clf.score(X_test, y_test))

plt.plot(c_settings, training_accuracy, label="training accuracy")
plt.plot(c_settings, test_accuracy, label="test accuracy")
plt.legend()

In the plot we can see that training and test accuracy values come closer after approximately a C value of 10. It is not entirely clear in this plot, so we can plot a smaller interval:

Now it is easy to see that at C=12 training and test accuracy values are closer, which means this is our optimal C value. We can check it:

Build a visualisation for the model:

Lastly, we can visualise our model performance by build a graph with the decision regions. For doing that, we need our DataFrame to be in csv format:

df.to_csv('data.csv', index = False)
data = pd.read_csv('data.csv')

Then we build a function for the graph using two main variables (we know we will use FEV1 and FeNO, once BD has greater class overlap):

def logisticReg_comparison(data,c):
    x = data[['FEV1','FeNO',]].values
    y = data['dx'].astype(int).values
    LogReg = LogisticRegression(C=c)
    LogReg.fit(x,y)
    print(LogReg.score(x,y))
    #Plot decision region:
    plot_decision_regions(x,y, clf=LogReg, legend=2)
    #Adding axes annotations:
    plt.xlabel('X_train')
    plt.ylabel('y_train')
    plt.title('LogReg with C='+str(c))
    plt.show()

logisticReg_comparison(data,12)

Thank you for reading! Let me know if you have corrections or suggestions and don’t forget to subscribe to receive notifications about my future publications.

If: you liked this article, don’t forget to follow me and thus receive all updates about new publications.

Else If: you want to read more, you can subscribe to Medium membership with my referral link. It will not cost you more but will pay me for a coffee.

Else: Thank you!