avatarDr. Soumen Atta, Ph.D.

Summary

The provided content is a comprehensive tutorial on implementing Logistic Regression for classification tasks using Python's scikit-learn library, with a focus on the Heart Disease dataset.

Abstract

The tutorial guides readers through the process of using Logistic Regression to classify data, specifically applying the method to the Heart Disease dataset from the UCI Machine Learning Repository. It covers essential steps such as importing and preprocessing the dataset, handling missing values, scaling features, splitting data into training and testing sets, training the model, evaluating its performance, and making predictions on new data. The tutorial emphasizes the importance of data preprocessing and demonstrates how to calculate the accuracy of the model's predictions. By the end of the tutorial, readers are expected to have a solid understanding of how to apply Logistic Regression to real-world classification problems in Python.

Opinions

  • The author, Dr. Soumen Atta, Ph.D., suggests that Logistic Regression is a popular and powerful classification algorithm.
  • The tutorial implies that the Heart Disease dataset is a suitable benchmark for evaluating classification models due to its relevance and availability.
  • The author advocates for the use of scikit-learn's LogisticRegression class for its ease of implementation and robustness.
  • The author recommends replacing missing values with the mean as a simple and effective imputation technique, while acknowledging that other methods might be more appropriate depending on the context.
  • The author emphasizes the importance of scaling features to improve model performance, particularly for algorithms sensitive to the scale of input data.
  • The use of the train_test_split function for data partitioning is presented as a standard practice in machine learning workflows.
  • The author highlights the significance of model evaluation and uses the accuracy_score function to assess the model's predictive power.
  • The tutorial concludes by demonstrating the practical application of the trained model to make predictions on new patient data, showcasing the real-world utility of the Logistic Regression model in medical diagnostics.

Exploring the Logistic Regression Algorithm with Heart Disease Dataset in Python

Photo by Giulia Bertelli on Unsplash

Logistic Regression is a popular classification algorithm used in machine learning. In this tutorial, we will explore how to implement the Logistic Regression algorithm using Python’s scikit-learn library. We will use the Heart Disease dataset as an example and cover the necessary steps, including importing and preprocessing the data, training the model, evaluating its performance, and making predictions. By the end of this tutorial, you will have a good understanding of how to use Logistic Regression for classification problems in Python.

Importing the Dataset

The first step is to import the Heart Disease dataset from the UCI Machine Learning Repository website. We can use the pandas library to read the dataset.

import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# Read the CSV file from the URL into a pandas dataframe
heart_df = pd.read_csv(url, header=None)

# Print the first 5 rows of the dataframe
print(heart_df.head())

In this example, we are reading the data file from the URL “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data" into a pandas DataFrame named heart_df. We set the header parameter to None since the CSV file does not contain column names in the first row. Finally, we print the first 5 rows of the dataframe using the head() function. The output is shown below:

     0    1    2      3      4    5    6      7    8    9    10   11   12  13
0  63.0  1.0  1.0  145.0  233.0  1.0  2.0  150.0  0.0  2.3  3.0  0.0  6.0   0
1  67.0  1.0  4.0  160.0  286.0  0.0  2.0  108.0  1.0  1.5  2.0  3.0  3.0   2
2  67.0  1.0  4.0  120.0  229.0  0.0  2.0  129.0  1.0  2.6  2.0  2.0  7.0   1
3  37.0  1.0  3.0  130.0  250.0  0.0  0.0  187.0  0.0  3.5  3.0  0.0  3.0   0
4  41.0  0.0  2.0  130.0  204.0  0.0  2.0  172.0  0.0  1.4  1.0  0.0  3.0   0

You can easily check the dimension of the loaded dataset using the following command:

print(heart_df.shape)

The output will be:

(303, 14)

The Heart Disease dataset contains 14 columns including 13 features and a target variable (1 if the patient has heart disease, 0 if not). We can separate the target variable from the features using:

X = heart_df.drop("target", axis=1)
y = heart_df["target"]

In this above code snippet, we are splitting the heart_df DataFrame into input features and output target.

The drop() function of pandas is used to remove the column with index 13 from the DataFrame heart_df, which represents the target variable. The resulting DataFrame X contains all the input features except the target variable. We are passing axis=1 to specify that we want to drop a column, not a row.

The target variable is stored in the y variable by selecting the column with index 13 from the heart_df DataFrame.

After this step, the input features are stored in X and the target variable is stored in y. These variables are used later for model training and testing.

This separation of the input features and target variable is a common step in supervised machine learning workflows.

Preprocessing the Data

Next, we need to preprocess the data before we can train the Logistic Regression model. This includes handling missing values, scaling the features, and splitting the data into training and testing sets.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Handling missing values
X.replace('?', None, inplace=True)
X.fillna(X.mean(), inplace=True)


# Scaling the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

scaler = StandardScaler()
X = scaler.fit_transform(X)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The above code performs the following steps:

  1. Replaces all occurrences of ‘?’ with None in the feature matrix X.
  2. Fills the missing values in X with the mean of the respective feature column.
  3. Scales the features of X using the StandardScaler() from sklearn.preprocessing.
  4. Splits the data into training and testing sets using train_test_split() from sklearn.model_selection, with a test size of 0.2 (i.e., 20% of the data is reserved for testing). The training and testing data are stored in X_train, X_test, y_train, and y_test.

Note that scaling the features is an important step in many machine learning algorithms, as it can improve their performance and convergence. Also, replacing missing values with the mean is a simple imputation technique that is often used when dealing with missing data. However, it is worth noting that this technique may not always be the best approach, and other imputation techniques may be more appropriate depending on the nature of the data and the problem being solved.

After this preprocessing step, the input features are scaled and the data is split into training and testing sets, which are ready to be used for training and evaluating the logistic regression model.

Training the Logistic Regression Model

Now that the data is preprocessed, we can train the Logistic Regression model. We can use the LogisticRegression class from scikit-learn's linear_model module:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

The above code is implementing the logistic regression algorithm from the scikit-learn library in Python. Here, we first import the LogisticRegression class from the linear_model module of scikit-learn. Then, we create an instance of the LogisticRegression class called model.

Next, we train the logistic regression model using the fit() method. The fit() method takes the training features (X_train) and the corresponding target labels (y_train) as input and fits the model to the training data.

After the model is trained, it can be used to predict the target labels for the test set.

Evaluating the Model

After training the model, we can evaluate its performance on the testing set. We can use the accuracy_score function from scikit-learn's metrics module to calculate the accuracy of the model:

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

The above code calculates the accuracy of the logistic regression model on the test data.

model.predict(X_test) predicts the target values of the test data using the fitted logistic regression model. These predicted values are stored in y_pred.

accuracy_score(y_test, y_pred) computes the accuracy of the predicted target values y_pred with the actual target values y_test. The accuracy score is defined as the fraction of correctly predicted instances out of the total number of instances in the test set. This score is stored in the accuracy variable.

Finally, the accuracy variable is printed using the print() function. The output of the above print statement is shown below:

Accuracy: 0.5901639344262295

Making Predictions

Finally, we can use the trained Logistic Regression model to make predictions on new data:

sample = [[54, 1, 4, 110, 214, 0, 0, 158, 0, 1.6, 2, 2, 0]]
prediction = model.predict(sample)

print("Prediction:", prediction)

This code snippet predicts whether a patient with certain characteristics has heart disease or not based on the trained Logistic Regression model. This code is predicting the target variable of a new sample using the trained logistic regression model.

The sample variable contains a new observation with 13 features (age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, the number of major vessels colored by fluoroscopy, and thalassemia).

The model.predict() method is used to predict the target variable for the given sample. The predicted value is stored in the variable prediction.

Finally, the predicted value is printed to the console using the print() function. The output is shown below:

Prediction: [1]

The output of the print statement Prediction: [1] indicates that the logistic regression model predicted the input sample to belong to the positive class, which in this case represents the presence of heart disease. The predicted value is 1, which corresponds to the positive class label in the dataset.

Conclusion

In this tutorial, we explored the Logistic Regression algorithm and its implementation using Python’s scikit-learn library. We used the Heart Disease dataset to train a Logistic Regression classifier and made predictions on a new sample of data.

Logistic Regression is a powerful and widely used classification algorithm in machine learning. By following the steps in this tutorial, you should have a good understanding of how Logistic Regression works and how to use it for your own classification problems.

Machine Learning
Logistic Regression
Heart Disease
Classification
Python
Recommended from ReadMedium