Exploring the Logistic Regression Algorithm with Heart Disease Dataset in Python
Logistic Regression is a popular classification algorithm used in machine learning. In this tutorial, we will explore how to implement the Logistic Regression algorithm using Python’s scikit-learn library. We will use the Heart Disease dataset as an example and cover the necessary steps, including importing and preprocessing the data, training the model, evaluating its performance, and making predictions. By the end of this tutorial, you will have a good understanding of how to use Logistic Regression for classification problems in Python.
Importing the Dataset
The first step is to import the Heart Disease dataset from the UCI Machine Learning Repository website. We can use the pandas library to read the dataset.
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# Read the CSV file from the URL into a pandas dataframe
heart_df = pd.read_csv(url, header=None)
# Print the first 5 rows of the dataframe
print(heart_df.head())In this example, we are reading the data file from the URL “https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data" into a pandas DataFrame named heart_df. We set the header parameter to None since the CSV file does not contain column names in the first row. Finally, we print the first 5 rows of the dataframe using the head() function. The output is shown below:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0You can easily check the dimension of the loaded dataset using the following command:
print(heart_df.shape)The output will be:
(303, 14)The Heart Disease dataset contains 14 columns including 13 features and a target variable (1 if the patient has heart disease, 0 if not). We can separate the target variable from the features using:
X = heart_df.drop("target", axis=1)
y = heart_df["target"]In this above code snippet, we are splitting the heart_df DataFrame into input features and output target.
The drop() function of pandas is used to remove the column with index 13 from the DataFrame heart_df, which represents the target variable. The resulting DataFrame X contains all the input features except the target variable. We are passing axis=1 to specify that we want to drop a column, not a row.
The target variable is stored in the y variable by selecting the column with index 13 from the heart_df DataFrame.
After this step, the input features are stored in X and the target variable is stored in y. These variables are used later for model training and testing.
This separation of the input features and target variable is a common step in supervised machine learning workflows.
Preprocessing the Data
Next, we need to preprocess the data before we can train the Logistic Regression model. This includes handling missing values, scaling the features, and splitting the data into training and testing sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Handling missing values
X.replace('?', None, inplace=True)
X.fillna(X.mean(), inplace=True)
# Scaling the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)The above code performs the following steps:
- Replaces all occurrences of ‘?’ with
Nonein the feature matrixX. - Fills the missing values in
Xwith the mean of the respective feature column. - Scales the features of
Xusing theStandardScaler()fromsklearn.preprocessing. - Splits the data into training and testing sets using
train_test_split()fromsklearn.model_selection, with a test size of 0.2 (i.e., 20% of the data is reserved for testing). The training and testing data are stored inX_train,X_test,y_train, andy_test.
Note that scaling the features is an important step in many machine learning algorithms, as it can improve their performance and convergence. Also, replacing missing values with the mean is a simple imputation technique that is often used when dealing with missing data. However, it is worth noting that this technique may not always be the best approach, and other imputation techniques may be more appropriate depending on the nature of the data and the problem being solved.
After this preprocessing step, the input features are scaled and the data is split into training and testing sets, which are ready to be used for training and evaluating the logistic regression model.
Training the Logistic Regression Model
Now that the data is preprocessed, we can train the Logistic Regression model. We can use the LogisticRegression class from scikit-learn's linear_model module:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)The above code is implementing the logistic regression algorithm from the scikit-learn library in Python. Here, we first import the LogisticRegression class from the linear_model module of scikit-learn. Then, we create an instance of the LogisticRegression class called model.
Next, we train the logistic regression model using the fit() method. The fit() method takes the training features (X_train) and the corresponding target labels (y_train) as input and fits the model to the training data.
After the model is trained, it can be used to predict the target labels for the test set.
Evaluating the Model
After training the model, we can evaluate its performance on the testing set. We can use the accuracy_score function from scikit-learn's metrics module to calculate the accuracy of the model:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)The above code calculates the accuracy of the logistic regression model on the test data.
model.predict(X_test) predicts the target values of the test data using the fitted logistic regression model. These predicted values are stored in y_pred.
accuracy_score(y_test, y_pred) computes the accuracy of the predicted target values y_pred with the actual target values y_test. The accuracy score is defined as the fraction of correctly predicted instances out of the total number of instances in the test set. This score is stored in the accuracy variable.
Finally, the accuracy variable is printed using the print() function. The output of the above print statement is shown below:
Accuracy: 0.5901639344262295Making Predictions
Finally, we can use the trained Logistic Regression model to make predictions on new data:
sample = [[54, 1, 4, 110, 214, 0, 0, 158, 0, 1.6, 2, 2, 0]]
prediction = model.predict(sample)
print("Prediction:", prediction)This code snippet predicts whether a patient with certain characteristics has heart disease or not based on the trained Logistic Regression model. This code is predicting the target variable of a new sample using the trained logistic regression model.
The sample variable contains a new observation with 13 features (age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, the number of major vessels colored by fluoroscopy, and thalassemia).
The model.predict() method is used to predict the target variable for the given sample. The predicted value is stored in the variable prediction.
Finally, the predicted value is printed to the console using the print() function. The output is shown below:
Prediction: [1]The output of the print statement Prediction: [1] indicates that the logistic regression model predicted the input sample to belong to the positive class, which in this case represents the presence of heart disease. The predicted value is 1, which corresponds to the positive class label in the dataset.
Conclusion
In this tutorial, we explored the Logistic Regression algorithm and its implementation using Python’s scikit-learn library. We used the Heart Disease dataset to train a Logistic Regression classifier and made predictions on a new sample of data.
Logistic Regression is a powerful and widely used classification algorithm in machine learning. By following the steps in this tutorial, you should have a good understanding of how Logistic Regression works and how to use it for your own classification problems.





