Analyzing a Covid-19 dataset using Python
In this article, an overview of my Exploratory Data Analysis will be provided. I’m gonna guide you through the first steps of loading and storing the data, understanding them, cleaning some N/A values and transforming the data to gain more meaningful insights.
Finally, i’m gonna show
•some crucial insights from the analysis
•key trends and patterns within the data
To begin with, regarding my data sources i took
— Covid 19 Dataset from “Kaggle”
This dataset was provided by the Mexican government.The raw dataset consists of 21 unique features and 1,048,576 unique patients with COVID-19 symptoms
First of all, it is loaded in a dataframe using the pandas library
covid_data_kaggle_df = pd.read_csv('../Data/covid_data_kaggle.csv')
covid_data_kaggle_df.columns
Index(['USMER', 'MEDICAL_UNIT', 'SEX', 'PATIENT_TYPE', 'DATE_DIED', 'INTUBED',
'PNEUMONIA', 'AGE', 'PREGNANT', 'DIABETES', 'COPD', 'ASTHMA', 'INMSUPR',
'HIPERTENSION', 'OTHER_DISEASE', 'CARDIOVASCULAR', 'OBESITY',
'RENAL_CHRONIC', 'TOBACCO', 'CLASIFFICATION_FINAL', 'ICU'],
dtype='object')
I didnt like this naming with all the capital letters so for my conveniece i rename some columns. Also for the columns: “MEDICAL_UNIT” and “USMER” we didnt have sufficient information besides that:
- “USMER” indicates whether the patient treated medical units of the first, second or third level.
- “MEDICAL_UNIT” indicates the type of institution of the National Health System that provided the care.
so we have decided to drop this columns since they didnt really helped us in finding a meaningfull insight
covid_data_kaggle_df.rename(columns = {
'PATIENT_TYPE':'Type of Care',
'CLASIFFICATION_FINAL':'Covid Test Findings',
'INMSUPR':'Immuno-supression',
'ICU': 'Intensive Care Unit Transfer'},
inplace = True)
# Remove two columns name is 'MEDICAL_UNIT' and 'USMER' since we dont have sufficient information
covid_data_kaggle_df.drop(['MEDICAL_UNIT', 'USMER'], axis=1, inplace=True)
This dataset contained only numeric values for all columns besides the date related columns. We have transformed lots of the numeric values into categorical values in order our visualizations to be more user friendly oriented. From the description of the dataset we made the following transformations to our data:
# Convert column titles to lowercase with first letter capitalized
covid_data_kaggle_df.columns = [col.lower().capitalize() for col in covid_data_kaggle_df.columns]
covid_data_kaggle_df['Sex'] = covid_data_kaggle_df['Sex'].replace(1,'Female')
covid_data_kaggle_df['Sex'] = covid_data_kaggle_df['Sex'].replace(2,'Male')
covid_data_kaggle_df['Type of care'] = covid_data_kaggle_df['Type of care'].replace(1,'Returned Home')
covid_data_kaggle_df['Type of care'] = covid_data_kaggle_df['Type of care'].replace(2,'Hospitalization')
covid_data_kaggle_df['Date_died'] = covid_data_kaggle_df['Date_died'].replace('9999-99-99','-')
exclude_columns = ['Age', 'Covid test findings']
for col in covid_data_kaggle_df.columns:
if col not in exclude_columns:
covid_data_kaggle_df[col] = covid_data_kaggle_df[col].replace(1, 'Yes')
covid_data_kaggle_df[col] = covid_data_kaggle_df[col].replace(2, 'No')
covid_data_kaggle_df[col] = covid_data_kaggle_df[col].replace(97, '-')
covid_data_kaggle_df[col] = covid_data_kaggle_df[col].replace(98, '-')
covid_data_kaggle_df[col] = covid_data_kaggle_df[col].replace(99, '-')
The code above:
- Replaced values 1 and 2 with values “Male” and “Female” respectivily in the “Sex” column
- Replaced values 1 and 2 with values “Returned Home” and “Hospitalization” respectivily in the “Type of care” column
- Replaced value ‘9999–99–99’ and with value “-” in the “Date_died” column
- For all the other columns, value 1 is replaced by “Yes”, value 2 is replaced by “No” and the other values are replaced by “-” since we didnt have information of these column values
After that we have checked for Nan values and we have found out that there are no NaN values in this dataset. That’s why this dataset has very well critics :P
# Checking for Nan for each column
covid_data_kaggle_df.isna().sum()
Sex 0
Type of care 0
Date_died 0
Intubed 0
Pneumonia 0
Age 0
Pregnant 0
Diabetes 0
Copd 0
Asthma 0
Immuno-supression 0
Hipertension 0
Other_disease 0
Cardiovascular 0
Obesity 0
Renal_chronic 0
Tobacco 0
Covid test findings 0
Intensive care unit transfer 0
Then we have decided to perform an outlier detection regarding the “Age” correlated with the “Pregrancy” and “Tobacco” usage. This could be done with other variables like “Diabetes” and “Hipertension”
# Detecting outliers first based on age and pregrancy
# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 4))
# Subplot for 'Pregnant' == 'Yes'
sns.boxplot(x='Pregnant', y='Age', data=covid_data_kaggle_df[covid_data_kaggle_df['Pregnant'] == 'Yes'], ax=axes[0])
axes[0].set_title('Box Plot of Age when Pregnant is Yes')
# Subplot for 'Pregnant' == 'No'
sns.boxplot(x='Tobacco', y='Age', data=covid_data_kaggle_df[covid_data_kaggle_df['Tobacco'] == 'Yes'], ax=axes[1])
axes[1].set_title('Box Plot of Age when Tobacco is Yes')
# Subplot for Age
sns.boxplot(y='Age', data=covid_data_kaggle_df, ax=axes[2])
axes[2].set_title('Box Plot of Age')
# Adjust layout
plt.tight_layout()
plt.show()
This shows some interesting findings. It’s impossible biologically for a woman to be pregrant under the age of 12 or when she is older that 55 years old. Also its impossible theoritically for someone to smoke when he/she is younger that 11. These values were removed using the Inter Quartile Range
# Remove outliers using IQR method
Q1 = covid_data_kaggle_df[covid_data_kaggle_df['Pregnant'] == 'Yes']['Age'].quantile(0.25)
Q3 = covid_data_kaggle_df[covid_data_kaggle_df['Pregnant'] == 'Yes']['Age'].quantile(0.75)
IQR = Q3 - Q1
covid_data_kaggle_df = covid_data_kaggle_df[~((covid_data_kaggle_df['Pregnant'] == 'Yes') & ((covid_data_kaggle_df['Age'] <= (Q1 - 1.5 * IQR)) | (covid_data_kaggle_df['Age'] > (Q3 + 1.5 * IQR))))]
covid_data_kaggle_df[covid_data_kaggle_df['Pregnant'] == 'Yes']['Age'].unique()
covid_data_kaggle_df = covid_data_kaggle_df[~((covid_data_kaggle_df['Tobacco'] == 'Yes') & ((covid_data_kaggle_df['Age'] <= 10)))]
After all this preprocessing, we started drawing our charts using matplotlib and seaborn
# Filter the DataFrame for rows where "Date_died" is not '-'
deaths_df = covid_data_kaggle_df[covid_data_kaggle_df['Date_died'] != '-']
death_counts = deaths_df['Sex'].value_counts()
infection_counts = covid_data_kaggle_df['Sex'].value_counts()
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Define colors for females and males
colors = {'Female': 'green', 'Male': 'blue'}
# Plot for number of infected males and females
sns.barplot(x=infection_counts.index, y=infection_counts.values, hue=infection_counts.index, palette=colors, ax=ax1)
ax1.set_title('Number of Infected by Gender')
ax1.set_xlabel('Gender')
ax1.set_ylabel('Number of Infected')
# Plot for number of deaths by gender
sns.barplot(x=death_counts.index, y=death_counts.values, hue=death_counts.index, palette=colors, ax=ax2)
ax2.set_title('Number of Deaths by Gender')
ax2.set_xlabel('Gender')
ax2.set_ylabel('Number of Deaths')
plt.show()
We can see here that even though the number of infected people by gender is almost equal, Males exhibit much higher mortality
Then we have decided to plot the number of deaths per age group. To divide our data records into age groups we made the following transformation
# Define the age bins and labels
age_bins = [0, 18, 34, 49, int(covid_data_kaggle_df['Age'].max())] # Bins represent the upper age limit for each group
age_labels = ['0-18', '19-34', '35-49', '50+']
# Create the "Age Group" column using pd.cut
covid_data_kaggle_df['Age Group'] = pd.cut(covid_data_kaggle_df['Age'], bins=age_bins, labels=age_labels, right=False)
covid_data_kaggle_df['Age Group'].value_counts()
Age Group
35-49 349535
50+ 342375
19-34 302200
0-18 54246
Then to display the infected and deaths per age group we execute this:
infection_counts_by_age_group = covid_data_kaggle_df['Age Group'].value_counts()
# Count the number of deaths per age group
death_counts_age_group = deaths_df['Age Group'].value_counts()
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Plot for number of infected males and females
sns.barplot(x=infection_counts_by_age_group.index, y=infection_counts_by_age_group.values, hue=infection_counts_by_age_group.index, palette='viridis', ax=ax1)
ax1.set_title('Number of Infected by Age Group')
ax1.set_xlabel('Age Group')
ax1.set_ylabel('Number of Infected')
# Plot for number of survivals by gender
sns.barplot(x=death_counts_age_group.index, y=death_counts_age_group.values, hue=death_counts_age_group.index, palette='viridis', ax=ax2)
ax2.set_title('Number of Death by Age Group')
ax2.set_xlabel('Age Group')
ax2.set_ylabel('Number of Deaths')
plt.show()
Well another interesting insight. If you dont count the 0–18 agre group the other three age groups have almost equal number of infected people. However, the vast majority of deaths is coming from the 50+ age group. Therefore, we can conlcude that Individuals aged 50 + face significantly higher risks compared to younger age groups.
Another interesting insight was the percentage of deaths that have covid along with other diseases. So we have created another column to depict if a patient has passed away or not. Then we execute the following code snippet to display the percentages of deaths from covid along that had other diseases.
total_deaths = covid_data_kaggle_df[covid_data_kaggle_df['Passed_away']=='Yes'].Passed_away.count()
percentage = []
columns=['Intubed', 'Pneumonia', 'Pregnant', 'Hipertension', 'Obesity', 'Cardiovascular', 'Renal_chronic', 'Tobacco', 'Other_disease', 'Copd', 'Diabetes', 'Asthma', 'Intensive care unit transfer', 'Immuno-supression']
for column in columns:
percentage_per_column = covid_data_kaggle_df[(covid_data_kaggle_df['Passed_away']=='Yes') & (covid_data_kaggle_df[column]=='Yes')].Passed_away.count()/total_deaths * 100
percentage.append(percentage_per_column)
# Create a DataFrame for plotting
plot_data = pd.DataFrame({'Column': columns, 'Percentage': percentage})
# Sort the DataFrame by Percentage in descending order
plot_data = plot_data.sort_values(by='Percentage', ascending=False)
# Plotting
plt.figure(figsize=(12, 8))
sns.barplot(x='Percentage', y='Column', data=plot_data, palette='viridis')
plt.title('Percentage of Death for Each Condition')
plt.xlabel('Death Percentage')
plt.ylabel('Condition')
plt.show()
We’ve observed that the presence of pneumonia is associated with a mortality risk exceeding 70% in COVID cases.
This was my EDA performed in the aforementioned dataset using python. Feel free to comment out.
Feel free to contact me for further discussion or any inquiries. I am genuinely excited and eager to engage in meaningful conversations regarding the analysis, findings, or related topics.