Data Science with Python — Breast Cancer Detection using Ensemble Methods
This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
The Breast Cancer Wisconsin (Diagnostic) Dataset is a well-known benchmark dataset for breast cancer classification tasks. It contains 569 instances with various features, such as the mean radius, mean texture, and mean smoothness of cell nuclei extracted from digitized images of fine needle aspirates of breast mass. The dataset also includes the corresponding diagnosis (malignant or benign) for each instance.
The objective today will be to apply what we’ve seen in the previous article to predict whether a diagnosis may be malignant or benign depending on various parameters. So we’ll use ensemble methods.
Loading the Dataset
This dataset is included in scikit-learn. So we can load it easily this way:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
However, load_breast_cancer
returns a dict. You can print it to see how it looks. It’s better to convert it to a pd.DataFrame so let’s do this:
import pandas as pd
dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
Exploratory Data Analysis
The first step in any data science task is to understand our dataset. Let’s start by looking at our dataset with df.head()
:
print(df.head())
mean radius mean texture ... worst fractal dimension target
0 17.99 10.38 ... 0.11890 0
1 20.57 17.77 ... 0.08902 0
2 19.69 21.25 ... 0.08758 0
3 11.42 20.38 ... 0.17300 0
4 20.29 14.34 ... 0.07678 0
We can then have a look at some statistics about our dataset:
print(df.describe())
mean radius mean texture ... worst fractal dimension target
count 569.000000 569.000000 ... 569.000000 569.000000
mean 14.127292 19.289649 ... 0.083946 0.627417
std 3.524049 4.301036 ... 0.018061 0.483918
min 6.981000 9.710000 ... 0.055040 0.000000
25% 11.700000 16.170000 ... 0.071460 0.000000
50% 13.370000 18.840000 ... 0.080040 1.000000
75% 15.780000 21.800000 ... 0.092080 1.000000
max 28.110000 39.280000 ... 0.207500 1.000000
We can see 62.7% of the diagnosis are malignants (our target variable = 1 when malignant, 0 when benign). We can also see this another way:
print(df["target"].value_counts())
target
1 357
0 212
Name: count, dtype: int64
Now, we can create a heatmap to visualize the relationships between the different features. I’ve already shown how to create a basic heatmap in the previous articles, so today I will make it a bit different:
# Compute correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .7})
f.tight_layout()
f.subplots_adjust(top=0.9)
plt.show()

We now have a beautiful plot, and we can see that some features are highly correlated.
We can also visualize the features distribution. Here is the code:
fig, axs = plt.subplots(5, 6, figsize=(20, 20))
for feature, ax in zip(dataset.feature_names, axs.flatten()):
sns.distplot(df[feature], ax=ax)
plt.show()
Below is the figure. You’ll see nothing as there are too many features, but if you’re reproducing what I’m doing on your computer you should be able to zoom in to see each distribution correctly.

Data Preprocessing
Let’s get to the second step: data preprocessing. First, we can split the data into features and target variable:
X = df.drop('target', axis=1)
y = df['target']
Then, we can apply feature scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
The data looks pretty clean, so we don’t need to perform other data preprocessing techniques. We can now split the data into training and testing sets, and check the shapes:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(455, 30)
(114, 30)
(455,)
(114,)
Training the Models
I’ll use several models to find the one performing the better. So let’s start with importing the models:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier
We can now initialize our models:
models = [
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
AdaBoostClassifier(n_estimators=100, random_state=42),
GradientBoostingClassifier(n_estimators=100, random_state=42),
ExtraTreesClassifier(n_estimators=100, max_depth=5, random_state=42)
]
Let’s s now train our models and see the results:
scores = []
for model in models:
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
plt.figure(figsize=(10, 5))
sns.barplot(x=[type(model).__name__ for model in models], y=scores)
plt.ylim(0.9, 1)
plt.show()
print(f"Best model: {type(models[np.argmax(scores)]).__name__} with score {np.max(scores)}")

Best model: AdaBoostClassifier with score 0.9736842105263158
Our models seem to perform well. The best is the AdaBoostClassifier.
Now, we can try to combine all our models using a voting classifier. Let’s do this and see what we get:
voting_clf = VotingClassifier(
estimators=[(type(model).__name__, model) for model in models],
voting='hard'
)
voting_clf.fit(X_train, y_train)
print(f"Voting classifier score: {voting_clf.score(X_test, y_test)}")
Voting classifier score: 0.9649122807017544
Our AdaBoostClassifier is still better, let’s stick with it.
Evaluating our Model
To evaluate our model, we can use a confusion matrix. The more the matrix looks like a diagonal matrix, the better our model is.
model = models[np.argmax(scores)]
y_pred = model.predict(X_test)
plt.figure(figsize=(10, 10))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print(classification_report(y_test, y_pred))

It looks nice! There are still some errors, and we have to tolerate as few errors as possible in the medical field, so let’s see if we can’t improve our model.
Improving the Model
To improve the model, we can try hyperparameter tuning. Hyperparameters are parameters not learned by the model. So we can just try various combinations of hyperparameters and see which one is better:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300, 400, 500],
"algorithm": ["SAMME", "SAMME.R"],
"learning_rate": [0.1, 0.5, 1, 1.5]
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)
y_pred = grid_search.predict(X_test)
plt.figure(figsize=(10, 10))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print(classification_report(y_test, y_pred))
{'algorithm': 'SAMME.R', 'learning_rate': 1.5, 'n_estimators': 500}
0.9846153846153847
precision recall f1-score support
0 0.98 0.95 0.96 43
1 0.97 0.99 0.98 71
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
We still have the same confusion matrix so our model isn’t really performing better.
But let’s store it anyway:
improved_model = grid_search.best_estimator_
Another way to improve the model could be to try feature selection. Indeed, maybe some features are just noise and aren’t really important.
Let’s start by plotting the importance of each feature.
from sklearn.feature_selection import SelectFromModel
feature_importance = pd.DataFrame(model.feature_importances_, index=dataset.feature_names, columns=['importance']).sort_values('importance', ascending=False)
print(feature_importance)
plt.figure(figsize=(10, 15))
sns.barplot(x=feature_importance.index, y=feature_importance['importance'])
plt.xticks(rotation=90)
plt.show()

Now let’s perform feature selection:
sfm = SelectFromModel(improved_model, threshold=0.05)
sfm.fit(X_train, y_train)
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)
improved_model.fit(X_important_train, y_train)
y_pred = improved_model.predict(X_important_test)
plt.figure(figsize=(10, 10))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print(classification_report(y_test, y_pred))

Our confusion matrix is now worse! So we need all the features… We can’t do better for now, we need either other models such as deep learning models, or more data to use as training data.
Final Note
As you can see, ensemble methods provide a way to build robust models, with good scores. However, they may require more resources to be trained than a simple logistic regression.
I hope you found this article useful. I wanted to show you things I’ve not shown you before such as the tringle heatmap visualization or the confusion matrix, so that you’ve still learned new things!
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: