avatarEsteban Thilliez

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8161

Abstract

Draw the heatmap with the mask and correct aspect ratio</span> sns.heatmap(corr, mask=mask, cmap=cmap, vmax=<span class="hljs-number">.3</span>, center=<span class="hljs-number">0</span>, square=<span class="hljs-literal">True</span>, linewidths=<span class="hljs-number">.5</span>, cbar_kws={<span class="hljs-string">"shrink"</span>: <span class="hljs-number">.7</span>})

f.tight_layout()
f.subplots_adjust(top=<span class="hljs-number">0.9</span>)

plt.show()</pre></div><figure id="90c2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*udU3nf7bE_KCSJwF0O6o1w.png"><figcaption></figcaption></figure><p id="a08b">We now have a beautiful plot, and we can see that some features are highly correlated.</p><p id="5fe2">We can also visualize the features distribution. Here is the code:</p><div id="4f84"><pre>    fig, axs = plt.subplots(<span class="hljs-number">5</span>, <span class="hljs-number">6</span>, figsize=(<span class="hljs-number">20</span>, <span class="hljs-number">20</span>))
<span class="hljs-keyword">for</span> feature, ax <span class="hljs-keyword">in</span> <span class="hljs-built_in">zip</span>(dataset.feature_names, axs.flatten()):
    sns.distplot(df[feature], ax=ax)
plt.show()</pre></div><p id="8b0a">Below is the figure. You’ll see nothing as there are too many features, but if you’re reproducing what I’m doing on your computer you should be able to zoom in to see each distribution correctly.</p><figure id="e345"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*wpScsqMAZZcVqLWeAokJHQ.png"><figcaption></figcaption></figure><h2 id="e950">Data Preprocessing</h2><p id="d3d3">Let’s get to the second step: data preprocessing. First, we can split the data into features and target variable:</p><div id="6bbc"><pre>    X = df.drop(<span class="hljs-string">'target'</span>, axis=<span class="hljs-number">1</span>)
y = df[<span class="hljs-string">'target'</span>]</pre></div><p id="dc7b">Then, we can apply feature scaling:</p><div id="7b81"><pre>    <span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler

scaler = StandardScaler()
X = scaler.fit_transform(X)</pre></div><p id="0b05">The data looks pretty clean, so we don’t need to perform other data preprocessing techniques. We can now split the data into training and testing sets, and check the shapes:</p><div id="3127"><pre>    <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-built_in">print</span>(X_train.shape)
<span class="hljs-built_in">print</span>(X_test.shape)
<span class="hljs-built_in">print</span>(y_train.shape)
<span class="hljs-built_in">print</span>(y_test.shape)</pre></div><div id="c98c"><pre>(<span class="hljs-number">455</span><span class="hljs-punctuation">,</span> <span class="hljs-number">30</span>)

(<span class="hljs-number">114</span><span class="hljs-punctuation">,</span> <span class="hljs-number">30</span>) (<span class="hljs-number">455</span><span class="hljs-punctuation">,</span>) (<span class="hljs-number">114</span><span class="hljs-punctuation">,</span>)</pre></div><h2 id="0a8c">Training the Models</h2><p id="bb94">I’ll use several models to find the one performing the better. So let’s start with importing the models:</p><div id="a2d5"><pre><span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier</pre></div><p id="a4da">We can now initialize our models:</p><div id="aee8"><pre> models = [ RandomForestClassifier(n_estimators=<span class="hljs-number">100</span>, max_depth=<span class="hljs-number">5</span>, random_state=<span class="hljs-number">42</span>), AdaBoostClassifier(n_estimators=<span class="hljs-number">100</span>, random_state=<span class="hljs-number">42</span>), GradientBoostingClassifier(n_estimators=<span class="hljs-number">100</span>, random_state=<span class="hljs-number">42</span>), ExtraTreesClassifier(n_estimators=<span class="hljs-number">100</span>, max_depth=<span class="hljs-number">5</span>, random_state=<span class="hljs-number">42</span>) ]</pre></div><p id="e9c4">Let’s s now train our models and see the results:</p><div id="75aa"><pre> scores = []

<span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> models:
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">5</span>))
sns.barplot(x=[<span class="hljs-built_in">type</span>(model).__name__ <span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> models], y=scores)
plt.ylim(<span class="hljs-number">0.9</span>, <span class="hljs-number">1</span>)
plt.show()

<span class="hljs-built_in">print</span>(<span class="hljs-string">f"Best model: <span class="hljs-subst">{<span class="hljs-built_in">type</span>(models[np.argmax(scores)]).__name__}</span> with score <span class="hljs-subst">{np.<span class="hljs-built_in">max</span>(scores)}</span>"</span>)</pre></div><figure id="5faf"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hyRuxvjwz_VbHoOwLTWHXQ.png"><figcaption></figcaption></figure><div id="d337"><pre><span class="hljs-attribute">Best</span> model: AdaBoostClassifier with score <span class="hljs-number">0</span>.<span class="hljs-number">9736842105263158</span></pre></div><p id="0698">Our models seem to perform well. The best is the AdaBoostClassifier.</p><p id="8c1d">Now, we can try to combine all our models using a voting classifier. Let’s do this and see what we get:</p><div id="188e"><pre>    voting_clf = VotingClassifier(
    estimators=[(<span class="hljs-built_in">type</span>(model).__name__, model) <span class="hljs-keyword">for</span> model <span class="hljs-keyword">in</span> models],
    voting=<span class="hljs-string">'hard'</span>
)

voting_clf.fit(X_train, y_train)
<span class="hljs-built_in">print</span>(<span class="hljs-string">f"Voting classifier score: <span class="hljs-subst">{voting_clf.score(X_test, y_test)}</span>"</span>)</pre></div><div id="fd82"><pre><span class="hljs-attribute">Voting</span> classifier score: <span class="hljs-number">0</span>.<span class="hljs-number">9649122807017544</span></pre></div><p id="ecb6">Our AdaBoostClassifier is still better, let’s stick with it.</p><h2 id="abbd">Evaluating our Model</h2><p id="a3d2">To evaluate our model, we can use a confusion matrix. The more the matrix looks like a diagonal matrix, the better our model is.</p><div id="9c6f"><pre>    model = models[np.argmax(scores)]
y_pred = model.predict(X_test)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">"YlGnBu"</span>, fmt=<span class="hljs-string">'g'</span>)
plt.ylabel(<span class="hljs-string">'Actual label'</span>)
plt.xlabel(<span class="hljs-string">'Predicted label'</span>)
plt.show()

<span class="hljs-built_in">print</span>(classification_report(y_test, y_pred))</pre></div><figure id="fde4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*sDMLuc41gh7-jaf7k6HVew.png"><figcaption></figcaption></figure><p id="9c42">It looks nice! There are still some errors, and we have to tolerate as few errors as possible in the medical field, so let’s see if we can’t improve our model.</p><h2 id="8971">Improving the Model</h2><p id="aac6">To improve the model, we can try hyperparameter tuning. Hyperparameters are parameters not learned by the model. So we can just try various combinations

Options

of hyperparameters and see which one is better:</p><div id="9825"><pre> <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> GridSearchCV

param_grid = {
    <span class="hljs-string">'n_estimators'</span>: [<span class="hljs-number">100</span>, <span class="hljs-number">200</span>, <span class="hljs-number">300</span>, <span class="hljs-number">400</span>, <span class="hljs-number">500</span>],
    <span class="hljs-string">"algorithm"</span>: [<span class="hljs-string">"SAMME"</span>, <span class="hljs-string">"SAMME.R"</span>],
    <span class="hljs-string">"learning_rate"</span>: [<span class="hljs-number">0.1</span>, <span class="hljs-number">0.5</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1.5</span>]
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=<span class="hljs-number">5</span>, n_jobs=-<span class="hljs-number">1</span>, verbose=<span class="hljs-number">2</span>)
grid_search.fit(X_train, y_train)

<span class="hljs-built_in">print</span>(grid_search.best_params_)
<span class="hljs-built_in">print</span>(grid_search.best_score_)

y_pred = grid_search.predict(X_test)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">"YlGnBu"</span>, fmt=<span class="hljs-string">'g'</span>)
plt.ylabel(<span class="hljs-string">'Actual label'</span>)
plt.xlabel(<span class="hljs-string">'Predicted label'</span>)
plt.show()

<span class="hljs-built_in">print</span>(classification_report(y_test, y_pred))</pre></div><div id="72f2"><pre>{<span class="hljs-string">'algorithm'</span>: <span class="hljs-string">'SAMME.R'</span>, <span class="hljs-string">'learning_rate'</span>: <span class="hljs-number">1.5</span>, <span class="hljs-string">'n_estimators'</span>: <span class="hljs-number">500</span>}

<span class="hljs-number">0.9846153846153847</span> precision recall f1-score support

       <span class="hljs-number">0</span>       <span class="hljs-number">0.98</span>      <span class="hljs-number">0.95</span>      <span class="hljs-number">0.96</span>        <span class="hljs-number">43</span>
       <span class="hljs-number">1</span>       <span class="hljs-number">0.97</span>      <span class="hljs-number">0.99</span>      <span class="hljs-number">0.98</span>        <span class="hljs-number">71</span>

accuracy                           <span class="hljs-number">0.97</span>       <span class="hljs-number">114</span>

macro <span class="hljs-keyword">avg</span> <span class="hljs-number">0.97</span> <span class="hljs-number">0.97</span> <span class="hljs-number">0.97</span> <span class="hljs-number">114</span> weighted <span class="hljs-keyword">avg</span> <span class="hljs-number">0.97</span> <span class="hljs-number">0.97</span> <span class="hljs-number">0.97</span> <span class="hljs-number">114</span></pre></div><p id="ee4f">We still have the same confusion matrix so our model isn’t really performing better.</p><p id="6d1e">But let’s store it anyway:</p><div id="d171"><pre> improved_model = grid_search.best_estimator_</pre></div><p id="9057">Another way to improve the model could be to try feature selection. Indeed, maybe some features are just noise and aren’t really important.</p><p id="2342">Let’s start by plotting the importance of each feature.</p><div id="af46"><pre> <span class="hljs-keyword">from</span> sklearn.feature_selection <span class="hljs-keyword">import</span> SelectFromModel

feature_importance = pd.DataFrame(model.feature_importances_, index=dataset.feature_names, columns=[<span class="hljs-string">'importance'</span>]).sort_values(<span class="hljs-string">'importance'</span>, ascending=<span class="hljs-literal">False</span>)
<span class="hljs-built_in">print</span>(feature_importance)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">15</span>))
sns.barplot(x=feature_importance.index, y=feature_importance[<span class="hljs-string">'importance'</span>])
plt.xticks(rotation=<span class="hljs-number">90</span>)
plt.show()</pre></div><figure id="0ef1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VI99RsPAxBSzOF8MpmlHNQ.png"><figcaption></figcaption></figure><p id="51a4">Now let’s perform feature selection:</p><div id="801f"><pre>    sfm = SelectFromModel(improved_model, threshold=<span class="hljs-number">0.05</span>)
sfm.fit(X_train, y_train)

X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

improved_model.fit(X_important_train, y_train)
y_pred = improved_model.predict(X_important_test)

plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">10</span>))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=<span class="hljs-literal">True</span>, cmap=<span class="hljs-string">"YlGnBu"</span>, fmt=<span class="hljs-string">'g'</span>)
plt.ylabel(<span class="hljs-string">'Actual label'</span>)
plt.xlabel(<span class="hljs-string">'Predicted label'</span>)
plt.show()

<span class="hljs-built_in">print</span>(classification_report(y_test, y_pred))</pre></div><figure id="fd45"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*i0Y-ISNtasa9bh6q-I4vdQ.png"><figcaption></figcaption></figure><p id="807c">Our confusion matrix is now worse! So we need all the features… We can’t do better for now, we need either other models such as deep learning models, or more data to use as training data.</p><h2 id="36e0">Final Note</h2><p id="e9d1">As you can see, ensemble methods provide a way to build robust models, with good scores. However, they may require more resources to be trained than a simple logistic regression.</p><p id="c10a">I hope you found this article useful. I wanted to show you things I’ve not shown you before such as the tringle heatmap visualization or the confusion matrix, so that you’ve still learned new things!</p><p id="43b3"><i>To explore the other stories of this story, click below!</i></p><div id="ad69" class="link-block">
      <a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f">
        <div>
          <div>
            <h2>Data Science with Python</h2>
            <div><h3>Aka the best programming language for data scientists</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*d7J13Ipreaf-8k5j)"></div>
          </div>
        </div>
      </a>
    </div><p id="dc9e"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block">
      <a href="https://medium.com/@estebanthi/membership">
        <div>
          <div>
            <h2>Join Medium with my referral link — Esteban Thilliez</h2>
            <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div>
          </div>
        </div>
      </a>
    </div></article></body>

Data Science with Python — Breast Cancer Detection using Ensemble Methods

Photo by National Cancer Institute on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

The Breast Cancer Wisconsin (Diagnostic) Dataset is a well-known benchmark dataset for breast cancer classification tasks. It contains 569 instances with various features, such as the mean radius, mean texture, and mean smoothness of cell nuclei extracted from digitized images of fine needle aspirates of breast mass. The dataset also includes the corresponding diagnosis (malignant or benign) for each instance.

The objective today will be to apply what we’ve seen in the previous article to predict whether a diagnosis may be malignant or benign depending on various parameters. So we’ll use ensemble methods.

Loading the Dataset

This dataset is included in scikit-learn. So we can load it easily this way:

from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

However, load_breast_cancer returns a dict. You can print it to see how it looks. It’s better to convert it to a pd.DataFrame so let’s do this:

import pandas as pd

dataset = load_breast_cancer()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

Exploratory Data Analysis

The first step in any data science task is to understand our dataset. Let’s start by looking at our dataset with df.head() :

print(df.head())
   mean radius  mean texture  ...  worst fractal dimension  target
0        17.99         10.38  ...                  0.11890       0
1        20.57         17.77  ...                  0.08902       0
2        19.69         21.25  ...                  0.08758       0
3        11.42         20.38  ...                  0.17300       0
4        20.29         14.34  ...                  0.07678       0

We can then have a look at some statistics about our dataset:

print(df.describe())
       mean radius  mean texture  ...  worst fractal dimension      target
count   569.000000    569.000000  ...               569.000000  569.000000
mean     14.127292     19.289649  ...                 0.083946    0.627417
std       3.524049      4.301036  ...                 0.018061    0.483918
min       6.981000      9.710000  ...                 0.055040    0.000000
25%      11.700000     16.170000  ...                 0.071460    0.000000
50%      13.370000     18.840000  ...                 0.080040    1.000000
75%      15.780000     21.800000  ...                 0.092080    1.000000
max      28.110000     39.280000  ...                 0.207500    1.000000

We can see 62.7% of the diagnosis are malignants (our target variable = 1 when malignant, 0 when benign). We can also see this another way:

print(df["target"].value_counts())
target
1    357
0    212
Name: count, dtype: int64

Now, we can create a heatmap to visualize the relationships between the different features. I’ve already shown how to create a basic heatmap in the previous articles, so today I will make it a bit different:

    # Compute correlation matrix
    corr = df.corr()

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    f, ax = plt.subplots(figsize=(11, 9))

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(230, 20, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .7})

    f.tight_layout()
    f.subplots_adjust(top=0.9)

    plt.show()

We now have a beautiful plot, and we can see that some features are highly correlated.

We can also visualize the features distribution. Here is the code:

    fig, axs = plt.subplots(5, 6, figsize=(20, 20))
    for feature, ax in zip(dataset.feature_names, axs.flatten()):
        sns.distplot(df[feature], ax=ax)
    plt.show()

Below is the figure. You’ll see nothing as there are too many features, but if you’re reproducing what I’m doing on your computer you should be able to zoom in to see each distribution correctly.

Data Preprocessing

Let’s get to the second step: data preprocessing. First, we can split the data into features and target variable:

    X = df.drop('target', axis=1)
    y = df['target']

Then, we can apply feature scaling:

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    X = scaler.fit_transform(X)

The data looks pretty clean, so we don’t need to perform other data preprocessing techniques. We can now split the data into training and testing sets, and check the shapes:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
(455, 30)
(114, 30)
(455,)
(114,)

Training the Models

I’ll use several models to find the one performing the better. So let’s start with importing the models:

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier

We can now initialize our models:

    models = [
        RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
        AdaBoostClassifier(n_estimators=100, random_state=42),
        GradientBoostingClassifier(n_estimators=100, random_state=42),
        ExtraTreesClassifier(n_estimators=100, max_depth=5, random_state=42)
    ]

Let’s s now train our models and see the results:

    scores = []

    for model in models:
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))

    plt.figure(figsize=(10, 5))
    sns.barplot(x=[type(model).__name__ for model in models], y=scores)
    plt.ylim(0.9, 1)
    plt.show()

    print(f"Best model: {type(models[np.argmax(scores)]).__name__} with score {np.max(scores)}")
Best model: AdaBoostClassifier with score 0.9736842105263158

Our models seem to perform well. The best is the AdaBoostClassifier.

Now, we can try to combine all our models using a voting classifier. Let’s do this and see what we get:

    voting_clf = VotingClassifier(
        estimators=[(type(model).__name__, model) for model in models],
        voting='hard'
    )

    voting_clf.fit(X_train, y_train)
    print(f"Voting classifier score: {voting_clf.score(X_test, y_test)}")
Voting classifier score: 0.9649122807017544

Our AdaBoostClassifier is still better, let’s stick with it.

Evaluating our Model

To evaluate our model, we can use a confusion matrix. The more the matrix looks like a diagonal matrix, the better our model is.

    model = models[np.argmax(scores)]
    y_pred = model.predict(X_test)

    plt.figure(figsize=(10, 10))
    sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    plt.show()

    print(classification_report(y_test, y_pred))

It looks nice! There are still some errors, and we have to tolerate as few errors as possible in the medical field, so let’s see if we can’t improve our model.

Improving the Model

To improve the model, we can try hyperparameter tuning. Hyperparameters are parameters not learned by the model. So we can just try various combinations of hyperparameters and see which one is better:

    from sklearn.model_selection import GridSearchCV

    param_grid = {
        'n_estimators': [100, 200, 300, 400, 500],
        "algorithm": ["SAMME", "SAMME.R"],
        "learning_rate": [0.1, 0.5, 1, 1.5]
    }

    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)

    print(grid_search.best_params_)
    print(grid_search.best_score_)

    y_pred = grid_search.predict(X_test)

    plt.figure(figsize=(10, 10))
    sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    plt.show()

    print(classification_report(y_test, y_pred))
{'algorithm': 'SAMME.R', 'learning_rate': 1.5, 'n_estimators': 500}
0.9846153846153847
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

We still have the same confusion matrix so our model isn’t really performing better.

But let’s store it anyway:

    improved_model = grid_search.best_estimator_

Another way to improve the model could be to try feature selection. Indeed, maybe some features are just noise and aren’t really important.

Let’s start by plotting the importance of each feature.

    from sklearn.feature_selection import SelectFromModel

    feature_importance = pd.DataFrame(model.feature_importances_, index=dataset.feature_names, columns=['importance']).sort_values('importance', ascending=False)
    print(feature_importance)

    plt.figure(figsize=(10, 15))
    sns.barplot(x=feature_importance.index, y=feature_importance['importance'])
    plt.xticks(rotation=90)
    plt.show()

Now let’s perform feature selection:

    sfm = SelectFromModel(improved_model, threshold=0.05)
    sfm.fit(X_train, y_train)

    X_important_train = sfm.transform(X_train)
    X_important_test = sfm.transform(X_test)

    improved_model.fit(X_important_train, y_train)
    y_pred = improved_model.predict(X_important_test)

    plt.figure(figsize=(10, 10))
    sns.heatmap(pd.DataFrame(confusion_matrix(y_test, y_pred)), annot=True, cmap="YlGnBu", fmt='g')
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    plt.show()

    print(classification_report(y_test, y_pred))

Our confusion matrix is now worse! So we need all the features… We can’t do better for now, we need either other models such as deep learning models, or more data to use as training data.

Final Note

As you can see, ensemble methods provide a way to build robust models, with good scores. However, they may require more resources to be trained than a simple logistic regression.

I hope you found this article useful. I wanted to show you things I’ve not shown you before such as the tringle heatmap visualization or the confusion matrix, so that you’ve still learned new things!

To explore the other stories of this story, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data Science
Artificial Intelligence
Python
Programming
Data
Recommended from ReadMedium