Free AI web copilot to create summaries, insights and extended knowledge, download it at here
6781
Abstract
1.readmedium.com/v2/resize:fit:800/1*iHnyaj21QzV2wD78vDNKtQ.png"><figcaption></figcaption></figure><div id="4f0f"><pre><span class="hljs-attribute">0</span> <span class="hljs-number">319553</span>
<span class="hljs-attribute">1</span> <span class="hljs-number">62601</span>
<span class="hljs-attribute">Name</span>: Response, dtype: int64</pre></div><p id="856d">Examples: <b><i>Total: 382154 Positive: 62601 (16.38% of total)</i></b></p><p id="8292">Check the sns pairplot<b><i>.</i></b></p><figure id="944a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ma4vHnEof-raXe0NwjVJyw.png"><figcaption></figcaption></figure><p id="5c17">A heatmap</p><figure id="80bd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TqcKMxhkUa4S1-1vDzaMPA.png"><figcaption></figcaption></figure><p id="3730"><b>Data Preprocessing and Feature engineering:</b></p><figure id="a7f1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5wU9JidtVLQC70g-GABaAw.png"><figcaption></figcaption></figure><ul><li>The ID is not needed and can be dropped.</li><li>The driving_license can also be dropped. People without a license are very few. Also, people without a license and interested in buying vehicle insurance are very few. Hence this feature is not important and can be dropped.</li><li>Convert all the one-hot coding columns from numeric to string and then use the pandas pd.get_dummies.</li></ul>
<figure id="0c8f">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/f2e273c87c68b928f055c49ae79b5ebd.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure>
<figure id="f35d">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/361d992b421be6711717d33fdae67207.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="34c9"><b>Split the dataset into training, validation and test :</b></p><ul><li>Use Sklearn to split the dataset into training, validation, and test datasets.</li></ul>
<figure id="95be">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/b03ae048e1de6b3661798af849a156da.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="b6ae">Normalize the input features using the sklearn StandardScaler. This will set the mean to 0 and standard deviation to 1.The <code>StandardScaler</code> is only fit using the <code>train_features</code> to be sure the model is not peeking at the validation or test sets.</p>
<figure id="9380">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/7c73ec80de2f97e00c5e947e18a3b516.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="e739">Now the dataset is ready for training. We know the dataset is imbalanced. Lets build the model</p><ul><li><b>Baseline model</b>-Train a model which can’t handle the class imbalance.</li><li>Train a model with <b>class weights</b></li><li><b>Oversampling:</b> Oversample the minority class.</li></ul><p id="74bb">Since the dataset is imbalanced we can’t apply the standard metrics. Accuracy is not a helpful metric for this task. You can 90%+ accuracy on this task by predicting False all the time. In this case, Recall and F1 are the best metrics to evaluate.</p><p id="769b"><b>Baseline Model:</b></p>
<figure id="9ebb">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/d221a7a759d40f222d011233b23c4911.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure>
<figure id="1743">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/3fe62aa57dfddc6ae81cce2d9d808a9b.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><ul><li>A simple neural network with a densely connected hidden layer</li><li>A <b><i>dropout layer</i></b> to reduce overfitting</li><li>An output layer with a<b><i> sigmoid loss </i></b>function that returns the probability of the output class whether interested in vehicle insurance or not.</li><li><b><i>Accuracy = ( True Samples \ Total Samples )</i></b></li><li><b><i>Precision = ( True Positives \ (True Positives + False Positives)</i></b></li><li><b><i>Recall = ( True Positives\(True Positives + False Negatives)</i></b></li><li><b>AUC</b> refers to the Area Under the Curve of a Receiver Operating Characteristic curve (ROC-AUC). This metric is equal to the probability that a classifier will rank a random positive sample higher than a random negative sample.</li></ul><p id="97ed"><b>Baseline Model Metrics:</b></p><div id="6c67"><pre>loss : 0.2792171537876129
tp : 3849.0
fp : 3122.0
tn : 73830.0
fn : 11285.0
accuracy : 0.8435484170913696
precision : 0.5521445870399475
recall : 0.25432801246643066
auc : 0.8915838599205017
(<span class="hljs-literal">True</span> Negatives): <span class="hljs-number">73830</span>
(<span class="hljs-literal">False</span> Positives): <span class="hljs-number">3122</span>
(<span class="hljs-literal">False</span> Negatives): <span class="hljs-number">11285</span>
(<span class="hljs-literal">True</span> Positives): <span class="hljs-number">3849</span>
Total : 15134</pre></div><figure id="1cb7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*VHnOgGYCX9BM7w8KNxbcIw.png"><figcaption></figcaption></figure><p id="2ed1"><b>Model with Class Weights:</b></p>
<figure id="d485">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/c078bdecbd5e12abf8d4084a5d59bcbf.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><div id="5fc8"><pre><span class="hljs-attribute">Weight</span> for class <span class="hljs-number">0</span>: <span class="hljs-number">0</span>.<span class="hljs-number">60</span>
<span class="hljs-attrib
Options
ute">Weight</span> for class <span class="hljs-number">1</span>: <span class="hljs-number">3</span>.<span class="hljs-number">05</span></pre></div><p id="2eba">Apply the weights</p>
<figure id="eba6">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/e5aa113dd92bcfe9b45da8e21b18b05b.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="0ea5"><b>Model With Class Weights Metrics:</b></p><div id="94ca"><pre>loss : 0.3881891071796417
tp : 13815.0
fp : 18721.0
tn : 58231.0
fn : 1319.0
accuracy : 0.7823773622512817
precision : 0.424606591463089
recall : 0.912845253944397
auc : 0.8915590047836304
(<span class="hljs-literal">True</span> Negatives): <span class="hljs-number">58231</span>
(<span class="hljs-literal">False</span> Positives): <span class="hljs-number">18721</span>
(<span class="hljs-literal">False</span> Negatives): <span class="hljs-number">1319</span>
(<span class="hljs-literal">True</span> Positives): <span class="hljs-number">13815</span>
Total : 15134</pre></div><figure id="1f39"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*G1OihX2cCJBfzXBH15Xk5A.png"><figcaption></figcaption></figure><p id="af9d"><b>Model: Oversampling: Oversample The Minority class:</b></p><p id="5f81">This approach is to resample the dataset by oversampling the minority class.</p><p id="c8d5"><b>tf.data:</b> provides 2 methods</p><ul><li><code><b>experimental.sample_from_datasets</b></code></li><li><code><b>experimental.rejection_resample</b></code></li></ul><p id="c6d2">We will be using <code>tf.data.experimental.sample_from_datasets</code></p><p id="9704">We can use<code>tf.data</code> the easiest way to produce balanced examples is to start with a <code>positive</code> and a <code>negative</code> dataset, and merge them. One approach to resampling a dataset is to use <code>sample_from_datasets</code>. This is more applicable when you have a separate <code>data.Dataset</code> for each class.</p><p id="6981">Create the dataset tf.data</p>
<figure id="92ee">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/579e7452203105d08aa934d7a79bba5c.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="f475">To use <code>tf.data.experimental.sample_from_datasets</code> pass the datasets, and the weight for each. In our case the code is</p>
<figure id="fe04">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/4efb3627c4c7c35a1d008356a5432a9c.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="10af">Now the dataset produces examples of each class with 50/50 probability.</p><p id="3fdb">One problem with the above <code>experimental.sample_from_datasets</code> approach is that it needs a separate <code>tf.data.Dataset</code> per class. Using <code>Dataset.filter</code> works, but results in all the data being loaded twice. The <code>data.experimental.rejection_resample</code> function can be applied to a dataset to rebalance it, while only loading it once. Elements will be dropped from the dataset to achieve balance.<code>data.experimental.rejection_resample</code> takes a <code>class_func</code> argument. This <code>class_func</code> is applied to each dataset element, and is used to determine which class an example belongs to for the purposes of balancing.</p><p id="383d"><b>Model With Resampled Dataset Metrics:</b></p><div id="ad86"><pre>loss : 0.39975816011428833
tp : 14057.0
fp : 20528.0
tn : 56573.0
fn : 928.0
accuracy : 0.7670004367828369
precision : 0.4064478874206543
recall : 0.9380714297294617
auc : 0.8911738395690918
(<span class="hljs-literal">True</span> Negatives): <span class="hljs-number">56573</span>
(<span class="hljs-literal">False</span> Positives): <span class="hljs-number">20528</span>
(<span class="hljs-literal">False</span> Negatives): <span class="hljs-number">928</span>
(<span class="hljs-literal">True</span> Positives): <span class="hljs-number">14057</span>
Total : 14985</pre></div><figure id="343d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-MaASJ1CdVKlYsjNRPvPhg.png"><figcaption></figcaption></figure><p id="4bd4"><b>Models Comparison:</b></p><figure id="ebe0"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*z15hLwB6k3wr3O5DjOtqXg.png"><figcaption></figcaption></figure><p id="0e0a">When no class weights added or no resampling then the accuracy is higher with 84% which is exactly the percentage of the negative class in the dataset. The dataset is imbalanced with 83% of one class and 17% of other classes. After adding class weights the recall went higher and also when you did the oversampling like increase the samples from the minority class, then the recall went higher.</p><p id="9fe6">The full code is below</p>
<figure id="9ba2">
<div>
<div>
<iframe class="gist-iframe" src="/gist/esenthil2018/6a8eac37098636765aef922c3781aefa.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h2 id="d39e">Conclusion:</h2><p id="bfa9">It is difficult to predict using an imbalanced dataset. It is always better to collect more data if it is not expensive and time-consuming. Also feature engineering is more important so that your features can extract the most from the minority class. Also, standard metrics won’t apply to imbalanced datasets. Try to change the evaluation metrics based on the datasets and try different methods like adding class weights, resampling, SMOTE to address the imbalanced datasets.</p><p id="6f45">Please feel free to connect with me on <a href="http://linkedin.com/in/esenthil"><b>LinkedIn</b></a></p><p id="2c91"><b>References:</b></p><p id="454e">[1] <a href="https://www.tensorflow.org/tutorials/structured_data/imbalanced_data">https://www.tensorflow.org/tutorials/structured_data/imbalanced_data</a>
[2]<a href="https://www.tensorflow.org/guide/data">https://www.tensorflow.org/guide/data</a>
[3]<a href="https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28">https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28</a></p></article></body>