avatarRobert Shaneyfelt

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3526

Abstract

In most cases, it’s enough to split your dataset randomly into <a href="https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets">three subsets</a>:</p><ol><li><b>The training set</b> is applied to train, or <b>fit</b>, your model. For example, you use the training set to find the optimal weights, or coefficients, for <a href="https://realpython.com/linear-regression-in-python/">linear regression</a>, <a href="https://realpython.com/logistic-regression-python/">logistic regression</a>, or <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">neural networks</a>.</li><li><b>The validation set</b> is used for unbiased model evaluation during <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization">hyperparameter tuning</a>. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.</li><li><b>The test set</b> is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.</li></ol><p id="6546">In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.</p><h1 id="157c">Underfitting and Overfitting</h1><p id="39e2">Splitting a dataset might also be important for detecting if your model suffers from one of two very common problems, called <a href="https://en.wikipedia.org/wiki/Overfitting">underfitting and overfitting</a>:</p><ol><li><b>Underfitting</b> is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model. Underfitted models will likely have poor performance with both training and test sets.</li><li><b>Overfitting</b> usually takes place when a model has an excessively complex structure and learns both the existing relations among data and noise. Such models often have bad generalization capabilities. Although they work well with training data, they usually yield poor performance with unseen (test) data.</li></ol><h1 id="2aad">Prerequisites for Using train_test_split()</h1><p id="fc4e"><b>scikit-learn</b>, or <code><b>sklearn</b></code> has many packages for data science and machine learning. We will focus on the <code><b>model_selection</b></code> package, specifically on the function <code><b>train_test_split()</b></code>.</p><h1 id="19f5">Application of train_test_split() in supervised machine learning.</h1><p id="7abe">Start with a small regression problem that can be solved with <a href="https://towardsdatascience.com/a-simple-guide-to-linear-regression-using-python-7050e8c751c1">linear regression.</a></p><h1 id="ea9b">Example of Linear Regression</h1><p id="bf0e">This example will show how to solve a regression problem. Start by importing the necessary data structures from the other libraries,</p><p id="2877"><b>import numpy as np from sklearn.linear_model import Linear Regression from sklearn.model_selection import train_test_split</b></p><p id="7ebf">Then create two small arrays, <code>x</code> and <code>y</code>, to represent the observations and then split them into training and test sets. I had to install scikit at levels 0.2.4.2 to get the linear regression algorithm.</p><div id="f4c9"><pre>$ python -m pip <span class="hljs-keyword">install<

Options

/span> -U <span class="hljs-string">"scikit-learn==0.24.2"</span></pre></div><p id="105a"><code>LinearRegression</code> creates the object that represents the model, while <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit"><code>.fi</code>t()</a> trains, or fits, the model and returns it. With linear regression, fitting the model means determining the best intercept (<code>model.intercept_</code>).</p><p id="18e6">The below text in bold represents the python source code. While the text in italic is the program output.</p><p id="bce0"><b>import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split</b></p><p id="f7a0"><b><i>x = np.arange(20).reshape(-1, 1) x y = np.array([5, 12, 11, 19, 30, 29, 23, 40, 51, 54, 74, 62, 68, 73, 89, 84, 89, 101, 99, 106])</i></b></p><p id="f75b"><b>y</b></p><div id="757e"><pre>array(<span class="hljs-comment">[<span class="hljs-comment">[ 0]</span>, <span class="hljs-comment">[ 1]</span>, <span class="hljs-comment">[ 2]</span>, <span class="hljs-comment">[ 3]</span>, <span class="hljs-comment">[ 4]</span>, <span class="hljs-comment">[ 5]</span>, <span class="hljs-comment">[ 6]</span>, <span class="hljs-comment">[ 7]</span>, <span class="hljs-comment">[ 8]</span>, <span class="hljs-comment">[ 9]</span>, <span class="hljs-comment">[10]</span>, <span class="hljs-comment">[11]</span>, <span class="hljs-comment">[12]</span>, <span class="hljs-comment">[13]</span>, <span class="hljs-comment">[14]</span>, <span class="hljs-comment">[15]</span>, <span class="hljs-comment">[16]</span>, <span class="hljs-comment">[17]</span>, <span class="hljs-comment">[18]</span>, <span class="hljs-comment">[19]</span>]</span>)</pre></div><p id="a6e8"><b>x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=8, random_state=0)</b></p><p id="c9cc"><b>model = LinearRegression ().fit(x_train, y_train)</b></p><p id="6677"><b>model.intercept_ 3.1617195496417523</b></p><div id="6740"><pre><span class="hljs-attribute">3</span>.<span class="hljs-number">1617195496417523</span></pre></div><div id="98c8"><pre><span class="hljs-keyword">model</span>.score(x_train, y_train)</pre></div><div id="87dd"><pre><span class="hljs-attribute">0</span>.<span class="hljs-number">9868175024574795</span></pre></div><p id="7ca5"><b>model.score(x_test, y_test)</b></p><div id="f40a"><pre><span class="hljs-attribute">0</span>.<span class="hljs-number">9465896927715023</span></pre></div><p id="e92f"><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score"><code>.scor</code>e()</a> returns the <b>coefficient of determination</b>, or <i>R</i>², for the data passed. Its maximum is <code>1</code>. The higher the <i>R</i>² value, the better the fit. In this case, the training data yields a slightly higher coefficient. However, the <i>R</i>² calculated with test data is an unbiased measure of your model’s prediction performance.</p><p id="bd96">The combination of the training data with the machine learning algorithm creates the model. Then, with this model, you can make predictions for new data.</p><p id="6b8e">Tune in for the next pair in this series, where I will discuss deep learning.</p></article></body>

Python Machine Learning

Part 1 of a series showing how to build neural networks from scratch.

Photo by David Clode on Unsplash

This story is meant to be an introduction to artificial intelligence (AI). Python is a great language to learn since most of the tools are built using it. Deep learning is a technique used to make predictions using data, and it heavily relies on neural networks. This story is part of a is a series leading up to building a neural network from scratch.

In a production setting, you would use a deep learning framework like TensorFlow or PyTorch instead of building your neural network. That said, having some knowledge of how neural networks work is helpful because you can use it to better architect your deep learning models.

Artificial Intelligence Overview

In basic terms, the goal of using AI is to make computers think as humans do. This may seem like something new, but the field was born in the 1950s.

Machine Learning

Machine learning is a technique in which you train the system to solve a problem instead of explicitly programming the rules.

A common machine learning task is supervised learning, in which you have a dataset with inputs and known outputs. The task is to use this dataset to train a model that predicts the correct outputs based on the inputs.

One of the key aspects of supervised machine learning is model evaluation and validation. When you evaluate the predictive performance of your model, the process must be unbiased. Using de>train_test_split() from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.

The Importance of Data Splitting

Supervised machine learning is about creating models that precisely map the given inputs (independent variables, or predictors) to the given outputs (dependent variables, or responses).

The goal of supervised learning tasks is to make predictions for new, unseen data. To do that, you assume that this unseen data follows a probability distribution similar to the distribution of the training dataset. If in the future this distribution changes, then you need to train your model again using the new training dataset.

Training, Validation, and Test Sets

Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets:

  1. The training set is applied to train, or fit, your model. For example, you use the training set to find the optimal weights, or coefficients, for linear regression, logistic regression, or neural networks.
  2. The validation set is used for unbiased model evaluation during hyperparameter tuning. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.
  3. The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.

In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.

Underfitting and Overfitting

Splitting a dataset might also be important for detecting if your model suffers from one of two very common problems, called underfitting and overfitting:

  1. Underfitting is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model. Underfitted models will likely have poor performance with both training and test sets.
  2. Overfitting usually takes place when a model has an excessively complex structure and learns both the existing relations among data and noise. Such models often have bad generalization capabilities. Although they work well with training data, they usually yield poor performance with unseen (test) data.

Prerequisites for Using train_test_split()

scikit-learn, or sklearn has many packages for data science and machine learning. We will focus on the model_selection package, specifically on the function train_test_split().

Application of train_test_split() in supervised machine learning.

Start with a small regression problem that can be solved with linear regression.

Example of Linear Regression

This example will show how to solve a regression problem. Start by importing the necessary data structures from the other libraries,

import numpy as np from sklearn.linear_model import Linear Regression from sklearn.model_selection import train_test_split

Then create two small arrays, x and y, to represent the observations and then split them into training and test sets. I had to install scikit at levels 0.2.4.2 to get the linear regression algorithm.

$ python -m pip install -U "scikit-learn==0.24.2"

LinearRegression creates the object that represents the model, while .fit() trains, or fits, the model and returns it. With linear regression, fitting the model means determining the best intercept (model.intercept_).

The below text in bold represents the python source code. While the text in italic is the program output.

import numpy as np from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split

x = np.arange(20).reshape(-1, 1) x y = np.array([5, 12, 11, 19, 30, 29, 23, 40, 51, 54, 74, 62, 68, 73, 89, 84, 89, 101, 99, 106])

y

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13],
       [14],
       [15],
       [16],
       [17],
       [18],
       [19]])

x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=8, random_state=0)

model = LinearRegression ().fit(x_train, y_train)

model.intercept_ 3.1617195496417523

3.1617195496417523
model.score(x_train, y_train)
0.9868175024574795

model.score(x_test, y_test)

0.9465896927715023

.score() returns the coefficient of determination, or R², for the data passed. Its maximum is 1. The higher the R² value, the better the fit. In this case, the training data yields a slightly higher coefficient. However, the R² calculated with test data is an unbiased measure of your model’s prediction performance.

The combination of the training data with the machine learning algorithm creates the model. Then, with this model, you can make predictions for new data.

Tune in for the next pair in this series, where I will discuss deep learning.

Artificial Intelligence
Machine Learning
Illumination
Programming
Writing
Recommended from ReadMedium