Summary

The content describes the process of preparing data for a recurrent neural network, including normalizing the data and writing a Python generator to yield batches of data for training, validation, and testing.

Abstract

The content is a tutorial on preparing data for a recurrent neural network. It explains the problem formulation, which involves predicting the temperature in a certain number of timesteps given past data. The tutorial then describes the process of normalizing the data and writing a Python generator to yield batches of data for training, validation, and testing. The generator takes several arguments, including the original data, the number of timesteps to look back, the delay for the target, the indices to draw from, and the batch size. The tutorial provides code examples for the generator function and for instantiating the generators for training, validation, and testing.

Opinions

The tutorial assumes that the reader has some familiarity with recurrent neural networks and Python programming.
The tutorial emphasizes the importance of normalizing the data and using a generator to yield batches of data, rather than explicitly allocating every sample.
The tutorial provides clear code examples and explanations for each step of the process.
The tutorial does not provide much context or motivation for the specific problem formulation, beyond stating that it is a temperature forecasting problem.
The tutorial does not discuss any evaluation metrics or criteria for success.
The tutorial does not provide any discussion of potential challenges or limitations of the approach.
The tutorial does not provide any discussion of alternative approaches or techniques for preparing data for recurrent neural networks.

Advanced Use of Recurrent Neural Networks: Part 2

Preparing the Data

Fourth Section in a Series of Python Deep Learning Posts.

Previous Sections

Python Deep Learning

undefined

Convolutional Neural Networks

undefined

Recurrent Neural Networks

undefined

Additionally, you can check out the series of posts on Apache Spark

Data Analysis with Scala and Spark

undefined

Apache Spark and Hadoop on an AWS Cluster with Flintrock

undefined

Preparing the Data

The exact formulation of the problem will be as follows: given data going as far back as lookback timesteps (a timestep is 10 minutes) and sampled every steps timesteps, can you predict the temperature in delay timesteps? You’ll use the following parameter values:

lookback = 720 — Observations will go back 5 days.
steps = 6 — Observations will be sampled at one data point per hour.
delay = 144 — Targets will be 24 hours in the future.

To get started, you need to do two things:

Preprocess the data to a format a neural network can ingest. This is easy: the data is already numerical, so you don’t need to do any vectorization. But each timeseries in the data is on a different scale (for example, temperature is typically between -20 and +30, but atmospheric pressure, measured in mbar, is around 1,000). You’ll normalize each timeseries independently so that they all take small values on a similar scale.
Write a Python generator that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future. Because the samples in the dataset are highly redundant (sample N and sample N + 1 will have most of their timesteps in common), it would be wasteful to explicitly allocate every sample. Instead, you’ll generate the samples on the fly using the original data.

You’ll preprocess the data by subtracting the mean of each timeseries and dividing by the standard deviation. You’re going to use the first 200,000 timesteps as training data, so compute the mean and standard deviation only on this fraction of the data.

About the data generator you’ll use — It yields a tuple ( samples, targets), where samples is one batch of input data and targets is the corresponding array of target temperatures. It takes the following arguments:

data — The original array of floating-point data, which you normalized in Part 1.
lookback — How many timesteps back the input data should go.
delay — How many timesteps in the future the target should be.
min_index and max_index — Indices in the data array that delimit which timesteps to draw from. This is useful for keeping a segment of the data for validation and another for testing.
shuffle — Whether to shuffle the samples or draw them in chronological order.
batch_size — The number of samples per batch.
step — The period, in timesteps, at which you sample data. You’ll set it to 6 in order to draw one data point every hour.

>>> def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
...     if max_index is None:
...         max_index = len(data) - delay - 1
...     i = min_index + lookback
...     while 1:
...         if shuffle:
...             rows = np.random.randint(
...                 min_index + lookback, max_index, size=batch_size)
...         else:
...             if i + batch_size >= max_index:
...                 i = min_index + lookback
...             rows = np.arange(i, min(i + batch_size, max_index))
...             i += len(rows)
...         samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
...         targets = np.zeros((len(rows),))
...         for j, row in enumerate(rows):
...             indices = range(rows[j] - lookback, rows[j], step)
...             samples[j] = data[indices]
...             targets[j] = data[rows[j] + delay][1]
...         yield samples, targets
...

Now, let’s use the abstract generator function to instantiate three generators: one for training, one for validation, and one for testing. Each will look at different temporal segments of the original data: the training generator looks at the first 200,000 timesteps, the validation generator looks at the following 100,000, and the test generator looks at the remainder.

>>> lookback = 1440
>>> step = 6
>>> delay = 144
>>> batch_size = 128
>>> train_gen = generator(
...     float_data,
...     lookback=lookback,
...     delay=delay,
...     min_index=0,
...     max_index=200000,
...     shuffle=True,
...     step=step,
...     batch_size=batch_size)
>>> val_gen = generator(
...     float_data,
...     lookback=lookback,
...     delay=delay,
...     min_index=200001,
...     max_index=300000,
...     step=step,
...     batch_size=batch_size)
>>> test_gen = generator(
...     float_data,
...     lookback=lookback,
...     delay=delay,
...     min_index=300001,
...     max_index=None,
...     step=step,
...     batch_size=batch_size) 
>>> val_steps = (300000 - 200001 - lookback)
>>> test_steps = (len(float_data) - 300001 - lookback)

Chollet, François. Deep learning with Python. Shelter Island, NY: Manning Publications Co, 2018. Print.

Advanced Use of Recurrent Neural Networks: Part 2

Preparing the Data

Previous Sections

Python Deep Learning

undefined

Convolutional Neural Networks

undefined

Recurrent Neural Networks

undefined

Data Analysis with Scala and Spark

undefined

Apache Spark and Hadoop on an AWS Cluster with Flintrock

undefined

Preparing the Data

Next

Advanced Use of Recurrent Neural Networks: Part 3

A Non-Machine-Learning Baseline

Previous

Advanced Use of Recurrent Neural Networks: Part 1

A Temperature-Forecasting Problem