
Advanced Use of Recurrent Neural Networks: Part 2
Preparing the Data
Fourth Section in a Series of Python Deep Learning Posts.
Previous Sections
Additionally, you can check out the series of posts on Apache Spark
Preparing the Data
The exact formulation of the problem will be as follows: given data going as far back as lookback timesteps (a timestep is 10 minutes) and sampled every steps timesteps, can you predict the temperature in delay timesteps? You’ll use the following parameter values:
lookback= 720 — Observations will go back 5 days.steps= 6 — Observations will be sampled at one data point per hour.delay= 144 — Targets will be 24 hours in the future.
To get started, you need to do two things:
- Preprocess the data to a format a neural network can ingest. This is easy: the data is already numerical, so you don’t need to do any vectorization. But each timeseries in the data is on a different scale (for example, temperature is typically between -20 and +30, but atmospheric pressure, measured in mbar, is around 1,000). You’ll normalize each timeseries independently so that they all take small values on a similar scale.
- Write a Python generator that takes the current array of float data and yields batches of data from the recent past, along with a target temperature in the future. Because the samples in the dataset are highly redundant (sample N and sample N + 1 will have most of their timesteps in common), it would be wasteful to explicitly allocate every sample. Instead, you’ll generate the samples on the fly using the original data.
You’ll preprocess the data by subtracting the mean of each timeseries and dividing by the standard deviation. You’re going to use the first 200,000 timesteps as training data, so compute the mean and standard deviation only on this fraction of the data.
About the data generator you’ll use — It yields a tuple ( samples, targets), where samples is one batch of input data and targets is the corresponding array of target temperatures. It takes the following arguments:
data— The original array of floating-point data, which you normalized in Part 1.lookback— How many timesteps back the input data should go.delay— How many timesteps in the future the target should be.min_indexandmax_index— Indices in the data array that delimit which timesteps to draw from. This is useful for keeping a segment of the data for validation and another for testing.shuffle— Whether to shuffle the samples or draw them in chronological order.batch_size— The number of samples per batch.step— The period, in timesteps, at which you sample data. You’ll set it to 6 in order to draw one data point every hour.
>>> def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
... if max_index is None:
... max_index = len(data) - delay - 1
... i = min_index + lookback
... while 1:
... if shuffle:
... rows = np.random.randint(
... min_index + lookback, max_index, size=batch_size)
... else:
... if i + batch_size >= max_index:
... i = min_index + lookback
... rows = np.arange(i, min(i + batch_size, max_index))
... i += len(rows)
... samples = np.zeros((len(rows), lookback // step, data.shape[-1]))
... targets = np.zeros((len(rows),))
... for j, row in enumerate(rows):
... indices = range(rows[j] - lookback, rows[j], step)
... samples[j] = data[indices]
... targets[j] = data[rows[j] + delay][1]
... yield samples, targets
...Now, let’s use the abstract generator function to instantiate three generators: one for training, one for validation, and one for testing. Each will look at different temporal segments of the original data: the training generator looks at the first 200,000 timesteps, the validation generator looks at the following 100,000, and the test generator looks at the remainder.
>>> lookback = 1440
>>> step = 6
>>> delay = 144
>>> batch_size = 128
>>> train_gen = generator(
... float_data,
... lookback=lookback,
... delay=delay,
... min_index=0,
... max_index=200000,
... shuffle=True,
... step=step,
... batch_size=batch_size)
>>> val_gen = generator(
... float_data,
... lookback=lookback,
... delay=delay,
... min_index=200001,
... max_index=300000,
... step=step,
... batch_size=batch_size)
>>> test_gen = generator(
... float_data,
... lookback=lookback,
... delay=delay,
... min_index=300001,
... max_index=None,
... step=step,
... batch_size=batch_size)
>>> val_steps = (300000 - 200001 - lookback)
>>> test_steps = (len(float_data) - 300001 - lookback)




