DEEP REINFORCEMENT LEARNING EXPLAINED — 07
Cross-Entropy Method Performance Analysis
Implementation of the Cross-Entropy Training Loop
In this post, we will describe in detail the training loop of the Cross-Entropy method, which we have skipped in the previous post, as well as see how we can improve the learning of the Agent considering more complex neural networks. Also, we will present the improved variant of the method that keeps “elite” episodes for several iterations of the training process. Finally, we will show the limitations of the Cross-Entropy method to motivate other approaches.
1. Overview of the Training Loop
Next, we will present in detail the code that makes up the training loop that we presented in the previous post.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.
Main variables
The code begins by defining the main parameters of the method.
BATCH_SIZE = 100
GAMMA = 0.9
PERCENTILE = 30
REWARD_GOAL = 0.8
Helper classes
We suggest to use two helper classes to easy the explainability of the code:
from collections import namedtuple
Episode = namedtuple(‘Episode’, field_names=[‘reward’, ‘steps’])
EpisodeStep = namedtuple(‘EpisodeStep’,
field_names=[‘observation’, ‘action’])
Here we will define two helper classes that are named tuples from the collections
package in the standard library:
EpisodeStep
: This will be used to represent one single step that our Agent made in the episode, and it stores the state observed from the Environment and what action the Agent completed. The reward are not recorded due to it is always0.0
except for the last transition. Remember that we will use episode steps from “elite” episodes as training data.Episode
: This is a single episode stored with a total discounted Reward and a collection of EpisodeStep.
Initialization of variables
At this point, a set of variables that we will use in the training loop are initialized. We will present each of them as they are required in the loop:
iter_no = 0
reward_mean = 0
full_batch = []
batch = []
episode_steps = []
episode_reward = 0.0
state = env.reset()
The training loop
We learned in the previous post that the training loop of our Agent that implements the Cross-Entropy algorithm repeats 4 main steps until we become satisfied with the result:
1 — Play N number of episodes
2 — Calculate the Expected Return for every episode and decide on a return boundary
3 — Throw away all episodes with a return below the boundary.
4 — Train the neural network using episode steps from the “elite” episodes
We have decided that the Agent must be trained until a certain Reward threshold is reached. Specifically, we have decided a threshold of 80% indicated in the variable REWARD_GOAL
:
while reward_mean < REWARD_GOAL:
STEP 1 — Play N number of episodes
The next piece of code is the one that generates the batches with episodes:
action = select_action(state)
next_state, reward, episode_is_done, _ = env.step(action)
episode_steps.append(EpisodeStep(observation=state,action=action))episode_reward += reward
if episode_is_done: # Episode finished
batch.append(Episode(reward=episode_reward,
steps=episode_steps))
next_state = env.reset()
episode_steps = []
episode_reward = 0.0
<STEP 2>
<STEP 3>
<STEP 4>
state = next_state
The main variables we will use are:
batch
accumulates the list ofEpisode
instances (BATCH_SIZE=100
).episode_steps
accumulates the list of steps in the current episode.episode_reward
maintain a reward counter for the current episode (in our case we only have Reward at the end of the episode, but the algorithm is described for a more general situation where we can have Rewards not only at the last step).
The list of episode steps is extended with an (observation, action) pair. It is important to note that we save the observed state
that was used to choose the action (but not the observation next_state
returned by the Environment as a result of the action):
episode_steps.append(EpisodeStep(observation=state,action=action))
The reward is added to the current episode’s total reward:
episode_reward += reward
When the current episode is over (hole or goal state) we need to append the finalized episode to the batch, saving the total reward and steps we have taken. Then, we reset our environment to start over and we reset variables episode_steps
and episode_reward
to start to track next episode:
batch.append(Episode(reward=episode_reward, steps=episode_steps))
next_obs = env.reset()
episode_steps = []
episode_reward = 0.0
STEP 2 — Calculate the Return for every episode and decide on a return boundary
The next piece of code implements step 2:
if len(batch) == BATCH_SIZE:
reward_mean = float(np.mean(list(map(lambda s:
s.reward, batch))))
elite_candidates= batch
ExpectedReturn = list(map(lambda s: s.reward * (GAMMA **
len(s.steps)), elite_candidates))
reward_bound = np.percentile(ExpectedReturn, PERCENTILE)
The training loop executes this step when a number of plays equal toBATCH_SIZE
have been run:
if len(batch) == BATCH_SIZE:
First, the code calculates the expected return for all the episodes in the current batch:
elite_candidates= batch
ExpectedReturn = list(map(lambda s: s.reward * (GAMMA **
len(s.steps)), elite_candidates))
In this step, from the given batch of episodes and percentile value, we calculate a boundary reward, which will be used to filter “elite” episodes to train the Agents neural networks:
reward_bound = np.percentile(ExpectedReturn, PERCENTILE)
To obtain the boundary reward, we will use NumPy’s percentile function, which, from the list of values and the desired percentile, calculates the percentile’s value. In this code, we will use the top 30% of episodes (indicated by the variable PERCENTILE
) to create the “elite” episodes.
During this step we compute the reward_mean
that is used to decide when to finish the training loop:
reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
STEP 3 — Throw away all episodes with a return below the boundary
Next, we will filter off our episodes with the following code:
train_obs = []
train_act = []
elite_batch = []
for example, discounted_reward in zip(elite_candidates,
ExpectedReturn):
if discounted_reward > reward_bound:
train_obs.extend(map(lambda step: step.observation,
example.steps))
train_act.extend(map(lambda step: step.action,
example.steps))
elite_batch.append(example)
full_batch=elite_batch
state=train_obs
acts=train_act
For every episode in the batch:
for example, discounted_reward in zip(elite_candidates,
ExpectedReturn):
we will check that the episode has a higher total reward than our boundary:
if discounted_reward > reward_bound:
and if it has, we will populate the list of observed states and actions that we will train on, and keep track of the elite episodes:
train_obs.extend(map(lambda step: step.observation,example.steps))
train_act.extend(map(lambda step: step.action, example.steps))
elite_batch.append(example)
Then we will update this tree variable with the “elite” episodes, the list of states and actions with which we will train our neural network:
full_batch=elite_batch
state=train_obs
acts=train_act
STEP 4 — Train the neural network using episode steps from the “elite” episodes
Every time our loop accumulates enough episodes (BATCH_SIZE
), we compute the “elite” episodes and at the same iteration the loop trains the neural network of the Agent with this code:
state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)
optimizer.zero_grad()
action_scores_t = net(state_t)
loss_t = objective(action_scores_t, acts_t)
loss_t.backward()
optimizer.step()
iter_no += 1
batch = []
This code train the neural network using episode steps from the “elite” episodes, using the state s as the input and issued actions a as the label (desired output). Let’s go to comment it in more detail al the code lines:
First, we transform the variables to tensors:
state_t = torch.FloatTensor(state)
acts_t = torch.LongTensor(acts)
We zero gradients of our neural network
optimizer.zero_grad()
and pass the observed state to the neural network, obtaining its action scores:
action_scores_t = net(state_t)
These scores are passed to the objective function, which will calculate cross-entropy between the neural network output and the actions that the agent took
loss_t = objective(action_scores_t, acts_t)
Remember that we only consider “elite” actions. The idea of this is to reinforce our neural network to carry out those “elite” actions that have led to good rewards.
Finally, we need to calculate gradients on the loss using thebackward
method and adjust the parameters of our neural network using the step
method of the optimizer:
loss_t.backward()
optimizer.step()
Monitor the progress of the Agent
In order to monitor the progress of the Agent’s learning performance, we included this print in the training loop:
print(“%d: loss=%.3f, reward_mean=%.3f” %
(iter_no, loss_t.item(), reward_mean))
With it we show the iteration number, the loss and the mean reward of the batch (in the next section we also write the same values to TensorBoard to get a nice chart):
0: loss=1.384, reward_mean=0.020
1: loss=1.353, reward_mean=0.040
2: loss=1.332, reward_mean=0.010
3: loss=1.362, reward_mean=0.020
4: loss=1.337, reward_mean=0.020
5: loss=1.378, reward_mean=0.020
. . .
639: loss=0.471, reward_mean=0.730
640: loss=0.511, reward_mean=0.730
641: loss=0.472, reward_mean=0.760
642: loss=0.481, reward_mean=0.650
643: loss=0.472, reward_mean=0.750
644: loss=0.492, reward_mean=0.720
645: loss=0.480, reward_mean=0.660
646: loss=0.479, reward_mean=0.740
647: loss=0.474, reward_mean=0.660
648: loss=0.517, reward_mean=0.830
We can check that the last value of the reward_mean
variable is the one that allowed to finish the training loop.
2. Improving the Agent with a better neural network
In a previous post, we already introduced TensorBoard, a tool that helps in the process of data visualization. Instead, the “print” used in the previous section, we could use these two sentences to plot the behavior of these two variables:
writer.add_scalar(“loss”, loss_t.item(), iter_no)
writer.add_scalar(“reward_mean”, reward_mean, iter_no)
In this case, the output is:
More complex Neural Network
One question that arises is if we could improve the Agent’s neural network. For instance, what happens if we consider a hidden layer with more neurons, let say 128 neurons:
HIDDEN_SIZE = 128
net= nn.Sequential(
nn.Linear(obs_size, HIDDEN_SIZE),
nn.Sigmoid(),
nn.Linear(HIDDEN_SIZE, n_actions)
)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)
train_loop()
The result can be shown here (or executing the GitHub code):
We can see that this network learns faster than the previous one.
ReLU activation function
What happens if we change the activation function? e.g. a ReLU instead a Sigmoid?
Below you can see what happens: the network converges much earlier, with only 200 iterations it has already been completed.
3. Improving the Cross-Entropy algorithm
So far we have shown how to improve the neural network architecture. But we can also improve the algorithm itself: we can keep “elite” episodes for a longer time. The previous version of the algorithm samples episodes from the Environment, train on the best ones and threw them away. However, when the number of successful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them. We need to change only one line in the code:
elite_candidates= full_batch + batch
#elite_candidates= batch
The result seen through TensorBoard is:
We can see that the number of iterations required is reduced again.
4. Limitations of the Cross-Entropy method
So far we have seen that with the proposed improvements, with very few iterations of the training loop we can find a good neural network. But this is because we are talking about a very simple “non-slippery” Environment. But what if we have a “slippery” environment?
slippedy_env = gym.make(‘FrozenLake-v0’, is_slippery=True)
class OneHotWrapper(gym.ObservationWrapper):
def __init__(self, env):
super(OneHotWrapper, self).__init__(env)
self.observation_space = gym.spaces.Box(0.0, 1.0,
(env.observation_space.n, ), dtype=np.float32)
def observation(self, observation):
r = np.copy(self.observation_space.low)
r[observation] = 1.0
return r
env = OneHotWrapper(slippedy_env)
Again TensorBoard is a big help. In the following figure, we see the behavior of the algorithm during the first iterations. It is not able to take off the value of the Reward:
But if we wait for 5,000 more iterations, we see that it can improve, but from there it stagnates and is no longer able to surpass a threshold:
And although we have waited more than two hours, it fails to improve and not surpass the threshold of 60%:
5. Summary
In these two posts about Cross-Entropy method the reader became familiar with the method. We choosed this method becase it was a good warm-up due to it is simple but quite powerful, despite its limitations, and merge Reinforcement Learning and Deep Learning.
We applied it to FrozenLake Environment. We have seen that with can find a good neural network for the simple “non-slippery” Environment. But if we consider a “slippery” Environment the Cross-Entropy method cannot find the solution (of training a neural network). Later in the series, you will become familiar with other methods that address these limitations.
In the next post, that starts a new part of this series, we will switch to a more systematic study of RL methods and discuss the value-based family of methods.
See you in the next post!.
The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.
Acknowledgments: The code presented in this post has been inspired from the code of Maxim Lapan who has written an excellent practical book on the subject.
Deep Reinforcement Learning Explained Series
by UPC Barcelona Tech and Barcelona Supercomputing Center
A relaxed introductory series that gradually and with a practical approach introduces the reader to this exciting technology that is the real enabler of the latest disruptive advances in the field of Artificial Intelligence.
About this series
I started to write this series in May, during the period of lockdown in Barcelona. Honestly, writing these posts in my spare time helped me to #StayAtHome because of the lockdown. Thank you for reading this publication in those days; it justifies the effort I made.
Disclaimers — These posts were written during this period of lockdown in Barcelona as a personal distraction and dissemination of scientific knowledge, in case it could be of help to someone, but without the purpose of being an academic reference document in the DRL area. If the reader needs a more rigorous document, the last post in the series offers an extensive list of academic resources and books that the reader can consult. The author is aware that this series of posts may contain some errors and suffers from a revision of the English text to improve it if the purpose were an academic document. But although the author would like to improve the content in quantity and quality, his professional commitments do not leave him free time to do so. However, the author agrees to refine all those errors that readers can report as soon as he can.