Tips and Tricks

Hindsight Experience Replay (HER) Implementation

An Explanation of the Algorithm and Code

I recently implemented the HER algorithm for my research reinforcement learning library: Pearl. In doing so, I found that whilst there are articles discussing the theory of the algorithm, there isn’t anything discussing how to implement it in code. So be it, that’s where I come in! In this article, I’ll be discussing how to implement the HER buffer in Python 😊

A Quick Algorithm Recap

As a reminder, here’s the algorithm from the original paper:

Source: https://arxiv.org/abs/1707.01495

As a summary, in a binary and sparse reward environment (an environment where the only non-negative reward is achieved when the agent wins), it can be difficult for an agent to appropriately explore the environment and learn the sequence of steps required to achieve the desired goal if it never receives any variability in reward. The HER buffer attempts to overcome this by copying each trajectory experienced and replacing the actual rewards with rewards calculated assuming the goals are the steps achieved at the end of the trajectories. The idea is this will help the agent explore better and also learn intermediate goals that build up to the actual desired goal. In this article, I will refer to the new sampled goals as HER goals, whilst the resultant trajectories will be the HER trajectories which contain the calculated HER rewards.

Implementation and Code

The implementation will diverge slightly from the theoretical algorithm in the paper. The key differences are below:

Instead of sampling goals and calculating HER rewards immediately after each trajectory is added, we will wait until sampling time when we can do a vectorized calculation all at once which is much faster in Python ⚡
We will also use a trick to use only one buffer to store observations and next observations to cut down the memory requirements for observation storage by almost half! This can be done assuming the observations stored are sequential; that is, the observation at index i+1 is the next observation from the one at index i 💾

Initialization

Counter-intuitively, let’s start with point 2 since this is defined in the class initialization below, notice how there is no self.next_observations defined.

A key thing to note here is that this buffer only accepts GoalEnv environments!! This is what allows us to easily keep track of episode goals. See the original description from OpenAI below:

At the end of a trajectory, the next observation will be the end state and the desired goal should be constant throughout the episode. Therefore, we only really need to keep track of the observation desired_goal and the next observation achieved_goal in order to calculate HER rewards. Also of note are the self.episode_end_indices and self.index_episode_map buffers:

self.episode_end_indices: keeps track of episode end indices in the trajectory buffers.
self.index_episode_map: keeps track of which transitions belong to which episodes.

Finally, the her_ratio variable indicates the fraction of trajectories to sample with the new HER rewards vs the standard replay buffer trajectories.

Adding trajectories

As mentioned earlier, we don’t sample the HER goals and rewards when adding trajectories, so this method isn’t so complicated.

Note that adding the next_observation["observation"] is a little more complicated algorithmically, since we store it in self.observations just like observation["observation"].

Sampling

This is where the real processing happens and we batch calculate the HER goals and resultant HER rewards. Let’s start with the high-level method defining which indices from the trajectory buffers we want to return:

The key thing to note here is that we want to sample indices from complete episodes, which is why we need to define end_idx as the index of the last complete episode in the trajectory buffers.

Next, let’s look at the method getting the trajectories to return. This gets both the standard replay and HER trajectories in the ratio defined by self.her_ratio:

Fortunately, the GoalEnvhas a method compute_reward(achieved_goal, desired_goal, info) which means we don’t have to worry about calculating the HER rewards ourselves. In this method, we essentially separate the indices we need to calculate HER rewards and goals for before recombining everything at the end to return. All observations are also concatenated with the desired goals as specified in the algorithm. Note again the slightly more complex sampling for the next_observations since we get these from the same self.observations buffer.

Finally, there is the issue of getting the HER goals themselves to calculate the HER rewards. This is defined below:

Remember at the algorithm recap when I said that the HER goals are assumed to be the trajectory end observations? Well that was only partly true, there are actually many different ways of sampling HER goals, I’ve done two of them above. The simplest way is to get the final observation, but the paper shows the best results actually come when you sample HER goals as a random state in the same episode but after the current transition. In this strategy, there is a failure mode. If the episode sampled ‘overlaps’ the buffer; that is, the episode starts at the end of the buffer and ends at the beginning as the buffer overflows, it is possible to pick a trajectory in this episode at the end of the buffer. Therefore, when calling np.random.randint() the low value will be the index at the end of the buffer whilst the high value will be the episode end index at the beginning, leading to an error. Since this is rare if the buffer size is relatively large (which it normally should be, in the order of 1e6), as a quick fix we can resort to the simpler HER goal strategy of picking the episode end index when this occurs.

Phew! Made it to the end 🎉🎉

If you found this article useful, do consider:

giving me a follow 🙌
subscribing to my email notifications to never miss an upload 📧
using my Medium referral link to directly support me and get access to unlimited premium articles 🤗

Promotions out the way, let me know your thoughts on this topic and happy learning!!