Rainbow DQN — The Best Reinforcement Learning Has to Offer?
What happens if the most successful techniques in Deep Q-Learning are combined into a single algorithm?
Reinforcement Learning has come a long way since the era of classical tabular Q-learning. Deep Q-Learning Networks (DQN) drove a revolution in the field, enabling powerful generalizations of states.
By replacing a Q-table (storing the values of all state-action pair) with a powerful neural network (theoretically able to transform any input state to an output value per action), we could handle problems with massive state spaces, signifying a breakthrough in the field.
Working with neural network also brings a myriad of problems though. In response, a plethora of techniques has been used to improve performance of DQN algorithms. Common techniques include replay buffers, target networks and entropy bonuses. Often, incorporating these techniques ensures considerable performance improvements.
If one technique can so drastically improve performance, why not use all of them?
This is precisely what the DeepMind team (their 2017 paper has 10 authors, so I’ll just refer to ‘the team’) must have wondered. If you would combine all these individual techniques into a single algorithm, would you get a Dream Team or Frankenstein’s monster?
Techniques used in Rainbow DQN
We’ll suspend the performance question for a bit, and first dive into the six techniques that are deployed and combined in Rainbow DQN.
I. Double Q-Learning
Q-learning selects actions in a way that maximizes state-action pair. However, this comes with a fundamental flaw. As the Q-values are mere approximations, the algorithm favors actions whose Q-values are inflated. This means that we systematically overestimate Q-values! Naturally, this bias does not help when trying to find the best decision-making policy. Worse, with Q-learning being a bootstrapping method, the overestimating effect may remain even after many iterations.
Double Q-learning combats this flaw by deploying one network to select an action and another one to update the values. Thus, even if we use an inflated Q-value to select the action, we update with one that (probably) does not. Each time, we randomly choose one network for action selection and the other one for the update. Thus, we decouple the selection mechanism from the evaluation mechanism. The update procedure for network A is given below (obviously, network B deploys the reverse):
Empirically, Double Q architecture successfully reduces bias and leads to better decision-making policies.
II. Prioritized Experience Replay
Collection observations is often costly. Q-learning — being an off-policy method — allows to store observations in a replay buffer, from which we can subsequently sample experiences to update our policy. Additionally, it breaks correlation between states — as subsequent states often exhibit strong correlation, neural networks tend to locally overfit.
However, not every past experience is equally relevant to revisit. To maximize learning, we might like to sample tuples that yielded high (absolute) value function errors in the past, with the idea that we want to evaluate these experiences again and reduce the error.
Simply selecting experiences with the highest errors would introduce sampling bias. To remedy this, a form of weighted importance sampling is used to perform the weight updates:
Prioritized experience replay was shown to consistently outperform uniform replay sampling.
III. Dueling Networks
Deep Q-networks typically return a Q-value Q(s,a), serving as an approximation for the value function corresponding to a given state-action pair (s,a). In the original Bellman equations this makes perfect sense, as computing the optimal value functions yields the optimal policy. In Q-learning the concatenation of state- and action values can be problematic though, as it may be hard to distinguish whether a value stems primarily from the state, the action, or both.
The Q-value can be decomposed as follows: Q(s,a)=V(s)+A(s,a). Given that we operate under the best known policy (i.e., the policy returns the best action according to its estimates), the expected advantage is 0 and the Q-value equals the state value.
In particular, for some states the action selection simply doesn’t matter that much — e.g., consider a self-driving car in a state without obstacles — and advantage values are negligible. Or, to give another example, a state could be poor regardless of the action we take in it. For such states, we prefer to not waste computational effort by exploring all actions.
A dueling architecture allows to more quickly determine good actions, as poor actions can be discarded easier and comparable actions can be identified more quickly. By decoupling the advantage value from the state value, we gain a more granular perspective on the action value.
The network has a joint set of layers parameterized by θ, a stream parameterized by α to determine the advantage function, and a stream parameterized by β to determine the value function. Added up, the two streams return the Q-value.
To enforce that the value function and advantage function indeed rely on separately parameterized streams, the mean advantage over all actions is subtracted from the selected action’s advantage:
Empirical results show that the proposed architecture quickly identifies good actions, as identifying action values has become an explicit part of the learning problem.
IV. Multi-step learning
The impact of an action is not always directly visible. Many individual actions may not even yield an immediate reward. For instance, a bad turn in a racing game might result in losing pole position only some moments later. Typical Q-learning updates — or more generally, TD(0) updates — only consider the reward directly following the action, bootstrapping the downstream value of the decision.
At the other end of the spectrum, Monte Carlo/ TD(1) learning first observes all rewards before updating Q-values, but this often results in high variance and slow convergence. Multi-step learning — also known as n-Q-learning — strikes a balance between bias and variance, observing a number of time steps before performing an update. For instance, if n=3, the update would be a combination of three observed rewards and a bootstrapped value function for the remainder of the time horizon.
In general form, the multi-step reward can be computed as follows:
The corresponding update scheme is defined as follows:
For many problem settings, multi-step learning yields a substantial improvement over TD(0) and TD(1) learning, mitigating bias and variance effects and acknowledging delayed impact of decisions. The main challenge is to appropriately tune n for the problem at hand.
V. Distributional reinforcement learning
Most reinforcement learning algorithms work with expected return values. Instead, we could attempt to learn the distribution of returns (for which the Q-value would represent the expected value). Although traditionally used mainly to derive insights (e.g., riskiness of returns), it appears this style of learning actually improves performance as well.
The main use case for learning distributions rather than expected values is the application in stochastic environments. Even if rewards are noisy and do not directly aid in a converging expected value, we can still utilize them to get a better grasp on the underlying reward-generating distribution.
Whereas typical Q-values are defined by Q(s,a)=r(s,a)+γQ(s’,a’), the distributional variant is defined as follows:
Here, Z denotes the distribution of returns rather than the expected (Q-)value. Furthermore, R, S’ and A’ are random variables.
Instead of reducing TD error, we aim to minimize the sampled Kullback-Leibner divergence with the target distribution. The loss we minimize is the following:
Here, 𝒯 is a contraction mapping, Φ a distributional projection, and bar θ a frozen parameter copy to represent the target network. I fear you really need the full paper to make sense out of it.
Perhaps more intuitively: the network outputs a discretized probability mass function, where each output (‘atom’) represents the probability of a small reward interval. With each update, we try to bring the predicted reward distribution closer to the observed one.
Distributional RL yielded good results, especially in problem settings with rare and noisy rewards.
VI. Noisy exploration layers
Traditional Q-learning uses epsilon-greedy learning, randomly selecting actions with probability ϵ (usually 0.05 or 0.10) instead of the one with the expected maximum value. This mechanism allows escaping local optima by ensuring exploration. Entropy bonuses are another common technique to reward deviating actions.
In noisy environments we might need much more exploration though. We can achieve this by adding noisy exploration layers to our neural network. We replace a standard linear layer — defined in canonical form by y=Wx, where x is the input, W the set of weights, and y the output — with a layer composed of a deterministic stream and a noisy stream:
As you’d imagine, the noise generated by the layer transformation impacts the resulting Q-values, and the highest Q-value yielded by the network can vary, driving explorative behavior. Over time, the algorithm may learn to ignore the noisy stream, as b_noisy and W_noisy are learnable parameters. However, the ratio between signal and noise can be variable respective to the state, encouraging exploration where necessary and converging to deterministic actions where appropriate.
Noisy neural networks appear to be a more robust exploration mechanisms than, e.g., epsilon-greedy or entropy bonus. Like the other techniques mentioned here, it is empirically shown to improve performance on many benchmark problems.
So…does it work?
The DeepMind team put their Rainbow algorithm — combining the aforementioned six techniques — to the test against DQN algorithms using a single technique at a time. The figure below represents the average performance over 57 Atari games, with 100% indicating human-level performance.
As seen, Rainbow DQN strongly outperforms all benchmarks that use a single component DQN. Also interesting is that it achieves good performance much quicker than the other ones. It rivals human-like performance after about 15m frames; four of the benchmarks never hit that point at all. After 44/200m frames, Rainbow DQN is already at a level where it consistently outperforms all competitors, regardless of runtime. After 200m frames, it is the undisputed winner of the lot.
The DeepMind team also checked whether all components are necessary, by omitting individual techniques from the rainbow. Some techniques contribute more than others, but — simplifying the paper’s conclusions a bit — each technique adds something to the algorithm in terms of performance. See the comparison below.
Overall, the experimental results look absolutely stunning, so why don’t we all use Rainbow DQN all the time?
Drawbacks?
First of all, due to deploying so many elaborate techniques, Rainbow DQN is slow. Every technique requires a set of additional computations, and naturally these add up. Per frame the algorithm may learn a lot, but the time to process a frame is substantially longer. Thus, a time-based comparison with the other algorithms will be less favorable.
A second drawback of Rainbow DQN is the extensive tuning. The number of steps to take in the multi-step procedure, the amount of noise in the noisy layers, the update frequency of target networks— everything has to be appropriately tuned. Knowing how quickly a grid search explodes, tuning is not easy in such a heavily parameterized algorithm, especially factoring in the low speed of the algorithm.
Finally, in part due to the previous points, Rainbow DQN can be unstable and hard to debug. A single coding mistake in one technique can ruin the entire algorithm — good luck finding the bug. Behavior not as expected? It could be anything, even some peculiar interplay between various techniques.
Given the difficulties of the algorithm, Rainbow DQN is presently not included in popular RL libraries, and therefore cannot easily be used off-the shelf. In a sense, the rainbow indeed is a Dream Team of RL techniques, but seemingly winning is not everything.
Closing words
As we saw, Rainbow DQN reports very impressive results with respect to its benchmarks. However, it should be noted that the original paper only compares it to other DQN techniques. In particular, the entire class of policy-based methods — including popular algorithms such as TRPO and PPO — has been left out of the comparison.
In a study by OpenAI (not on the same problems though), performance was compared to PPO, the present go-to algorithm in continuous control. With standard implementations, Rainbow DQN handily beat PPO. However, when deploying a joint training mechanism — essentially transfer learning while keeping the algorithms themselves the same — joint PPO narrowly beat joint Rainbow DQN. Given that PPO is much simpler and faster than Rainbow DQN, its higher popularity is not surprising.
Ultimately, performance is highly important, but it is not the only relevant aspect of an RL algorithm. We should be able to work with it, modify it, debug it, and most of all — understand it. Rainbow DQN demonstrated that advances in the field can strengthen each other and the algorithm yielded state-of-the-art results, but ultimately did not convince human users to wholeheartedly embrace it.
Want to know more about DQN? Check out the following articles:
References
Rainbow DQN
Rainbow DQN paper: Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., … & Silver, D. (2018, April). Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence.
Comparison Rainbow DQN and PPO: Nichol, A., Pfau, V., Hesse, C., Klimov, O., & Schulman, J. (2018). Gotta learn fast: A new benchmark for generalization in RL. arXiv preprint arXiv:1804.03720.
DQN techniques
I. Double Q-learning: Hasselt, H. (2010). Double Q-learning. Advances in neural information processing systems, 23.
II. Prioritized experience replay: Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.
III. Dueling networks: Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016, June). Dueling network architectures for deep reinforcement learning. In International conference on machine learning (pp. 1995–2003). PMLR.
IV. Multi-step learning: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Section 7.1, n-Step TD Prediction). MIT Press.
V. Distributional learning: Bellemare, M. G., Dabney, W., & Munos, R. (2017, July). A distributional perspective on reinforcement learning. In International Conference on Machine Learning (pp. 449–458). PMLR.
VI. Noisy networks: Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., … & Legg, S. (2017). Noisy networks for exploration. arXiv preprint arXiv:1706.10295.