When to use Reinforcement Learning (and when not to)

Summary

Reinforcement Learning (RL) is a powerful tool with remarkable achievements, but it is important to apply it judiciously by considering whether the problem allows for trial-and-error learning, has well-defined state variables, and can be guided by a concrete reward function.

Abstract

Reinforcement Learning (RL) has demonstrated superior performance in various domains, including video games and Go, contributing to its current hype. However, it is crucial to recognize that RL is not a universal solution and should be applied selectively. To determine RL's suitability, three key questions must be addressed: whether the problem context can tolerate mistakes during learning, whether the state of the environment can be accurately captured and accessed by the agent, and whether a clear and computable reward function can be defined. RL's sample inefficiency and the complexity of state representation and reward function design are significant challenges that require careful consideration and may benefit from advanced research techniques aimed at improving RL's robustness and efficiency.

Opinions

The author acknowledges the potential and current limitations of RL, emphasizing the importance of critical evaluation before applying RL to a problem.
Sample inefficiency is a notable drawback of RL, particularly deep RL, which necessitates a considerable amount of trial and error to learn effective policies.
The choice of state variables is critical and problem-dependent; it requires a balance between providing enough information for good decision-making and avoiding an overwhelming number of variables that could lead to inefficient learning.
The design of the reward function is paramount, as it directly influences the policy that the RL agent learns; poorly designed reward functions can lead to unintended and undesirable behaviors.
The author suggests that while automatic reward modeling is an active research area, its practical application remains limited, implying that manual reward function design is still necessary.
The author provides a pragmatic approach to applying RL, recommending a quick prototype after evaluating the problem against the three questions, and offers a list of RL environments for those in need of a testing ground for their RL algorithms.

When to use Reinforcement Learning (and when not to)

RL has achieved better than human performance in most video games and has also beat the best Go player in the world. It is a general framework that can solve very different tasks without any prior knowledge, and even achieve stellar performance at it. This is why there is so much hype around RL nowadays, and it is certainly a very important framework that still has lots of potential.

However, RL cannot solve every problem, at least not yet.

This is something important to have in mind, especially for RL enthusiasts like myself. In order to limit myself to applying RL only when it makes sense, I have written down three questions to ask myself before applying RL. If I can answer all of them in a satisfactory way, then I can continue applying RL to the task at hand, otherwise, I should look elsewhere for a way to deal with it.

Hoping these questions might be useful for anyone else trying to decide if RL is the right way to go, I’ve listed them here:

Can I afford making mistakes?

Possible effects of using RL when you shouldn’t [screenshot from this video].

RL can be sample inefficient, especially deep RL. This means that it will take a long time for the RL agent to learn which actions are good and which ones are bad (i.e., which actions give a positive and a negative reward) making several mistakes on the way. This is why you shouldn’t use RL to control an airplane with passengers on board :)) However, if you can afford to make mistakes like in a flight simulator, RL is a good option. Making RL more robust — more efficient and less prone to errors — is currently a highly researched topic and many different approaches exist to reduce errors in RL: maximizing sample efficiency, learning from demonstrations, using external knowledge to guide and constrain agent (this is a paper I authored ;) ) and more.

What variables describe the state of the environment?

At each time step, the agent observes its environment and based on its state, it will decide what is the best action to perform at that time step. However, the definition of state in RL is not as clear as one might think at first because there are several things to take into account:

What are the state variables and how can we quantify them? As an example, let’s say we are working applying RL for stocks trading. We can decide that the state will comprise the close price of the stock for the last 10 days, the volume operated on those days and the difference between the minimum and maximum price of the stock those days. However, we could also decide to use the change of price in percentage of that stock for the last 10 days, the volume operated and the difference between the minimum and maximum as percentage of the close price. Which option is better? This depends on the problem we want to solve: the second approach seems better because it is more generalizable but maybe the first one could perform better for a given stock since it will be able to identify minimum and maximum prices of the stock that can influence the growth of the price. And there are still many other options to take into account: we can also use a binary variable where 0 represents a decline on the price that day and a 1 that represents a price growth on that day, we could also observe the RSI, the moving average price, we could also use sentiment of news for that stock, or even the price of other stocks or commodities, etc.

Does the agent have access to these variables at each time step? This is a very important point to consider and it is often given as granted. However, not having access to a variable on a given day can mean that our RL agent will perform badly and even come to learn a wrong behavior with time. In the previous case, imagine the RL is trading a not so famous stock and one of the state variables is the news sentiment in the last 10 days. Since this stock is not famous, it could happen that 10 days pass by and no news of this stock show up, what is the sentiment on this stock then?

On one hand, the more information available to the agent, the better decisions it will be able to make because it will have a more precise idea of what is happening. This is the same for us humans: the more information we have, the better decision we can make. E.g., we could use all the variables defined in the first point for our RL trader, then it has access to everything that’s happening.

On the other hand, the more information available to the agent, the longer it will take for it to map states to actions, because the action space grows exponentially with the number of variables and values we take into account. We can also relate to this: when we have access to too much information, we feel overwhelmed because it is hard to process all this information in our brain. Continuing with the previous example, the agent has access to many variables that actually do not matter for trading a certain stock and could get confused by thinking correlation implies causation (even humans get trapped into this sometimes!).

Can I define a concrete reward function and compute that reward after taking an action?

An important part of RL is the definition of the reward. The reward determines the behavior of the RL, i.e., its policy, so its design is paramount for RL to work as we desire. In the video below, you can see how an agent finds a way to achieve a higher score than following the normal game play, even though catching on fire, crashing into other boats, and going the wrong way on the track [you can see the full post on OpenAI’s website].

Automatic reward modelling/shaping is a very popular research topic. As an example, in Florensa et al. (2018), a generator neural network is used to propose tasks for the agent, where each task is a configuration of state variables that the agent must reach with its actions. Many other methods to definereward functions automatically have been developed, but its use in practice is very limited.

Have you answered these questions?

If so, then your problem might be a good fit for RL. At this point, I would just make a quick prototype and apply it to see how it works. If you need an environment for applying your RL algo, then feel free to check this list of RL environments I’ve made.

In case you think I am missing a question or some consideration when answering one of them, please let me know with a response!