When to use Reinforcement Learning (and when not to)
RL has achieved better than human performance in most video games and has also beat the best Go player in the world. It is a general framework that can solve very different tasks without any prior knowledge, and even achieve stellar performance at it. This is why there is so much hype around RL nowadays, and it is certainly a very important framework that still has lots of potential.
However, RL cannot solve every problem, at least not yet.
This is something important to have in mind, especially for RL enthusiasts like myself. In order to limit myself to applying RL only when it makes sense, I have written down three questions to ask myself before applying RL. If I can answer all of them in a satisfactory way, then I can continue applying RL to the task at hand, otherwise, I should look elsewhere for a way to deal with it.
Hoping these questions might be useful for anyone else trying to decide if RL is the right way to go, I’ve listed them here:
Can I afford making mistakes?

RL can be sample inefficient, especially deep RL. This means that it will take a long time for the RL agent to learn which actions are good and which ones are bad (i.e., which actions give a positive and a negative reward) making several mistakes on the way. This is why you shouldn’t use RL to control an airplane with passengers on board :)) However, if you can afford to make mistakes like in a flight simulator, RL is a good option. Making RL more robust — more efficient and less prone to errors — is currently a highly researched topic and many different approaches exist to reduce errors in RL: maximizing sample efficiency, learning from demonstrations, using external knowledge to guide and constrain agent (this is a paper I authored ;) ) and more.
What variables describe the state of the environment?
At each time step, the agent observes its environment and based on its state, it will decide what is the best action to perform at that time step. However, the definition of state in RL is not as clear as one might think at first because there are several things to take into account:
- What are the state variables and how can we quantify them? As an example, let’s say we are working applying RL for stocks trading. We can decide that the state will comprise the close price of the stock for the last 10 days, the volume operated on those days and the difference between the minimum and maximum price of the stock those days. However, we could also decide to use the change of price in percentage of that stock for the last 10 days, the volume operated and the difference between the minimum and maximum as percentage of the close price. Which option is better? This depends on the problem we want to solve: the second approach seems better because it is more generalizable but maybe the first one could perform better for a given stock since it will be able to identify minimum and maximum prices of the stock that can influence the growth of the price. And there are still many other options to take into account: we can also use a binary variable where 0 represents a decline on the price that day and a 1 that represents a price growth on that day, we could also observe the RSI, the moving average price, we could also use sentiment of news for that stock, or even the price of other stocks or commodities, etc.
- Does the agent have access to these variables at each time step? This is a very important point to consider and it is often given as granted. However, not having access to a variable on a given day can mean that our RL agent will perform badly and even come to learn a wrong behavior with time. In the previous case, imagine the RL is trading a not so famous stock and one of the state variables is the news sentiment in the last 10 days. Since this stock is not famous, it could happen that 10 days pass by and no news of this stock show up, what is the sentiment on this stock then?
- On one hand, the more information available to the agent, the better decisions it will be able to make because it will have a more precise idea of what is happening. This is the same for us humans: the more information we have, the better decision we can make. E.g., we could use all the variables defined in the first point for our RL trader, then it has access to everything that’s happening.
- On the other hand, the more information available to the agent, the longer it will take for it to map states to actions, because the action space grows exponentially with the number of variables and values we take into account. We can also relate to this: when we have access to too much information, we feel overwhelmed because it is hard to process all this information in our brain. Continuing with the previous example, the agent has access to many variables that actually do not matter for trading a certain stock and could get confused by thinking correlation implies causation (even humans get trapped into this sometimes!).




