The Five Building Blocks of Markov Decision Processes

Define and communicate your Reinforcement Learning models by mastering the foundational principles of a Markov Decision Process

The ability to properly formulate a Markov Decision Process (MDP) is imperative for successful Reinforcement Learning (RL) practitioners. A clear mathematical model is needed to (i) properly code it and train the solution algorithm, (ii) communicate with stakeholders in a uniform language, and (iii) explicate abstractions and simplifications.

MDPs serve as a framework to formalize sequential decision-making problems, in which decisions over time are decomposed into a series of subproblems. In canonical form, an MDP is described by a tuple (S,A,R,P,γ). This article expands on each element.

State space (S)

A state s∈S contains all information needed to compute decisions, rewards, and transitions; the state space S is the set containing all states. A core notion of MDPs is that decisions are made solely based on the current state of the system (the memoryless property); past states should not factor into decision-making. A common misunderstanding is that a state cannot incorporate past information; this is in fact allowed (higher-order Markov models, see Salnikov et al.).

In some cases a state is a simple scalar, such as an agent’s position on a one-dimensional line. Typically, the state would be a vector. Consider a robot arm, with the state described by the angles of its three joints s=[x,y,z]. A stock portfolio state may be described with the prevailing price level and amount held per stock s=[p_1,p_2,…,p_n,s_1,s_2,…,s_n]. A large warehouse may have a 10,000 entry state, describing the number of each product held.

There is no need to stop at the boundaries of physical world. Powell distinguishes three state components: (i) physical, (ii) information and (iii) knowledge. The outline follows below.

i. Physical

The term ‘physical’ may be used a bit liberally here; think the location of a truck, money on your bank account, or battery level of a drone. It captures the physical properties of a system or available resources. Unfortunately, many modelers equate a state to physical attributes, as such omitting relevant information.

ii. Information

Aside from properties directly derived from the system, we often use additional information in decision-making. If we have a stock portfolio, the stock prices (market data) would be information. For a wind farm, it might be the wind speed and -direction.

Information can also pertain to the past or future. Past stock prices might reveal relevant trends that cannot be gleaned solely from the present price. If we have a large production order confirmed for next week, that is relevant information when drafting a production schedule this week.

These types of information might seem at conflict with the memoryless property. However: if you knew the Apocalyps was due next Thursday, you probably would not finish that report for Friday, right? We simply include relevant information we know in the state.

The emphasis is on ‘knowing’. From a Monte Carlo simulation or a modeler’s knowledge, we may derive much information not known to the agent. Information should be added from the perspective of the decision-making agent, who does not have clairvoyant skills.

iii. Belief

We might not make decisions based solely on known information, but also on uncertain information or beliefs. We can measure the exact wind speed right now, but for the rest of the week we rely on the weather forecast. A warehouse may utilize a sales forecasting model. Past stock prices may be used to predict future ones. With beliefs, we typically refer to probability distributions that capture certain knowledge.

As a guideline, the state should include exactly all relevant data used to make informed decisions — no less, no more. If certain data does not impact decision-making, rewards or transitions, it should not be incorporated in the state.

A Deep Dive into Problem States

Define appropriate problem states for your Markov Decision Process. Learn about higher-order MDPs and belief state…

wvheeswijk.medium.com

Action space (A)

Integral to MDPs is the ability to exercise some degree of control over the system. The action a∈A — also decision or control in some domains — describes this influence by the agent; the action space A contains all (feasible) actions. As for the state, the action can be a simple scalar (‘exercise option a∈{0,1}’), but also a high-dimensional vector (‘order a_i products of type i’).

With state-independent actions, the same action space A applies for every state. For many problems, actions are state-dependent (action space A(s)): the moves on a chess board depend on the current pieces and their positions, and a water reservoir cannot be drained to negative amounts. Sometimes it is convenient to presume the action space is state-independent (e.g., in Deep Q-learning) and simply mask infeasible actions.

For vector-based decisions that allow all sorts of permutations, enumeration of the action space may become problematic. Additionally, it is often hard to assess whether an action is feasible. Consider decision-making for production lines with limited raw materials, precedence relationships, storage capacity, machine availability, etc. Merely verifying whether a schedule is valid construes a complicated subproblem. In such cases we need to subject the action space to a set of constraints, e.g., by formulating the action problem as a mathematical program. An added bonus is that it vastly increases the sizes of action spaces that may be handled.

Using Linear Programming to Boost Your Reinforcement Learning Algorithms

Large and high-dimensional action spaces are often computational bottlenecks in Reinforcement Learning. Formulating…

towardsdatascience.com

Reward function (R)

MDP models entail optimizing some objective function, typically a (discounted) cumulative reward. The reward function R(s,a) captures the direct rewards, costs or contributions associated with individual decision moments, being dependent on state s and/or action a. Sometimes the reward is evident, e.g., the monetary gains and losses resulting from stock trading. In other cases, more artificial constructs are needed, for instance the reward for winning a game or reaching a target. It is then necessary to align rewards with the objectives of solving the problem.

Non-zero rewards are not necessarily incurred at each epoch, e.g., they might only be received at the end of an episode (winning/losing a game). In RL, we often resort to reward shaping, introducing intermediate rewards to recognize ‘good’ actions and guide the algorithm. Reward shaping should be used with care, as it alters the original problem setting and may introduce unexpected behavior.

Transition function (P)

In an MDP, we take discrete time steps in which we move from one state (s) to the next (s’). This transition function depends on (i) the present state s, (ii) the action a and (iii) random information ω. The transition can be broken down into a deterministic and a stochastic part, in which a describes the deterministic component and ω the stochastic component (although often a mixture, the transition might be fully deterministic or fully stochastic).

In MDP formulations, the transition function is often hidden in the deceptively simple probability P(s’|s,a). Deceptive indeed, as it is often the most difficult component to model. Truthfully, the transition function may range from a trivial deterministic function (e.g., a step in GridWorld) to a highly complex Digital Twin representation of reality, with many uncertainties to model.

Although the transition function is a necessary component of the MDP, in RL we do not necessarily define it explicitly. Instead, we may observe transitions from the environment. If we learn to play a game of Super Mario, we likely don’t know the actual transition functions; we simply try moves and observe what happens. We may even utilize real-world observations (e.g., when controlling wind farm that depends on weather circumstances). In this case, the world itself is our transition function!

What Are Post-Decision States and What Do They Want From Us?

undefined

About Post-Decision States Again

On the (not so) subtle differences between state-action pairs and post-decision states in Reinforcement Learning.

towardsdatascience.com

Discount factor (γ)

The discount factor γ ∈ [0,1] is not always included in the MDP tuple, as it is optional for finite horizon. In short, it indicates to what extent future rewards are factored into current decision-making, with γ=0 completely dismissing future rewards and γ=1 weighing all future rewards equally. There are various rationales for working with discount rates, the chief one is to ensure converging value functions for infinite horizon problems.

Any setting γ<1 suffices to guarantee (theoretical) convergence, but the exact setting is important for optimization purposes. A general rule-of-thumb is that the more uncertain the system is, the higher the discount rate is set, reflecting the limited impact that actions today have on performance in the future.

Why Discount Future Rewards In Reinforcement Learning?

A discussion on discount rates from the perspectives of mathematics, finance, life, and of course Reinforcement…

towardsdatascience.com

Summary

A Markov Decision Process is composed of the following building blocks:

State spaceS — The state contains data needed to make decisions, determine rewards and guide transitions. The state can be divided into physical-, information- and belief attributes, and should contain precisely the attributes needed for the aforementioned purposes.
Action spaceA — The set containing all (feasible) actions. For state-dependent decisions a(s), it may be necessary to subject the action space to a set of constraints, e.g., using mathematical programming.
Reward function R — Denoting the direct reward when taking action a in state s. In many problems, the reward is not self-explanatory, requiring reward shaping to fulfill the objectives of solving the problem. Poor reward function may result in undesired behavior.
Transition function P — The function governing the dynamics of the system over time, guiding the agent from state s to s’. The transition typically involves both a deterministic component (the action a) and a stochastic one (exogenous information ω). Transitions might be observed from an external environment without explicitly modelling them.
Discount factor γ — Defines the degree to which future rewards impact current decisions. When the problem is infinite-horizon and relies on a cumulative reward objective function, a discount rate γ<1 is necessary to ensure convergence. The discount rate is often set in line with the uncertainty of the environment.

Each building block may harbor substantial challenges. The way the MDP is modeled may substantially impact the effectiveness of the solution algorithm. Furthermore, both in business- and academic environments, the ability to define clear MDPs is important to uniformly communicate problem abstractions.

Although MDP formulation might be a bit of a hassle, doing it properly pays off in the long run.

References

Powell, W. B. (2019). A unified framework for stochastic optimization. European Journal of Operational Research, 275(3), 795–821.

Salnikov, V., Schaub, M. & Lambiotte, R. Using higher-order Markov models to reveal flow-based communities in networks. Sci Rep 6, 23194 (2016). https://doi.org/10.1038/srep23194