avatarWouter van Heeswijk, PhD

Summary

The web content provides a comprehensive overview of the five fundamental components of Markov Decision Processes (MDPs), which are essential for formulating and solving sequential decision-making problems in Reinforcement Learning (RL).

Abstract

The article on the website delves into the foundational principles of Markov Decision Processes (MDPs), which are crucial for Reinforcement Learning (RL) practitioners. MDPs are formalized as a tuple ((S,A,R,P,γ)), where (S) represents the state space, (A) the action space, (R) the reward function, (P) the transition function, and (γ) the discount factor. The state space should encapsulate all necessary information for decision-making without superfluous data, including physical attributes, relevant information, and beliefs or probability distributions. The action space comprises all feasible actions, which can be state-dependent or independent, and may require constraints or mathematical programming for complex decision vectors. The reward function quantifies the direct rewards associated with state-action pairs and may need reward shaping to guide the RL algorithm effectively. The transition function governs the system's dynamics, moving the agent from one state to another, and can be fully deterministic, fully stochastic, or a mixture of both. Lastly, the discount factor determines the influence of future rewards on present decisions, with its setting impacting the convergence of value functions in infinite horizon problems. The article emphasizes the importance of a well-defined MDP for both the effectiveness of solution algorithms and clear communication of problem abstractions in various environments.

Opinions

  • The author stresses that a clear mathematical model of an MDP is imperative for proper coding, training of solution algorithms, and communication with stakeholders.
  • A common misunderstanding is addressed, clarifying that states in an MDP can incorporate past information, which is essential for understanding higher-order Markov models.
  • The article highlights the importance of including exactly all relevant data in the state to make informed decisions, avoiding unnecessary complexity.
  • The author points out that many modelers equate the state with physical attributes, potentially omitting relevant information, and emphasizes the need to consider information and knowledge components as well.
  • The use of reward shaping is cautioned, as it can alter the original problem setting and potentially introduce unexpected behavior in the RL algorithm.
  • The transition function is noted to be often the most difficult component to model in an MDP, ranging from simple deterministic functions to complex representations involving many uncertainties.
  • The article suggests that in RL, it is not always necessary to explicitly define the transition function, as transitions can be observed from the environment, which can be particularly useful in real-world applications like controlling a wind farm.
  • The discount factor's role in ensuring the convergence of value functions in infinite horizon problems is underscored, with the recommendation that the discount rate should reflect the uncertainty of the environment.
  • The author concludes that while MDP formulation can be challenging, properly defining the MDP is crucial for the effectiveness of the solution algorithm and for clear communication in both business and academic settings.

The Five Building Blocks of Markov Decision Processes

Define and communicate your Reinforcement Learning models by mastering the foundational principles of a Markov Decision Process

Photo by Mourizal Zativa on Unsplash

The ability to properly formulate a Markov Decision Process (MDP) is imperative for successful Reinforcement Learning (RL) practitioners. A clear mathematical model is needed to (i) properly code it and train the solution algorithm, (ii) communicate with stakeholders in a uniform language, and (iii) explicate abstractions and simplifications.

MDPs serve as a framework to formalize sequential decision-making problems, in which decisions over time are decomposed into a series of subproblems. In canonical form, an MDP is described by a tuple (S,A,R,P,γ). This article expands on each element.

State space (S)

A state s∈S contains all information needed to compute decisions, rewards, and transitions; the state space S is the set containing all states. A core notion of MDPs is that decisions are made solely based on the current state of the system (the memoryless property); past states should not factor into decision-making. A common misunderstanding is that a state cannot incorporate past information; this is in fact allowed (higher-order Markov models, see Salnikov et al.).

In some cases a state is a simple scalar, such as an agent’s position on a one-dimensional line. Typically, the state would be a vector. Consider a robot arm, with the state described by the angles of its three joints s=[x,y,z]. A stock portfolio state may be described with the prevailing price level and amount held per stock s=[p_1,p_2,…,p_n,s_1,s_2,…,s_n]. A large warehouse may have a 10,000 entry state, describing the number of each product held.

There is no need to stop at the boundaries of physical world. Powell distinguishes three state components: (i) physical, (ii) information and (iii) knowledge. The outline follows below.

i. Physical

The term ‘physical’ may be used a bit liberally here; think the location of a truck, money on your bank account, or battery level of a drone. It captures the physical properties of a system or available resources. Unfortunately, many modelers equate a state to physical attributes, as such omitting relevant information.

ii. Information

Aside from properties directly derived from the system, we often use additional information in decision-making. If we have a stock portfolio, the stock prices (market data) would be information. For a wind farm, it might be the wind speed and -direction.

Information can also pertain to the past or future. Past stock prices might reveal relevant trends that cannot be gleaned solely from the present price. If we have a large production order confirmed for next week, that is relevant information when drafting a production schedule this week.

These types of information might seem at conflict with the memoryless property. However: if you knew the Apocalyps was due next Thursday, you probably would not finish that report for Friday, right? We simply include relevant information we know in the state.

The emphasis is on ‘knowing’. From a Monte Carlo simulation or a modeler’s knowledge, we may derive much information not known to the agent. Information should be added from the perspective of the decision-making agent, who does not have clairvoyant skills.

iii. Belief

We might not make decisions based solely on known information, but also on uncertain information or beliefs. We can measure the exact wind speed right now, but for the rest of the week we rely on the weather forecast. A warehouse may utilize a sales forecasting model. Past stock prices may be used to predict future ones. With beliefs, we typically refer to probability distributions that capture certain knowledge.

As a guideline, the state should include exactly all relevant data used to make informed decisions — no less, no more. If certain data does not impact decision-making, rewards or transitions, it should not be incorporated in the state.

Action space (A)

Integral to MDPs is the ability to exercise some degree of control over the system. The action a∈A — also decision or control in some domains — describes this influence by the agent; the action space A contains all (feasible) actions. As for the state, the action can be a simple scalar (‘exercise option a∈{0,1}), but also a high-dimensional vector (‘order a_i products of type i).

With state-independent actions, the same action space A applies for every state. For many problems, actions are state-dependent (action space A(s)): the moves on a chess board depend on the current pieces and their positions, and a water reservoir cannot be drained to negative amounts. Sometimes it is convenient to presume the action space is state-independent (e.g., in Deep Q-learning) and simply mask infeasible actions.

For vector-based decisions that allow all sorts of permutations, enumeration of the action space may become problematic. Additionally, it is often hard to assess whether an action is feasible. Consider decision-making for production lines with limited raw materials, precedence relationships, storage capacity, machine availability, etc. Merely verifying whether a schedule is valid construes a complicated subproblem. In such cases we need to subject the action space to a set of constraints, e.g., by formulating the action problem as a mathematical program. An added bonus is that it vastly increases the sizes of action spaces that may be handled.

Reward function (R)

MDP models entail optimizing some objective function, typically a (discounted) cumulative reward. The reward function R(s,a) captures the direct rewards, costs or contributions associated with individual decision moments, being dependent on state s and/or action a. Sometimes the reward is evident, e.g., the monetary gains and losses resulting from stock trading. In other cases, more artificial constructs are needed, for instance the reward for winning a game or reaching a target. It is then necessary to align rewards with the objectives of solving the problem.

Non-zero rewards are not necessarily incurred at each epoch, e.g., they might only be received at the end of an episode (winning/losing a game). In RL, we often resort to reward shaping, introducing intermediate rewards to recognize ‘good’ actions and guide the algorithm. Reward shaping should be used with care, as it alters the original problem setting and may introduce unexpected behavior.

Transition function (P)

In an MDP, we take discrete time steps in which we move from one state (s) to the next (s’). This transition function depends on (i) the present state s, (ii) the action a and (iii) random information ω. The transition can be broken down into a deterministic and a stochastic part, in which a describes the deterministic component and ω the stochastic component (although often a mixture, the transition might be fully deterministic or fully stochastic).

In MDP formulations, the transition function is often hidden in the deceptively simple probability P(s’|s,a). Deceptive indeed, as it is often the most difficult component to model. Truthfully, the transition function may range from a trivial deterministic function (e.g., a step in GridWorld) to a highly complex Digital Twin representation of reality, with many uncertainties to model.

Although the transition function is a necessary component of the MDP, in RL we do not necessarily define it explicitly. Instead, we may observe transitions from the environment. If we learn to play a game of Super Mario, we likely don’t know the actual transition functions; we simply try moves and observe what happens. We may even utilize real-world observations (e.g., when controlling wind farm that depends on weather circumstances). In this case, the world itself is our transition function!

Discount factor (γ)

The discount factor γ ∈ [0,1] is not always included in the MDP tuple, as it is optional for finite horizon. In short, it indicates to what extent future rewards are factored into current decision-making, with γ=0 completely dismissing future rewards and γ=1 weighing all future rewards equally. There are various rationales for working with discount rates, the chief one is to ensure converging value functions for infinite horizon problems.

Any setting γ<1 suffices to guarantee (theoretical) convergence, but the exact setting is important for optimization purposes. A general rule-of-thumb is that the more uncertain the system is, the higher the discount rate is set, reflecting the limited impact that actions today have on performance in the future.

Summary

A Markov Decision Process is composed of the following building blocks:

  • State spaceS — The state contains data needed to make decisions, determine rewards and guide transitions. The state can be divided into physical-, information- and belief attributes, and should contain precisely the attributes needed for the aforementioned purposes.
  • Action spaceA — The set containing all (feasible) actions. For state-dependent decisions a(s), it may be necessary to subject the action space to a set of constraints, e.g., using mathematical programming.
  • Reward function R — Denoting the direct reward when taking action a in state s. In many problems, the reward is not self-explanatory, requiring reward shaping to fulfill the objectives of solving the problem. Poor reward function may result in undesired behavior.
  • Transition function P — The function governing the dynamics of the system over time, guiding the agent from state s to s’. The transition typically involves both a deterministic component (the action a) and a stochastic one (exogenous information ω). Transitions might be observed from an external environment without explicitly modelling them.
  • Discount factor γ — Defines the degree to which future rewards impact current decisions. When the problem is infinite-horizon and relies on a cumulative reward objective function, a discount rate γ<1 is necessary to ensure convergence. The discount rate is often set in line with the uncertainty of the environment.

Each building block may harbor substantial challenges. The way the MDP is modeled may substantially impact the effectiveness of the solution algorithm. Furthermore, both in business- and academic environments, the ability to define clear MDPs is important to uniformly communicate problem abstractions.

Although MDP formulation might be a bit of a hassle, doing it properly pays off in the long run.

References

Powell, W. B. (2019). A unified framework for stochastic optimization. European Journal of Operational Research, 275(3), 795–821.

Salnikov, V., Schaub, M. & Lambiotte, R. Using higher-order Markov models to reveal flow-based communities in networks. Sci Rep 6, 23194 (2016). https://doi.org/10.1038/srep23194

Markov Decision Process
Reinforcement Learning
Artificial Intelligence
Dynamic Programming
State
Recommended from ReadMedium