Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

an be thought of as the average reward and is used if the reward given for a particular action can vary.</p><figure id="9e89"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*mKlw4K6LTXEfU68j.png"><figcaption></figcaption></figure><p id="fc44">At time ‘<i>t -1</i>’, starting in a particular state <b><i>s</i></b> and taking action <b><i>a</i></b>, the expected reward that’s received at the next time step is a function of the current state and action.</p><h2 id="f802">Return ‘Gₜ’</h2><p id="8559">The total amount of reward accumulated over an episode, starting at time <b><i>t</i></b> is equal to the sum of future rewards:</p><figure id="59f9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*iXE2EWNr_vy6Jme1.png"><figcaption></figcaption></figure><h1 id="2751">Policy</h1><p id="7bcf">The strategy that is used by the agent to choose an action in a state. In equations the policy is typically represented by ‘<b><i>π</i></b><i>’.</i></p><h2 id="0aea">Greedy Policy</h2><p id="4894">Selects the action with the highest immediate reward.</p><p id="ec09">Reinforcement Learning can be split into two distinct parts, the <b><i>Prediction Problem</i></b> and the <b><i>Control Problem</i></b>:</p><h1 id="7dd3">Prediction Problem</h1><p id="1647">Evaluate the performance of an agent.</p><h2 id="6359">State Value</h2><p id="10ec">A measure of how good it is to be in that state. Given by the expected reward that can be obtained from starting in that state and then following the current policy for all future states.</p><p id="0fd6">The value for state <b><i>s</i></b> under policy <b><i>π</i></b> is the expected return:</p><figure id="60d2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*ffZ_JY_ACzsXxchB.png"><figcaption></figcaption></figure><p id="7083">For a <b><i>deterministic policy</i></b>, where a single action is always selected in each state and when you’re guaranteed of getting the same reward and ending up in the same next state, the value of a state is simply the immediate reward plus the value of the next state:</p><figure id="6180"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Gl-i8lUcedYxyFg3.png"><figcaption></figcaption></figure><p id="67e7">For a <b><i>stochastic policy, </i></b>where multiple actions are possible, <b><i>π(a|s)</i></b><i> </i>represents the probability of taking action<i> <b>a</b> </i>from state<i> <b>s</b> </i>under policy <b><i>π</i></b><i>.</i></p><p id="5603">In this case the state value is given by the sum of each action’s reward multiplied by the probability of taking that action:</p><figure id="9e10"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*WPpQd4m9r8sUb64n.png"><figcaption></figcaption></figure><h1 id="5f4c">Dynamic Programming</h1><p id="f894">Splits the problem into simpler sub-problems. In this case the state value is calculated by splitting the problem into 2 parts: the immediate reward and the value of the next state.</p><p id="2088">State values are calculated using the values of other states. As a result you only need to look one step ahead and don’t need to know all of the rewards accumulated during the epi

Options

sode.</p><h1 id="86fa">Iterative Policy Evaluation</h1><p id="faaa">By repeatedly applying the above equation for the state value the converged state value can be calculated.</p><p id="1b77">The algorithm for this is:</p><ul><li>Start with all state values set to zero.</li><li>Do a sweep of all states to calculate the state values (using the above equation).</li><li>Repeat until convergence.</li></ul><figure id="3412"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*biHB1PGvzINp6sRC2rRTmw.gif"><figcaption></figcaption></figure><h1 id="8b1a">Control Problem</h1><p id="d719">Modify the policy to improve performance.</p><h2 id="fbd9">Policy Improvement</h2><p id="1c7c">After the state values have been calculated act greedily with respect to these values to form a new policy (i.e. choose the action which will take you to the next state that has the greatest state value).</p><figure id="14e8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5gXPiWgoKjikPlpK3t8Qkg.png"><figcaption></figcaption></figure><h2 id="7cc0">Discounted Rewards</h2><p id="4410">Progressively reduce the contribution of rewards from future time steps by using a discount factor <b><i>γ</i></b> (gamma), where 0 ≤ <i>γ</i> ≤ 1.</p><figure id="0230"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*LP9mF8vf2lNWM6KG.png"><figcaption></figcaption></figure><p id="0125">This allows the calculation of state values for deterministic policies that may not terminate. A value of 0.9 is commonly used as the discount factor.</p><p id="3774">The value of state <b><i>s</i></b> under policy <b><i>π</i></b> with discounted future rewards is given by:</p><figure id="9f48"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*SizUjGI5IEcjPQvW.png"><figcaption></figcaption></figure><h1 id="8d94">Summary</h1><p id="7e49">We’ve now covered all of the basics of Reinforcement Learning (<i>RL</i>) and all in record time!</p><p id="ff6b">We concluded our introduction with an equation that’s a cut-down version of the <b><i>Bellman Equation</i></b>, the key equation in Reinforcement Learning. In <a href="https://readmedium.com/markov-decision-processes-and-bellman-equations-45234cce9d25"><b><i>Part 2</i></b></a> of this series we’ll look at the full Bellman Equation and examine <b><i>Markov Decision Processes</i></b> which are another core concept of <i>RL</i>.</p><blockquote id="a294"><p>For a more detailed explanation of all of the topics covered in this article check out the original post:</p></blockquote><div id="29e9" class="link-block"> <a href="https://towardsdatascience.com/state-values-and-policy-evaluation-ceefdd8c2369"> <div> <div> <h2>State Values and Policy Evaluation</h2> <div><h3>An Introduction to Reinforcement Learning: Part 1</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*mTjWkv3vwEuXTBUnf28U7g.gif)"></div> </div> </div> </a> </div></article></body>

A Baby Robot’s Guide To Reinforcement Learning

State Values and Policy Evaluation in 5 minutes

An Introduction to Reinforcement Learning

This is a summary of the article State Values and Policy Evaluation. It distils all of the key terms and theory from that article down into a single cheat-sheet that can be read in 5 minutes or less. With that in mind, we’d better get started…

You can also open this article as a Jupyter Notebook on Binder which allows you to run the associated code samples using the Baby Robot Custom Gym Environment.

Reinforcement Learning

Reinforcement Learning can be considered to be a problem that takes place in an environment that consists of multiple, independent, states.

A simple example of this would be a grid world, where each square in the grid represents a state:

State

A unique, self-contained, stage in the environment that defines the current situation. Each state is independent of previous states, which means you don’t need to know or remember what has happened before.

Episode

One complete execution of the environment.

Terminal State

The final state in an environment where the episode ends.

Agent

The agent is the thing that interacts with the environment and that makes decisions on how to move through the environment. In our case Baby Robot is the agent.

To move from one state s to the next state s’ (read as s-prime) the agent takes an action a. As a result of taking this action it receives a reward r.

Reward

The numerical value used to measure performance. It can be expressed as a penalty by making it negative. The more negative the worse the performance.

Expected reward for a state-action pair

The expected reward can be thought of as the average reward and is used if the reward given for a particular action can vary.

At time ‘t -1’, starting in a particular state s and taking action a, the expected reward that’s received at the next time step is a function of the current state and action.

Return ‘Gₜ’

The total amount of reward accumulated over an episode, starting at time t is equal to the sum of future rewards:

Policy

The strategy that is used by the agent to choose an action in a state. In equations the policy is typically represented by ‘π’.

Greedy Policy

Selects the action with the highest immediate reward.

Reinforcement Learning can be split into two distinct parts, the Prediction Problem and the Control Problem:

Prediction Problem

Evaluate the performance of an agent.

State Value

A measure of how good it is to be in that state. Given by the expected reward that can be obtained from starting in that state and then following the current policy for all future states.

The value for state s under policy π is the expected return:

For a deterministic policy, where a single action is always selected in each state and when you’re guaranteed of getting the same reward and ending up in the same next state, the value of a state is simply the immediate reward plus the value of the next state:

For a stochastic policy, where multiple actions are possible, π(a|s) represents the probability of taking action a from state s under policy π.

In this case the state value is given by the sum of each action’s reward multiplied by the probability of taking that action:

Dynamic Programming

Splits the problem into simpler sub-problems. In this case the state value is calculated by splitting the problem into 2 parts: the immediate reward and the value of the next state.

State values are calculated using the values of other states. As a result you only need to look one step ahead and don’t need to know all of the rewards accumulated during the episode.

Iterative Policy Evaluation

By repeatedly applying the above equation for the state value the converged state value can be calculated.

The algorithm for this is:

Start with all state values set to zero.
Do a sweep of all states to calculate the state values (using the above equation).
Repeat until convergence.

Control Problem

Modify the policy to improve performance.

Policy Improvement

After the state values have been calculated act greedily with respect to these values to form a new policy (i.e. choose the action which will take you to the next state that has the greatest state value).

Discounted Rewards

Progressively reduce the contribution of rewards from future time steps by using a discount factor γ (gamma), where 0 ≤ γ ≤ 1.

This allows the calculation of state values for deterministic policies that may not terminate. A value of 0.9 is commonly used as the discount factor.

The value of state s under policy π with discounted future rewards is given by:

Summary

We’ve now covered all of the basics of Reinforcement Learning (RL) and all in record time!

We concluded our introduction with an equation that’s a cut-down version of the Bellman Equation, the key equation in Reinforcement Learning. In Part 2 of this series we’ll look at the full Bellman Equation and examine Markov Decision Processes which are another core concept of RL.

For a more detailed explanation of all of the topics covered in this article check out the original post:

State Values and Policy Evaluation

An Introduction to Reinforcement Learning: Part 1

towardsdatascience.com