avatarJonathan Hui

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

6102

Abstract

esize:fit:800/1*QaqDHtuawUwxhEXhZMCFqg.png"><figcaption><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf">Source</a></figcaption></figure><p id="5157"><b>Q-learning with replay buffer and the target network</b></p><p id="9af3">Add a replay buffer and a target network for training stability.</p><figure id="a522"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*954JY0SBz3rb6eLn5ktXMQ.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">Source</a></figcaption></figure><p id="9bea"><b>DQN (Deep Q-learning)</b></p><p id="f92b">DQN is a Q-learning method with a deep network to estimate <b><i>Q</i></b> using a replay buffer and a target network.</p><figure id="5a2e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5E_j9RPD3kNKeVVKKjkEjw.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">Source</a></figcaption></figure><h1 id="b3ab">Policy Gradients</h1><p id="8b98">Maximize the rewards by taking actions with high rewards more likely.</p><figure id="1813"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5wMXST9bl2ETg-tg2CscDw.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">Source</a></figcaption></figure><p id="0972">Example: Policy Gradient using Monte Carlo rollouts and a baseline:</p><figure id="9693"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*urlzccRM18b08dOlkqFIUg.png"><figcaption><a href="https://books.google.com/books/about/Reinforcement_Learning.html?id=CAFR6IBF4xYC">Source</a></figcaption></figure><h1 id="9ee4">Actor-Critic Algorithm</h1><p id="0bbd">Combining policy gradient and value-function learning.</p><figure id="8097"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*J_2SNq9PKtlW1qnF2EKVtw.png"><figcaption>Batch Actor-Critic <a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">source</a></figcaption></figure><figure id="22ab"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dCrEc__YPfvSkMdDAHL-FA.png"><figcaption>Online Actor-Critic <a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">source</a></figcaption></figure><p id="d004">A3C algorithm: asynchronous actor-critic using advantage function.</p><figure id="e399"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ej4aAGmlLmmf2xdjERTARA.png"><figcaption><a href="https://arxiv.org/pdf/1602.01783.pdf">Source</a></figcaption></figure><h2 id="e0ed">Deep Deterministic Policy Gradient DDPG (Continuous actions)</h2><p id="af6f">An actor-critic approach for continuous actions.</p><figure id="0e95"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yXHYnEbDwg5DdsBph13lcA.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">Source</a></figcaption></figure><p id="d4d2">Adding parameter noise for exploration:</p><figure id="fcc3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NGDaTsJfjp5Kq6WISk-GxA.png"><figcaption></figcaption></figure><figure id="f794"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*aQlKP43WHl-bqAjftir6CA.png"><figcaption><a href="https://arxiv.org/pdf/1509.02971.pdf">Source</a></figcaption></figure><h1 id="aefa">Natural Policy Gradient</h1><p id="21b6">Natural Policy Gradient is a better policy gradient method which guarantees policy improvement. It is invariant of how models are created and parameterized.</p><figure id="a932"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*FrN-ktK6h_7DIAZ8jiP2jQ.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">Source</a></figcaption></figure><p id="a6b1"><b>TRPO</b></p><p id="27c5">Optimize Natural Policy Gradient with Conjugate Gradient.</p><figure id="3e37"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Mctl1Js9OBWjFn9h82PxsA.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">Source</a></figcaption></figure><p id="edf9">Backtracking line search:</p><figure id="99ca"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Rn-79YcAyTsVvEvCGrgpIA.png"><figcaption></figcaption></figure><p id="58b4"><b>PPO</b></p><p id="9e3c">Optimize Natural Policy Gradient by clipping the advantage function.</p><figure id="afab"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*S8e9USP6x28PLXlu-FA_UQ.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf">Source</a></figcaption></figure><h1 id="6f0b">DAgger: Dataset Aggregation</h1><p id="9353">Use human labels to improve imitation learning.</p><figure id="7540"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*RkCKUyRW68fAuysDgWhuMA.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_2_behavior_cloning.pdf">Source</a></figcaption></figure><h1 id="6105">Monte Carlo Tree Search (MCTS)</h1><p id="d1f2">Search discrete action spaces using a search tree with an exploration tree policy.</p><figure id="5545"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Q6S2A_akI9E07q-LdGi-yA.png"><figcaption>Modified from <a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_8_model_based_planning.pdf">source</a></figcaption></figure><h1 id="6f84">Model-based Reinforcement Learning</h1><p id="6544"><b>MPC (Model Predictive Control)</b></p><p id="8435">Replan actions in every action taken.</p><figure id="bdd3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gxUsmypxopxGaeSG0dkYgA.png"><figcaption></figcaption></figure><p id="59fd"><b>Policy search with backpropagation</b></p><p id="5d3c">Use backpropagation to build a policy under a model-based method.</p><figure id="30b5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Tt7Xy13it5sPYvtB1v24tQ.png"><figcaption></figcaption></figure><p id="7fae"><b>PILCO</b></p><p id="cc71">Policy search with backpropagati

Options

on:</p><figure id="17a4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0SY5yZ8-tc90BeKHE8KvuA.png"><figcaption>Simplified from <a href="http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf">source</a></figcaption></figure><p id="3ccc"><b>Policy search with Gaussian Process (GP)</b></p><p id="78cf">Using a Gaussian Process for the dynamic model.</p><figure id="8e7f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5y1RgDJGwWgoaWamxMjmDw.png"><figcaption></figcaption></figure><h1 id="7bc2">Guided Policy Search (GPS)</h1><p id="e6bd"><b>General scheme</b></p><p id="536e">Use locally optimized trajectory to guide the training of a policy.</p><figure id="fd66"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9HgOV5AbGObzxKjZuYVBPA.png"><figcaption></figcaption></figure><p id="8f81"><b>Deterministic GPS</b></p><p id="887a">Guided Policy Search on deterministic policy.</p><figure id="5cfa"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*nVBSvT-Znzu65LgwlDcZdg.png"><figcaption></figcaption></figure><figure id="61f5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*iVSOHUt1MbAiGsKxdvLt3A.png"><figcaption></figcaption></figure><p id="c11a"><b>Stochastic GPS</b></p><p id="677c">Guided Policy Search on stochastic policy.</p><figure id="70ad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LCHYR6VVMLZwEoOJ4lz7qw.png"><figcaption></figcaption></figure><figure id="f02f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9o3TkEDquGfubAFw8BR-lw.png"><figcaption></figcaption></figure><h1 id="6608">Imitating optimal control</h1><p id="00bc">Use computer generated samples (<i>s</i>,<i> a</i>) from locally optimized trajectories to train a policy.</p><p id="fc00"><b>PLATO algorithm</b></p><p id="8a51">With a Linear Gaussian controller:</p><figure id="4b35"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EIGTTbwr_J9QHIPzc4eDTQ.png"><figcaption></figcaption></figure><p id="776d">We replan with:</p><figure id="95e4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fMR7gmKH-d8_q2DkxAi0YQ.png"><figcaption></figcaption></figure><figure id="213e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uhONVLy70zU_CixhUYN4zA.png"><figcaption></figcaption></figure><h1 id="655c">Dyna-Q</h1><p id="9fbc">Dyna-Q is a constant loop of learning the value function or policy from real and simulated experience. We use the real experience to build and refine the dynamic model, and use this model to produce simulated experience to complement the training of the value function and/or the policy.</p><figure id="2fad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*u7MpjbYsJcxcXS8Tr80Szw.jpeg"><figcaption><a href="https://books.google.com/books/about/Reinforcement_Learning.html?id=CAFR6IBF4xYC">Source</a></figcaption></figure><figure id="68fd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qNC4nz8b_juRwAUyi2O3mw.png"><figcaption><a href="https://books.google.com/books/about/Reinforcement_Learning.html?id=CAFR6IBF4xYC">Source</a></figcaption></figure><p id="d314">Learned from sampled experience:</p><figure id="7584"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*31eGNmtCp1G35rUujjTH6g.png"><figcaption><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">Source</a></figcaption></figure><h1 id="d035">Cross-entropy method (CEM)</h1><p id="5af5">Take “educated” random guesses on actions. Select top performers of guesses and use them as seeds for next round of guessing.</p><figure id="2bfd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lvWNYqYP_puZPykqf-DLEw.png"><figcaption><a href="http://rail.eecs.berkeley.edu/deeprlcourse-fa17/index.html">Source</a></figcaption></figure><h1 id="1c77">Double Gradient Descent (DGD)</h1><p id="fd6c">Optimize an objective under constraints.</p><figure id="38f7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Ni6AzrRcWQrnn-z2YKyOVw.png"><figcaption></figcaption></figure><h1 id="b6b5">ε-greedy policy</h1><p id="e7e2">To sample the kth episode, we use a ε-greedy algorithm below to pick which action to sample next. Here is the data distribution used in selecting the action:</p><figure id="55a9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*fI0IAWpzkcIEjFpRD_9eyQ.png"><figcaption></figcaption></figure><h1 id="4c43">Generalized advantage estimation (GAE)</h1><p id="6aeb">An n-step look ahead advantage function is defined as:</p><figure id="05a4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Pe5kcwn5pef6_O7PYxm7bQ.png"><figcaption></figcaption></figure><p id="792d">Advantage function with 1 to k-step lookahead.</p><figure id="58d3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*EDJU3mUUU5YKDb7XaPxwkg.png"><figcaption><a href="https://arxiv.org/pdf/1506.02438.pdf">Source</a></figcaption></figure><p id="d1ab">The final advantage function for GAE is</p><figure id="fa3c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*CNLtrsHS8uAwBFl5aXqvnA.png"><figcaption><a href="https://arxiv.org/pdf/1506.02438.pdf">Source</a></figcaption></figure><p id="ec89">where <b><i>λ </i></b>is a hyperparameter from 0 to 1. When λ is 1, it is Monte Carlo. When λ is 0, it is TD with one step look ahead.</p><h1 id="a275">Credits and references</h1><p id="82b6"><a href="http://rail.eecs.berkeley.edu/deeprlcourse/">UC Berkeley Reinforcement Learning</a></p><p id="b381"><a href="https://sites.google.com/view/deep-rl-bootcamp/lectures">UC Berkeley Reinforcement Learning Bootcamp</a></p><p id="6aa9"><a href="http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html">UCL Reinforcement Learning</a></p><p id="e967"><a href="https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation-ebook/dp/B008H5Q8VA/ref=sr_1_3?s=digital-text&amp;ie=UTF8&amp;qid=1535569242&amp;sr=1-3&amp;keywords=Reinforcement+Learning%3A+An+Introduction">Reinforcement Learning: An Introduction</a></p></article></body>

Photo by Daniel Cheung

RL — Reinforcement Learning Algorithms Quick Overview

This article overviews the major algorithms in reinforcement learning. Each algorithm will be explained briefly in a single context for an easy and quick overview.

Terms

Action-value functions Q, state-value function V and advantage function A are defined as:

Policy Evaluation

Policy evaluation computes the value functions for a policy π using the Bellman equations.

For example,

Policy evaluation using Monte Carlo Rollouts

Monte Carlo plays out the whole episode until the end to calculate the total rewards.

Source

Policy Iteration

Policy iteration evaluates (step 1)

and improves policy (step 2) continuously.

Source

Example: The following maze exits on the top-left or the bottom-right corner with a reward of -1 for every step taken. We evaluate and improve the policy continuously. Since there are only finite actions, the policy will eventually converge to the optimal policy π*.

Modified from source

Value Iteration

Compute the value functions iteratively under the Bellman equation using dynamic programming:

Example: This maze exits on the top-left corner with a reward of -1 for every step taken.

Source

Algorithms:

Source

This method ignores the policy and focuses on the value function until the end. The final policy is derived from the value function as:

Fitted value iteration

Estimate the value function V with a model parameterized by Φ:

Source

Fitted Q iteration

The Bellman equation for the action-value function is:

Estimate the action-value function Q with a model parameterized by Φ.

Full fitted Q-iteration source
Online Q-iteration source

Q-learning (SARSA max)

Q-learning is an online action-value function learning with an exploration policy (like epsilon-greedy).

Modified from source

Pseudocode:

Source

Q-learning with replay buffer and the target network

Add a replay buffer and a target network for training stability.

Source

DQN (Deep Q-learning)

DQN is a Q-learning method with a deep network to estimate Q using a replay buffer and a target network.

Source

Policy Gradients

Maximize the rewards by taking actions with high rewards more likely.

Source

Example: Policy Gradient using Monte Carlo rollouts and a baseline:

Source

Actor-Critic Algorithm

Combining policy gradient and value-function learning.

Batch Actor-Critic source
Online Actor-Critic source

A3C algorithm: asynchronous actor-critic using advantage function.

Source

Deep Deterministic Policy Gradient DDPG (Continuous actions)

An actor-critic approach for continuous actions.

Source

Adding parameter noise for exploration:

Source

Natural Policy Gradient

Natural Policy Gradient is a better policy gradient method which guarantees policy improvement. It is invariant of how models are created and parameterized.

Source

TRPO

Optimize Natural Policy Gradient with Conjugate Gradient.

Source

Backtracking line search:

PPO

Optimize Natural Policy Gradient by clipping the advantage function.

Source

DAgger: Dataset Aggregation

Use human labels to improve imitation learning.

Source

Monte Carlo Tree Search (MCTS)

Search discrete action spaces using a search tree with an exploration tree policy.

Modified from source

Model-based Reinforcement Learning

MPC (Model Predictive Control)

Replan actions in every action taken.

Policy search with backpropagation

Use backpropagation to build a policy under a model-based method.

PILCO

Policy search with backpropagation:

Simplified from source

Policy search with Gaussian Process (GP)

Using a Gaussian Process for the dynamic model.

Guided Policy Search (GPS)

General scheme

Use locally optimized trajectory to guide the training of a policy.

Deterministic GPS

Guided Policy Search on deterministic policy.

Stochastic GPS

Guided Policy Search on stochastic policy.

Imitating optimal control

Use computer generated samples (s, a) from locally optimized trajectories to train a policy.

PLATO algorithm

With a Linear Gaussian controller:

We replan with:

Dyna-Q

Dyna-Q is a constant loop of learning the value function or policy from real and simulated experience. We use the real experience to build and refine the dynamic model, and use this model to produce simulated experience to complement the training of the value function and/or the policy.

Source
Source

Learned from sampled experience:

Source

Cross-entropy method (CEM)

Take “educated” random guesses on actions. Select top performers of guesses and use them as seeds for next round of guessing.

Source

Double Gradient Descent (DGD)

Optimize an objective under constraints.

ε-greedy policy

To sample the kth episode, we use a ε-greedy algorithm below to pick which action to sample next. Here is the data distribution used in selecting the action:

Generalized advantage estimation (GAE)

An n-step look ahead advantage function is defined as:

Advantage function with 1 to k-step lookahead.

Source

The final advantage function for GAE is

Source

where λ is a hyperparameter from 0 to 1. When λ is 1, it is Monte Carlo. When λ is 0, it is TD with one step look ahead.

Credits and references

UC Berkeley Reinforcement Learning

UC Berkeley Reinforcement Learning Bootcamp

UCL Reinforcement Learning

Reinforcement Learning: An Introduction

Machine Learning
Reinforcement Learning
Artificial Intelligence
Deep Learning
Algorithms
Recommended from ReadMedium