avatarJonathan Hui

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1750

Abstract

expected rewards as a quadratic equation</b></p><p id="3eec">For the objective</p><figure id="98a9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*94KM5Bpquat3pz9DF_pOFw.png"><figcaption></figcaption></figure><p id="eeb8">We can use Taylor’s series to expand both terms above up to the second-order. The second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.</p><figure id="0f42"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*A386qkCNen7IXjwfFbH2Og.jpeg"><figcaption></figcaption></figure><p id="7940">After taking out the zero values:</p><figure id="7dcd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Sm49fOBxZBvzgSIo8zm-fA.png"><figcaption></figcaption></figure><p id="3b8a">where <b><i>g</i></b> is the policy gradient and <b><i>H</i></b> measure the sensitivity (curvature) of the policy relative to the model parameter <b><i>θ</i></b>.</p><p id="1b81">Our objective can therefore be approximated as:</p><figure id="75b9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mVonldYKr3Qj1Na9lgekjA.png"><figcaption></figcaption></figure><p id="f87a">or</p><figure id="c5ac"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NcHd3tpyGa-EZCieoJsp9Q.jpeg"><figcaption></figcaption></figure><h1 id="d536">𝓛 & M function</h1><p id="f29e">We want to proof</p><figure id="56ad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e6S6Yu_3COHJfNvZvBN-mA.jpeg"><figcaption></figcaption></figure><p id="d497">During the proof, we will also show <b><i>M</i></b> approximates the following terms locally (a requirement for the MM method).</p><figure id="2cef"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:80

Options

0/1*xNAK74CuieaHqAX8MrKhKA.png"><figcaption></figcaption></figure><ol><li>The difference in the discounted rewards between two different policies can be computed as (<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#b263">proof</a>):</li></ol><figure id="a9ea"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*07OLuGdMRV3AJ2ShjCtf3A.png"><figcaption></figcaption></figure><p id="a164">2. 𝓛 can be approximated as <b>(<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#a7f6"></a></b><a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#a7f6">proof</a>)</p><figure id="779a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6lkDGCbmRve9nLMwE1ejIA.jpeg"><figcaption></figcaption></figure><p id="a405">3. When π’ = π, the L.H.S. above is zero and we can show (<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#097b">proof</a>)</p><figure id="e358"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e6S6Yu_3COHJfNvZvBN-mA.jpeg"><figcaption></figcaption></figure><p id="9d59">The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at π’ = π. This is one requirement for the MM algorithm.</p><p id="091b">Proof:</p><figure id="473a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NQ21c9AD8PW_USAtYVcOzg.jpeg"><figcaption></figcaption></figure><p id="f6f0">turns to</p><figure id="cfc3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*O5SOwiklZsg6EJRYuUZ1xw.png"><figcaption></figcaption></figure><p id="a994">So <b><i>M</i></b> approximates our objective locally.</p></article></body>

RL — Appendix: Proof for the article in TRPO & PPO

Difference of discounted rewards

The difference in the discounted rewards for two policies is:

Proof:

Source

Natural Policy Gradient is covariance

Source

Approximate the difference of discounted rewards with 𝓛

Proof (assuming both policies are similar):

𝓛 match with K to the first order

i.e.

  • K(π) = 𝓛(π), and
  • K’(π) = 𝓛 ’(π)

Proof:

Approximate the expected rewards as a quadratic equation

For the objective

We can use Taylor’s series to expand both terms above up to the second-order. The second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.

After taking out the zero values:

where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter θ.

Our objective can therefore be approximated as:

or

𝓛 & M function

We want to proof

During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).

  1. The difference in the discounted rewards between two different policies can be computed as (proof):

2. 𝓛 can be approximated as (proof)

3. When π’ = π, the L.H.S. above is zero and we can show (proof)

The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at π’ = π. This is one requirement for the MM algorithm.

Proof:

turns to

So M approximates our objective locally.

Recommended from ReadMedium