Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

expected rewards as a quadratic equationFor the objective<figure id="98a9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*94KM5Bpquat3pz9DF_pOFw.png"><figcaption></figcaption></figure>We can use Taylor’s series to expand both terms above up to the second-order. The second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.<figure id="0f42"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*A386qkCNen7IXjwfFbH2Og.jpeg"><figcaption></figcaption></figure>After taking out the zero values:<figure id="7dcd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Sm49fOBxZBvzgSIo8zm-fA.png"><figcaption></figcaption></figure>where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter θ.Our objective can therefore be approximated as:<figure id="75b9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mVonldYKr3Qj1Na9lgekjA.png"><figcaption></figcaption></figure>or<figure id="c5ac"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NcHd3tpyGa-EZCieoJsp9Q.jpeg"><figcaption></figcaption></figure><h1 id="d536">𝓛 & M function</h1>We want to proof<figure id="56ad"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e6S6Yu_3COHJfNvZvBN-mA.jpeg"><figcaption></figcaption></figure>During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).<figure id="2cef"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:80

Options

0/1*xNAK74CuieaHqAX8MrKhKA.png"><figcaption></figcaption></figure><ol><li>The difference in the discounted rewards between two different policies can be computed as (<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#b263">proof</a>):</li></ol><figure id="a9ea"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*07OLuGdMRV3AJ2ShjCtf3A.png"><figcaption></figcaption></figure>2. 𝓛 can be approximated as (<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#a7f6"></a><a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#a7f6">proof</a>)<figure id="779a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6lkDGCbmRve9nLMwE1ejIA.jpeg"><figcaption></figcaption></figure>3. When π’ = π, the L.H.S. above is zero and we can show (<a href="https://readmedium.com/rl-proof-for-trpo-ppo-f18056fd6594#097b">proof</a>)<figure id="e358"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*e6S6Yu_3COHJfNvZvBN-mA.jpeg"><figcaption></figcaption></figure>The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at π’ = π. This is one requirement for the MM algorithm.Proof:<figure id="473a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*NQ21c9AD8PW_USAtYVcOzg.jpeg"><figcaption></figcaption></figure>turns to<figure id="cfc3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*O5SOwiklZsg6EJRYuUZ1xw.png"><figcaption></figcaption></figure>So M approximates our objective locally.</article></body>

RL — Appendix: Proof for the article in TRPO & PPO

Difference of discounted rewards

The difference in the discounted rewards for two policies is:

Proof:

Natural Policy Gradient is covariance

Approximate the difference of discounted rewards with 𝓛

Proof (assuming both policies are similar):

𝓛 match with K to the first order

i.e.

K(π) = 𝓛(π), and
K’(π) = 𝓛 ’(π)

Proof:

Approximate the expected rewards as a quadratic equation

For the objective

We can use Taylor’s series to expand both terms above up to the second-order. The second-order of 𝓛 is much smaller than the KL-divergence term and will be ignored.

After taking out the zero values:

where g is the policy gradient and H measure the sensitivity (curvature) of the policy relative to the model parameter θ.

Our objective can therefore be approximated as:

𝓛 & M function

We want to proof

During the proof, we will also show M approximates the following terms locally (a requirement for the MM method).

The difference in the discounted rewards between two different policies can be computed as (proof):

2. 𝓛 can be approximated as (proof)

3. When π’ = π, the L.H.S. above is zero and we can show (proof)

The claim in (3) is particularly important for us. Since the DL-divergence is zero when both policies are the same, the R.H.S. below approximates our objective function locally at π’ = π. This is one requirement for the MM algorithm.

Proof:

turns to

So M approximates our objective locally.