Proximal Policy Optimization (PPO): Exploring the Algorithm Behind ChatGPT’s Powerful Reinforcement Learning Capabilities
Discover the Versatile Deep Reinforcement Learning Algorithm Used in ChatGPT’s RL Capabilities — Proximal Policy Optimization (PPO)
ChatGPT is currently the most popular Large Language model, significantly impacting natural language processing and disrupting the world. It is trained on large and diverse data sources, such as news articles, books, websites, and social media posts, and uses PPO Reinforcement Learning involving Human Feedback.
If you are new to Reinforcement learning, then the following are concepts good to know.
Essential Elements of Reinforcement Learning
Reinforcement Learning: Temporal Difference Learning
Reinforcement Learning: Q-Learning
Deep Q Learning: A Deep Reinforcement Learning Algorithm
An Intuitive Explanation of Policy Gradient
Unlocking the Secrets of Actor-Critic Reinforcement Learning: A Beginner’s Guide
A Basic Understanding of the ChatGPT Model
Before exploring Proximal Policy Optimization(PPO) RL algorithm, let’s understand different RL algorithms to handle continuous state and action.
The goal is to efficiently train a reinforcement learning agent to handle continuous states and actions, reduce variance, and do so in a sample-efficient and stable way while keeping the training process easy to implement.
A2C -Advantage Actor-Critic
The A2C (Advantage Actor-Critic) algorithm combines value-based and policy-based methods to improve performance. It consists of two main components: the Actor and the Critic.
The Actor determines the actions the agent takes using the Policy gradient, while the Critic evaluates the effectiveness of the actions based on a value function. The Actor is updated using the advantage function as a guide.
However, A2C does have some limitations.
- It can suffer from slow convergence and become stuck in suboptimal policies.
- A2C is sensitive to changes in hyperparameters, which can result in significant variations in performance.
- Not very sample-efficient, requiring lots of experience interacting with the environment to learn effectively.
One advantage of A2C is that it can be run in parallel environments, significantly speeding up the training process. By running multiple environments simultaneously, A2C can collect more experience and learn from diverse experiences in less time.
TRPO-Trust Region Policy Optimization
TRPO (Trust Region Policy Optimization) is an algorithm that combines the actor-critic approach with a trust region to restrict the policy update. The policy update is evaluated using the KL divergence between the old and updated policies, which is used as a metric to determine the size of the trust region at each iteration.
TRPO is known for its good sample efficiency, which means it can learn quickly with fewer samples. Using the trust region to limit policy updates, TRPO ensures that the policy changes are not too large, making it relatively stable.
However, TRPO can be computationally expensive and more difficult to optimize compared to other algorithms due to the complexity of the trust region constraint. Therefore, it may require more tuning and optimization to achieve optimal performance.
Proximal Policy Optimization(PPO) is a policy gradient reinforcement algorithm that handles continuous states and action spaces, takes less time to train, has less variance, gives better and more stable results than other algorithms, and is simple to implement.
So, what exactly is PPO, and how it works?
PPO-Proximal Policy Optimization
PPO belongs to the policy gradient methods for reinforcement learning. It alternates between
- Sampling data through interaction with the environment and
- Optimizing a surrogate objective function using stochastic gradient ascent.
The goal of stochastic gradient descent(SGD) used in Policy gradient RL is to increase the expected reward by adjusting the policy network parameters and determining the action to be taken in a given state.
By taking a gradient ascent step on this loss with respect to the network parameters, the agent will be incentivized to take actions that lead to higher rewards.
For stable agent training, the algorithm should avoid network parameter updates that change the policy too much at one step, which might wreck the policy.
However, too small an update of the network parameter will slow down the training process.
Consider the challenge of reaching the top of a steep mountain cliff covered with dense forest. If your step size is too large, you run the risk of falling off the cliff. On the other hand, if your step size is too small, it could take an eternity to reach the summit.
So how much should the policy network parameters be updated to lead to higher rewards and faster but more stable training?
If you want to reach the top of a steep mountain cliff covered with dense forest, one way is to rely on the experience of previous explorers and follow their path, taking steps that stay relatively close to their experience.
The step size can be determined by the probability ratio between your current actions on the path and the previous actions taken by others on the same path to guide your decisions.
The probability ratio rₜ(θ) is the probability of taking action aₜ state sₜ in the current policy divided by the previous one.
The probability ratio is an easy way to estimate how much to diverge between the old and current policies.
- If rₜ(θ) > 1, then the action is more likely for your current policy than it is for your old policy.
- If rₜ(θ) is between 0 and 1 when the action is less likely for your current policy than for your old.
PPO uses the clipped surrogate objective, which takes the minimum between the original value and the clipped objective.
PPO clips rₜ(θ) to stay within a small interval around 1, [1 − ε, 1 +ε ], where ε is a hyperparameter and is usually set to 0.2
The highlighted term in red modifies the surrogate objective by clipping the probability ratio, removing the incentive for moving rₜ(θ) outside the interval [1 − ε, 1 + ε].
The clipped surrogate objective takes the minimum of the clipped and unclipped objective, which results in a lower bound on the unclipped objective. This ensures that the optimization algorithm does not make overly optimistic assumptions ensuring the objective functions do not deviate too much from the desired behavior.
When hiking on a less explored hill, minimizing the clipped and unclipped objective function can help ensure you do not deviate too much from the previous explorer’s experience and prevent you from reaching the summit. This approach helps establish constraints that keep you on a path consistent with the previous explorers’ knowledge while allowing for some flexibility.
When the Advantage function(Aₜ) for that state-action pair is positive, implying that the action based on the new policy had an estimated positive effect on the outcome. The minimum in the term below constrains how much the objective can increase, ensuring that the new policy does not benefit by going far away from the old policy.
When the advantage function(Aₜ) for the state-action pair is negative, implying that the action based on the new policy had an estimated negative effect on the outcome. The max in the term below constrains how much the objective can increase, ensuring that the new policy does not benefit by going far away from the old policy.
so how do ChatGPT and InstructGPT use PPO for training?
ChatGPT is trained on large and diverse data sources, such as news articles, books, websites, and social media posts, to learn the patterns and structures of language.
The human labelers are presented with a prompt from the prompt dataset and asked to provide an answer that best fits the prompt. This human-labeled dataset is then used to fine-tune the GPT Language Model using supervised learning techniques.
Once the GPT model is fine-tuned, a prompt can be inputted into the model, and a labeler then ranks the output generated by the model from best to worst. This ranking data trains the Rewards model, which predicts a scalar reward based on the labelers’ preferred ranking.
The components of the PPO RL model used for the ChatGPT and InstructGPT are:
- Observation space: All possible sequences of input tokens that the Large Language model can process.
- Action space: The generation of coherent sentences using conversational vocabulary from the model’s language generation capabilities.
- Value function: The scalar value obtained from the reward function, which helps to estimate the expected return of generated conversational vocabulary for the given tokens.
- Policy: The Large Language model itself, which generates the sentences based on the current state, the input prompts using its language generation capabilities.
The language model, which is the policy, is fine-tuned using the PPO algorithm for ChatGPT and InstructGPT.
The vocabulary generated from the initial language model forms the old policy. The new policy for calculating the probability ratio is the vocabulary generated from the fine-tuned language model using the rewards function. To ensure stability, the clipped surrogate objective constrains the new policy to be within a certain distance of the previous policy.
Conclusion:
The PPO algorithm’s implementation is straightforward due to its clipped surrogate objective, stabilizing the training process. The clipped surrogate objective involves the probability ratio between the new and old policies, which prevents the optimization algorithm from making overly optimistic assumptions. This ensures that the objective functions remain close to the desired behavior and do not deviate too much.
References:
Reinforcement Learning: An Introduction second edition by Richard S. Sutton and Andrew G. Barto