# Summary

The web content discusses the use of OpenAI's Gym and Stable Baselines3 libraries to solve the classic control problem of Mountain Car Continuous using reinforcement learning, specifically the Proximal Policy Optimization (PPO) algorithm.

# Abstract

The article on the `undefined` website provides a comprehensive guide to applying reinforcement learning to the Mountain Car Continuous problem from OpenAI's Gym using the Stable Baselines3 library. It begins with a background on deep reinforcement learning (RL), the Markov Decision Process (MDP), and the importance of the OpenAI Gym environment for RL research. The focus then shifts to the Mountain Car Continuous challenge, a standard test for RL algorithms where an agent must navigate a car to the top of a hill. The article highlights the use of Stable Baselines3, particularly the PPO algorithm, to efficiently solve this problem. The author includes a Colab notebook link for practical implementation and details the steps for setting up the environment, configuring the PPO model with hyperparameters, training the model, and evaluating its performance. The article concludes with a discussion on the significance of selecting appropriate hyperparameters for the PPO algorithm and provides additional resources for understanding them.

# Opinions

- The author emphasizes the effectiveness of the Proximal Policy Optimization (PPO) algorithm in solving the Mountain Car Continuous problem.
- The article suggests that the Stable Baselines3 library simplifies the implementation of RL algorithms, making it accessible for researchers and practitioners.
- There is an opinion that creating a standardized environment with OpenAI Gym is crucial for benchmarking RL algorithms and focusing on algorithm development.
- The author values the use of Tensorboard for visualizing learning progress and model performance, as evidenced by the instructions to save logs for Tensorboard visualization.
- The author indicates a preference for specific hyperparameters, which they claim are tuned to solve the Mountain Car Continuous problem effectively.
- There is a mention of a practical challenge in rendering the environment within a Jupyter Notebook, hinting at a potential area for improvement or further exploration in the tooling ecosystem.

Use Stable Baselines3 to Solve Mountain Car Continuous in Gym

OpenAI has created the Gym and Stable Baselines library to make reinforcement learning easy to use. I’d like to recap how to use it with one of the classic control problems — Mountain Car Continuous. The Colab notebook is here https://colab.research.google.com/drive/1m5Ppsrv6B5maUJ-vMgbZtMeSxqFfUVSP?usp=sharing

1. Background

(1) Deep Reinforcement Learning (RL)

RL was used to train an agent winning the world champion of GO in 2016. Since then, RL has been attracting increasing attention. Different from supervised learning which tries to learn the distribution boundaries or unsupervised learning which learns the distribution directly, RL is created based on Markov Decision Process(MDP). MDP is a mathematical formulation of a sequential decision-making process with objectives. In a given environment, the learning agent takes actions based on its observation. The environment will update its state influenced by those actions and give feedback or reward back to the agent. The process goes on and on until the agent reaches the goal or meets termination conditions. The learning agent is trying to obtain the optimal policy(taking which action under certain observations/states) leading to accumulated rewards in one process.

Currently, Proximal Policy Optimization (PPO) is the most used algorithm to solve this MDP problem.

(2) OpenAI Gym

In RL, the environment is crucial since it provides the reward that the agent’s learning is based on. It also needs to update the states in each step. In order to better benchmark the research in various environments and allow people to focus on algorithm development, OpenAI creates a Gym library providing several standard environments. https://www.gymlibrary.ml/

(3) Stable Baselines3

Stable Baselines3 gives reliable implementations of reinforcement learning algorithms in PyTorch which is the major version replacing previous Stable Baselines. Official Doc https://stable-baselines3.readthedocs.io/en/master/index.html

2. Mountain Car Continuous

“The goal of the MDP is to strategically accelerate the car to reach the goal state on top of the right hill. There are two versions of the mountain car domain in gym: one with discrete actions and one with continuous. This version is the one with continuous actions.” https://www.gymlibrary.ml/environments/classic_control/mountain_car_continuous/

3. Implementation using Stable Baselines3(SB3)

The code is super simple using the library, below is the tuned PPO model which solves the problem. The notebook:https://colab.research.google.com/drive/1m5Ppsrv6B5maUJ-vMgbZtMeSxqFfUVSP?usp=sharing

This is the result I got, it reaches the flag on the right. It tried several times to go to the top.

(1) Install packages

pip install stable-baselines3[extra]
import gym

from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_util import make_vec_env

import os
import time

(2) Create folders to save models and logs

# Saving logs to visulise in Tensorboard, saving models
models_dir = f"models/Mountain-{time.time()}"
logdir = f"logs/Mountain-{time.time()}"
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
if not os.path.exists(logdir):
    os.makedirs(logdir)

(3) Create an Environment

# Parallel environments

env = make_vec_env("MountainCarContinuous-v0", n_envs=1)

(4) Set up the model with SB3

# The learning agent and hyperparameters
model = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0,
    batch_size=256,
    ent_coef=0.00429,
    learning_rate=7.77e-05,
    n_epochs=10,
    n_steps=8,
    gae_lambda=0.9,
    gamma=0.9999,
    clip_range=0.1,
    max_grad_norm =5,
    vf_coef=0.19,
    use_sde=True,
    policy_kwargs=dict(log_std_init=-3.29, ortho_init=False),
    verbose=1,
    tensorboard_log=logdir
    )

(5) Training

#Training and saving models along the way
TIMESTEPS = 20000
for i in range(10): 
    model.learn(total_timesteps=TIMESTEPS,reset_num_timesteps=False, tb_log_name="PPO")
    model.save(f"{models_dir}/{TIMESTEPS*i}")

(6) Load the best model to check the result

# Check model performance
# load the best model you observed from tensorboard - the one reach the goal/ obtaining highest return
models_dir = "models/Mountain-1653282767.3143597"
model_path = f"{models_dir}/80000"
best_model = PPO.load(model_path, env=env)

obs = env.reset()
while True:
    action, _states = best_model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    # env.render()  use Python IDE to check, I havn't figure out how to render in Notebook

A good post about hyperparameters for PPO :

PPO Hyperparameters and Ranges https://readmedium.com/ppo-hyperparameters-and-ranges-6fc2d29bccbe