Setting up the Cliff Walking Environment for Reinforcement Learning (RL)

The OpenAI Gym’s Cliff Walking environment is a classic reinforcement learning task in which an agent must navigate a grid world to reach a goal state while avoiding falling off of a cliff. The Cliff Walking environment is another environment within the toy text environments in Gym.

The agent starts at the bottom-left corner of the grid and must reach the bottom-right corner. The grid is composed of safe cells, which the agent can move through freely, and cliff cells, which the agent must avoid.
The agent can move in four directions: up, down, left, and right. If the agent falls off a cliff, it will be returned to the starting position and incur a penalty of -100; otherwise, the agent receives a reward of -1 for each step it takes while not having completed the episode. The state or observation is taken as the current position in the grid world. The goal is to find the optimal policy that maximizes the total reward, which amounts to finding the shortest path in this environment.
Here is how to setup the Cliff Walking environment using Python and the OpenAI Gym library:
import gym
# Create the Cliff Walking environment
env = gym.make('CliffWalking-v0')
# Reset the environment to its initial state
observation = env.reset()
# Set the number of steps to take
num_steps = 10
# Take the given number of steps
for i in range(num_steps):
# Render the environment to the screen
env.render()
# Choose a random action
action = env.action_space.sample()
# Take the action and get the next observation, reward, and done flag
observation, reward, done, info = env.step(action)
# Print some environmental values
print(f'Step {i}: observation={observation}, \
reward={reward}, done={done}, info={info}')
# If the episode is over, reset the environment
if done:
observation = env.reset()
# Close the environment
env.close()
Sample Output:

The x represents the agent’s current location during the epsiode. The o’s represent safe cells, and the C’s represent the dangerous cliff cells. The RL agent’s task is to learn the optimal path through this environment, which should be intuitively obvious upon inspection. However, the RL agent needs to learn this from the reward structure, i.e., that it shouldn’t fall of the cliff or take some needlessly long path in the environment before stumbling into the goal state.