AI Reinforcement Learning with OpenAI’s Gym

OpenAI’s Gym library for developing and comparing algorithms

What is Reinforcement Learning?

Reinforcement learning is like teaching a system to figure things out by trial and error. Imagine you’re teaching a dog new tricks. You give the dog a treat when it does something right and ignore it when it does something wrong. Over time, the dog learns which actions lead to rewards and which don’t.

In reinforcement learning, the model learns by trying different actions and seeing which ones lead to the best outcomes. It’s about training the model to make smart choices to reach a particular goal in different situations.

It operates on the principle of learning by interaction, where the agent receives feedback from its actions and adjusts its behaviour accordingly to maximise cumulative rewards.

This is quite different to other machine learning and artificial intelligence models I’ve used and written on where you train a model based on preprocessed historical data only.

In the realm of finance, particularly trading, reinforcement learning holds significant promise. Here’s how it may be utilised:

Algorithmic Trading: Reinforcement learning algorithms can be employed to develop trading strategies that adapt to changing market conditions. By learning from historical data and real-time market signals, these algorithms can make informed decisions about when to buy, sell, or hold assets.
Risk Management: Reinforcement learning can assist in devising risk management strategies by optimising portfolio allocation and managing exposure to various assets. By considering factors such as volatility, correlation, and market trends, reinforcement learning algorithms can help traders mitigate risk and enhance returns.
Market Prediction: Reinforcement learning techniques can be used to forecast market trends and predict price movements. By analysing vast amounts of financial data and identifying patterns, these algorithms can generate insights into future market behaviour, aiding traders in making informed investment decisions.
High-Frequency Trading: In high-frequency trading, where speed is crucial, reinforcement learning algorithms can help optimise trading strategies to exploit fleeting opportunities in the market. By reacting quickly to changing market conditions, these algorithms can capitalise on small price discrepancies and generate profits.

Reinforcement learning offers a powerful framework for developing adaptive and intelligent trading systems in the finance domain, enabling traders to navigate complex markets more effectively and achieve their investment objectives.

I have to admit this was my first attempt at learning how reinforcement works in practice. It took a few weeks to get my head around the mechanics of it and how to develop something useful in Python. I found a really impressive Python library by OpenAI called Gym that simplified the process and my understanding a lot. If I have made any mistakes or I could have done anything in a better way, feel free to let me/us know in the comments.

What is OpenAI’s Gym library?

OpenAI’s Gym is an open-source library designed to provide a toolkit for developing and comparing reinforcement learning algorithms. It offers a wide range of environments for testing and benchmarking reinforcement learning algorithms, from simple grid worlds i.e. basic environments represented as grids, to complex physics-based simulations.

Gym provides a common interface for interacting with these environments, making it easier for researchers and developers to experiment with various reinforcement learning algorithms and compare their performance. The library includes environments with discrete and continuous action spaces, as well as support for episodic and continuous tasks.

OpenAI’s Gym library comes with a variety of pre-built environments across different categories.

Just some examples…

Classic Control: Includes simple control tasks such as CartPole, where the goal is to balance a pole on a cart by moving left or right.
Atari: Offers classic Atari video game environments such as Pong, Breakout, and Space Invaders, where the agent learns to play the games by observing pixel inputs.
Box2D: Provides physics-based simulations using the Box2D physics engine, including tasks like LunarLander, where the agent must safely land a spacecraft on the moon.
MuJoCo: Utilises the MuJoCo physics engine for more complex control tasks, like Humanoid, where the agent controls a simulated humanoid robot.
Toy Text: Offers simple text-based environments like FrozenLake, where the agent navigates a grid-world to reach a goal while avoiding holes.

For the reinforcement learning experiment I wanted to do in Gym, the pre-built environment did not exist, I had to create my own. I will show you how to do this later. If you are going to try this out yourself, I highly recommend creating your own environment. It makes the process a lot more clear what is happening. It’s also very flexible.

In my case I wanted to create an environment for algorithmic trading, and to use a trading example as the data for my experiment was easily available from EODHD APIs. I think a trading example also is a concept most people would understand and visualise how this works. I like working with the EODHD APIs because their endpoints are easy to use, provide a lot of data per request which is great for training models, and they have a vast amount of market data. I’ve had a subscription with them for years and highly recommend them.

Introducing Gym

You can install the Python “gym” library using PIP. I recommend installing it within a virtual environment. It’s probably not mandatory, but it would be good practice and avoid unexpected issues.

rl % python3 -m venv venv

rl % source venv/bin/activate

(venv) rl % python3 -m pip install --upgrade pip
Requirement already satisfied: pip in ./venv/lib/python3.11/site-packages (23.0.1)
Collecting pip
  Using cached pip-24.0-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-24.0

(venv) rl % python3 -m pip install gym
Collecting gym
  Using cached gym-0.26.2-py3-none-any.whl
Collecting numpy>=1.18.0 (from gym)
  Downloading numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 2.0 MB/s eta 0:00:00
Collecting cloudpickle>=1.2.0 (from gym)
  Downloading cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)
Collecting gym-notices>=0.0.4 (from gym)
  Downloading gym_notices-0.0.8-py3-none-any.whl.metadata (1.0 kB)
Downloading cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Downloading gym_notices-0.0.8-py3-none-any.whl (3.0 kB)
Downloading numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl (20.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.6/20.6 MB 7.2 MB/s eta 0:00:00
Installing collected packages: gym-notices, numpy, cloudpickle, gym
Successfully installed cloudpickle-3.0.0 gym-0.26.2 gym-notices-0.0.8 numpy-1.26.4

A “Classic Control” environment example may look like this:

import gym  # Import the Gym library

# Create the environment
env = gym.make("CartPole-v1")  # Choose an environment, in this case, CartPole-v1

# Reset the environment to its initial state and get the initial observation
observation = env.reset()

# Run the simulation for a certain number of steps
for t in range(100):
    # Render the environment (optional, useful for visualising)
    env.render()

    # Take a random action from the action space
    action = env.action_space.sample()  # Randomly select an action
    # Apply the action to the environment and get the next observation, reward, and whether the episode is done
    observation, reward, done, info = env.step(action)

    # Print information about the current step
    print("Step:", t)
    print("Action:", action)
    print("Observation:", observation)
    print("Reward:", reward)
    print("Done:", done)
    print("Info:", info)

    # Check if the episode is done
    if done:
        print("Episode finished after {} timesteps".format(t+1))
        break

# Close the environment
env.close()

This is what is happening here…

“Classic Control” environment “CartPole-v1” created.
Reset the initial state, the first observation
Loop through 100 steps in this episode
Optionally render something during the process
In the example above the action being passed in is random, which is not helpful at all, but just gives you an example. The action will be numeric. Taking a trading example, it may be 0 for hold, 1 for buy, and 2 for sell.

A potential trading example…

obs = env.reset()
done = False
while not done:
    # Current prices and moving averages
    current_price = env.df.loc[env.current_step, "close"]
    short_ma = env.df.loc[env.current_step, "sma50"]
    long_ma = env.df.loc[env.current_step, "sma200"]

    # Decide action based on moving average crossover strategy
    if short_ma > long_ma and env.position == 0:  # Golden cross - Buy signal
        action = 1
    elif short_ma < long_ma and env.position == 1:  # Death cross - Sell signal
        action = 2
    else:
        action = 0  # Hold

    obs, reward, done, info = env.step(action)
    env.render()

You may have spotted a flaw with this already. You are controlling the action in each step. Sure this will work but this is just basic technical analysis. There is no learning happening here. As the fast moving average crosses above the slow moving average then buy, and sell with the reverse. What we are missing here is factoring in the reward for the successful trade E.g. a profitable trade being fed back into the loop so the model improves. This took me a little while to figure out how to get this working but I did and I’ll explain later. As this is quite an involved topic, I’m trying to build it up slowly so you can see what is happening.

Pass the action into the episode step, and the result of the action will be returned as the observation, reward (positive or negative), done (True if completed or False for still processing), and info which is optional info used for debugging.
Process the steps and then close.

You may notice that the environment has some methods like “reset”, “render”, “close”, etc. If we want to create our own environment we’ll need to implement these.

Like this…

class MyEnv:
    def __init__(self, input1, input2):
        self.input1 = input1
        self.input2 = input2

    def reset(self):
        return None

    def step(self, action):
        reward = 1
        done = True
        return "current_state", reward, done, {}

    def render(self):
        print("something useful")

    def close(self):
        pass

Creating your classes

As I mentioned above, when a reinforcement learning action is completed and a reward is issued, that needs to be fed back into the process to improve. You will need to create a class for this. Mine looks like this.

components/QLearningAgent.py

This is a “simple” reinforcement learning agent using Q-learning. Think of the agent like a decision-maker navigating a maze. It doesn’t initially know the best path to the reward, so it needs to explore and learn from its experiences. The QLearningAgent class stores a table (called the Q-table) where each row represents a state (a position in the maze), and each column represents an action (like moving up, down, left, or right).

The agent follows a strategy called “epsilon-greedy” when choosing actions. Sometimes, it picks a random action to explore new paths, and other times it uses past experience to choose the best-known action. It updates its Q-table after every move, refining its understanding of which actions lead to better rewards. Over time, the agent gradually reduces how much it explores randomly, instead relying more on the knowledge it’s gathered to consistently choose the most rewarding actions. Essentially, the agent learns the optimal way through trial and error, progressively finding better strategies to achieve the maximum reward.

import numpy as np

class QLearningAgent:
    def __init__(self, n_actions, state_dim, learning_rate=0.01, discount_factor=0.99, exploration_rate=1.0, max_exploration_rate=1.0, min_exploration_rate=0.01, exploration_decay_rate=0.001):
        self.n_actions = n_actions
        self.state_dim = state_dim
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_rate = exploration_rate
        self.max_exploration_rate = max_exploration_rate
        self.min_exploration_rate = min_exploration_rate
        self.exploration_decay_rate = exploration_decay_rate
        self.q_table = np.zeros((state_dim, n_actions))

    def choose_action(self, state):
        if np.random.rand() < self.exploration_rate:
            action = np.random.randint(self.n_actions)
        else:
            action = np.argmax(self.q_table[state])
        print(f"Choosing action: {action} for state: {state}")  # Debug statement
        return action

    def update_policy(self, state, action, reward, next_state, done):
        old_value = self.q_table[state, action]
        next_max = np.max(self.q_table[next_state])

        new_value = (1 - self.learning_rate) * old_value + self.learning_rate * (reward + self.discount_factor * next_max * (not done))
        self.q_table[state, action] = new_value

        if self.exploration_rate > self.min_exploration_rate:
            self.exploration_rate -= self.exploration_decay_rate

components/TradingEnv.py

I created this environment class for my trading example. It should be flexible enough to adjust for other use cases as well. The point is to demonstrate the concept and give you a foundation to further explore. I like the trading example because I think it’s something most people will able to understand and visualise.

There are a few important points I want to highlight…

self.action_space = spaces.Discrete(3)  # 0=hold, 1=buy, 2=sell

The action space defines the possible actions. They will be assigned a numerical value. For example, in my case to hold is 0, to buy is 1, and 2 is sell.

self.observation_space = spaces.Box(low=0, high=1, shape=(len(df.columns),), dtype=np.float32)

This line I’ve made dynamic so you shouldn’t need to change it, but I just want to point out that the observation space needs to match the same dimensions as your data. I used “shape=(len(df.columns),)” to do this.

self.scaler = MinMaxScaler()
self.df_scaled = self.scaler.fit_transform(self.df[['open', 'high', 'low', 'close', 'volume']])

Like most data science problems, you will almost always want to apply a scaler to the data. I used a MinMaxScaler in this case as I wanted all data to be scaled between 0 and 1. Be careful using something like a StandardScaler as they will scale between -1 and 1 which could be strange when dealing with trading data which will be positive. I’m not sure if it would affect the result, but debugging will be tricky if you see a negative price for example.

The rest of the code should be self explanatory. I’ve added comments to explain certain parts. If you have any questions, just ask in the comments and I’ll try and answer them for you.

import gym
from gym import spaces
import numpy as np
from sklearn.preprocessing import MinMaxScaler


class TradingEnv(gym.Env):
    metadata = {"render.modes": ["human"]}

    def __init__(self, df, initial_balance=10000):
        super(TradingEnv, self).__init__()
        self.df = df
        self.initial_balance = initial_balance
        self.action_space = spaces.Discrete(3)  # 0=hold, 1=buy, 2=sell
        self.observation_space = spaces.Box(
            low=0, high=1, shape=(len(df.columns),), dtype=np.float32
        )
        self.scaler = MinMaxScaler()
        self.df_scaled = self.scaler.fit_transform(
            self.df[["open", "high", "low", "close", "adjusted_close", "volume"]]
        )
        self.df["short_ma"] = self.df["adjusted_close"].rolling(window=50).mean()
        self.df["long_ma"] = self.df["adjusted_close"].rolling(window=200).mean()
        self.reset()

    def reset(self):
        self.balance = self.initial_balance
        self.position = 0
        self.open_position_price = 0
        self.current_step = 0
        self.trade_open = False
        self.trade_summary = {}
        return self.get_discrete_state()

    def _next_observation(self):
        return self.df_scaled[self.current_step]

    def get_discrete_state(self):
        current_price = self.df.loc[self.current_step, "adjusted_close"]
        short_ma = self.df.loc[self.current_step, "short_ma"]
        long_ma = self.df.loc[self.current_step, "long_ma"]

        if current_price > short_ma > long_ma:
            return 0  # Bullish signal
        elif current_price < short_ma < long_ma:
            return 1  # Bearish signal
        else:
            return 2  # Neutral

    def step(self, action):
        done = False
        self.current_step += 1
        if self.current_step >= len(self.df) - 1:
            done = True

        current_price = self.df.loc[self.current_step, "adjusted_close"]
        reward = 0
        trade_info = "hold"

        if action == 1 and self.position == 0:  # Buy
            self.position = 1
            self.open_position_price = current_price
            trade_info = "buy"
            self.trade_summary = {
                "open_price": current_price,
                "open_step": self.current_step,
            }
        elif action == 2 and self.position == 1:  # Sell
            profit = current_price - self.open_position_price
            reward = profit - abs(profit) * 0.01
            self.balance += profit
            self.position = 0
            trade_info = "sell"
            self.trade_summary.update(
                {
                    "close_price": current_price,
                    "close_step": self.current_step,
                    "profit": reward,  # Note we now use reward here, which includes the fee
                }
            )

        unrealized_profit = (
            current_price - self.open_position_price if self.position else 0
        )
        self.info = {
            "trade": trade_info,
            "open_position_price": self.open_position_price if self.position else None,
            "current_price": current_price,
            "unrealised_profit": unrealized_profit,
        }

        next_state = self.get_discrete_state()
        return next_state, reward, done, self.info

    def render(self, mode="human", close=False):
        trade_status = "open" if self.position else "closed"
        current_price = self.df.loc[self.current_step, "adjusted_close"]
        if self.position:
            unrealized_profit = current_price - self.open_position_price
        else:
            unrealized_profit = 0

        # General information about the current step
        print(
            f"Step: {self.current_step}, Balance: {self.balance:.2f}, "
            f'Open Trade: {trade_status}, Action: {self.info["trade"]}, '
            f"Current Price: {current_price:.2f}, "
            f"Unrealised Profit: {unrealized_profit:.2f}"
        )

        # Detailed trade summary when a position is closed
        if "profit" in self.trade_summary and not self.position:
            print(
                f'Trade Summary - Open Price: {self.trade_summary["open_price"]:.2f}, '
                f'Close Price: {self.trade_summary["close_price"]:.2f}, '
                f'Profit: {self.trade_summary["profit"]:.2f}, Steps Held: '
                f'{self.trade_summary["close_step"] - self.trade_summary["open_step"]}'
            )

The last part is the training code…

train.py

import sys
import warnings
import pandas as pd
import numpy as np
from eodhd import APIClient
from components import TradingEnv
from components import QLearningAgent
import config as cfg

api = APIClient(cfg.API_KEY)


def get_ohlc_data():
    df = api.get_historical_data("GSPC.INDX", "d", results=1825)  # 5 years of trading days

    # Remove features we don't need
    df.drop(columns=["symbol", "interval"], inplace=True)

    # Reset index
    df.reset_index(drop=True, inplace=True)

    return df


if __name__ == "__main__":
    df = get_ohlc_data()
    # df.to_csv("data/ohlc_data.csv", index=True)
    # df = pd.read_csv("data/ohlc_data.csv", index_col=0)

    env = TradingEnv(df)
    state_dim = 3  # Three states (Hold, Buy, Sell)
    n_actions = env.action_space.n
    agent = QLearningAgent(n_actions, state_dim)

    n_episodes = 100  # Run for a finite number of episodes
    max_steps_per_episode = len(df)  # Limit the number of steps per episode if necessary

    for episode in range(n_episodes):
        state = env.reset()
        done = False
        total_reward = 0
        steps = 0

        while not done and steps < max_steps_per_episode:
            action = agent.choose_action(state)
            next_state, reward, done, info = env.step(action)
            agent.update_policy(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
            steps += 1
            env.render()  # This needs to be called here

        print(f"Episode: {episode}, Total reward: {total_reward:.2f}, Final balance: {env.balance:.2f}, Exploration rate: {agent.exploration_rate:.4f}, Steps: {steps}")

Hopefully most of this is self explanatory, and I’ve added comments where necessary.

The episodes (n_episodes) and max steps per episode (max_steps_per_episode) needs some explaining…

Episodes:

An episode is a single trial or attempt where the agent interacts with the environment from a starting point to some end goal.
During an episode, the agent takes a sequence of actions, aiming to maximise its reward or achieve a specific goal.
After reaching the goal or exhausting a certain number of steps (called “max steps per episode”), the episode ends, and a new one starts.

Max Steps per Episode:

This is the maximum number of actions the agent can take before the episode ends.
If the agent reaches the goal sooner, it ends early. Otherwise, it stops after reaching the limit set by “max steps per episode.”
In my case I’ve set the max to the number of days/rows in the trading data.

Conclusion

I added some print statements to help show what is happening during the training process. It will show when buys are executed, the unrealised profit of open trades, and the sell.

It will finish with something that looks like this:

Episode: 999, Total reward: 1763.29, Final balance: 11808.21, Exploration rate: 0.0100, Steps: 1257

At high level my trial started with £10,000 and at the end I concluded with £11,808.21. That’s promising at least.

I also included the total reward calculated. Successful trades result in a positive reward, and unsuccessful trades result in a negative reward. The total reward in this example is 1763.29.

Hopefully this gives you some idea of how this could be used for different use cases.

I hope you found this article interesting and useful. If you would like to be kept informed, please don’t forget to follow me and sign up to my email notifications.

If you liked this article, I recommend checking out EODHD APIs on Medium. They have some interesting articles.

Michael Whittle

If you enjoyed this, please follow me on Medium
For more interesting articles, please follow my publication
Interested in collaborating? Let’s connect on LinkedIn
Support me and other Medium writers by signing up here
Please don’t forget to clap for the article :) ← Thank you!