Free AI web copilot to create summaries, insights and extended knowledge, download it at here
13958
Abstract
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-built_in">print</span>(<span class="hljs-string">'numpy: %s'</span> % np.version) <span class="hljs-comment"># print version</span>
<span class="hljs-comment"># Note need to 'pip install gym', and 'pip install gym[toy_text]' </span>
<span class="hljs-comment"># or 'pip install gym[toy_text]' if zsh does nor recongize the first command</span>
<span class="hljs-keyword">import</span> gym <span class="hljs-comment"># for simulated environments</span>
<span class="hljs-built_in">print</span>(<span class="hljs-string">'gym: %s'</span> % gym.version) <span class="hljs-comment"># print version</span>
<span class="hljs-keyword">import</span> matplotlib
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt <span class="hljs-comment"># for displaying environment states</span>
<span class="hljs-built_in">print</span>(<span class="hljs-string">'matplotlib: %s'</span> % matplotlib.version) <span class="hljs-comment"># print version</span>
<span class="hljs-keyword">from</span> IPython <span class="hljs-keyword">import</span> display <span class="hljs-comment"># for displaying environment states</span>
<span class="hljs-keyword">import</span> time <span class="hljs-comment"># for slowing down rendering of states by adding small time delays</span></pre></div><p id="a397">The above code prints package versions used in this example:</p><div id="1d37"><pre><span class="hljs-attribute">numpy</span>: <span class="hljs-number">1</span>.<span class="hljs-number">23</span>.<span class="hljs-number">3</span>
<span class="hljs-attribute">gym</span>: <span class="hljs-number">0</span>.<span class="hljs-number">26</span>.<span class="hljs-number">0</span>
<span class="hljs-attribute">matplotlib</span>: <span class="hljs-number">3</span>.<span class="hljs-number">6</span>.<span class="hljs-number">0</span></pre></div><p id="9250">Next, we set up a Frozen-Lake environment:</p><div id="faad"><pre><span class="hljs-comment"># Setup environment</span>
env = gym.make(<span class="hljs-built_in">id</span>=<span class="hljs-string">'FrozenLake-v1'</span>, <span class="hljs-comment"># Choose one of the existing environments</span>
desc=<span class="hljs-literal">None</span>, <span class="hljs-comment"># Used to specify custom map for frozen lake. E.g., desc=["SFFF", "FHFH", "FFFH", "HFFG"].</span>
map_name=<span class="hljs-string">'4x4'</span>, <span class="hljs-comment"># ID to use any of the preloaded maps. E.g., '4x4', '8x8'</span>
is_slippery=<span class="hljs-literal">False</span>, <span class="hljs-comment"># True/False. If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions.</span>
max_episode_steps=<span class="hljs-literal">None</span>, <span class="hljs-comment"># default=None, Maximum length of an episode (TimeLimit wrapper).</span>
autoreset=<span class="hljs-literal">False</span>, <span class="hljs-comment"># default=None, Whether to automatically reset the environment after each episode (AutoResetWrapper).</span>
disable_env_checker=<span class="hljs-literal">None</span>, <span class="hljs-comment"># default=None, If to run the env checker</span>
render_mode = <span class="hljs-string">'rgb_array'</span> <span class="hljs-comment"># The set of supported modes varies per environment. (And some third-party environments may not support rendering at all.)</span>
)</pre></div><p id="541d">Note that we use a non-slippery version of the game in this example. Meanwhile, the rendering options are as follows:</p><ul><li><b>None (default):</b> no render is computed.</li><li><b>human:</b> render return None. The environment is continuously rendered in the current display or terminal, usually for human consumption.</li><li><b>rgb_array:</b> return frames representing the state of the environment. A frame is a numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image.</li><li><b>ansi:</b> Return a list of strings (str) or StringIO.StringIO containing a terminal-style text representation for each time step. The text can include newlines and ANSI escape sequences (e.g. for colours).</li></ul><p id="4f8f">Let’s check the environment description, state space and action space of the environment that we have set up above:</p><div id="83f0"><pre><span class="hljs-comment"># Show environment description (map) as an array</span>
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Environment Array: "</span>)
<span class="hljs-built_in">print</span>(env.desc)
<span class="hljs-comment"># Observation and action space </span>
state_obs_space = env.observation_space <span class="hljs-comment"># Returns sate(observation) space of the environment.</span>
action_space = env.action_space <span class="hljs-comment"># Returns action space of the environment.</span>
<span class="hljs-built_in">print</span>(<span class="hljs-string">"State(Observation) space:"</span>, state_obs_space)
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Action space:"</span>, action_space)</pre></div><div id="195f"><pre><span class="hljs-symbol">Environment</span> <span class="hljs-symbol">Array</span>:
[[b<span class="hljs-string">'S'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'F'</span>]
[b<span class="hljs-string">'F'</span> b<span class="hljs-string">'H'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'H'</span>]
[b<span class="hljs-string">'F'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'H'</span>]
[b<span class="hljs-string">'H'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'F'</span> b<span class="hljs-string">'G'</span>]]</pre></div><div id="2f5b"><pre><span class="hljs-function"><span class="hljs-title">State</span>(<span class="hljs-variable">Observation</span>) <span class="hljs-variable">space</span>: <span class="hljs-title">Discrete</span>(<span class="hljs-number">16</span>)</span>
<span class="hljs-variable">Action</span> <span class="hljs-variable">space</span>: <span class="hljs-function"><span class="hljs-title">Discrete</span>(<span class="hljs-number">4</span>)</span></pre></div><p id="3c7c">You can see that the environment matches the layout shown in the previous section. However, you can always specify your own map using the <b>desc </b>option within the gym.make(). Reminder: S = Start, F = Frozen, H = Hole, G = Goal.</p><p id="8c04">Also, as expected, the state space contains 16 discrete states (4x4), and the action space has 4 discrete actions (0: LEFT, 1: DOWN, 2: RIGHT, 3: UP).</p><p id="0138">As a last check, before we do any training, we will let our agent loose around its environment (taking random actions at each step) and render it in the Jupyter Notebook.</p><div id="89ee"><pre><span class="hljs-comment"># Reset environment to initial state</span>
state, info = env.reset()
<span class="hljs-comment"># Cycle through 20 random steps redering and displaying the agent inside the environment each time</span>
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">20</span>):
<span class="hljs-comment"># Render and display current state of the environment</span>
plt.imshow(env.render()) <span class="hljs-comment"># render current state and pass to pyplot</span>
plt.axis(<span class="hljs-string">'off'</span>)
display.display(plt.gcf()) <span class="hljs-comment"># get current figure and display</span>
display.clear_output(wait=<span class="hljs-literal">True</span>) <span class="hljs-comment"># clear output before showing the next frame</span>
<span class="hljs-comment"># Sample a random action from the entire action space</span>
random_action = env.action_space.sample()
<span class="hljs-comment"># Pass the random action into the step function</span>
state, reward, done, _, info = env.step(random_action)
<span class="hljs-comment"># Wait a little bit before the next frame</span>
time.sleep(<span class="hljs-number">0.2</span>)
<span class="hljs-comment"># Reset environment when done=True, i.e., when the agent falls into a Hole (H) or reaches the Goal (G)</span>
<span class="hljs-keyword">if</span> done:
<span class="hljs-comment"># Render and display current state of the environment</span>
plt.imshow(env.render()) <span class="hljs-comment"># render current state and pass to pyplot</span>
plt.axis(<span class="hljs-string">'off'</span>)
display.display(plt.gcf()) <span class="hljs-comment"># get current figure and display</span>
display.clear_output(wait=<span class="hljs-literal">True</span>) <span class="hljs-comment"># clear output before showing the next frame</span>
<span class="hljs-comment"># Reset environment</span>
state, info = env.reset()
<span class="hljs-comment"># Close environment </span>
env.close()</pre></div><p id="34d9">Here is the output from the above code:</p><figure id="ff81"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mzY3hQpmWJJHFHA1vWzQ3A.gif"><figcaption>Gif image created by the <a href="https://solclover.com/">author</a> using the components from the <a href="https://www.gymlibrary.dev/environments/toy_text/frozen_lake/?highlight=frozen+lake">Frozen-Lake game</a>.</figcaption></figure><h2 id="8459">Training the Q-function to find the best policy</h2><p id="90aa">With the setup complete, let’s use <b>Q-Learning</b> to find the best <b>policy(𝜋)</b> for our agent to follow in this game.</p><p id="3ecc">We start by initialising a few parameters:</p><div id="0228"><pre><span class="hljs-comment"># Q-function parameters</span>
alpha = <span class="hljs-number">0.7</span> <span class="hljs-comment"># learning rate</span>
gamma = <span class="hljs-number">0.95</span> <span class="hljs-comment"># discount factor</span>
<span class="hljs-comment"># Training parameters</span>
n_episodes = <span class="hljs-number">10000</span> <span class="hljs-comment"># number of episodes to use for training</span>
n_max_steps = <span class="hljs-number">100</span> <span class="hljs-comment"># maximum number of steps per episode</span>
<span class="hljs-comment"># Exploration / Exploitation parameters</span>
start_epsilon = <span class="hljs-number">1.0</span> <span class="hljs-comment"># start training by selecting purely random actions</span>
min_epsilon = <span class="hljs-number">0.05</span> <span class="hljs-comment"># the lowest epsilon allowed to decay to</span>
decay_rate = <span class="hljs-number">0.001</span> <span class="hljs-comment"># epsilon will gradually decay so we do less exploring and more exploiting as Q-function improves</span></pre></div><p id="21d3">Note that we will vary epsilon throughout training. We will start with epsilon=1, meaning that our agent’s actions will be all random at the beginning. However, we will decay epsilon with every episode, so our agent gradually moves from pure exploration to exploitation.</p><p id="ecca">Next, let’s initialise the Q-table. As we’ve seen in the previous section, it will be a 16x4 table where 16 rows represent 16 states, and 4 columns represent 4 possible actions. We initialise the Q-table with all 0’s since we do not know how valuable each state is before we start the training.</p><div id="c68a"><pre><span class="hljs-comment"># Initial Q-table</span>
<span class="hljs-comment"># Our Q-table is a matrix of state(observation) space x action space, i.e., 16 x 4</span>
Qtable = np.zeros((env.observation_space.n, env.action_space.n))
<span class="hljs-comment"># Show</span>
Qtable</pre></div><p id="ea34">The above code displays the initialised Q-table:</p><div id="3f0b"><pre>array(<span class="hljs-comment">[<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>,
<span class="hljs-comment">[0., 0., 0., 0.]</span>]</span>)</pre></div><p id="e189">Recall that Q-Learning is an <b>off-policy</b> algorithm. Hence, we will define one function for <b>acting</b> (epsilon_greedy) and another for <b>updating</b> the Q-table (update_Q). The updating policy uses a greedy approach, i.e. no exploration.</p><p id="ca0d">You should be able to spot that the update_Q function contains the Q-Learning algorithm equation analysed in the previous section.</p><div id="4efa"><pre><span class="hljs-comment"># This is our acting policy (epsilon-greedy), for the agent to do exploration and exploitation during training</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">epsilon_greedy</span>(<span class="hljs-params">Qtable, state, eps
Options
ilon</span>):
<span class="hljs-comment"># Generate a random number and compare to epsilon, if lower then explore, itherwuse exploit</span>
randnum = np.random.uniform(<span class="hljs-number">0</span>, <span class="hljs-number">1</span>)
<span class="hljs-keyword">if</span> randnum < epsilon:
action = env.action_space.sample() <span class="hljs-comment"># explore</span>
<span class="hljs-keyword">else</span>:
action = np.argmax(Qtable[state, :]) <span class="hljs-comment"># exploit</span>
<span class="hljs-keyword">return</span> action
<span class="hljs-comment"># This is our updating policy (greedy) </span>
<span class="hljs-comment"># i.e., always select the action with the highest value for that state: np.max(Qtable[next_state])</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">update_Q</span>(<span class="hljs-params">Qtable, state, action, reward, next_state</span>):
<span class="hljs-comment"># Q(S_t,A_t) = Q(S_t,A_t) + alpha [R_t+1 + gamma * max Q(S_t+1,a) - Q(S_t,A_t)]</span>
Qtable[state][action] = Qtable[state][action] + alpha * (reward + gamma * np.<span class="hljs-built_in">max</span>(Qtable[next_state]) - Qtable[state][action])
<span class="hljs-keyword">return</span> Qtable
<span class="hljs-comment"># This function (also greedy) will return the action from Qtable when we do evaluation</span>
<span class="hljs-keyword">def</span> <span class="hljs-title function_">eval_greedy</span>(<span class="hljs-params">Qtable, state</span>):
action = np.argmax(Qtable[state, :])
<span class="hljs-keyword">return</span> action</pre></div><p id="229f">Finally, let’s define our training function:</p><div id="e7a7"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">train</span>(<span class="hljs-params">n_episodes, n_max_steps, start_epsilon, min_epsilon, decay_rate, Qtable</span>):
<span class="hljs-keyword">for</span> episode <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(n_episodes):
<span class="hljs-comment"># Reset the environment at the start of each episode</span>
state, info = env.reset()
t = <span class="hljs-number">0</span>
done = <span class="hljs-literal">False</span>
<span class="hljs-comment"># Calculate epsilon value based on decay rate</span>
epsilon = <span class="hljs-built_in">max</span>(min_epsilon, (start_epsilon - min_epsilon)*np.exp(-decay_rate*episode))
<span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(n_max_steps):
<span class="hljs-comment"># Choose an action using previously defined epsilon greedy policy</span>
action = epsilon_greedy(Qtable, state, epsilon)
<span class="hljs-comment"># Perform the action in the environment, get reward and next state</span>
next_state, reward, done, _, info = env.step(action)
<span class="hljs-comment"># Update Q-table</span>
Qtable = update_Q(Qtable, state, action, reward, next_state)
<span class="hljs-comment"># Update current state </span>
state = next_state
<span class="hljs-comment"># Finish the episode when done=True, i.e., reached the goal or fallen into a hole</span>
<span class="hljs-keyword">if</span> done:
<span class="hljs-keyword">break</span>
<span class="hljs-comment"># Return final Q-table</span>
<span class="hljs-keyword">return</span> Qtable</pre></div><p id="ad78">Now let’s call the training function and see the results:</p><div id="1c72"><pre><span class="hljs-comment"># Train</span>
Qtable = train(n_episodes, n_max_steps, start_epsilon, min_epsilon, decay_rate, Qtable)
<span class="hljs-comment"># Show Q-table</span>
Qtable</pre></div><p id="05f6">Following the training, we get the optimised Q-table, which matches the results we showed in the previous section. Now the agent can use it to always reach the Goal without falling into a Hole.</p><div id="b199"><pre>array([[<span class="hljs-number">0.73509189</span>, <span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.73509189</span>],
[<span class="hljs-number">0.73509189</span>, <span class="hljs-number">0</span>. , <span class="hljs-number">0.81450625</span>, <span class="hljs-number">0.77378094</span>],
[<span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.857375</span> , <span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.81450625</span>],
[<span class="hljs-number">0.81450625</span>, <span class="hljs-number">0</span>. , <span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.77378094</span>],
[<span class="hljs-number">0.77378094</span>, <span class="hljs-number">0.81450625</span>, <span class="hljs-number">0</span>. , <span class="hljs-number">0.73509189</span>],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. ],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>.<span class="hljs-number">9025</span> , <span class="hljs-number">0</span>. , <span class="hljs-number">0.81450625</span>],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. ],
[<span class="hljs-number">0.81450625</span>, <span class="hljs-number">0</span>. , <span class="hljs-number">0.857375</span> , <span class="hljs-number">0.77378094</span>],
[<span class="hljs-number">0.81450625</span>, <span class="hljs-number">0</span>.<span class="hljs-number">9025</span> , <span class="hljs-number">0</span>.<span class="hljs-number">9025</span> , <span class="hljs-number">0</span>. ],
[<span class="hljs-number">0.857375</span> , <span class="hljs-number">0</span>.<span class="hljs-number">95</span> , <span class="hljs-number">0</span>. , <span class="hljs-number">0.857375</span> ],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. ],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. ],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>.<span class="hljs-number">9025</span> , <span class="hljs-number">0</span>.<span class="hljs-number">95</span> , <span class="hljs-number">0.857375</span> ],
[<span class="hljs-number">0</span>.<span class="hljs-number">9025</span> , <span class="hljs-number">0</span>.<span class="hljs-number">95</span> , <span class="hljs-number">1</span>. , <span class="hljs-number">0</span>.<span class="hljs-number">9025</span> ],
[<span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. , <span class="hljs-number">0</span>. ]])</pre></div><h2 id="644a">Evaluation</h2><p id="3092">Let’s evaluate this policy by running a few simulations and checking if the agent always manages to get the maximum reward.</p><div id="6b9d"><pre><span class="hljs-keyword">def</span> <span class="hljs-title function_">evaluate_agent</span>(<span class="hljs-params">n_max_steps, n_eval_episodes, Qtable</span>):
<span class="hljs-comment"># Initialize an empty list to store rewards for each episode</span>
episode_rewards=[]
<span class="hljs-comment"># Evaluate for each episode</span>
<span class="hljs-keyword">for</span> episode <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(n_eval_episodes):
<span class="hljs-comment"># Reset the environment at the start of each episode</span>
state, info = env.reset()
t = <span class="hljs-number">0</span>
done = <span class="hljs-literal">False</span>
tot_episode_reward = <span class="hljs-number">0</span>
<span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(n_max_steps):
<span class="hljs-comment"># Use greedy policy to evaluate</span>
action = eval_greedy(Qtable, state)
<span class="hljs-comment"># Pass action into step function</span>
next_state, reward, done, _, info = env.step(action)
<span class="hljs-comment"># Sum episode rewards</span>
tot_episode_reward += reward
<span class="hljs-comment"># Update current state </span>
state = next_state
<span class="hljs-comment"># Finish the episode when done=True, i.e., reached the goal or fallen into a hole</span>
<span class="hljs-keyword">if</span> done:
<span class="hljs-keyword">break</span>
episode_rewards.append(tot_episode_reward)
mean_reward = np.mean(episode_rewards)
std_reward = np.std(episode_rewards)
<span class="hljs-keyword">return</span> mean_reward, std_reward
<span class="hljs-comment"># Call the above evaluation function and display the results:</span>
n_eval_episodes=<span class="hljs-number">100</span> <span class="hljs-comment"># evaluate over 100 episodes</span>
mean_reward, std_reward = evaluate_agent(n_max_steps, n_eval_episodes, Qtable)
<span class="hljs-built_in">print</span>(<span class="hljs-string">f"Mean Reward = <span class="hljs-subst">{mean_reward:<span class="hljs-number">.2</span>f}</span> +/- <span class="hljs-subst">{std_reward:<span class="hljs-number">.2</span>f}</span>"</span>)</pre></div><p id="2118">The above code prints the following results:</p><div id="36fc"><pre><span class="hljs-attribute">Mean</span> Reward = <span class="hljs-number">1</span>.<span class="hljs-number">00</span> +/- <span class="hljs-number">0</span>.<span class="hljs-number">00</span></pre></div><p id="2860">As you can see, in every episode out of 100 episodes tested, the agent managed to get the maximum reward (1.00).</p><p id="91a5">Let’s also evaluate it visually by making the agent follow the policy and render it on the screen:</p><div id="c0b5"><pre><span class="hljs-comment"># Cycle through 19 steps redering and displaying environment state each time</span>
state, info = env.reset()
<span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">19</span>):
<span class="hljs-comment"># Render and display current state of the environment</span>
plt.imshow(env.render()) <span class="hljs-comment"># render current state and pass to pyplot</span>
plt.axis(<span class="hljs-string">'off'</span>)
display.display(plt.gcf()) <span class="hljs-comment"># get current figure and display</span>
display.clear_output(wait=<span class="hljs-literal">True</span>) <span class="hljs-comment"># clear output before showing the next frame</span>
<span class="hljs-comment"># Use greedy policy to evaluate</span>
action = eval_greedy(Qtable, state)
<span class="hljs-comment"># Pass action into step function</span>
state, reward, done, _, info = env.step(action)
<span class="hljs-comment"># Wait a little bit before the next frame</span>
time.sleep(<span class="hljs-number">0.2</span>)
<span class="hljs-comment"># Reset environment when done=True, i.e. when the agent falls into a Hole (H) or reaches the Goal (G)</span>
<span class="hljs-keyword">if</span> done:
<span class="hljs-comment"># Render and display final state of the environment</span>
plt.imshow(env.render()) <span class="hljs-comment"># render current state and pass to pyplot</span>
plt.axis(<span class="hljs-string">'off'</span>)
display.display(plt.gcf()) <span class="hljs-comment"># get current figure and display</span>
display.clear_output(wait=<span class="hljs-literal">True</span>) <span class="hljs-comment"># clear output before showing the next frame</span>
state, info = env.reset()
env.close()</pre></div><p id="2e45">The results are as expected, as we have seen in the previous section:</p><figure id="3943"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-JdsniM1sZ18YpFHcDUJcw.gif"><figcaption>Gif image created by the <a href="https://solclover.com/">author</a> using the components from the <a href="https://www.gymlibrary.dev/environments/toy_text/frozen_lake/?highlight=frozen+lake">Frozen-Lake game</a>.</figcaption></figure><h1 id="f77e">Final remarks</h1><p id="b3d0">We have successfully employed Q-Learning to find the best policy for the agent to use in the Frozen-Lake game. I hope the game’s simple nature made it easy to understand how Q-Learning works.</p><p id="7ac2">Please don’t forget to <a href="https://solclover.com/subscribe">subscribe</a>, so you get to <b>learn from my upcoming articles</b> about other Reinforcement Learning algorithms and see how to apply them to different environments using <b>Python!</b></p><p id="e75f">You can find a complete Python code used in this article as a Jupyter Notebook on my <a href="https://github.com/SolClover/Art056_RL_Q_Learning"><b>GitHub repository</b></a>.</p><p id="1c07">Cheers! 🤓
<b>Saul Dobilas</b></p></article></body>