Free AI web copilot to create summaries, insights and extended knowledge, download it at here
4322
Abstract
nt does during training, we can also use it for testing and debugging our fully trained agents. For example, after training our agent to solve the simple 3x3 gridworld described above, we can provide it with some special test scenarios it had never encountered during the training process to evaluate whether it really is representing experience as we would expect it to.</p><p id="e874">Below is an example of the agent performing a modified version of the task with only green squares. As you can see, as the agent gets closer to the green squares the value estimate increases just as we would expect. It also has high estimates of the advantage for taking actions that get it closer to the green goals.</p>
<figure id="90c4">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2FWideUnrulyAfricanbushviper&url=https%3A%2F%2Fgfycat.com%2FWideUnrulyAfricanbushviper&image=https%3A%2F%2Fthumbs.gfycat.com%2FWideUnrulyAfricanbushviper-size_restricted.gif&key=d04bfffea46d4aeda930ec88cc64b87c&type=text%2Fhtml&schema=gfycat" allowfullscreen="" frameborder="0" height="1128" width="1236">
</div>
</div>
</figure></iframe></div></div></figure><p id="fc77">For the next test we can invert the situation, giving the agent a world in which there were only two red squares. It didn’t like this very much. As you can see below, the agent attempts to stay away from either square, resulting in behavior where it goes back and forth for a long period of time. Notice how the value estimate decreases as the agent approaches the red squares.</p>
<figure id="76bd">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2FWiltedBaggyBaiji&url=https%3A%2F%2Fgfycat.com%2FWiltedBaggyBaiji&image=https%3A%2F%2Fthumbs.gfycat.com%2FWiltedBaggyBaiji-size_restricted.gif&key=d04bfffea46d4aeda930ec88cc64b87c&type=text%2Fhtml&schema=gfycat" allowfullscreen="" frameborder="0" height="1100" width="1192">
</div>
</div>
</figure></iframe></div></div></figure><p id="7511">Finally, I provided the agent with a bit of an existential challenge. Instead of augmenting the kind of goals present, I removed them all. In this scenario, the blue square is by itself in the environment, with no other objects. Without a goal to move towards, the agent moves around seemingly at random, and the value estimates are likewise seemingly meaningless. What would <a href="https://en.wikipedia.org/wiki/Albert_Camus">Camus</a> say?</p>
<figure id="fffd">
<div>
<div>
<img class="ratio" src="http://placehold.it/16x9">
<iframe class="" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2FMinorOfficialLarva&url=https%3A%2F%2Fgfycat.com%2FMinorOfficialLarva&image=https%3A%2F%2Fthumbs.gfycat.com%2FMinorOfficialLarva-size_restricted.gif&key=d04bfffea46d4aeda930ec88cc64b87c&type=text%2Fhtml&schema=gfycat" allowfullscreen="" frameborder="0" height="1112" width="1216">
</div>
</div>
</figure></iframe></div></div></figure><p id="eda3">Taken together, these three experiments provide us with evidence that our agent is indeed responding to the environment as we would intuitively expect. These kinds of checks are essential to make when designing any reinforcement learning agent. If we aren’t careful about the expectations we built into the agent itself and the reward structure of the environment, we can easily end up with situations where the agent doesn’t properly learn the task, or at least doesn’t learn it as we’d expect. In the gridworld for example, taking a step results in a -0.1 reward. This wasn’t always the case though. Originally there was no such penalty, and the agent would learn to move to the green square, but do so after an average of about 50 steps! It had no “reason” to hurry, to the goal, so it didn’t. By penalizing each
Options
step even a small amount, the agent is able to quickly learn the intuitive behavior of moving directly to the green goal. This reminds us of just how subconscious our own reward structures as humans often are. While we may explicitly only think of the green as being rewarding and the red as being punishing, we are subconsciously constraining our actions by a desire to finish quickly. When designing RL agents, we need ensure that we are making their reward structures as rich as ours.</p><h2 id="75f3">Using the Control Center</h2><p id="97b4">If you want to play with a working version of the Control Center without training an agent yourself, just <a href="http://awjuliani.github.io/Center/">follow this link</a> (currently requires Google Chrome). The agent’s performance you will see was pretrained on the gridworld task for 40,000 episodes. You can click the timeline on the left to look at an example episode from any point in training. The earlier episodes clearly show the agent failing to properly interpret the task, but by the end of training the agent almost always goes straight to the goal.</p><p id="9fd4">The Control Center is a piece of software I plan to continue to develop as I work more with various Reinforcement Learning algorithms. It is currently hard-coded to certain specifics of the gridworld and DD-DQN described in <a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df#.i2zpbmre8">Part 4</a>, but if you are interested in using the interface for your own projects, feel free to <a href="https://github.com/awjuliani/RL-CC">fork it on Github</a>, and adjust/adapt it to your particular needs as you see fit. Hopefully it can provide new insights into the internal life of your learning algorithms too!</p><p id="2dbb">If this post has been valuable to you, please consider <a href="https://www.paypal.com/cgi-bin/webscr?cmd=_donations&business=V2R22DV4XSR5Y&lc=US&item_name=Arthur%20Juliani%27s%20Deep%20Learning%20Tutorials&currency_code=USD&bn=PP%2dDonationsBF%3abtn_donateCC_LG%2egif%3aNonHosted"><i>donating</i></a> to help support future tutorials, articles, and implementations. Any contribution is greatly appreciated!</p><p id="c7fa">If you’d like to follow my work on Deep Learning, AI, and Cognitive Science, follow me on Medium @<a href="undefined">Arthur Juliani</a>, or on twitter <a href="https://twitter.com/awjuliani">@awjliani</a>.</p><p id="de6e"><b><i>More from my Simple Reinforcement Learning with Tensorflow series:</i></b></p><ol><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0"><i>Part 0 — Q-Learning Agents</i></a></li><li><a href="https://readmedium.com/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149"><i>Part 1 — Two-Armed Bandit</i></a></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c"><i>Part 1.5 — Contextual Bandits</i></a></li><li><a href="https://readmedium.com/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724"><i>Part 2 — Policy-Based Agents</i></a></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-3-model-based-rl-9a6fe0cce99"><i>Part 3 — Model-Based RL</i></a></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df#.i2zpbmre8"><i>Part 4 — Deep Q-Networks and Beyond</i></a></li><li><b>Part 5 — Visualizing an Agent’s Thoughts and Actions</b></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-6-partial-observability-and-deep-recurrent-q-68463e9aeefc#.gi4xdq8pk"><i>Part 6 — Partial Observability and Deep Recurrent Q-Networks</i></a></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf"><i>Part 7 — Action-Selection Strategies for Exploration</i></a></li><li><a href="https://readmedium.com/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.hg13tn9zw"><i>Part 8 — Asynchronous Actor-Critic Agents (A3C)</i></a></li></ol></article></body>