Sign in to your account

Leroy Dunn

Introduction

In this post we show how the application of curriculum learning can affect the performance of a simple reinforcement learning agent on some target task. We do this by handcrafting source tasks using knowledge of our domain and agent. Our findings show that a curriculum can positively affect the performance of the agent on some target task, and serves as a proof of concept for future, more complex work.

Curriculum Learning

image3-1.png

"Example of a mathematics curriculum. Lessons progress from simpler topics to more complex ones, with each building on the last."[1]

Curriculum learning is a study in Machine Learning in which the goal is to design a sequence of source tasks (or curriculum) for an agent to initially train on, such that final performance or learning speed of the agent is improved on some target task. It is motivated by the desire to apply autonomous agents to increasingly difficult tasks and serves to make such tasks easier to solve.

Domain

We conduct our experiment on a simple grid world domain. The below description and visuals are quoted/utilized directly from [2].

GridWorld.PNG

The world consists of a room, which can contain 4 types of objects. Keys are items the agent can pick up by moving to them and executing a pickup action. These are used to unlock locks. Each lock in a room is dependent on a set of keys. If the agent is holding the right keys, then moving to a lock and executing an unlock action opens the lock. Pits are obstacles placed throughout the domain. If the agent moves into a pit, the episode is terminated. Finally, beacons are landmarks that are placed on the corners of pits.

The goal of the learning agent is to traverse the world and unlock all the locks. At each time step, the learning agent can move in one of the four cardinal directions, execute a pickup action, or an unlock action. Moving into a wall causes no motion. Successfully picking up a key gives a reward of +500, and successfully unlocking a lock gives a reward of +1000. Falling into a pit terminates the episode with a reward of -200. All other actions receive a constant step penalty of -10.

Agent

For our experiments we use a simple tabular Q learning agent with epsilon greedy policy [0.1] and learning rate [0.01]

The agent's observation/state space is implemented as per [2] and described below.

GridWorldBasicAgentStateSpace.png

Using localized observations in this manner (measurements in relation to the agent), as opposed to absolute measurements such as {x y} coordinates, allows for the agent to transfer knowledge to similar tasks.

Experiment

In our experiments we compare the performance of our agent on a target task with and without pretraining on a curriculum. The target task is described by Figure 1 (a) (in the domain section above) . We increase the difficulty of the target task by setting a maximum number of steps per episode equal to 75. We average results over 50 episodes.

Curriculum

Good_Curriculum_Pretty.png

We handcrafted a curriculum containing two source tasks - both subsets of the target task. In the first source task, the agent starts at a state in which it already has the key and has to navigate to unlock the lock. In the second source task, the agent does not have the key, but is initialized at a starting position closer to the key. Additionally, in both of the source tasks, the dimensionality of the state space has been reduced and can be interpreted as being a 'slice' of the target task grid. The methods used to create these source tasks, 'Promising Initialization' and 'Task Dimension Simplification' (among other methods), were adapted from literature [3] that this work is inspired by.

Results

2_promising_initializations_2.png

The above plot shows the performance of the agent solving the target task when trained with a curriculum vs with no prior training. The curve depicting the performance of the curriculum-trained agent is offset by the number of steps required to train the agent on the curriculum (~50000 steps). We observe that despite this training offset, the curriculum-trained agent still manages to achieve optimal performance on the target task ~10000 steps quicker (+30%) than that of the agent with no prior training, demonstrating the utility of a good curriculum.

Future Work

In this experiment we used a simple tabular Q learning agent. More complex RL agents employing function approximation techniques over the state space have been shown to generalize knowledge to unseen states. We hypothesize that this property should allow these agents to train and apply knowledge from a wider and possibly more valuable set of source tasks. In later work we will explore using these agents.

Additionally, in this experiment we handcrafted the training curriculum. In later work we aim to explore automated methods of constructing source tasks and sequencing them into a curriculum.

References

[1] Curriculum Learning on Unity Blog [2] Autonomous Task Sequencing for Customized Curriculum Design in Reinforcement Learning, Narvekar et al, 2017 [3] Source Task Creation for Curriculum Learning, Narvekar et al, 2016