## A Boolean Task Algebra For Reinforcement Learning

Paper at NeurIPS 2020

# Abstract

In this work, we investigate how to compose learned skills to solve their conjunction, disjunction, and negation in a manner that is both principled and optimal. We begin by introducing a goal-oriented value function that amortises the learning effort over future tasks. We then prove that by composing these value functions in specific ways, we immediately recover the optimal policies for the conjunction, disjunction, and negation of learned tasks. Finally, we formalise the logical composition of tasks in a Boolean algebra structure. This gives us a notion of base tasks which when learned, can be composed to solve any other task in a domain.

## Introduction

In this post, we are interested in answering the following question: given a set of existing skills, can we compose them to solve any task that is a logical combination of learned ones? To illustrate want we want, we will use the 2d video game domain of Van Niekerk [2019] where we train an agent to collect blue objects and separately train it to collect squares. We then see if we can obtain the skill to collect blue squares by averaging the learned value functions (since averaging is the best we can do from previous works [Haarnoja 2018, Van Niekerk 2019]). The respective value functions and trajectories are illustrated below.

Attempting to collect blue squares by composing learned skills - collecting blue objects and collecting squares - results in an agent that gets stuck optimally.

We can see that...

# Introduction

Heuristic search algorithms rely on a heuristic function, $h$, to guide search for planning. The aim of such a heuristic function is to produce a quick-to-compute estimate of the true cost-to-goal, $h^*$, for any given state $s$.

A well-known property of heuristic search based algorithms like A* or IDA* is that if the heuristic never overestimates the true cost-to-goal - that is, $h(s) \leq h^*(s)$ - then the plans produced by these algorithms is guaranteed to be optimal. Such a heuristic is called an admissible heuristic.

Unfortunately, crafting strong admissible heuristics is difficult, often requiring expert domain-specific knowledge and high memory resources.

# Learning Heuristics

An alternative approach is to learn heuristics from data using machine learning algorithms. For example, consider the popular 15-puzzle. The aim of the 15-puzzle is to reach a goal state from some start state by sliding a blank tile in a direction and swapping that blank tile with the adjacent number in that direction.

Now, suppose we have a set of optimal plans for a set of 15-puzzle tasks, where each plan is for a 15-puzzle task with a different start state. Then it is possible to use these plans as training data for a supervised learning algorithm, such as a neural network, to learn a heuristic. Since supervised learning algorithms generalise to unseen data, such heuristics can then be applied to new, previously unseen tasks i.e. 15-puzzle tasks with different start states...

## Best Poster at DL Indaba 2019

Congratulations to Kale-ab Tessera whose work, Learning compact, general purpose neural network architectures, received the "Best Poster" award at the 2019 Deep Learning Indaba.

He wins a trip to Vancouver, Canada in December, where he will attend NeurIPS 2019.

# Abstract

In this work, we investigate ways an agent can combine existing skills to create novel ones in a manner that is both principled and optimal. We find that by constraining the reward function and transition dynamics, skills composition can be achieved in both entropy-regularised and standard RL. Our approach allows an agent to generate new skills without further learning, and can be applied to high-dimensional environments and deep RL methods.

## Introduction

In this post, we look to answer the following question: given a set of existing skills, can we compose them to produce novel behaviours? For example, imagine we have learned skills like running and jumping. Can we build more complex skills by simply combining them in interesting ways? Of course, it is likely that there are many ad-hoc ways to do this, but it would be really nice if we could do it in a way that is both provably correct and that requires no further learning.

To illustrate what we're after, we use pretrained skills generated by DeepMimic [Peng, 2018]. As we can see, in the first two animations below our agent has learned to backflip and spin. But if we try to combine these skills together (in this case, by taking the maximum of the policy network output), we get complete bupkis! The question is: is there some way we can make this work?

Attempting to compose two skills - backflips and spins - results in an agent that can injure itself optimally.

Unsurprisingly, the answer is "yes". More...