## A Boolean Task Algebra For Reinforcement Learning

Paper at NeurIPS 2020

# Abstract

In this work, we investigate how to compose learned skills to solve their conjunction, disjunction, and negation in a manner that is both principled and optimal. We begin by introducing a goal-oriented value function that amortises the learning effort over future tasks. We then prove that by composing these value functions in specific ways, we immediately recover the optimal policies for the conjunction, disjunction, and negation of learned tasks. Finally, we formalise the logical composition of tasks in a Boolean algebra structure. This gives us a notion of base tasks which when learned, can be composed to solve any other task in a domain.

## Introduction

In this post, we are interested in answering the following question: given a set of existing skills, can we compose them to solve any task that is a logical combination of learned ones? To illustrate want we want, we will use the 2d video game domain of Van Niekerk [2019] where we train an agent to collect blue objects and separately train it to collect squares. We then see if we can obtain the skill to collect blue squares by averaging the learned value functions (since averaging is the best we can do from previous works [Haarnoja 2018, Van Niekerk 2019]). The respective value functions and trajectories are illustrated below.

Attempting to collect blue squares by composing learned skills - collecting blue objects and collecting squares - results in an agent that gets stuck optimally.

We can see that...

# Introduction

Heuristic search algorithms rely on a heuristic function, $h$, to guide search for planning. The aim of such a heuristic function is to produce a quick-to-compute estimate of the true cost-to-goal, $h^*$, for any given state $s$.

A well-known property of heuristic search based algorithms like A* or IDA* is that if the heuristic never overestimates the true cost-to-goal - that is, $h(s) \leq h^*(s)$ - then the plans produced by these algorithms is guaranteed to be optimal. Such a heuristic is called an admissible heuristic.

Unfortunately, crafting strong admissible heuristics is difficult, often requiring expert domain-specific knowledge and high memory resources.

# Learning Heuristics

An alternative approach is to learn heuristics from data using machine learning algorithms. For example, consider the popular 15-puzzle. The aim of the 15-puzzle is to reach a goal state from some start state by sliding a blank tile in a direction and swapping that blank tile with the adjacent number in that direction.

Now, suppose we have a set of optimal plans for a set of 15-puzzle tasks, where each plan is for a 15-puzzle task with a different start state. Then it is possible to use these plans as training data for a supervised learning algorithm, such as a neural network, to learn a heuristic. Since supervised learning algorithms generalise to unseen data, such heuristics can then be applied to new, previously unseen tasks i.e. 15-puzzle tasks with different start states...

# Abstract

In this work, we investigate ways an agent can combine existing skills to create novel ones in a manner that is both principled and optimal. We find that by constraining the reward function and transition dynamics, skills composition can be achieved in both entropy-regularised and standard RL. Our approach allows an agent to generate new skills without further learning, and can be applied to high-dimensional environments and deep RL methods.

## Introduction

In this post, we look to answer the following question: given a set of existing skills, can we compose them to produce novel behaviours? For example, imagine we have learned skills like running and jumping. Can we build more complex skills by simply combining them in interesting ways? Of course, it is likely that there are many ad-hoc ways to do this, but it would be really nice if we could do it in a way that is both provably correct and that requires no further learning.

To illustrate what we're after, we use pretrained skills generated by DeepMimic [Peng, 2018]. As we can see, in the first two animations below our agent has learned to backflip and spin. But if we try to combine these skills together (in this case, by taking the maximum of the policy network output), we get complete bupkis! The question is: is there some way we can make this work?

Attempting to compose two skills - backflips and spins - results in an agent that can injure itself optimally.

Unsurprisingly, the answer is "yes". More...

# Through Social Cobots: Robots with Human-Like Collaboration Skills

This study is a part of a collaboration with DAI-Labor of TU Berlin, Germany. We envision the future of collaborative robots (cobots) in industry through their fully autonomous human-like collaboration with human partners.

Our research aims to address the question: "How do we build cobots with human-like natural collaboration skills?". Existing intention-aware planning approaches often make the assumptions that a human collaborator's actions are always relevant to the collaborated task and that the human always accepts the cobot's assistance when offered.

We believe that these assumptions are a significant barrier against having social cobots in the real world. In reality, a human's dynamic desires and emotional states could result in stochastic human intentions, especially in repeated tasks. A cobot with these assumptions may misinterpret the human actions, which may result in intrusive and unreasonable robot behaviors (e.g. a human gazing at an object might be interpreted as she needs it, yet behind this gaze, she could be evaluating to take it herself or thinking of something irrelevant to the task like staring into space).

Our goal is to offer a new model design as an alternative to the conventional intention-aware models by removing these assumptions. The result is our novel robot decision-making model, a partially observable Markov decision process (POMDP), that is capable of handling these dynamic...

# Introduction

An important area of study in the field of Reinforcement Learning (RL) is Transfer Learning where the aim is to leverage previous experiences to accelerate learning in a new unseen tasks. While it is clear that living organisms apply transfer learning throughout their lives, it is often unclear how this transfer mechanism exhibited by living organisms can be incorporated into autonomous agents

As a concrete example, consider a simple Sokoban task as below.

Once a human completes this task they have learned core concepts about the underlying structure of the Sokoban domain. For example, they would learn that the warehouse-keeper:

1. cannot walk through walls,
2. cannot push a box that is adjacent to another box
3. cannot push a box that is adjacent to a wall

Now suppose the human would be given this new more complex task to solve.

Clearly, the human would re-use the rules they had previously learned in the simple task to gain an advantage in this new task. Unfortunately, most state-of-the art RL algorithms would not leverage such knowledge and would instead re-learn everything from scratch on the more complex task. Such wastefulness of prior experience is clearly inefficient!

# Object-Oriented Representation

One idea that has shown promise in transfer learning is the notion of object-oriented representation. With this approach we view a task as being instances of objects classes. For example, any Sokoban task can be thought of as made objects that are instances...

# Introduction

DRM-connect is an algorithm for motion planning and replanning, and is a combination of dynamic reachability maps (DRM) with lazy collision checking and a fallback strategy based on the RRT-connect algorithm, which is used to repair the roadmap through further exploration.

Trajectory planning and replanning in complex environments often reuses very little information from previous solutions. This is particularly evident when the motion is repeated multiple times with only a limited amount of variation between each run. Graph-based planning offers fast replanning at the cost of significant pre-computation, while probabilistic planning requires no pre-computation at the cost of slow replanning.

We attempt to offer the best of both by proposing the DRM-connect algorithm.

# Algorithm

Offline, an approximate Reeb graph is constructed from the trajectories of prior tasks in the same or similar environments.

For a new planning or replanning query, DRM-connect searches this Reeb graph for a trajectory to complete the task (checking collisions lazily). If no path is found, DRM-connect iterates between attempting to repair the disconnected subgraphs through a process similar to RRT-connect (operating on multiple graphs, rather than trees) and searching for paths through the graph. Since DRM-connect is probabilistically complete, the likelihood of a successful trajectory being returned approaches one as time tends to infinity.

Further work will incorporate online updates...