Applications are open for a full-time postdoctoral research position in reinforcement learning for robotics, at the RAIL laboratory at the University of the Witwatersrand, Johannesburg, South Africa (https://www.raillab.org/).

This particular post will focus on behaviour composition and symbolic abstraction in reinforcement learning, specifically in extending our existing work to apply to flexible manipulation tasks on a two-armed manipulator robot.

The post is available from January 2022, for a period of 2-5 years. The candidate must have completed a PhD in the last 5 years, and have research experience in either reinforcement learning or robotics, preferably both.

Please submit your application using the online form https://forms.gle/oXdTHpg3QHni3M2W7
This will require a detailed CV, a motivation letter, two examples of your publications/written work, and two academic references.

The closing date for applications is 31 August 2021.

Enquiries can be sent to Prof. Benjamin Rosman (benjamin.rosman1@wits.ac.za).

**Apply for a DeepMind Scholarship for an MSc at the University of the Witwatersrand!**

Applications are now open to anyone from Africa who wishes to pursue a research- or coursework-based Masters at the University of the Witwatersrand in South Africa. The degrees are full-time, starting in January 2022, with a research focus on machine learning.

We are looking for candidates with strong backgrounds including experience with multivariate calculus, linear algebra, statistics, optimisation, and programming. Preference will be given to applicants from groups across the continent that are underrepresented in the global machine learning landscape.

Come and join our exciting multidisciplinary research team pushing the frontiers of fundamental machine learning research!

Apply here now: https://forms.gle/uC2rRJM12k5WsmWb7

Closing date for applications: **10 August 2021**

Short-listed and unsuccessful applications will be notified by **31 August 2021**.

Enquiries: Prof Benjamin Rosman (Benjamin.Rosman1@wits.ac.za)

Congratulations to Geraud Nangue Tasse who was recently awarded an IBM PhD Fellowship - the only recipient from an African university!

# Abstract

In this work, we investigate how to compose learned skills to solve their conjunction, disjunction, and negation in a manner that is both principled and optimal. We begin by introducing a goal-oriented value function that amortises the learning effort over future tasks. We then prove that by composing these value functions in specific ways, we immediately recover the optimal policies for the conjunction, disjunction, and negation of learned tasks. Finally, we formalise the logical composition of tasks in a Boolean algebra structure. This gives us a notion of base tasks which when learned, can be composed to solve any other task in a domain.

## Introduction

In this post, we are interested in answering the following question: *given a set of existing skills, can we compose them to solve any task that is a logical combination of learned ones?* To illustrate want we want, we will use the 2d video game domain of Van Niekerk [2019] where we train an agent to collect blue objects and separately train it to collect squares. We then see if we can obtain the skill to collect blue squares by averaging the learned value functions (since averaging is the best we can do from previous works [Haarnoja 2018, Van Niekerk 2019]). The respective value functions and trajectories are illustrated below.

Attempting to collect blue squares by composing learned skills - collecting blue objects and collecting squares - results in an agent that gets stuck optimally.

We can see that...

# Introduction

Heuristic search algorithms rely on a heuristic function, $h$, to guide search for planning. The aim of such a heuristic function is to produce a quick-to-compute estimate of the true cost-to-goal, $h^*$, for any given state $s$.

A well-known property of heuristic search based algorithms like A* or IDA* is that if the heuristic never overestimates the true cost-to-goal - that is, $h(s) \leq h^*(s)$ - then the plans produced by these algorithms is guaranteed to be optimal. Such a heuristic is called an admissible heuristic.

Unfortunately, crafting strong admissible heuristics is difficult, often requiring expert domain-specific knowledge and high memory resources.

# Learning Heuristics

An alternative approach is to learn heuristics from data using machine learning algorithms. For example, consider the popular 15-puzzle. The aim of the 15-puzzle is to reach a goal state from some start state by sliding a blank tile in a direction and swapping that blank tile with the adjacent number in that direction.

Now, suppose we have a set of optimal plans for a set of 15-puzzle tasks, where each plan is for a 15-puzzle task with a different start state. Then it is possible to use these plans as training data for a supervised learning algorithm, such as a neural network, to learn a heuristic. Since supervised learning algorithms generalise to unseen data, such heuristics can then be applied to new, previously unseen tasks i.e. 15-puzzle tasks with different start states...

Congratulations to Kale-ab Tessera whose work, *Learning compact, general purpose neural network architectures*, received the "Best Poster" award at the 2019 Deep Learning Indaba.

He wins a trip to Vancouver, Canada in December, where he will attend NeurIPS 2019.