Harnessing the Wisdom of an Unreliable Crowd for Autonomous Decision Making

Abstract

In Reinforcement Learning there is often a need for greater sample efficiency when learning an optimal policy, whether due to the complexity of the problem or the difficulty in obtaining data. One approach to tackling this problem is to introduce external information to the agent in the form of domain expert advice. Indeed, it has been shown that giving an agent advice in the form of state-action pairs during learning can greatly improve the rate at which the agent converges to an optimal policy. These approaches typically assume a single, infallible expert. However, it may be desirable to collect advice from multiple experts to further improve sample efficiency. This may introduce the problem of multiple experts offering conflicting advice. In general, experts (especially humans) can give incorrect advice. The problem of incorporating advice from multiple, potentially unreliable experts is considered an open problem in the field of Assisted Reinforcement Learning. Contextual bandits are an important class of problems with a broad range of applications such as in medicine, finance and recommendation systems. To address the problem of learning with expert advice from multiple, unreliable experts, we present CLUE (Cautiously Learning with Unreliable Experts), a framework which allows any contextual bandit algorithm to benefit from incorporating expert advice into its decision making. It does so by modelling the unreliability of each expert, and using this model to pool advice together to determine the probability of each action being optimal. We perform a number of experiments with simulated experts over randomly generated environments. Our results show that CLUE benefits from improved sample efficiency when advised by reliable experts, but is robust to the presence of unreliable experts, and is able to benefit from multiple experts. This research provides an approach to incorporating the advice of humans of varying levels of expertise in the learning process.

Publication
The 5th Multi-disciplinary Conference on Reinforcement Learning and Decision Making
Tamlin Love
Tamlin Love
PhD Student

I am a PhD student at the Institut de Robotica i Informàtica Industrial (IRI) (under CSIC and UPC) in Barcelona, working on the TRAIL Marie Skłodowska-Curie Doctoral Network under the supervision of Guillem Alenyà. I was previously an MSc student and lecturer at the University of the Witwatersrand, under the supervision of Benjamin Rosman and Ritesh Ajoodha, as well as a member of the RAIL Lab.

Benjamin Rosman
Benjamin Rosman
Lab Director

I am a Professor in the School of Computer Science and Applied Mathematics at the University of the Witwatersrand in Johannesburg. I work in robotics, artificial intelligence, decision theory and machine learning.