IMPOSSIBLY GOOD EXPERTS AND HOW TO FOLLOW THEM

Abstract

We consider the sequential decision making problem of learning from an expert that has access to more information than the learner. For many problems this extra information will enable the expert to achieve greater long term reward than any policy without this privileged information access. We call these experts "Impossibly Good" because no learning algorithm will be able to reproduce their behavior. However, in these settings it is reasonable to attempt to recover the best policy possible given the agent's restricted access to information. We provide a set of necessary criteria on the expert that will allow a learner to recover the optimal policy in the reduced information space from the expert's advice alone. We also provide a new approach called ELF Distillation (Explorer Learning from Follower) that can be used in cases where these criteria are not met and environmental rewards must be taken into account. We show that this algorithm performs better than a variety of strong baselines on a challenging suite of Minigrid and Vizdoom environments.

1. INTRODUCTION

Sequential decision making is one of the most important problems in modern machine learning theory and practice. Reinforcement learning from an environmental reward signal is a powerful but unwieldy tool for attacking these problems. In contrast, imitation learning can be much more sample efficient and empirically easier to train than reinforcement learning, but requires a powerful expert that can either provide an offline dataset of instructions or online supervision. In many practical settings, these experts have access to more information than the learning agent. This can occur when using human demonstrators to train robots that have inferior sensors, or in simulated environments where a synthetic expert uses hidden simulator information to train an agent. In these settings, it is possible that the expert's additional information makes it more powerful than any learning agent that does not have access to the hidden information. We call these experts "Impossibly Good" and show that learning from them using techniques that do not incorporate environmental rewards can cause the agent to drastically under perform the optimal policy in the reduced information space. For example, consider a simulated robot tasked with retrieving a cell phone in an unknown apartment consisting of multiple rooms. The robot observes the world using a camera and is not given the location of the phone in advance, so it must explore each room in order to find it. Because this is a simulated environment, we can construct an expert that knows not only the location of the phone, but also the exact layout of each room and can compute the shortest path from the robot to the phone. We can then use this expert to construct a large corpus of training data across any number of apartments and phone locations. While these demonstrations may be optimal according to the expert that knows the phone's location, they crucially do not provide any demonstrations of the exploratory behavior that is necessary for the robot which must rely on its more limited sensors. At test time, the robot may need to explore many empty rooms before finding the one that contains the phone, but the expert has always walked directly to the goal and so it has never shown the robot what to do when encountering an empty room. In this case the expert is impossibly good because on average, it can reach the phone much faster than any agent that does not have access to the map, but must explore each room one by one. While we may be able to learn some important skills from this expert, we are crucially missing demonstrations of other necessary behavior, and so learning from this expert's advice alone may cause the robot to fail. Our goal in these settings is to find an algorithm that retains the efficiency of imitation learning, while incorporating just enough reward feedback from the environment to achieve success. To address this, we introduce a new technique called ELF Distillation (Explorer Learning from Follower). The key insight of this approach is to train one follower policy using the advice of the impossibly good expert alone, and then use the estimated long-term value of this policy to drive exploration of a second explorer policy using reward shaping. These two policies are trained jointly so that the explorer policy can be used to inform the distribution of states from which the follower must learn. In order to study these problems, we have constructed a suite of Minigrid (Chevalier-Boisvert et al., 2018) and Vizdoom(Wydmuch et al., 2019) environments that clearly demonstrate the challenges of learning from impossibly good experts. While these are toy problems, they are quite challenging for many strong baselines and related approaches, and allow us to clearly demonstrate the necessary concepts in a setting that avoids confounding implementation details. Code for these experiments can be found at https://github.com/aaronwalsman/impossibly-good/.

2. RELATED WORK

Recently many authors have identified the problem of learning from experts that have access to more information than the learning agent. This can occur in the self-driving domain, where a human expert may be able to see more than the car's perception system (de Haan et al., 2019; Bansal et al., 2018; Chen et al., 2019) . It can also occur robot exploration, where the expert may already know a map of the world (Choudhury et al., 2017; Jain et al., 2021) , or in heuristic search where the expert may know which graph edges are valid (Bhardwaj et al., 2017) . In some cases it is possible to overcome these issues by considering a history of recent observations instead a single step, but unfortunately this exacerbates the "latching" behavior identified by several practitioners in both self-driving (Muller et al., 2006; Kuefler et al., 2017; Bansal et al., 2018; Codevilla et al., 2019) and natural language processing (Ortega et al., 2021) in which an agent becomes overly fixated on repeating recent actions. In these partial information settings, Choudhury et al. (2018) showed that interactive imitation learning converges to the QMDP approximation of the expert's policy (Littman et al., 1995) . This likely explains the empirical success (Chen et al., 2019; Lee et al., 2020) of such techniques in settings where information naturally reveals itself over time. These interactive imitation learning techniques were originally designed to address covariate shift, a condition in which compounding single-step errors drive the learning agent into states not seen during test time. This effect was first noted by (Pomerleau, 1989) and has long been recognized as one of the most important challenges in imitation learning (Bagnell, 2015) . Several approaches have been proposed to address this such as SEARN Daumé et al. (2009 ), DAgger (Ross et al., 2011) and AggreVaTe (Ross & Bagnell, 2014; Sun et al., 2017) . Recent work (Spencer et al., 2021) has shown that covariate shift can be broken into realizable settings where off-policy methods such as behavior cloning work well with increasing data, and non-realizable settings where on-policy methods with an interactive expert (Ross et al., 2010) or an interactive simulator (Ziebart et al., 2008; Swamy et al., 2021) are necessary. Some authors (Zhang et al., 2020; Kumor et al., 2021; Ortega et al., 2021) have proposed to address learning in partial information settings using the causal reasoning framework of Pearl et al. ( 2016), while Swamy et al. ( 2022) has recently shown that on-policy imitation learning methods can be more effective at recovering the expert's behavior than off-policy approaches in these situations, and has provided conditions under which an agent can asymptotically recover the expert's behavior. However, in settings where information does not reveal itself, the learner has to actively gather information (Lee et al., 2021 ). Tennenholtz et al. (2021) examine similar settings but give the learner access to the confounder at test time. Warrington et al. (2021) proposes an asymmetric DAgger algorithm for this setting, but it requires a differentiable model of the expert, which is frequently unavailable. Nguyen et al. (2022) replaces the entropy term in Soft Actor Critic (Haarnoja et al., 2018) with a divergence between the agent and expert policies at each visited state. Weihs et al. (2021) interpolates between the policy gradient and an imitation learning signal using an estimate of how well the agent is able to follow the expert in each state. Our work builds on these ideas by encouraging the agent to visit states where following the expert leads to long term success rather than short-term ability to mimic the expert. This technique uses the tools of pol-

