FAST ADAPTATION VIA HUMAN DIAGNOSIS OF TASK DISTRIBUTION SHIFT

Abstract

When agents fail in the world, it is important to understand why they failed. These errors could be due to underlying distribution shifts in the goals desired by the end user or to the environment layouts that impact the policy's actions. In the case of multi-task policies conditioned on goals, this problem manifests in difficulty in disambiguating between goal and policy failures: is the agent failing because it can't correctly infer what the desired goal is or because it doesn't know how to take actions toward achieving the goal? We hypothesize that successfully disentangling these two failures modes holds important implications for selecting a finetuning strategy. In this paper, we explore the feasibility of leveraging human feedback to diagnose what vs. how failures for efficient adaptation. We develop an end-to-end policy training framework that uses attention to produce a humaninterpretable representation, a visual masked state, to communicate the agent's intermediate task representation. In experiments with human users in both discrete and continuous control domains, we show that our visual attention mask policy can aid participants in successfully inferring the agent's failure mode significantly better than actions alone. Leveraging this feedback, we show subsequent empirical performance gains during finetuning and discuss implications of using humans to diagnose parameter-level failures of distribution shift.

1. INTRODUCTION

Humans are remarkably adept at asking for information relevant to learning a task (Ho & Griffiths, 2022) . This is in large part due to their ability to communicate feature-level failures of their internal state via communicative acts to a teacher (e.g. expressing confusion, attention, understanding, etc.) (Argyle et al., 1973) . Such failures can range from not understanding what the task is, e.g. being asked to go to Walgreens when they don't know what Walgreens is, to not knowing how to accomplish the task, e.g. being asked to go to Walgreens and not knowing which direction to walk in. In both cases, a human learner would clarify why they are unable to complete the task so that they can solicit feedback that is most useful for their downstream learning. This synergistic and tightly coupled interaction loop enables a teacher to better estimate the learner's knowledge base to give feedback that is best tailored to filling their knowledge gap (Rafferty et al., 2016) . Our sequential decision-making agents face the same challenge when trying to adapt to new scenarios. When agents fall in the world due to distribution shifts between their training and test environments (Levine et al., 2020) , it would be helpful to understand why they fail so that we can provide the right data to adapt the policy. The difficulty today when dealing with systems trained end-to-end is that they are inherently incapable of expressing the cause of failure and exhibit behaviours that may be arbitrarily bad, rendering a human user left in the dark with respect to what type of feedback would be most useful for finetuning. Ergo, active learning strategies focus on generating state or action queries that would be maximally informative for the human to label (Akrour et al., 2012; Bobu et al., 2022; Reddy et al., 2020; Bıyık et al., 2019) , but such methods require an unscalable amount of human supervision to cover a large task distribution (MacGlashan et al., 2017) . To address the challenge above, we propose a human-in-the-loop framework for training an agent end-to-end capable of explicitly communicating information useful for a human to infer the underlying cause of failure and provide targeted feedback for finetuning. In the training phase, we leverage attention to train a policy capable of producing an intermediate task representation, a masked state that only includes visual information relevant to solving the task. Our key insight is that while visual attention has been studied in the context of visualizing features of a deep learning model's black box predictions, an incorrect visual mask can also help a human infer the underlying parameters of distribution shift in the event of a policy's failure. This is done in the feedback phase, when we use the masked state to help a human infer whether the agent is attending to the right features but acting incorrectly (a how error) versus attending to the wrong features (a what error). To close the loop, we leverage the identified failure mode in the adaptation phase to perform more efficient finetuning via targeted data augmentation of the shifted parameter. We formalize the problem setting and describe the underlying assumptions. Next, we present our interactive learning framework for diagnosing and fixing parameter-level shifts using human feedback. Through human experiments, we verify our hypothesis that visual attention is a more informative way for humans to understand agent failures compared to behaviour alone. Finally, we show that this feedback can be empirically leveraged to improve policy adaptation via targeted data augmentation. We call the full interactive training protocol the visual attention mask policy (VAMP).

2. RELATED WORK

Goal-Conditioned Imitation Learning. The learning technique used in our paper is goalconditioned imitation learning (IM), which seeks to learn a multi-task policy end-to-end by supervised learning or "cloning" from expert trajectories (Abbeel & Ng, 2004; Ng et al., 2000; Ding et al., 2019) . The learning from demonstrations framework means that we can optimize a policy without the need for a reward function (Pomerleau, 1988) , albeit we cannot generate new behaviours without feedback. Moreover, unlike standard IM or IRL methods, goal-conditioned policies are capable of learning a single policy to perform many tasks. Unfortunately, generating enough expert demonstrations to cover a large test distribution is difficult. (Ziebart et al., 2008; Finn et al., 2016) . Human-in-the-loop RL. Interactively querying humans for data to aid in downstream task learning belongs to a class of problems referred to as human-in-the-loop RL (Abel et al., 2017; Zhang et al., 2019) . Existing frameworks like TAMER (Knox & Stone, 2008) and COACH (MacGlashan et al., 2017) use human feedback to train policies, but are restricted to binary or scalar labeled rewards. A different line of work seeks to learn tasks using human preferences, oftentimes asking them to compare or rank trajectory snippets (Christiano et al., 2017; Brown et al., 2020) . Yet another direction focuses on how to perform active learning from human teachers, where the emphasis is on gener-



Figure 1: A human user trying to diagnose the agent's failure mode. (A) When the human only sees agent behaviour, it is ambiguous why it's failing. (B) If the human also has access to the agent's intermediate task representation for what it perceives to be relevant to the task, they can infer the type of error and thus parameters of the distribution shift. For example, if the agent is not attending to the target object, it is likely unfamiliar with the user's stated goal-i.e. a what error. (C) Alternatively, if the agent is attending to the target object but generates the wrong behaviour, this can indicate that it is unfamiliar with how to navigate to the object's location-i.e. a how error.

