SEEING DIFFERENTLY, ACTING SIMILARLY: HETEROGENEOUSLY OBSERVABLE IMITATION LEARN-ING

Abstract

In many real-world imitation learning tasks, the demonstrator and the learner have to act under different observation spaces. This situation brings significant obstacles to existing imitation learning approaches, since most of them learn policies under homogeneous observation spaces. On the other hand, previous studies under different observation spaces have strong assumptions that these two observation spaces coexist during the entire learning process. However, in reality, the observation coexistence will be limited due to the high cost of acquiring expert observations. In this work, we study this challenging problem with limited observation coexistence under heterogeneous observations: Heterogeneously Observable Imitation Learning (HOIL). We identify two underlying issues in HOIL: the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on importance weighting and learning with rejection to solve HOIL problems. Experimental results show that IWRE can solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.

1. INTRODUCTION

Imitation Learning (IL), which studies how to learn a good policy by imitating the given demonstrations (Xu et al., 2020; Chen et al., 2022) , has made significant progress in real-world applications such as autonomous driving (Chen et al., 2019) , health care (Iyer et al., 2021) , and continuous control (Wang et al., 2023) . In tradition, the expert and the learner are assumed to use the same observation space. However, nowadays, many real-world IL tasks demand to remove this assumption (Chen et al., 2019; Warrington et al., 2021) , such as in autonomous driving (Chen et al., 2019), recommendation system (Wu et al., 2019) , and medical decision making (Wang et al., 2021a) . Taking AI for medical diagnosis as an example in Figure 1 : A medical AI is learning to make medical decisions based on expert doctor demonstrations. To ensure demonstration quality, the expert may use high-cost observations such as CT, MRI, and B-ultrasound. In contrast, the AI learner is ideal to use only low-cost observations from cheaper devices, which could be newly designed ones that have not been used previously by the expert. Meanwhile, to ensure reliability, it is also reasonable to allow the learner to access the high-cost observations during training under a limited budget (Yu et al., 2019) . The above examples share three characteristics: (i) Even though a pair of expert and learner observations can be different, they are under the same state of the environment, leading to similar policies; (ii) The learner's new observations are not available to the expert when generating demonstrations; (iii) During training, the learner can only access expert observations under a limited budget, in special the high-cost ones, since it is also important to minimize the usage of the  OE ̸ = OL # ! ! ! The demonstrations do not include OL N/A

# ! !

The learner does not require OE+OL all the time N/A

# # !

OE is not more privileged than OL N/A expert observations during training to avoid unnecessary costs. We name such IL tasks Heterogeneously Observable Imitation Learning (HOIL). Among them, we focus on the most challenging HOIL setting in which the expert and learner observation spaces have no overlap. There are two lines of research studying the related problems, summarized in Table 1 and Figure 2 . The first one relates to Domain-shifted IL (DSIL): the observation spaces of experts and learners are the homogeneous, while some typical distribution mismatches could exist: morphological, viewpoint, and dynamics mismatch (Stadie et al., 2017; Raychaudhuri et al., 2021) . However, the approaches for DSIL are invalid when the observation spaces are heterogeneous as in HOIL. The second line studied IL under different observations similar to HOIL. Some representative works include Partially Observable IL (POIL) (Gangwani et al., 2019; Warrington et al., 2021) and Learning by Cheating (LBC) (Chen et al., 2019). Both POIL and LBC assume that the learner can easily access the expert observations without any budget limit. However, in practice, different from the learner observations, the access to expert observations might be of high cost, invasive, and even unavailable (Yu et al., 2019) , which hinder the wide application of these methods. In this paper, we initialize the study of the HOIL problem. We propose a learning process across observation spaces of experts and learners to solve this problem. We further analyze the underlying issues of HOIL, i.e., the dynamics mismatch and the support mismatch. To tackle both two issues, we use the techniques of importance weighting (Shimodaira, 2000; Fang et al., 2020) and learning with rejection (Cortes et al., 2016; Geifman & El-Yaniv, 2019) for active querying to propose the Importance Weighting with REjection (IWRE) approach. We evaluate the effectiveness of the IWRE algorithm in continuous control tasks of MuJoCo (Todorov et al., 2012) , and the challenging tasks of learning random access memory (RAM)-based policies from vision-based expert demonstrations in Atari (Bellemare et al., 2013) games. The results demonstrate that IWRE can significantly outperform existing IL algorithms in HOIL tasks, with limited access to expert observations.

2. RELATED WORK

FESL. Recently, to deal with the significant challenges in open environment learning (Zhou, 2022) , there are emerging studies of feature evolvable stream learning (FESL) (Hou et al., 2017; Hou & Zhou, 2018; Zhang et al., 2020) , which are among major inspirations of our work. FESL focuses on supervised learning with heterogeneous feature space. There are also significant differences between existing FESL approaches and HOIL. In (Hou et al., 2017; Hou & Zhou, 2018) , they assume that data features are generated from heterogeneous but fixed and static data distributions, while in HOIL, the data distributions are dynamically changing. FESL with dynamically changing distributions is studied in (Zhang et al., 2020) , while they assume that the changes are passive to the learner. This is different from HOIL, in which the data distribution changes are actively decided by the learner due to the nature of decision-making learning. DSIL. For the standard IL process, where the learner and the expert share the same observation space, current state-of-the-art methods tend to learn the policy in an adversarial style (Wang et al., 2021b; c) , like Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) . When considering the domain mismatch problem, i.e., DSIL, the research aims at addressing the static distributional shift of the optimal policies resulted from the environmental differences but still under homogeneous observation spaces. Stadie et al. (2017 ), Sermanet et al. (2018 ), and Liu et al. (2018) 



Figure 1: Medical decision making: an example of the HOIL problem. Figures 1, 2, and 3 include some illustrations and pictures from the Internet (source: https://www.flaticon.com/).

Comparisons between different IL processes. O E and O L denote the observation spaces for experts and learners respectively. N/A denotes not applicable.

