SEEING DIFFERENTLY, ACTING SIMILARLY: HETEROGENEOUSLY OBSERVABLE IMITATION LEARN-ING

Abstract

In many real-world imitation learning tasks, the demonstrator and the learner have to act under different observation spaces. This situation brings significant obstacles to existing imitation learning approaches, since most of them learn policies under homogeneous observation spaces. On the other hand, previous studies under different observation spaces have strong assumptions that these two observation spaces coexist during the entire learning process. However, in reality, the observation coexistence will be limited due to the high cost of acquiring expert observations. In this work, we study this challenging problem with limited observation coexistence under heterogeneous observations: Heterogeneously Observable Imitation Learning (HOIL). We identify two underlying issues in HOIL: the dynamics mismatch and the support mismatch, and further propose the Importance Weighting with REjection (IWRE) algorithm based on importance weighting and learning with rejection to solve HOIL problems. Experimental results show that IWRE can solve various HOIL tasks, including the challenging tasks of transforming the vision-based demonstrations to random access memory (RAM)-based policies in the Atari domain, even with limited visual observations.

1. INTRODUCTION

Imitation Learning (IL), which studies how to learn a good policy by imitating the given demonstrations (Xu et al., 2020; Chen et al., 2022) , has made significant progress in real-world applications such as autonomous driving (Chen et al., 2019) , health care (Iyer et al., 2021) , and continuous control (Wang et al., 2023) . In tradition, the expert and the learner are assumed to use the same observation space. However, nowadays, many real-world IL tasks demand to remove this assumption (Chen et al., 2019; Warrington et al., 2021) , such as in autonomous driving (Chen et al., 2019) , recommendation system (Wu et al., 2019) , and medical decision making (Wang et al., 2021a) . Taking AI for medical diagnosis as an example in Figure 1 : A medical AI is learning to make medical decisions based on expert doctor demonstrations. To ensure demonstration quality, the expert may use high-cost observations such as CT, MRI, and B-ultrasound. In contrast, the AI learner is ideal to use only low-cost observations from cheaper devices, which could be newly designed ones that have not been used previously by the expert. Meanwhile, to ensure reliability, it is also reasonable to allow the learner to access the high-cost observations during training under a limited budget (Yu et al., 2019) . The above examples share three characteristics: (i) Even though a pair of expert and learner observations can be different, they are under the same state of the environment, leading to similar policies; (ii) The learner's new observations are not available to the expert when generating demonstrations; (iii) During training, the learner can only access expert observations under a limited budget, in special the high-cost ones, since it is also important to minimize the usage of the 

# # !

expert observations during training to avoid unnecessary costs. We name such IL tasks Heterogeneously Observable Imitation Learning (HOIL). Among them, we focus on the most challenging HOIL setting in which the expert and learner observation spaces have no overlap. There are two lines of research studying the related problems, summarized in Table 1 and Figure 2 . The first one relates to Domain-shifted IL (DSIL): the observation spaces of experts and learners are the homogeneous, while some typical distribution mismatches could exist: morphological, viewpoint, and dynamics mismatch (Stadie et al., 2017; Raychaudhuri et al., 2021) . However, the approaches for DSIL are invalid when the observation spaces are heterogeneous as in HOIL. The second line studied IL under different observations similar to HOIL. Some representative works include Partially Observable IL (POIL) (Gangwani et al., 2019; Warrington et al., 2021) and Learning by Cheating (LBC) (Chen et al., 2019) . Both POIL and LBC assume that the learner can easily access the expert observations without any budget limit. However, in practice, different from the learner observations, the access to expert observations might be of high cost, invasive, and even unavailable (Yu et al., 2019) , which hinder the wide application of these methods. In this paper, we initialize the study of the HOIL problem. We propose a learning process across observation spaces of experts and learners to solve this problem. We further analyze the underlying issues of HOIL, i.e., the dynamics mismatch and the support mismatch. To tackle both two issues, we use the techniques of importance weighting (Shimodaira, 2000; Fang et al., 2020) and learning with rejection (Cortes et al., 2016; Geifman & El-Yaniv, 2019) for active querying to propose the Importance Weighting with REjection (IWRE) approach. We evaluate the effectiveness of the IWRE algorithm in continuous control tasks of MuJoCo (Todorov et al., 2012) , and the challenging tasks of learning random access memory (RAM)-based policies from vision-based expert demonstrations in Atari (Bellemare et al., 2013) games. The results demonstrate that IWRE can significantly outperform existing IL algorithms in HOIL tasks, with limited access to expert observations.

2. RELATED WORK

FESL. Recently, to deal with the significant challenges in open environment learning (Zhou, 2022) , there are emerging studies of feature evolvable stream learning (FESL) (Hou et al., 2017; Hou & Zhou, 2018; Zhang et al., 2020) , which are among major inspirations of our work. FESL focuses on supervised learning with heterogeneous feature space. There are also significant differences between existing FESL approaches and HOIL. In (Hou et al., 2017; Hou & Zhou, 2018) , they assume that data features are generated from heterogeneous but fixed and static data distributions, while in HOIL, the data distributions are dynamically changing. FESL with dynamically changing distributions is studied in (Zhang et al., 2020) , while they assume that the changes are passive to the learner. This is different from HOIL, in which the data distribution changes are actively decided by the learner due to the nature of decision-making learning.

DSIL.

For the standard IL process, where the learner and the expert share the same observation space, current state-of-the-art methods tend to learn the policy in an adversarial style (Wang et al., 2021b; c) , like Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) . When considering the domain mismatch problem, i.e., DSIL, the research aims at addressing the static distributional shift of the optimal policies resulted from the environmental differences but still under homogeneous observation spaces. Stadie et al. (2017) , Sermanet et al. (2018), and Liu et al. (2018) Figure 2 : Comparison sketches of IL with respect to observation spaces. The target of (a) is to learn a policy based on the same observations of the expert, while that of (b), (c), and (d) are to learn a policy based on the second observation space only. The detailed differences can be found in Table 1 . studied the situation where the demonstrations are in view of a third person. Kim et al. (2020) and Kim et al. (2019) addressed the IL problem with morphological mismatch between the expert's and learner's environments. Tirinzoni et al. (2018) , Jiang et al. (2020), and Desai et al. (2020) focused on the calibration for the mismatch between simulators and the real world through some transfer learning styles. There are two major differences between HOIL and DSIL: One is that HOIL considers heterogeneous observation spaces instead of homogeneous ones; another is that without observation heterogeneity, DSIL can directly align two fixed domains, which may not be realistic for solving HOIL when two observation spaces are totally different. Thus HOIL is a significantly more challenging problem than DSIL. Besides, Chen et al. ( 2019) learned a vision-based agent from a privileged expert. But it can obtain expert observations throughout the whole learning process, so it cannot handle the problem of the support mismatch under HOIL. POMDP. The problem of Partially Observable Markov Decision Process (POMDPs), in which only partial observations are available for the agent(s), has been studied in the context of multiagent (Omidshafiei et al., 2017; Warrington et al., 2021) and IL (Gangwani et al., 2019; Warrington et al., 2021) problems. But distinct from HOIL, in a POMDP, the learner only has partial observations and shares the same underlying observation space with the expert, which would become an obstacle for them to make decisions correctly. For example, Warrington et al. (2021) assumed that the observation of the learner is partial than that of the expert. Instead, in HOIL, expert's and learner's observations are totally different from each other, while the learner's observations do not belong to the expert's. For HOIL, the main challenge is to deal with the mismatches between the observation spaces, especially when the access to expert observations is strictly limited.

3. THE HOIL PROBLEM

In this section, we first give a formal definition of the HOIL setting, and then introduce the learning process for solving the HOIL problem.

3.1. SETTING DEFINITION

A HOIL problem is defined within a Markov decision process with mutiple observation spaces, i.e., ⟨S, {O}, A, P, γ⟩, where S denotes the state space, {O} denotes a set of observation spaces, A denotes the action space, P : S × A × S → R denotes the transition probability distribution of the state and action, and γ ∈ (0, 1] denotes the discount factor. Furthermore, a policy π over an observation space O is defined as a function mapping from O to A, and we denote by Π O the set of all policies over O. In HOIL, both the expert and the learner have their own observation spaces, which are denoted as O E and O L respectively. Both O E and O L are assumed to be produced by two bijective mappings f E : S → O E , f L : S → O L , which are unknown functions mapping the underlying true states to the observations. It is obvious to see that by this assumption, any policy over O E has a unique correspondence over O L . This makes HOIL possible since the target of HOIL is to find the policy corresponding to the expert policy under O L . A state-action pair (s, a), denoted by x, is called an instance. Also, a trajectory T = {x i }, i := {1, . . . , m} is a set of m instances. For each observation space, x ∈ T ⊆ O E ×A and x ∈ T ⊆ O L × A, where O E = f E (S) and O L = f L (S) . Furthermore, we define the occupancy measure of a policy π under the state space S as ρ π : S × A → R such that ρ π (x) = π(a|o)Pr(o|s) ∞ t=0 γ t Pr(s t = s|π). Under HOIL, the learner accesses the expert demonstrations T πE , a set of instances sampled from ρ πE . The goal of HOIL is to learn a policy π as the corresponding policy of π E over O L . If O E = O L , HOIL degenerates to standard IL . GAIL (Ho & Ermon, 2016) is one of the state-of-theart IL approaches under this situation, which tries to minimize the divergence between the learner's  min π max w E x∼ρπ E [log D w ( x)] + E x∼ρ π [log(1 -D w ( x))] -H(π), where H(π) is the causal entropy performed as a regularization term, and D w : O E × A → [0, 1] is the discriminator of π E and π. GAIL solved Equation ( 1) by alternatively taking a gradient ascent step to train the discriminator D w , and a minimization step to learn policy π based on an off-the-shelf reinforcement learning (RL) algorithm with the pseudo reward -log D w ( x).

3.2. THE LEARNING PROCESS FOR SOLVING HOIL

In HOIL, we need to cope with the absence of the learner's observations in demonstrations and the high cost of collecting expert observations while learning. So we introduce a learning process with pretraining across two different observation spaces for solving HOIL, as abstracted in Figure 3 . Pretraining. Same as LBC (Chen et al., 2019) , we assume that we can obtain an auxiliary policy π 1 based on O E at the beginning. π 1 can be directly provided by any sources, or trained by GAIL or behavior cloning (Michie et al., 1990) from online data stream (Cai et al., 2019) or offline demonstrations (Sasaki & Yamashina, 2021) as did in LBC. Besides, we use this π 1 to sample some data T π1 , which contain both observation under O E (i.e., T π1 ) and O L (i.e., T π1 ), in order to connect these two different observation spaces. We name T π1 = { T π1 , T π1 } the initial data. Training. Here we learn a policy π 2 from the initial data T π1 and the collected data T π2 , under O L only. Besides, the learner is allowed for some operation of observation coexistence (OC): At some steps of learning, besides the observations O L , the learner could also request T π2 from the corresponding observations O E (e.g., from the human-understandable sensors). The final objective of HOIL is to learn a good policy π 2 under O L . In practical applications, the auxiliary policy π 1 can also come from simulation training or direct imitation. But since π 1 is additionally provided, it is more practical to consider π 1 as a non-optimal policy. During training, OC is an essential operation for solving HOIL, which helps the learner address the issues of the dynamics mismatch and the support mismatch (especially the latter one). Also, in reality, we do not need an oracle for actions, which still needs OC for obtaining expert observations first, as in many active querying research (Brantley et al., 2020; Chen et al., 2019) , so its cost will be relatively lower. Besides, the related work (Chen et al., 2019) also required an initialized policy π 1 to solve their problem, which acts as a teacher under privileged O E in the pretraining and then learned a visionbased student from the guidance of the teacher under both O L and O E . Their setting can be viewed as a variety of HOIL with optimal π 1 , unlimited O E , and unlimited OC operations, so HOIL is actually a more practical learning framework.

4. IMITATION LEARNING WITH IMPORTANCE WEIGHTING AND REJECTION

In HOIL, the access frequency to O E is strictly limited, so it is unrealistic to learn π 2 in a Dataset Aggregation (DAgger) style (Ross et al., 2011) as in LBC. Therefore, we resort to learning π 2 with a learned reward function by inverse RL (Abbeel & Ng, 2010) in an adversarial learning style (Ho & Ermon, 2016; Fu et al., 2018) . 

4.1. DYNAMICS MISMATCH AND IMPORTANCE WEIGHTING

To analyze the learning process, we let ρ πE , ρ π1 , and ρ π2 be the occupancy measure distributions of the expert demonstrations, the initial data, and the data during training respectively. Since we need to consider the sub-optimality of π 1 , ρ π1 should be a mixture distribution of the expert ρ πE and non-expert ρ πNE , i.e., there exists some δ ∈ (0, 1) such that ρ π1 = δρ πE + (1 -δ)ρ πNE , as depicted in Figure 4a . During training, the original objective of π 2 is to imitate π E through demonstrations. To this end, the original objective of reward function D w2 for π 2 is to optimize max w2 E x∼ρπ 2 [log D w2 (x)] + E x∼ρπ E [log(1 -D w2 (x))]. But the expert demonstrations are only available under O E . During training, we can only utilize the initial data T π1 ∼ ρ π1 to learn π 2 and D w2 . Besides, as π 1 is sub-optimal, directly imitating T π1 could reduce the performance of the optimal π 2 to that of π 1 . So we use the importance weighting to calibrate this dynamics mismatch, i.e., max w2 L(D w2 ) = E x∼ρπ 2 [log D w2 (x)] + E x∼ρπ 1 [α(x) log(1 -D w2 (x))], where α(x) x) is an importance weighting factor (Fang et al., 2020) . So the current issue lies in how to estimate ≜ ρπ E (x) ρπ 1 ( ρπ E ρπ 1 under O E . To achieve this purpose, we need to bridge the expert demonstrations and the initial data. Therefore, here we use these two data sets to train an adversarial model D w1 in the same way as D w2 in the pretraining: max w1 L(D w1 ) ≜ E x∼ρπ 1 [log D w1 ( x)] + E x∼ρπ E [log(1 -D w1 ( x))]. (5) If we write the training criterion (5) in the form of integral, i.e., max w1 L(D w1 ) = x [ρ π1 log D w1 + ρ πE log(1 -D w1 )]dx, then, by setting the derivative of the objective to 0 ( ∂L ∂Dw 1 = 0), we can obtain the optimum D w1 : D * w1 = ρ π1 ρ π1 + ρ πE , in which the order of differentiation and integration was changed by the Leibniz rule. Besides, we can sufficiently train D w1 using the initial data T π1 and the expert demonstrations T πE . Then D w1 will be good enough to estimate the importance weighting factor, i.e., α(x) ≜ ρ πE ρ π1 = 1 -D * w1 ( x) D * w1 ( x) ≈ 1 -D w1 ( x) D w1 ( x) . In this way, we can use D w1 , which can connect demonstrations and initial data, to calibrate the learning process of D w2 . The final optimization objective for D w2 is max w2 L(D w2 ) = E x∼ρπ 2 log D w2 (x) + E x∼ρπ 1 1 -D w1 ( x) D w1 ( x) log[1 -D w2 (x)]. In this way, D w2 can effectively dig out the expert part of ρ π1 and produce efficient rewards for π 2 .

4.2. SUPPORT MISMATCH

Expert π1 (a) Hopper (b) Walker2d So far, the challenges have still been similar to homogeneously observable IL. However, our preliminary experiments demonstrated that mere importance weighting is not enough to fix the problem that occurred by the absence of interactions under O E . So there exist some other issues between the expert demonstrations and the initial data. To find out the underlying issues, we plotted the t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten & Hinton, 2008) visualizations of these two empirical distributions under O E on Hopper and Walker2d in Figure 5 . Twenty trajectories were collected for both the expert demonstrations and the initial data. We can observe that there exist some high-density regions of demonstrations in which the initial data do not cover; that is, there exist some regions of the demonstrations that π 1 did not explore. Wang et al. (2019) found a similar phenomenon in the standard IL setting. On the other hand, the importance weighting α cannot calibrate this situation where ρπ E ρπ 1 = ∞. To formulate this problem, here we introduce the support set of the occupancy measure: Definition 1 (Support Set). The support set of an occupancy measure ρ is the subset of the domain containing the elements which are not mapped to zero: supp(ρ) := {x ∈ S × A|ρ(x) ̸ = 0}. Due to the sub-optimality of π 1 , supp(ρ πE ) \ supp(ρ π1 ) ̸ = ∅ (see Figure 4b ). We call this part the latent demonstration, defined as: Definition 2 (Latent Demonstration). The latent demonstration H is the set of the domain that belongs to the relative complement of supp(ρ π1 ) in supp(ρ πE ): H := {x ∈ S × A|supp(ρ πE ) \ supp(ρ π1 )}. Also, another part of the demonstration is named the observed demonstration, defined as: Definition 3 (Observed Demonstration). The observed demonstration O is the set of the domain that belongs to the complement of H in supp(ρ πE ): O := {x ∈ S × A|supp(ρ πE ) ∩ supp(ρ π1 )}. Besides, the data outside of demonstrations should be non-expert data: Definition 4 (Non-Expert Data). The non-expert data N is the set of the domain out of supp(ρ πE ): N := {x ∈ S × A|ρ πE (x) = 0}. In other words, the sub-optimality of π 1 will cause not only the dynamics mismatch, but also the appearance of the latent demonstration H. We call the latter one the problem of support mismatch. Intuitively, when π 2 → π E , we have H → ∅, monotonously. So in order to fix the support mismatch between ρ πE and ρ π1 , guiding π 2 to find out H is the key. In addition, the support mismatch problem can be viewed as an inverse problem of the out-ofdistribution (OOD) problem that frequently occurred in offline RL setting (Levine et al., 2020) , in which they tried to avoid supp(ρ π1 ) \ supp(ρ πE ) instead.

4.3. IMITATION LEARNING WITH REJECTION

We can observe that H ∪O∪N = S ×A.  H = {x ∈ S × A|I[D * w1 ( x)]g * 1 ( x) = I[D * w2 (x)]g * 2 (x) = +1}, O = {x ∈ S × A|I[D * w1 ( x)]g * 1 ( x) = I[D * w2 (x)]g * 2 (x) = 0}, N = {x ∈ S × A|I[D * w1 ( x)]g * 1 ( x) = I[D * w2 (x)]g * 2 (x) = -1}, where 4c . To this end, both g 1 and g 2 should be able to cover O, meanwhile g 2 can be adaptive to the continuous change of ρ π2 due to the update of π 2 . Here we learn g 1 and g 2 in a rejection form, to reject O from O ∪ H (where I(D w ) = +1). Concretely, the rejection setting is the same as in Cortes et al. (2016) . Also, inspired by Geifman & El-Yaniv (2019) , the optimization objective of D w and g is I[•] takes +1 if • > 0.5, and -1 otherwise. I[D * w (x)]g * (x) is depicted in Figure L(D w , g) ≜ l(D w , g) + λ max(0, c -φ(g)) 2 , where c > 0 denotes the target coverage, and λ denotes the factor for controlling the relative importance of rejection. Besides, the empirical coverage φ(g) is defined as φ(g|X) ≜ 1 m m i=1 g(x i ), with a batch of data X = {x i }, i ∈ [m]. The empirical rejection risk l(D w , g) is the ratio between the covered risk of the discriminator and the empirical coverage: l(D w , g) ≜ 1 m m i=1 ⟨L(D w (x i )), g(x i )⟩ φ(g) . Meanwhile, both D w1 and g 1 can access ρ πE under O E directly. So given x ∼ T π2 under O L , once ⟨I(D w2 (x)), g 2 (x)⟩ = +1, we can query the corresponding observations x of x through OC operation and use ⟨I(D w1 ( x)), g 1 ( x)⟩ to calibrate the output of g 2 and D w2 . In this way, g 2 and D w2 can be entangled together and adaptively guide π 2 to find out the latent demonstrations H under O L .

4.4. IWRE

Here we combine importance weighting and rejection into a unified procedure, resulting in a novel algorithm named Importance Weighting with REjection (IWRE). Concretely, in a HOIL process: Pretraining. We train a discriminator D w1 by Equation ( 5) and its corresponding rejection model g 1 by Equation ( 17) using the initial data and the expert demonstrations. Training. We train a discriminator D w2 by the combination of Equation ( 9) and Equation ( 17), as well as its corresponding rejection model g 2 by Equation ( 17), using the initial data, the data collected by π 2 , and the output of D w1 with g 1 through OC operation. Also, π 2 will be updated with D w2 and g 2 asymmetrically as in GAIL. The pseudo-code of our algorithm is provided in the appendix. 

5. EXPERIMENT

In 9); TPIL learns the third-person demonstrations by leading the cross-entropy loss into the update of the feature extractor; GAMA learns a mapping function ψ in view of adversarial training to align the observation of the target domain into the source domain, and thereby can utilize the policy in the source domain for zero-shot imitation. For fairness, we allowed the interaction between the policy and the environment for GAMA under HOIL; LBC uses π 1 learned from privileged states as a teacher to train π 2 in a DAgger (Ross et al., 2011) style, so here we allowed LBC to access O E during the whole IL process. In Atari, to investigate whether our method could achieve good performance for RAMbased control, we further included a contender PPO-RAM, which uses proximal policy optimization (PPO) (Schulman et al., 2017) to perform RL directly with environmental true rewards under the RAM-based observations. More detailed setup including query strategies for TPIL and GAMA, network architecture, and hyper-parameters are reported in the appendix.

5.2. RESULTS

Experimental results are reported in Figure 6 . Since the mapping function is hard to learn when input is RAM and output is raw images, we omit the results of GAMA in Atari. We can observe that while IW is better than GAIL in most environments, both GAIL and IW can hardly outperform π 1 . Because they just imitated the performance of π 1 instead of π E , even with importance weighting for calibration. For TPIL, its learning process was extremely unstable on Hopper, Swimmer, and Walker2d due to the continuous distribution shift. Furthermore, the performance of GAMA was not satisfactory in Hopper and Walker2d because its mapping function is hard to learn well when the support mismatch appears. The results of TPIL and GAMA demonstrate that DSIL methods will be invalid under heterogeneous observations as in HOIL tasks. On Atari environments, O E contains more privileged information than O L , so LBC can achieve good performance. But when O E is not more privileged than O L , like in most environments of MuJoCo, its performance will decrease due to the support mismatch, which would make it even worse than BC. Finally, IWRE obtained the best performance on 6/8 environments, and comparable performance with LBC on Reacher, which shows the effectiveness of our method even with limited access to O E (LBC can access to O E all the time). Besides, we can see that the performance differences between the GAIL/IW and IWRE/TPIL/GAMA/LBC are huge (especially on Reacher) because of the absence of queries, which demonstrates that the query operation is indeed necessary for HOIL problems. Moreover, even learned with true rewards, PPO-RAM surprisingly failed to achieve comparable performance to IWRE, which shows that IWRE could possibly learn more effective rewards than true environmental rewards in RAM-input tasks. The results verify that, IWRE provides a powerful approach for tackling HOIL problems, even under the situation that the demonstrations are gathered from such a different observation space, meanwhile O E is strictly limited during training. (O E is hidden to π 2 ). All setups are the same as in Section 4.2. From the results shown in Figure 7 , we can see that even under O E , which cannot be obtained by π 2 , almost all high-density regions of the demonstrations were covered by the collected data. Meanwhile, the latent demonstration H is dug out nearly. The results demonstrate that IWRE basically solves the problem of support mismatch and thereby performs well in these environments. Besides, some collected data of π 2 of IWRE were out of the distribution of the demonstrations, which means π 2 slightly overly explored the environment. Since O E is hidden to π 2 , the reward function will encourage π 2 to explore more areas to fix the support mismatch problem. Meanwhile, the out-of-distribution problem in HOIL is not as severe as in the offline RL settings (Levine et al., 2020) , so this over-exploration phenomenon makes sense.

6. CONCLUSION

In this paper, we proposed a new learning framework named Heterogeneously Observable Imitation Learning (HOIL), to formulate situations where the observation space of demonstrations differs from that of the imitator while learning. We formally modeled the learning process of HOIL, in which access to expert observations is limited due to the high cost. Furthermore, we analyzed the underlying challenges of HOIL: the dynamics mismatch and the support mismatch, on the occupancy distributions between the demonstrations and the policy. To tackle these challenges, we proposed a new algorithm named Importance Weighting with REjection (IWRE), using importance weighting and learning with rejection. Experimental results showed that the direct imitation and domain adaptive methods could not solve this problem, while our approach obtained promising results. In the future, we hope to give the theoretical guarantee for our algorithm IWRE and investigate how many O E we need to query to learn a promising π 2 . Furthermore, we hope to use the learning framework of HOIL and IWRE to tackle more learning scenarios with demonstrations in different spaces.

A NOTATIONS

The notations of the main paper are gathered in Table 2 . Sample the evolving data { T π1 , T π1 } ∼ ρ π1 by π 1 . 3: Train D w1 and g 1 by Equation ( 5) and ( 17) respectively using T πE and T π1 . Initialize π 2 , D w2 , and g 2 . Sample T π2 ∼ ρ π2 by π 2 .

5:

for each mini-batch {x π2 } and {x π1 } from T π2 and T π1 do 6: Update π 2 by RL algorithms (such as PPO (Schulman et al., 2017) ) using instances {x π2 } and pseudo rewards {-log D w2 (x π2 )}.

7:

Update D w2 by Equation (9) using negative instances {x π2 } and positive ones {x π1 }.

8:

if ⟨I(D w2 (x π2 )), g 2 (x π2 )⟩ = +1 then 9: Query the O E observation of x π2 , i.e., x π2 , through OC operation. 10: Update D w2 and g 2 by Equation (17) using the instance x π2 and the corresponding label ⟨I(D w1 ( x π2 )), g 1 ( x π2 )⟩. 

C DEFINITIONS

The core challenges of HOIL, i.e., dynamics mismatch and support mismatch, are illustrated as below. Definition 5 (Dynamics Mismatch). The dynamics mismatch between the demonstrations and the initial data denotes the situation that: ρ πE ρ π1 = π E (a|o) ∞ t=0 γ t Pr(s t = s|π E ) π 1 (a|o) ∞ t=0 γ t Pr(s t = s|π 1 ) ̸ = 1. ( ) Definition 6 (Support Mismatch). The support mismatch between the demonstrations and the initial data denotes the situation that:  supp(ρ πE ) \ supp(ρ π1 ) = {x ∈ S × A|ρ πE (x) ̸ = 0} \ {x ∈ S × A|ρ π1 (x) ̸ = 0} ̸ = ∅. (21) # # ! GAIL-Rand # ! ! IW # # ! IW-Rand # ! ! TPIL # ! ! GAMA # ! ! BC # # ! LBC ! ! # PPO-RAM # # ! D DETAILED SETUP FOR THE EXPERIMENTS Environment and Contenders. For the environments that used in the main body of the paper: 1. Pixel-memory Atari games. O E : 84 × 84 × 4 raw pixels; O L : 128-byte random access memories (RAM). Expert: converged DQN-based agents (Mnih et al., 2013) . Atari games contain two totally isolated views: raw pixels and RAM, under the same state. Through these environments, we want to investigate whether the agent can learn an effective policy from demonstrations under completely different observation spaces. Moreover, IL with visual observations only is already very difficult (Cai et al., 2021) , while learning a RAM-based policy can be even more challenging (Bellemare et al., 2013; Sygnowski & Michalewski, 2016) The details of the environments are reported in Table 3 . Also, the detailed comparisons of the contenders (both in the main paper and the appendix) and IWRE are gathered in Table 4 . Learning process. To simulate the situation that O E is costly, the steps for training π 1 was set as 1/4 of that for training π 2 , using GAIL (Ho & Ermon, 2016) /HashReward (Cai et al., 2021) under the O E space for MuJoCo/Atari environments. The learning steps were 10 7 for MuJoCo and 5 × 10 6 for Atari environments. In the pretraining, we sampled 20 trajectories from π 1 , and the data from each trajectory had both O E and O L observations. In the training, each method learned 4 × 10 7 steps for MuJoCo and 2 × 10 7 steps for Atari under the O L space to obtain π 2 . Query Strategy. For TPIL and GAMA, if the output of the domain invariant discriminator is larger than 0.5, which means the encoder fails to generate proper features to confuse its discriminator, then we would query O E of this data to update the encoder. For IWRE, the threshold of the rejection model g and the discriminator D w2 was also 0.5, which means that if g 2 (x) > 0.5 meanwhile D w2 (x) > 0.5, O E of this data would be queried. D w2 , π 2 , and the encoder (for TPIL/GAMA) were pretrained for 100 epochs for all methods using evolving data during pretraining. The basic RL algorithm is PPO, and the reward signals of all methods were normalized into [0, 1] to enhance the performance of RL (Dhariwal et al., 2017) . The buffer size for TPIL and IWRE was set as 5000. Each time the buffer is full, the encoder and the rejection model will be updated for 4 epochs; also LBC will update π 2 for 100 epochs with batch size 256 using the cross-entropy loss for Atari and the mean-square loss for MuJoCo. We set all hyper-parameters, update frequency, and network architectures of the policy part the same as Dhariwal et al. (2017) . Besides, the hyper-parameters of the discriminator for all methods were the same: The rejection model and discriminator were updated using Adam with a decayed learning rate of 3 × 10 -4 ; the batch size was 256. The ratio of update frequency between the learner and discriminator was 3: 1. The target coverage c in Equation ( 17) was set as 0.8. λ in Equation ( 17) was 1.0. Accuracy Accuracy H = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 1&&I[D w2 (x i )]g 2 (x i ) == 1} m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 1} , Accuracy O = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 0&&I[D w2 (x i )]g 2 (x i ) == 0} m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 0} , N = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == -1&&I[D w2 (x i )]g 2 (x i ) == -1} m i=1 {I[D w1 ( x i )]g 1 ( x i ) == -1} , Ratio H = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 1} m , Ratio O = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == 0} m , Ratio N = m i=1 {I[D w1 ( x i )]g 1 ( x i ) == -1} m , in which {x i , x i } ∼ ρ π2 denotes a batch of data sampled by π 2 . The results are shown in Figure 9 . The results depicted not only the accuracies of I[D w2 ]g 2 , but also the changes of these three areas during the policy learning. We can see that the accuracies in each area and the ratio of O will decrease at first. While at the same time, the ratio of H will increase. This is because the successful detection of H will decrease the estimated ratio of O and reduce the accuracy of I[D w2 ]g 2 . With the help of query operations, the accuracy of I[D w2 ]g 2 will gradually increase. Also, followed by the learning procedure of the policy π 2 , more and more H will be recognized as O, with less and less N . This is why in the following period, the ratios of H and N will decrease while that of O will increase. These results also verify that our algorithm IWRE can indeed detect H, O, and N successfully as the learning process of the policy. 

G INVESTIGATIONS ON ChopperCommand PERFORMANCE

We observed some interesting phenomenons on ChopperCommand performance. Here we investigate the behavior of a random policy, the IWRE policy, and the expert policy, shown in Figure 10 . On ChopperCommand, the agent needs to imitate the expert to manipulate a helicopter (noted by the black circle) to avoid attacks from enemy aircraft meanwhile shooting them. So the policy in this environment can be split into two parts: 1. The movement of the helicopter; 2. The shoot of the helicopter. While capturing the image-level semantic information with only RAM input is quite difficult. Even with RAM input, IWRE still successfully learned the expert's movement of the helicopter (see the rows of "IWRE 2e7 steps" and "Expert Policy" in Figure 10 ). But it is much harder to capture the shot bullet in the image, not to mention in the RAM. So IWRE policy did not shoot enemy aircraft successfully. However, the environmental reward of ChopperCommand is only related to the number of enemy aircraft shot down, regardless of the distance the helicopter has flown. Therefore, in Figure 6a , IWRE policy can obtain higher rewards than the random policy at the beginning of the training (1e5 steps). Then the environmental reward fluctuates within a specific range. On the other hand, IWRE indeed learned some expert's policy of manipulating the helicopter in view of the trajectory shown in Figure 10 . So we believe that only considering the reward value as an evaluation is insufficient to reflect the degree of imitation on ChopperCommand. This interesting phenomenon also reveals that the reward function learned from IL can be very different from the environmental reward functions in RL tasks, since the IL reward can capture expert behaviors that are not reflected by the environmental reward signals. Figure 10 can also be used as a metric. We also provide the 



Figure 1: Medical decision making: an example of the HOIL problem. Figures 1, 2, and 3 include some illustrations and pictures from the Internet (source: https://www.flaticon.com/).

Comparisons between different IL processes. O E and O L denote the observation spaces for experts and learners respectively. N/A denotes not applicable. Setting DSIL POIL LBC HOIL(ours) OE ̸ = OL # ! ! ! The learner does not require OE+OL all the time N/A

Figure 3: Illustration of a learning process across two different observation spaces for solving HOIL. π 1 is an auxiliary policy that is additionally provided. and the expert's occupancy measures d(ρ π , ρ πE ). The objective of GAIL is

Figure 4: The comparisons among the distributions of expert demonstrations ρ πE , initial data ρ π1 , and non-expert data ρ πNE . The red and blue regions denote the expert and non-expert parts of ρ π1 respectively. H, O, and N denote the latent demonstration, the observed demonstration, and the non-expert data respectively. (a) The ideal situation, where supp(ρ πE ) \ supp(ρ π1 ) = ∅; (b) The real situation, where H := supp(ρ πE )\supp(ρ π1 ) ̸ = ∅ in ρ πE ; (c) The target output of the combined model I[D *w ]g * . The output +1, 0, and -1 regions correspond to H, O, and N respectively. In addition, both O E and O L are assumed to share the same latent state space S as introduced in Section 3.1, so the following analysis will be based on S, while the algorithm will handle the problem based on O E and O L specifically.

Figure 5: t-SNE visualizations of expert demonstrations and collected data of π 1 under O E .

Figure 6: The learning curves of each method. Shaded regions indicate the standard deviation.

Expert demonstrations T πE ; Evolving data T π1 ; Discriminator D w1 ; Rejection model g 1 . Output: Target policy π 2 . 1: function IWRE.TRAINING( T πE , T π1 , D w1 , g 1 ) 2:

Figure 8: The performance of RL methods under the division of O E and O L in MuJoCo. The agent can obtain comparable performances under O E and O L , so that we can make sure the fairness of the experiment of HOIL in the main paper.

Figure 9: The accuracies and ratios of H, O, and N calculated by I[D w1 ( x i )]g 1 ( x i ) and I[D w2 (x i )]g 2 (x i ) during policy learning.

Figure 11: The final rewards of each method on MuJoCo with different budget ratios, where the shaded regions indicate the standard deviation. The red horizontal dotted line represents the averaged performance of the expert.

Figure 12: The learning curves of each method in MuJoCo environments with different numbers of expert trajectories, where the shaded region indicates the standard deviation.

So it is desirable to filter out H from O and N . Meanwhile, D w1 and D w2 can only classify O ∪ H and N , under O E and O L respectively. Therefore, here we design two models g 1 : O E × A → {0, 1} and g 2 : O L × A → {0, 1} (Output 0: x ∈ O and output 1: otherwise), so that given x ∼ T (corresponding x ∼ T and x ∼ T ), they can satisfy

The notations of the main paper. Trajectory sampled by π E under O E (demonstrations) T π 1 Trajectory sampled by π 1 under O E T π 2 Trajectory sampled by π 2 under O E T π 1 Trajectory sampled by π 1 under O L T π 2 Trajectory sampled by π 2 under O L x An instance of state-action pair x An instance of observation-action pair under O E x An instance of observation-action pair under O L ρ π E Occupancy measure of the expert policy π E ρ π 1 Occupancy measure of the auxiliary policy π 1 ρ π 2 Occupancy measure of the target policy π 2 D w 1 Adversarial model on T π E and T π 1 D w 2 Adversarial model on T π 1 and T π 2 α Auxiliary policy π 1 ; Expert demonstrations T πE . Output: Evolving data { T π1 , T π1 }; Discriminator D w1 ; Rejection model g 1 .

Environmental summary of the tasks.

Comparisons between all contenders and IWRE in HOIL.

, so few IL research reported desirable results on this task. 2. Continuous control MuJoCo objects. O E : half of original observation features; O L : another half of original observation features. Expert: converged DDPG-based agents (Lillicrap et al., 2016). The features of MuJoCo contain monotonous information like the direction, position, velocity, etc., of an object. Here we want to investigate whether the agent can learn from demonstrations with complementary signals under observations with missing information. Meanwhile, we make sure RL algorithms can obtain comparable performances under O E and O L .

E RL PERFORMANCE UNDER THE DIVISIONS OF MUJOCOHere we report the performance under the division of O E and O L in MuJoCo. The details of the division are reported in Table5. We use DDPG-based(Lillicrap et al., 2016) agent with 10 7 training steps and repeat 10 times with different random seeds. The results are shown in Figure 8. We can see that the agent can obtain comparable performance under O E and O L . So for MuJoCo environments, the fairness of the division in HOIL can be guaranteed, and O E is not more or less privileged than O L . The observation division into O E and O L in MuJoCo. The numbers denote the randomly selected observation indexes in the corresponding MuJoCo environment on OpenAI Gym (Brockman et al., 2016) platform. F ESTIMATION OF H , O, AND N BY I[D w 2 ]g 2To investigate the ability of IWRE to distinguish the areas of latent demonstrations H, observed demonstrations O, and non-expert data N during policy learning, we recorded the accuracy and estimated ratio of each part on Hopper and Walker2d. The calculations of each curve are shown as below:

ACKNOWLEDGMENT

This research was supported by: National Key Research and Development Program of China (2020AAA0109401); National Science Foundation of China (62176117, 61921006, 62206245); the Institute for AI and Beyond, UTokyo; JST SPRING, Grant Number JPMJSP2108. The authors would like to thank Guoqing Liu, Yoshihiro Nagano, and the anonymous reviewers for their insightful comments and suggestions.

annex

full videos of these four policies on ChopperCommand in this anonymous link: https://www. dropbox.com/sh/fz7xpbt4y8t3umz/AACw5cMq5eG8sCl04x9_OOnha?dl=0. Figure 10 : The sequence comparisons between a random policy, the IWRE policy with 1e5/2e7 steps, and the expert policy. The agent needs to manipulate a helicopter (noted by the black circle) to avoid attacks from enemy aircraft meanwhile shooting them. The timestamp for each column of images and seed for each environment is the same.

H QUERY EFFICIENCY

We also investigate whether our query strategy is efficient. To this end, we allocate the query budget, i.e., limiting the query ratio for each method. For TPIL, it preferentially queries those data with low D w ϕ output; for our method IWRE, it preferentially queries those data with high ⟨D w2 , g 2 ⟩ output. Besides, since GAIL and IW cannot directly perform queries, we design a random-selection strategy for them as GAIL-Rand and IW-Rand: for each batch of data, we randomly select data and input the O E observations of these data to D w1 . If D w1 (x) > 0.5, which means D w1 regard this data being belonging to the expert demonstrations, then we would label this data as the expert data to update D w2 . The results are depicted in Figure 11 .We can observe that the random strategy does not always improve the performance of GAIL and IW. For GAIL-Rand, without importance weighting to calibrate the learning process of the reward function, its performance become even worse on Hopper, Swimmer, and Walker2d, because the queried information enhances the discrimination ability of reward function, making it even more impossible for the agent to obtain effective feedbacks; for IW-Rand, its performance is better than GAIL-Rand on most environments, and is reinforced on Hopper, Reacher, and Walker2d, which further demonstrate that the query operation is indeed necessary for HOIL problem, but still fails compared with our method; for TPIL, it is comparable with IW-Rand, however, its performance improvement is very limited as the budget increases, and on Swimmer and Walker2d there even exist performance degradations, which suggests that its query strategy is very unstable; for GAMA, it has a good start point, but the performance gain is very limited while the budget increases; for our method, its performance is almost the same as that of IW-Rand without query on most environments. When it is allowed to query O E observation, our method outperforms other methods with a large gap, which shows that the query strategy of our method is indeed more efficient.

I IMITATION WITH DIFFERENT NUMBER OF EXPERT TRAJECTORIES

The performances of different numbers of expert trajectories of all contenders are reported in Figure 12 . Each experiment is conducted 5 trials with different random seeds. We can observe that even with a very limited number of trajectories, our algorithm achieves better performance than other algorithms in most environments.

