GOAL-DRIVEN IMITATION LEARNING FROM OBSERVA-TION BY INFERRING GOAL PROXIMITY Anonymous

Abstract

Humans can effectively learn to estimate how close they are to completing a desired task simply by watching others fulfill the task. To solve the task, they can then take actions towards states with higher estimated proximity to the goal. From this intuition, we propose a simple yet effective method for imitation learning that learns a goal proximity function from expert demonstrations and online agent experience, and then uses the learned proximity to provide a dense reward signal for training a policy to solve the task. By predicting task progress as the temporal distance to the goal, the goal proximity function improves generalization to unseen states over methods that aim to directly imitate expert behaviors. We demonstrate that our proposed method efficiently learns a set of goal-driven tasks from state-only demonstrations in navigation, robotic arm manipulation, and locomotion tasks.

1. INTRODUCTION

Humans are capable of effectively leveraging demonstrations from experts to solve a variety of tasks. Specifically, by watching others performing a task, we can learn to infer how close we are to completing the task, and then take actions towards states closer to the goal of the task. For example, after watching a few tutorial videos for chair assembly, we learn to infer how close an intermediate configuration of a chair is to completion. With the guidance of such a task progress estimate, we can efficiently learn to assemble the chair to progressively get closer to and eventually reach, the fully assembled chair. Can machines likewise first learn an estimate of progress towards a goal from demonstrations and then use this estimate as guidance to move closer to and eventually reach the goal? Typical learning from demonstration (LfD) approaches (Pomerleau, 1989; Pathak et al., 2018; Finn et al., 2016) greedily imitate the expert policy and therefore suffer from accumulated errors causing a drift away from states seen in the demonstrations. On the other hand, adversarial imitation learning approaches (Ho & Ermon, 2016; Fu et al., 2018) encourage the agent to imitate expert trajectories with a learned reward that distinguishes agent and expert behaviors. However, such adversarially learned reward functions often overfit to the expert demonstrations and do not generalize to states not covered in the demonstrations (Zolna et al., 2019) , leading to unsuccessful policy learning. Inspired by how humans leverage demonstrations to measure progress and complete tasks, we devise an imitation learning from observation (LfO) method which learns a task progress estimator and uses the task progress estimate as a dense reward signal for training a policy as illustrated in Figure 1 . To measure the progress of a goal-driven task, we define goal proximity as an estimate of temporal distance to the goal (i.e., the number of actions required to reach the goal). In contrast to prior adversarial imitation learning algorithms, by having additional supervision of task progress and learning to predict it, the goal proximity function can acquire more structured task-relevant information, and hence generalize better to unseen states and provide better reward signals. However, the goal proximity function can still output inaccurate predictions on states not in demonstrations, which results in unstable policy training. To improve the accuracy of the goal proximity function, we continually update the proximity function with trajectories both from expert and agent. In addition, we penalize trajectories with the uncertainty of the goal proximity prediction, which prevents the policy from exploiting high proximity estimates with high uncertainty. As a result, by leveraging the agent experience and predicting the proximity function uncertainty, our method can achieve more efficient and stable policy learning.  L z K w Q Q v 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g U 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V p h g 2 m h N L t i B o U X G L D c i u w n W q k S S S w F Y 1 u Z 3 7 r C b X h S j 7 Y c Y p h Q g e S x 5 x R 6 6 R m 3 O u m Q 9 4 r V / y q P w d Z J U F O K p C j 3 i t / d f u K Z Q l K y w Q 1 p h P 4 q Q 0 n V F v O B E 5 L 3 c x g S t m I D r D j q K Q J m n A y v 3 Z K z p z S J 7 H S r q Q l c / X 3 x I Q m x o y T y H U m 1 A 7 N s j c T / / M 6 m Y 2 v w w m X a W Z R s s W i O B P E K j J 7 n f S 5 R m b F 2 B H K N H e 3 E j a k m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f i 3 u P G A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U 2 S R + M F Z A I T / q s N T N A T A d a f e R w k = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O e p M x s z P L z K w Q Q v 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g U 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V p h g 2 m h N L t i B o U X G L D c i u w n W q k S S S w F Y 1 u Z 3 7 r C b X h S j 7 Y c Y p h Q g e S x 5 x R 6 6 R m 3 O u m Q 9 4 r V / y q P w d Z J U F O K p C j 3 i t / d f u K Z Q l K y w Q 1 p h P 4 q Q 0 n V F v O B E 5 L 3 c x g S t m I D r D j q K Q J m n A y v 3 Z K z p z S J 7 H S r q Q l c / X 3 x I Q m x o y T y H U m 1 A 7 N s j c T / / M 6 m Y 2 v w w m X a W Z R s s W i O B P E K j J 7 n f S 5 R m b F 2 B H K N H e 3 E j a k m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f i 3 u P G A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U 2 S R + M F Z A I T / q s N T N A T A d a f e R w k = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O e p M x s z P L z K w Q Q v 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g U 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V p h g 2 m h N L t i B o U X G L D c i u w n W q k S S S w F Y 1 u Z 3 7 r C b X h S j 7 Y c Y p h Q g e S x 5 x R 6 6 R m 3 O u m Q 9 4 r V / y q P w d Z J U F O K p C j 3 i t / d f u K Z Q l K y w Q 1 p h P 4 q Q 0 n V F v O B E 5 L 3 c x g S t m I D r D j q K Q J m n A y v 3 Z K z p z S J 7 H S r q Q l c / X 3 x I Q m x o y T y H U m 1 A 7 N s j c T / / M 6 m Y 2 v w w m X a W Z R s s W i O B P E K j J 7 n f S 5 R m b F 2 B H K N H e 3 E j a k m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f i 3 u P G A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U 2 S R + M F Z A I T / q s N T N A T A d a f e R w k = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O e p M x s z P L z K w Q Q v 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g U 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V p h g 2 m h N L t i B o U X G L D c i u w n W q k S S S w F Y 1 u Z 3 7 r C b X h S j 7 Y c Y p h Q g e S x 5 x R 6 6 R m 3 O u m Q 9 4 r V / y q P w d Z J U F O K p C j 3 i t / d f u K Z Q l K y w Q 1 p h P 4 q Q 0 n V F v O B E 5 L 3 c x g S t m I D r D j q K Q J m n A y v 3 Z K z p z S J 7 H S r q Q l c / X 3 x I Q m x o y T y H U m 1 A 7 N s j c T / / M 6 m Y 2 v w w m X a W Z R s s W i O B P E K j J 7 n f S 5 R m b F 2 B H K N H e 3 E j a k m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f i 3 u P G A = = < / l a t e x i t >

= Proximity

Learning Proximity Function 1.0 (Goal) 0.9 0.8 0.7 0.6 1.0 (Goal) 0.9 0.8 1.0 (Goal) 0.9 0.8 0.7 Demo 1 Demo 2

Demo N Observations

Expert Demonstrations  Proximity to Goal Learning Policy = ⇡ ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v 0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v 0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p S Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p S Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p S Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M b R z C Q L U K Y j m 8 P s b G B J l E A = " > A A A C C H i c b V D L S s N A F J 3 U V 6 2 v q E s X D h a h R S y J C L o s u n F Z w T 6 g D W E y n b R D J w 9 m b o Q S s n T j r 7 h x o Y h b P 8 G d f + O 0 z a K 2 H r h w O O d e 7 r 3 H i w V X Y F k / R m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 z 9 w 9 a K k o k Z U 0 a i U h 2 P K K Y 4 C F r A g f B O r F k J P A E a 3 u j 2 4 n f f m R S 8 S h 8 g H H M n I A M Q u 5 z S k B L r n n s u 7 1 4 y C v K T e H M z q r 4 H M 8 p W d U 1 y 1 b N m g I v E z s n Z Z S j 4 Z r f v X 5 E k 4 C F Q A V R q m t b M T g p k c C p Y F m p l y g W E z o i A 9 b V N C Q B U 0 4 6 f S T D p 1 r p Y z + S u k L A U 3 V + I i W B U u P A 0 5 0 B g a F a 9 C b i f 1 4 3 A f / a S X k Y J 8 B C O l v k J w J D h C e p / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v 0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 Figure 1 : In goal-driven tasks, states on an expert trajectory have gradually increasing proximity toward the goal as the expert proceeds and fulfills a task. Inspired by this intuition, we propose to learn a proximity function f φ from expert demonstrations and agent experience, which provides an estimate of temporal distance to the goal of a task. Then, using this learned proximity function, we train a policy π θ to progressively move to states with higher proximity and eventually reach the goal to solve the task. We alternate these two learning phases to improve both the proximity function and the policy, leading to not only better learning efficiency but also superior performance. 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K V 4 p W k Q U q 0 B / U v f 6 h 5 G o N C L p m 1 f c 9 N M M i Y Q c E l z G p + a i F h f M J G 0 M + p Y j H Y I J u f P K N n u T K k k T Z 5 K a R z 9 f d E x m J r p 3 G Y d 8 Y M x 3 b Z K 8 T / v H 6 K 0 X W Q C Z W k C I o v F k W p p K h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v The main contributions of this paper include (1) an algorithm for imitation from observation that uses estimated goal proximity to inform an agent of the task progress; (2) modeling uncertainty of goal proximity estimation to prevent exploiting uncertain predictions; and (3) a joint training algorithm of the goal proximity function and policy. We show that the policy learned with our proposed goal proximity function is more effective and generalizes better than the state-of-the-art LfO algorithms on various domains, such as navigation, robot manipulation, and locomotion. Moreover, our method demonstrates comparable results with GAIL (Ho & Ermon, 2016), which learns from expert actions.

2. RELATED WORK

Imitation learning (Schaal, 1997) aims to leverage expert demonstrations to acquire skills. While behavioral cloning (Pomerleau, 1989 ) is simple but effective with a large number of demonstrations, it suffers from compounding errors caused by the distributional drift (Ross et al., 2011) . On the other hand, inverse reinforcement learning (Ng & Russell, 2000; Abbeel & Ng, 2004; Ziebart et al., 2008) estimates the underlying reward from demonstrations and learns a policy through reinforcement learning with this reward, which can better handle the compounding errors. Specifically, generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) and its variants (Fu et al., 2018; Kostrikov et al., 2020) shows improved demonstration efficiency by training a discriminator to distinguish expert and agent transitions and using the discriminator output as a reward for policy training. While most imitation learning algorithms require expert actions, imitation learning from observation (LfO) approaches learn from state-only demonstrations. This enables the LfO methods to learn from diverse sources of demonstrations, such as human videos, demonstrations with different controllers, and other robots. To imitate demonstrations without expert actions, inverse dynamics models (Niekum et al., 2015; Torabi et al., 2018a; Pathak et al., 2018) or learned reward functions (Edwards et al., 2016; Sermanet et al., 2017; 2018; Liu et al., 2018; Lee et al., 2019a ) can be used to train the policy. However, these methods require large amounts of data to train inverse dynamics models or representations. On the other hand, state-only adversarial imitation learning (Torabi et al., 2018b; Yang et al., 2019) can imitate an expert with few demonstrations, similar to GAIL. In addition to discriminating expert and agent trajectories, our method proposes to also estimate the proximity to the goal, which can provide more informed reward signals and generalize better. Closely related works to our approach are reinforcement learning algorithms that learn a value function or proximity estimator from successful trajectories and use them as an auxiliary reward (Mataric, 1994; Edwards & Isbell, 2019; Lee et al., 2019b) . While these value function and proximity estimator are similar to our proposed goal proximity function, these works require environment reward signals, and do not incorporate adversarial online training and uncertainty estimates. Moreover, demonstrating the value of learning a proximity estimate for imitation learning, Angelov et al. ( 2020) utilizes the learned proximity to choose a proper sub-policy but does not train a policy



< l a t e x i t s h a 1 _ b a s e 6 4 = " U 2 S R + M F Z A I T / q s N T N A T A d a f e R w k = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O e p M x s z P

0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "

0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > a < l a t e x i t s h a 1 _ b a s e 6 4 = " b 7 / v C s 5 z e 5 K t V d 6 6 W 3 y y A L Y B f b k = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M

2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / D X Y z l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b 7 / v C s 5 z e 5 K t V d 6 6 W 3 y y A L Y B f b k = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M

2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / D X Y z l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b 7 / v C s 5 z e 5 K t V d 6 6 W 3 y y A L Y B f b k = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M

2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / D X Y z l < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " b 7 / v C s 5 z e 5 K t V d 6 6 W 3 y y A L Y B f b k = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p S Q f l i l t 1 F y D r x M t J B X I 0 B u W v / j B m a Y T S M E G 1 7 n l u Y v y M K s O Z w F m p n 2 p M K J v Q E f Y s l T R C 7 W e L Q 2 f k w i p D E s b K l j R k o f 6 e y G i k 9 T Q K b G d E z V i v e n P x P 6 + X m v D G z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / D X Y z l < / l a t e x i t > Alternate Training Proximity Reward = f (st+1) f (st) < l a t e x i t s h a 1 _ b a s e 6 4 = " x m j y c Hb R z C Q L U K Y j m 8 P s b G B J l E A = " > A A A C C H i c b V D L S s N A F J 3 U V 6 2 v q E s X D h a h R S y J C L o s u n F Z w T 6 g D W E y n b R D J w 9 m b o Q S s n T j r 7 h x o Y h b P 8 G d f + O 0 z a K 2 H r h w O O d e 7 r 3 H i w V X Y F k / R m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 z 9 w 9 a K k o k Z U 0 a i U h 2 P K K Y 4 C F r A g f B O r F k J P A E a 3 u j 2 4 n f f m R S 8 S h 8 g H H M n I A M Q u 5 z S k B L r n n s u 7 1 4 y C v K T e H M z q r 4 H M 8 p W d U 1 y 1 b N m g I v E z s n Z Z S j 4 Z r f v X 5 E k 4 C F Q A V R q m t b M T g p k c C p Y F m p l y g W E z o i A 9 b V N C Q B U0 4 6 f S T D p 1 r p Y z + S u k L A U 3 V + I i W B U u P A 0 5 0 B g a F a 9 C b i f 1 4 3 A f / a S X k Y J 8 B C O l v k J w J D h C e p 4 D 6 X j I I Y a 0 K o 5 P p W T I d E E g o 6 u 5 I O w V 5 8 e Z m 0 L m q 2 V b P v L 8 v 1 m z y O I j p C J 6 i C b H S F 6 u g O N V A T U f S E X t A b e j e e j V f j w / i c t R a M f O Y Q / Y H x 9 Q v M z J i M < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x m j y c H b R z C Q L U K Y j m 8 P s b G B J l E A = " > A A A C C H i c b V D L S s N A F J 3 U V 6 2 v q E s X D h a h R S y J C L o s u n F Z w T 6 g D W E y n b R D J w 9 m b o Q S s n T j r 7 h x o Y h b P 8 G d f + O 0 z a K 2 H r h w O O d e 7 r 3 H i w V X Y F k / R m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 z 9 w 9 a K k o k Z U 0 a i U h 2 P K K Y 4 C F r A g f B O r F k J P A E a 3 u j 2 4 n f f m R S 8 S h 8 g H H M n I A M Q u 5 z S k B L r n n s u 7 1 4 y C v K T e H M z q r 4 H M 8 p W d U 1 y 1 b N m g I v E z s n Z Z S j 4 Z r f v X 5 E k 4 C F Q A V R q m t b M T g p k c C p Y F m p l y g W E z o i A 9 b V N C Q B U 0 4 6 f S T D p 1 r p Y z + S u k L A U 3 V + I i W B U u P A 0 5 0 B g a F a 9 C b i f 1 4 3 A f / a S X k Y J 8 B C O l v k J w J D h C e p 4 D 6 X j I I Y a 0 K o 5 P p W T I d E E g o 6 u 5 I O w V 5 8 e Z m 0 L m q 2 V b P v L 8 v 1 m z y O I j p C J 6 i C b H S F 6 u g O N V A T U f S E X t A b e j e e j V f j w / i c t R a M f O Y Q / Y H x 9 Q v M z J i M < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x m j y c H b R z C Q L U K Y j m 8 P s b G B J l E A = " > A A A C C H i c b V D L S s N A F J 3 U V 6 2 v q E s X D h a h R S y J C L o s u n F Z w T 6 g D W E y n b R D J w 9 m b o Q S s n T j r 7 h x o Y h b P 8 G d f + O 0 z a K 2 H r h w O O d e 7 r 3 H i w V X Y F k / R m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 z 9 w 9 a K k o k Z U 0 a i U h 2 P K K Y 4 C F r A g f B O r F k J P A E a 3 u j 2 4 n f f m R S 8 S h 8 g H H M n I A M Q u 5 z S k B L r n n s u 7 1 4 y C v K T e H M z q r 4 H M 8 p W d U 1 y 1 b N m g I v E z s n Z Z S j 4 Z r f v X 5 E k 4 C F Q A V R q m t b M T g p k c C p Y F m p l y g W E z o i A 9 b V N C Q B U 0 4 6 f S T D p 1 r p Y z + S u k L A U 3 V + I i W B U u P A 0 5 0 B g a F a 9 C b i f 1 4 3 A f / a S X k Y J 8 B C O l v k J w J D h C e p 4 D 6 X j I I Y a 0 K o 5 P p W T I d E E g o 6 u 5 I O w V 5 8 e Z m 0 L m q 2 V b P v L 8 v 1 m z y O I j p C J 6 i C b H S F 6 u g O N V A T U f S E X t A b e j e e j V f j w / i c t R a M f O Y Q / Y H x 9 Q v M z J i M < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " x m j y c H

4 D 6 X j I I Y a 0 K o 5 P p W T I d E E g o 6 u 5 I O w V 5 8 e Z m 0 L m q 2 V b P v L 8 v 1 m z y O I j p C J 6 i C b H S F 6 u g O N V A T U f S E X t A b e j e e j V f j w/ i c t R a M f O Y Q / Y H x 9 Q v M z J i M < / l a t e x i t >Agent Experience under Policy ⇡ ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = "

h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v 0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B

h p 8 T 8 d C g M c 5 T Q n j B u R 3 0 r 5 m B n G M U + p C M F b f n m V d C + a n t v 0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / G o Z x C L X a Z u Q t c D I V t / 1 r 5 1 5 G 7 Q = " > A A A B 8 n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E 0 G P R i 8 c K t h a a U D b b S b t 0 s x t 2 J 0 I J / R l e P C j i 1 V / j z X 9 j 0 u a g r Q 8 G H u / N M D M v T K S w 6 L r f T m V t f W N z q 7 p d 2 9 n d 2 z + o H x 5 1 r U 4 N h w 7 X U p t e y C x I o a C D A i X 0 E g M s D i U 8 h p P b w n 9 8 A m O F V g 8 4 T S C I 2 U i J S H C G u d T 3 E z H w c Q z I a o N 6 w 2 2 6 c 9 B

0 7 i 8 b r Z s y j i o 5 I a f k n H j k i r T I H W m T D u F E k 2 f y S t 4 c d F 6 c d + d j 0 V p x y p l j 8 g f O 5 w / g M Z D 4 < / l a t e x i t >

