DON'T THROW YOUR OLD POLICIES AWAY: KNOWLEDGE-BASED POLICY RECYCLING PROTECTS AGAINST ADVERSARIAL ATTACKS

Abstract

Recent work has shown that Deep Reinforcement Learning (DRL) is vulnerable to adversarial attacks, in which minor perturbations of input signals cause agents to behave inappropriately and unexpectedly. Humans, on the other hand, appear robust to these particular sorts of input variations. We posit that this part of robustness stems from accumulated knowledge about the world. In this work, we propose to leverage prior knowledge to defend against adversarial attacks in RL settings using a framework we call Knowledge-based Policy Recycling (KPR). Different from previous defense methods such as adversarial training and robust learning, KPR incorporates domain knowledge over a set of auxiliary tasks policies and learns relations among them from interactions with the environment via a Graph Neural Network (GNN). KPR can use any relevant policy as an auxiliary policy and, importantly, does not assume access or information regarding the adversarial attack. Empirically, KPR results in policies that are more robust to various adversarial attacks in Atari games and a simulated Robot Foodcourt environment.

1. INTRODUCTION

Despite significant performance breakthroughs in recent years (Mnih et al., 2015; Silver et al., 2016; Berner et al., 2019, e.g.,) , Deep Reinforcement Learning (DRL) policies can be brittle. Specifically, recent works have shown that DRL policies are vulnerable to adversarial attacks -adversarially manipulated inputs (e.g., images) of small magnitude can cause RL agents to take incorrect actions (Ilahi et al., 2022; Chen et al., 2019; Behzadan & Munir, 2017; Oikarinen et al., 2021; Lee et al., 2021; Chan et al., 2020; Bai et al., 2018) . To counter such attacks, recent work has proposed a range of defense strategies including adversarial training (Oikarinen et al., 2021; Behzadan & Munir, 2018; Han et al., 2018) , robust learning (Mandlekar et al., 2017; Smirnova et al., 2019; Pan et al., 2019) , defensive distillation (Rusu et al., 2016) , and adversarial detection (Gallego et al., 2019a; Havens et al., 2018; Gallego et al., 2019a) . While these defense methods can be effective, each has its limitations; adversarial training and adversarial detection require specific knowledge about the attacker. Robust learning adds noise during agent training, which can degrade performance (Tsipras et al., 2019; Yang et al., 2020) . Defensive distillation is typically unable to protect against diverse adversarial attacks (Carlini & Wagner, 2016; Soll et al., 2019) . In this work, we explore an alternative defense strategy that exploits existing knowledge encoded in auxiliary task policies and known relationships between the policies. The key intuition underlying our approach is that existing task policies encode learnt low-level knowledge regarding the environment (e.g., possible observations, dynamics), whilst high-level specifications can provide guidance for transfer or generalization. Our approach is to leverage known and learnt relations between different policies as structural priors for an ensemble of policies; our hypothesis is that while a single task policy can be attacked, perturbing inputs such that multiple policies are negatively affected in a consistent manner is more difficult. Our framework, which we call Knowledge-based Policy Fusion (KPR), is partially inspired by the use of domain knowledge to address vulnerabilities to adversarial attacks in supervised learning (Melacci et al., 2021; Gürel et al., 2021; Zhang et al., 2022) . In these works, domain knowledge is encoded as logical formulae over predicted labels and a set of features. A soft satisfiability score between A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " a c K w p a q 3 H w  M k v L K 6 t r 2 f V c f m N z a 7 u w s 1 t X Y S w J r Z G Q h 7 L p Y U U 5 E 7 S m m e a 0 G U m K A 4 / T h j e 4 S v 3 G k E r F Q l H V o 4 i 6 A e 4 J 5 j O C t Z F u q 5 2 7 T q F o l + w J 0 C J x Z q R Y P v x 6 e x / m v y u d w k e 7 G 5 I 4 o E I T j p V q O X a k 3 Q R L z Q i n 4 1 w 7 V j T C Z I B 7 t G W o w A F V b j I 5 d Y y O j N J F f i h N C Y 0 m 6 u + J B A d K j Q L P d A Z Y 9 9 W 8 l 4 r / e a 1 Y + x d u w k Q U a y r I d J E f c 6 R D l P 6 N u k x S o v n I E E w k M 7 c i 0 s c S E 2 3 S y Z k Q n P m X F 0 n 9 p O S c l U 5 v T B q X M E U W 9 u E A j s G B c y j D N V S g B g R 6 c A + P 8 G R x 6 8 F 6 t l 6 m r R l r N r M H f 2 C 9 / g D v 6 5 I D < / l a t e x i t > Tj < l a t e x i t s h a 1 _ b a s e 6 4 = " A D g W F w Y G U I B e 9 R / M / D y Z / / R / 3 / Q = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X d H N F 9 y i O w V a J t 6 c F E p H 1 W / 2 X v 6 o d P O f n Z 4 k S U S F I R x r 3 f b c 2 P g p V o Y R T s e 5 T q J p j M k Q 9 2 n b U o E j q v 1 0 e u o Y n V q l h 0 K p b A m D p u r v i R R H W o + i w H Z G 2 A z 0 o j c R / / P a i Q m v / Z S J O D F U k N m i M O H I S D T 5 G / W Y o s T w k S W Y K G Z v R W S A F S b G p p O z I X i L L y + T x n n R u y x e V G 0 a Z Z g h C 8 d w A m f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v s 9 a M M 5 8 5 h D 9 w 3 n 4 A E u K R X w = = < / l a t e x i t > Tn < l a t e x i t s h a 1 _ b a s e 6 4 = " A D g W F w Y G U I B e 9 R / M / D y Z / / R / 3 / Q = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X d H N F 9 y i O w V a J t 6 c F E p H 1 W / 2 X v 6 o d P O f n Z 4 k S U S F I R x r 3 f b c 2 P g p V o Y R T s e 5 T q J p j M k Q 9 2 n b U o E j q v 1 0 e u o Y n V q l h 0 K p b A m D p u r v i R R H W o + i w H Z G 2 A z 0 o j c R / / P a i Q m v / Z S J O D F U k N m i M O H I S D T 5 G / W Y o s T w k S W Y K G Z v R W S A F S b G p p O z I X i L L y + T x n n R u y x e V G 0 a Z Z g h C 8 d w A m f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v s 9 a M M 5 8 5 h D 9 w 3 n 4 A E u K R X w = = < / l a t e x i t > Tn < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 / b A n A a m l y v g S w v a Z 1 U n 5 0 r N a c U = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X d b N F 9 y i O w V a J t 6 c F E p H 1 W / 2 X v 6 o d P O f n Z 4 k S U S F I R x r 3 f b c 2 P g p V o Y R T s e 5 T q J p j M k Q 9 2 n b U o E j q v 1 0 e u o Y n V q l h 0 K p b A m D p u r v i R R H W o + i w H Z G 2 A z 0 o j c R / / P a i Q m v / Z S J O D F U k N m i M O H I S D T 5 G / W Y o s T w k S W Y K G Z v R W S A F S b G p p O z I X i L L y + T x n n R u y x e V G 0 a Z Z g h C 8 d w A m f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v s 9 a M M 5 8 5 h D 9 w 3 n 4 A C 0 6 R W g = = < / l a t e x i t > Ti < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 / b A n A a m l y v g S w v a Z 1 U n 5 0 r N a c U = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X d b N F 9 y i O w V a J t 6 c F E p H 1 W / 2 X v 6 o d P O f n Z 4 k S U S F I R x r 3 f b c 2 P g p V o Y R T s e 5 T q J p j M k Q 9 2 n b U o E j q v 1 0 e u o Y n V q l h 0 K p b A m D p u r v i R R H W o + i w H Z G 2 A z 0 o j c R / / P a i Q m v / Z S J O D F U k N m i M O H I S D T 5 G / W Y o s T w k S W Y K G Z v R W S A F S b G p p O z I X i L L y + T x n n R u y x e V G 0 a Z Z g h C 8 d w A m f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v M i Y T F J D J Z l 8 F K U c m R j l i 6 M u U 5 Q Y P r S A i W J 2 V k T 6 W G F i 7 H l c e w R / d u V 5 q B + X / d P y y a 1 X q l z B R A X Y h w M 4 A h / O o A I 3 U I U a E O j D A z z B s y O c R + f F e Z 2 U M i Y T F J D J Z l 8 F K U c m R j l i 6 M u U 5 Q Y P r S A i W J 2 V k T 6 W G F i 7 H l c e w R / d u V 5 q B + X / d P y y a 1 X q l z B R A X Y h w M 4 A h / O o A I 3 U I U a E O j D A z z B s y O c R + f F e Z 2 U L j j T n j 3 4 I + f 9 B 4 g s k Z E = < / l a t e x i t > _ < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B W v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B W v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0  v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " I z D W F j l g V g n 8 a o r Y M u f g 3 f y  C Q 5 Q = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d g V X 2 W I j W W C e U G y h N n J b D J k d m a Z m R X C k k + w s V D E 1 t a / 8 A v V o Y R T k f Z d q J p j M k A 9 2 j L U o E j q v 1 0 c u o I n V i l i 0 K p b A m D J u r v i R R H W g + j w H Z G 2 P T 1 v D c W / / N a i Q m v / Z S J O D F U k O m i M O H I S D T + G 3 W Z o s T w o S W Y K G Z v R a S P F S b G p p O 1 I X j z L y + S + l n B u y y c V 2 w a J Z g i A 0 d w D K f g w R U U 4 R b K U A M C P X i A J 3 h 2 u P P o v D i v 0 9 Y l Z z Z z A H / g v P 0 A v G + R J g = = < / l a t e x i t > T5 < l a t e x i t s h a 1 _ b a s e 6 4 = " I z D W F j l g V g n 8 a o r Y M u f g 3 f y C Q 5 Q = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d g V X 2 W I j W W C e U G y h N n J b D J k d m a Z m R X C k k + w s V D E 1 t a / 8 A v u Z K E J r R H K p m g H W l D N B a 4 Y Z T p u x o j g K O G 0 E g 5 u x 3 7 i n S j M p q m Y Y U z / C P c F C R r C x 0 l 2 1 c 9 H J 5 d 2 C O w F a J N 6 M 5 I u H l W / 2 X v o o d 3 K f 7 a 4 k S U S F I R x r 3 f L c 2 P g p V o Y R T k f Z d q J p j M k A 9 2 j L U o E j q v 1 0 c u o I n V i l i 0 K p b A m D J u r v i R R H W g + j w H Z G 2 P T 1 v D c W / / N a i Q m v / Z S J O D F U k O m i M O H I S D T + G 3 W Z o s T w o S W Y K G Z v R a S P F S b G p p O 1 I X j z L y + S + l n B u y y c V 2 w a J Z g i A 0 d w D K f g w R U U 4 R b K U A M C P X i A J 3 h 2 u P P o v D i v 0 9 Y l Z z Z z A H / g v P 0 A v G + R J g = = < / l a t e x i t >

T5

< l a t e x i t s h a 1 _ b a s e 6 4 = " c e F 9 4 f P T I + t J U P 4 L s s q t m o i 6 5 M Q = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d i V o J Y h N p Y J 5 g X J E m Y n s 8 m Q 2 Z l l Z l Y I S z 7 B x k I R W 1 v / w i + w s / F b n D w K T T x w 4 X D O v d x 7 T x B z p o 3 r f j k r q 2 v r G 5 u Z r e z 2 z u 7 e f u 7 g s K F l o g i t E 8 m l a g V Y U 8 4 E r R t m O G 3 F i u I o 4 L Q Z D G 8 m f v O e K s 2 k q J l R T P 0 I 9 w U L G c H G S n e 1 b r G b y 7 s F d w q 0 T L w 5 y Z e O q 9 / s v f x R 6 e Y + O z 1 J k o g K Q z j W u u 2 5 s f F T r A w j n I 6 z n U T T G J M h 7 t O 2 p Q J H V P v p 9 N Q x O r N K D 4 V S 2 R I G T d X f E y m O t B 5 F g e 2 M s B n o R W 8 i / u e 1 E x N e + y k T c W K o I L N F Y c K R k W j y N + o x R Y n h I 0 s w U c z e i s g A K 0 y M T S d r Q / A W X 1 4 m j Y u C d 1 k o V m 0 a Z Z g h A y d w C u f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v s 9 Y V Z z 5 z B H / g v P 0 A u u u R J Q = = < / l a t e x i t >

T4

< l a t e x i t s h a 1 _ b a s e 6 4 = " c e F 9 4 f P T I + t J U P 4 L s s q t m o i 6 5  M Q = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d i V o J Y h N p Y J 5 g X J E m Y n s 8 m Q 2 Z l l Z l Y I S z 7 B x k I R W 1 v / w i + w s / F b n D w K T T x w 4 X D O v d x 7 T x B z p o 3 r f j k r q 2 v r G 5 u Z r e z 2 z u 7 e f u 7 g s K F l o g i t E 8 m l a g V Y U 8 4 E r R t m O G 3 F i u I o 4 L Q Z D G 8 m f v O e K s 2 k q J l R T P 0 I 9 w U L G c H G S n e 1 b r G b y 7 s F d w q 0 T L w 5 y Z e O q 9 / s v f x R 6 e Y + O z 1 J k o g K Q z j W u u 2 5 s f F T r A w j n I 6 z n U T T G J M h 7 t O 2 p Q J H V P v p 9 N Q x O r N K D 4 V S 2 R I G T d X f E y m O t B 5 F g e 2 M s B n o R W 8 i / u e 1 E x N e + y k T c W K o I L N F Y c K R k W j y N + o x R Y n h I 0 s w U c z e i s g A K 0 y M T S d r Q / A W X 1 4 m j Y u C d 1 k o V m 0 a Z Z g h A y d w C u f g w R W U 4 B Y q U A c C f X i A J 3 h 2 u P P o v D i v s 9 Y V Z z 5 z B H / g v P 0 A u u u R J Q = = < p P U U E k m H 0 U p R y Z G + e K o y x Q l h g 8 t Y K K Y n R W R P l a Y G H s e 1 x 7 B n 1 1 5 H u r H Z f + 0 f H L r l S p X M F E B 9 u E A j s C H M 6 j A D V S h B g T 6 8 A B P 8 O w I 5 9 F 5 c V 4 n p Q v O t G c P / s h 5 / w F r W p F + < / l a t e x i t > ¬ < l a t e x i t s h a 1 _ b a s e 6 4 = " E u u 5 N C K 3 n C 0 P j B 8 5 i y K L H J / f p e s = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s q M i L o R i 2 5 c V r A X a I e S S T N t a J I Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q y 1 p r 1 M s e W V v L D Q P / h R K l x / u R f L 2 5 V Y 7 x c 9 2 N y a p o N I Q j r V u + V 5 i g g w r w w i n I 7 e d a p p g M s A 9 2 r I o s a A 6 y M a z j t C h d b o o i p V 9 0 q C x + 7 s j w 0 L r o Q h t p c C m r 2 e z 3 P w v a 6 U m O g 8 y J p P U U E k m H 0 U p R y Z G + e K o y x Q l h g 8 t Y K K Y n R W R P l a Y G H s e 1 x 7 B n 1 1 5 H u r H Z f + 0 f H L r l S p X M F E B 9 u E A j s C H M 6 j A D V S h B g T 6 8 A B P 8 O w I 5 9 F 5 c V 4 n p Q v O t G c P / s h 5 / w F r W p F + < / l a t e x i t > ¬ < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B W v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B W v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > ^< l a t e x i t s h a 1 _ b a s e 6 4 = " J a the predictions and given logic formulae is added to the objective to encourage the predictions to comply with the logical formulae. Even if part of the sample is corrupted by the adversary, the final output can be corrected by enforcing the domain knowledge rules. KPR extends this line of research to RL settings. Note that this extension is nontrivial; the auxiliary feature detectors used in supervised learning (Gürel et al., 2021; Zhang et al., 2022) do not capture temporal features, and more importantly, the consistency between the predictions and actions is not directly computable since the optimal actions for each state are unknown. E o q O C f t + u x a Y 8 O + 0 Q H M b 1 / I x A = " > A A A B + n i c b V D L S s N A F L 2 p r 1 p f q S 7 d D B b B V U l E 1 G X R h S 4 r 2 A e 0 I U y m k 3 b o Z B J m J k q J / R Q 3 L h R x 6 5 e 4 8 2 + c t F l o 6 4 G B w z n 3 c s + c I O F M a c f 5 t k o r q 2 v r G + X N y t b 2 z u 6 e X d 1 v q z i V h L Z I z G P Z D b C i n A n a 0 k x z 2 k 0 k x V H A a S c Y X + d + 5 4 F K x W J x r y c J 9 S I 8 F C x k B G s j + X a 1 H 2 E 9 I p h n N 1 M / G / v u 1 L d r T t 2 Z A S 0 T t y A 1 K N D 0 7 a / + I C Z p R I U m H C v V c 5 1 E e x m W m h F O p 5 V + q m i C y R g P a c 9 Q g S O q v G w W f Y q O j T J A Y S z N E x r N 1 N 8 b G Y 6 U m k S B m c y D q k U v F / / z e q k O L 7 2 M i S T V V J D 5 o T D l S M c o 7 w E N m K R E 8 4 k h m E h m s i I y w h I T b d q q m B L c x S 8 v k / Z p 3 T 2 v n 9 2 d 1 R p X R R 1 l O I Q j O A E X L q A B t 9 C E F h B 4 h G d 4 h T f r y X q x 3 q 2 P + W j J K n Y O 4 A + s z x 9 3 g 5 Q l < / l a t e x i t > G k1 < l a t e x i t d v f 0 D s 3 r Y l V E i M O n g i E W i 7 y N J G O W k o 6 h i p B 8 L g k K f k Z 4 / v c 7 9 3 g M R k k b 8 X s 1 i 4 o Z o z G l A M V J a 8 s z q M E R q g h F L b z I v n X q N z D N r d t 2 e w 1 o l T k F q U K D t m V / D U Y S T k H C F G Z J y 4 N i x c l M k F M W M Z J V h I k m M 8 B S N y U B T j k I i 3 X Q e P b N O t T K y g k j o x 5 U 1 V 3 9 v p C i U c h b 6 e j I P K p e 9 X P z P G y Q q u H R T y u N E E Y 4 X h 4 K E W S q y 8 h 6 s E R U E K z b T B G F B d V Y L T 5 B A W O m 2 K r o E Z / n L q 6 T b q D v n 9 e Z d s 9 a 6 K u o o w z G c w B k 4 c A E t u I U 2 d A D D I z z D K 7 w Z T 8 a L 8 W 5 8 L E Z L R r F z B H 9 g f P 4 A e Q i U J g = = < / l a t e x i t > G k2 < l a t e x i t s h a 1 _ b a s e 6 4 = " f S b h y O o g T l v U d F v c Y u j Z Y r R e K F E = " > A A A B + n i c b V D L S s N A F L 2 p r 1 p f q S 7 d D B b B V U l E 1 G X R h S 4 r 2 A e 0 I U y m k 3 b o T B J m J k q J / R Q 3 L h R x 6 5 e 4 8 2 + c t F l o 6 4 G B w z n 3 c s + c I O F M a c f 5 t k o r q 2 v r G + X N y t b 2 z u 6 e X d 1 v q z i V h L Z I z G P Z D b C i n E W 0 p Z n m t J t I i k X A a S c Y X + d + 5 4 F K x e L o X k 8 S 6 g k 8 j F j I C N Z G 8 u 1 q X 2 A 9 I p h n N 1 M / G / t i 6 t s 1 p + 7 M g J a J W 5 A a F G j 6 9 l d / E J N U 0 E g T j p X q u U 6 i v Q x L z Q i n 0 0 o / V T T B Z I y H t G d o h A V V X j a L P k X H R h m g M J b m R R r N 1 N 8 b G R Z K T U R g J v O g a t H L x f + 8 X q r D S y 9 j U Z J q G p H 5 o T D l S M c o 7 w E N m K R E 8 4 k h m E h m s i I y w h I T b d q q m B L c x S 8 v k / Z p 3 T 2 v n 9 2 d 1 R p X R R 1 l O I Q j O A E X L q A B t 9 C E F h B 4 h G d 4 h T f r y X q x 3 q 2 P + W j J K n Y O 4 A + s z x / S r 5 R h < / l a t e x i t > G km < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K 4 y a P o w g n c A p V 8 O A K G n A H T W g B A Q 7 P 8 A p v z q P z 4 r w 7 H 4 v W g p P P H M M f O J 8 / Z f y P k A = = < / l a t e x i t > f (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K 4 y a P o w g n c A p V 8 O A K G n A H T W g B A Q 7 P 8 A p v z q P z 4 r w 7 H 4 v W g p P P H M M f O J 8 / Z f y P k A = = < / l a t e x i t > f (•) …… < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K 4 y a P o w g n c A p V 8 O A K G n A H T W g B A Q 7 P 8 A p v z q P z 4 r w 7 H 4 v W g p P P H M M f O J 8 / Z f y P k A = = < / l a t e x i t > f (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K 4 y a P o w g n c A p V 8 O A K G n A H T W g B A Q 7 P 8 A p v z q P z 4 r w 7 H 4 v W g p P P H M M f O J 8 / Z f y P k A = = < / l a t e x i t > f (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q b i v e a e o q B t c R U l f q n m + 2 G s = " > A A A B 7 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R a h X s q u F P V Y 9 O K x g v 2 A d i n Z b L Y N z S Z r k h X K 0 j / h x Y M i X v 0 7 3 v w 3 p u 0 e t P X B w O O 9 G W b m B Q l n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H b S 1 T R W i L S C 5 V N 8 C a c i Z o y z D D a T d R F M c B p 5 1 g f D v z O 0 9 U a S b F g 5 k k 1 I / x U L C I E W y s 1 I 2 q f R J K c z 4 o V 9 y a O w d a J V 5 O K p C j O S h / 9 U N J 0 p g K Q z j W u u e 5 i f E z r A w j n E 5 L / V T T B J M x H t K e p Q L H V P v Z / N 4 p O r N K i C K p b A m D 5 u r v i Q z H W k / i w H b G 2 I z 0 s j c T / / N 6 q Y m u / Y y J J D V U k M W i K O X I S D R 7 H o V M U W L 4 x B J M F L O 3 I j L C C h N j I y r Z E L z l l 1 d J + 6 L m X d b q 9 / V K We address these issues by using auxiliary task policies and their relationships. In practice, these policies could be the intermediate by-products of curriculum/hierarchical learning or obtained via direct training with sub-goals. By combining these policies with specified and learnt relations, we construct an ensemble of policies on the target task. KPR uses graph neural networks (GNNs) as a backbone, which enables natural incorporation of graph-based domain knowledge while retaining the flexibility to learn from interaction data. The ensemble is then fused in a simple parameter-less manner to obtain a new robust task policy (Figure 1 ). From a practical perspective, KPR has a key advantage: it is both policy and attack agnostic. Specifically, KPR can utilize any type of policy representation (e.g., neural networks, rule-based policies) as either the main task policy or an auxiliary task policy. In addition, KPR doesn't require knowledge of the specific attack or access to the adversarial environment. Our empirical results show that KPR results in more robust policies across multiple attacks compared to baselines in a representative selection of Atari Games (Bellemare et al., 2013) and a high-dimensional Robot Food Court Environment (RoFoCo). To summarize, this paper contributes Knowledge-based Policy Recycling (KPR), which leverages domain knowledge to defend against adversarial attacks in RL. Different from prior defense methods in reinforcement learning, such as adversarial training and robust learning, KPR is able to incorporate domain knowledge as structural prior and then learn flexible relations from interaction data. To the best of our knowledge, this is the first work to demonstrate that domain knowledge in the form of policies can be used to defend against adversarial attacks.

2. BACKGROUND AND RELATED WORKS

Similar to supervised learning, recent work has shown that DRL is also susceptible to adversarial attacks (Ilahi et al., 2022; Chen et al., 2019; Behzadan & Munir, 2017; Oikarinen et al., 2021) .This area has drawn significant attention of late, and the following provides a brief overview; we refer readers desiring more detail to (Ilahi et al., 2022) . Broadly speaking, there are two basic attack types depending on the assumed adversary: (i) White-box attacks (Goodfellow et al., 2015; Carlini & Wagner, 2017; Madry et al., 2018; Schwinn et al., 2021) , where the adversary has perfect knowledge of the target model, which is the victim policy in the RL context, and (ii) Black-box attacks (Andriushchenko et al., 2020; Pomponi et al., 2022) , where the adversary does not know the model nor any of its attributes. Similar to previous works (Huang et al., 2017; Behzadan & Munir, 2017; Pattanaik et al., 2018; Zhang et al., 2020; Sun et al., 2022) , we assume that the attacker does not have the ability to change the environment directly but perturbs the state observations returned by the environment before the agent observes them. Existing defense methods can be categorized into: (i) Adversarial training (Oikarinen et al., 2021; Behzadan & Munir, 2018; Han et al., 2018) , where the RL agent is exposed to the adversarial environment during training (ii) Robust learning (Mandlekar et al., 2017; Smirnova et al., 2019; Pan et al., 2019) , which is a training mechanism to ensure robustness against training-time adversarial attacks. A common approach is to add noise to the parameter state while training; (iii) Adversarial detection (Gallego et al., 2019a; Havens et al., 2018; Gallego et al., 2019a ) trains a separate model to detect adversarial input and contaminated input (e.g., image frames) are replaced with predicted/generated versions; (iv) Policy distillation (Czarnecki et al., 2019) focuses on transferring knowledge from one or multiple policies to the target policy in a student-teacher framework. KPR is different from the above strategies. Unlike adversarial training and variants of adversarial detection (Gallego et al., 2019b; Lin et al., 2017) , we do not assume knowledge of the attack. In contrast to robust learning, KPR does inject noise during training. Although both policy distillation and KPR can fuse multiple policies, the methodology and application are different. Policy distillation does not leverage the relation between input policies and thus does not explicitly encourage structural consistency, which limits its defense performance as found in (Carlini & Wagner, 2016; Soll et al., 2019) and in our experiments (Section 4). KPR can be seen as a policy ensemble (Wiering & van Hasselt, 2008) , albeit one that leverages prior knowledge in its construction. Prior works have suggested that ensemble methods offer some protection against adversarial attacks in supervised settings (Wiering & van Hasselt, 2008) . However, to our knowledge, policy ensembles have yet to be used as a defense strategy in RL.

3. KNOWLEDGE-BASED POLICY RECYCLING (KPR)

As in standard RL, we consider a discounted discrete-time Markov Decision Process (MDP) (S, A, R, T , γ, d 0 ), where S is a set of states, A is a set of discrete action, R : S × A → R is the reward function, and T : S × A → P(s) is the transition function, γ ∈ [0, 1] is the discount factor, and d 0 ∈ P(s) is the distribution over the initial state. At time step t, the agent is in the state s t ∼ T (s t , a t ) and receives a reward R(s t , a t ). The agent's objective is to learn a policy π : S → P(A) that maximizes the expected cumulative rewards, max θ J(π θ ) = E s0∼d0,st+1∼T (st,π θ (st)) ∞ t=0 γ t R(s t , a t ) We consider a setting where the policy is trained in a benign environment and then deployed to a test environment that may be adversarial. Our goal is to retain expected cumulative rewards in the presence of a white box attacker that aims to alter our agent's actions by perturbing observations s ′ t = s t + δ t , s.t.δ ∈ ∆, where ∆ is a perturbation set, e.g., an ℓ 2 ball around s with ϵ radius, i.e. ℓ 2 (δ t ) ≤ ϵ, ∀t ≥ 0. We consider a setting where we do not have access to the attacker's perturbation set nor how it is optimizing. In this information impoverished setting, our defensive options are relatively limited. One approach is to sample possible attackers (with various perturbation abilities) and train a policy to be robust against these attacks. However, this approach is computationally expensive. We posit the susceptibility of a policy to an attacker is due principally to overfitting on a specific task. To alleviate this issue, we propose to leverage prior knowledge -comprising auxiliary tasks and relations between them -that can enable better robustness to noisy or perturbed observations. Let us define our main task as T M (which is modeled by an MDP as above) and an associated main task policy π TM that maximizes returns. We assume we possess a main task policy trained in the benign environment. We also have access to n auxiliary task policies Π aux = {π T1 , π T2 , . . . , π Tn } for tasks Υ aux = {T 1 , T 2 , . . . , T n }. Each task is modeled by an MDP tuple with the same elements as T M but with a different reward function. We define the relation set I = {K, f (•)} comprising explicitly specified relations between the task policies K (e.g., from a domain expert) and learnt knowledge that will be represented implicitly by the function f (•). While prior knowledge usually enters into training as a regularization term in the loss, we incorporate this knowledge directly into the policy. We seek to obtain an augmented policy π TM that maximizes returns and is conditioned upon Π aux and I. The challenge is to ensure that this augmented policy has sufficient capacity to perform well on the task yet be robust to attacks. Our approach is an ensemble method where we use I and Π aux to construct a new set of policies Π TM for the main task T M (Fig. 1 ). We then combine these policies using a simple parameter-less fusion mechanism, π TM (s) = p(a|s, Π aux , I) = p(a| Π TM , s). (2) In the following subsections, we provide details on the above: we will first discuss how prior domain knowledge regarding existing tasks can be specified. Next, we detail a Graph Neural Network (GNN) pooling network to process knowledge and state information to learn relations. Finally, we discuss the simple voting mechanism to derive the final policy action distribution.

3.1. SPECIFYING RELATIONS VIA LOGICAL GRAPHS

In our setting, domain knowledge comprises (i) the set of auxiliary tasks policies given, Π aux , which could be any type of policy (including non-differentiable or rule-based policies), and (ii) the logical relations among them. We focus on propositional logic, which is well-defined and unambiguous compared to natural language, yet relatively easy for humans to derive and interpret. A proposition p is a statement which is either True or False. A formula F is a compound of propositions connected by logical connectives, e.g., ¬, ∧∨, =⇒ . A logical formula (and corresponding truth assignments) can be represented as a undirected graph G = (V, E) with nodes v i ∈ V, and edges (v i , v j ) ∈ E. Individual nodes are either propositions (leaf nodes) or logical operators (¬, ∧, ∨), where subjects and objects are connected to their respective operators. In our work, leaf nodes are auxiliary tasks, which are True if the task is successful and False otherwise, where success of the task could be defined as whether the accumulated return attains a certain threshold. For example, the logical relation formula T M =⇒ T 1 ∧ T 2 expresses that "if the the main task is successful, then both the auxiliary tasks T 1 and T 2 are successful". The features of the leaf nodes are the predicted action distributions conditioned on the state s t ; the features of the logical operators are fixed to randomly generated vectors. To incorporate state information, we add a state feature node to every logical relation graph and connect it to every other node. As state features, we use the latent vector z s obtained from a simple VAE (Kingma & Welling, 2014) trained to reconstruct the state s. The set of logical relations forms the set K = {k i (Υ aux )} m i=1 defined over the auxiliary tasks Υ aux where each k i is represented as a graph G ki . Note that the logical relations need not to be complete nor error-free; KPR can tolerate a degree of misspecification as relations are also learnt from interaction data.

3.2. LEARNING RELATIONS FROM INTERACTION DATA

Graph Neural Network Pooling Function. To utilize the structural information contained in the logical relation graphs and enable learning from interaction data, we adopt Graph Neural Networks (GNNs) (Fey & Lenssen, 2019) . GNNs pass and aggregate the messages from the neighbors to encode the nodes in the graph with learnable weights. To obtain a single action distribution from each logical relation graph, we add a graph pooling layer that maps the entire graph into a compact representation. We adopt Graph Multiset Pooling (Baek et al., 2021) , which satisfies permutation invariance, as our GNN pooling function f (•). The accumulated knowledge I consists of both the logical relations graphs and learnt weights of the GNN pooling function f (•). Graph Multiset Pooling network f (•), consists of a two-layer message-passing module g(•), a graph multi-head attention pooling module that condenses all nodes to representative nodes h l (•), a multihead self-attention module q s (•), and the second multi-head attention pooling module that condenses the entire graph to one vector h 1 (•). Put together, f (G ki ) = h 1 q s h l g(G ki ) , where G ki is the input logical graph constructed from the logical relation formula k i . The message-passing function updates node representations by aggregating its neighbors' messages. In particular, the message-passing function we use is: x (l) v = u∈N ∪{v} 1 D(v) D(u) W T x (l-1) u , where x l v is the node representation of node v at l th step. N (v) denotes a set of neighboring nodes of v, W is the weight matrix, and D(v), D(u) denotes the degree of node v, u respectively. An attention function q can be described as mapping a query and a set of key-value pairs to an output, where the query Q, keys K, values V , and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. q(Q, K, V ) = Softmax QK T √ d l V. The multi-head attention function q multi simply uses a multiple attention function with the linearly projected query, keys, and values matrices. To explicitly use the graph structure in multi-head attention, the keys and values are generated using one layer of message-passing function instead of conventional linear projection. h k g(G ki ) = I q multi Z, g K g(G ki ) , g V g(G ki ) , where I(•) denotes the residual connection function, q multi (•) is the multi-head attention function, Z is a parameterized seed matrix that is optimized end-to-end acting as the query matrix, and g K , g V represents the message-passing functions for key and value respectively. Model Training. Each f (G ki ) can be interpreted as a policy in an ensemble; f takes a logical formula graph as input, and outputs an action distribution, i.e., p ki (a|s) = f (G ki ). The node feature for the auxiliary task node is the respective predicted action distribution and the auxiliary task identifier. The node features for logical operators are fixed randomly generated vectors that share the same dimension with auxiliary task features. To learn the relationships between tasks, we train each ensemble policy to predict the action with the highest probability under the main task policy. More precisely, we train the graph neural network pooling function f with the following loss: L f = E τ ∼p(τ |πT M ) T t=0 m i=0 CrossEntropy π TM (s t ), f k i , Π aux (s t ), z st , where τ = {(s t , a t , R(s t , a t )} is a trajectory sampled using the main task policy π TM and z st are the state features. Note that KPR does not need to access the perturbed environment during training.

3.3. POLICY VIA FUSION OF ACTION DISTRIBUTIONS

At this stage, we have m action distributions p ki (a|s) for each logical relation k i . To obtain a final action distribution, we will need to combine or fuse these action distributions together. We adopt a very simple voting-like mechanism that requires no training. The fusion is performed via two main steps: (i) task policy filtering and (ii) action counting. We first select the top-3 action distributions by selecting the most "confident" models as measured by negative entropy. Intuitively, this filters away irrelevant action distributions given the current state. Next, we form a new action distribution by vote counting. Each of the three remaining policies casts a positive vote for their top scoring action (with largest p(a j |s)) and a negative vote for their lowest scoring action (smallest p(a j |s)). For each action a j , we tally the number of positive votes o + j and negative votes o - j , and construct a new action distribution for the main task T M : π M = p(a j ) = o + j -o - j |A|

4. EXPERIMENTS

In this section, we focus on validating that incorporating knowledge offers protection against adversarial attacks in deep RL. Specifically, we conduct experiments on a selection of Atari games and a high-dimensional Robot Food Court Environment (RoFoCo)foot_0 . Compared Methods. We compared our method with policy ensemble (Wiering & van Hasselt, 2008) , policy distillation (Rusu et al., 2016) , and adversarial training (Goodfellow et al., 2015) as strong baselines 2 . For Policy Ensemble, we use five main task policies (trained with different random seeds) with majority voting. Since we focus on attack-agnostic defense methods, we conduct adversarial training over a union of commonly-known attacks, including the attacks that we evaluate on; this method approximates an "attack agnostic" adversarial training. Recall that the domain knowledge I comprises the relations among the auxiliary tasks and the main task. To investigate the role I plays, we included a variant of our method, MLP Fusion, which replaces the GNN with a MLP whose input are the action distributions of the auxiliary policies. Attacks. In Atari games, we used four common white-box attacks, FGSM (Goodfellow et al., 2015; Huang et al., 2017) , PGD (Madry et al., 2018) and Jitter (Schwinn et al., 2021) ; and a black-box attack, Square (Andriushchenko et al., 2020) . In the Robot Food Court environment (RoFoCo), we selected FGSM and PGD as white-box attacks (since they caused the most significant performance degradation in the Atari games experiments) and Square as the black-box attack. The attacks were implemented using the torchattacks package (Kim, 2020) .

4.1. ATARI GAMES

Environment. We evaluated KPR on three Atari games: Road Runner, River Raid, and Space Invader, which are representative tasks that can be naturally decomposed into auxiliary tasks, e.g., collect targets, shoot enemies, and avoid collisions. For each game, the state is a stack of four consecutive frames, where each frame is pre-processed to size 84 × 84. Detailed environment settings can be found in Appendix B.2. Main Task Policy. We adopt PPO (Schulman et al., 2017; Huang et al., 2022) as the learning algorithm. Each policy is trained with 10 million frames. Alternative algorithms and policies such as DQN (Mnih et al., 2015) and Rainbow DQN (Hessel et al., 2018) can be used without changing the overall proposed framework. Auxiliary Tasks and Domain Knowledge. Due to space restrictions, complete auxiliary tasks information and domain knowledge relations are in Appendix B.2. Taking Road Runner as an example, the auxiliary tasks are: "T 1 : Collect the bird seeds on the road." and "T 2 : Avoid the cars.". Denote the main task, "Collect the bird seeds while avoiding colliding with cars on the road", as π TM . The logical relation is Results and Discussion. Our experimental results are summarized in Table 1 , which shows test-time episode returns averaged over 70 episodes with standard errors in brackets. In each game, the main policy solved the task but performance degraded when attacked, as expected. In general, KPR improves robustness to all five attacks across the three environments. There is some performance degradation but it is less severe compared to the unprotected main task policy. This is most apparent for the strongest attacks, i.e., FGSM and PGD. T M =⇒ T 1 ∧ T 2 . Interestingly, we see that adversarial training led to poorer policies in the benign environment and was also generally ineffective against FGSM and PGD. We believe this is due to the attack-agnostic training scheme; in typical use, adversarial training requires knowledge of the attacker but in our experiments, different attackers were sampled and this led to significant noise that hampered training. Policy Ensemble, MLP Fusion, and KPR are all ensemble methods, but differ in their construction. Although not specifically designed for defense, the simple policy ensemble is surprisingly effective in the Road Runner domain. Policy Ensemble and MLP fusion achieve comparable performance, but are poorer compared to KPR. These results suggest that using existing prior knowledge in the form of policies and logical relations do result in more robust policies. (1) Collect food from the correct food stall (2) Deliver food to the correct table (3) Collect the used tray after customers finished eating 

4.2. ROBOT FOOD COURT ENVIRONMENT (ROFOCO)

Environment. This environment simulates a service robot in a food-court setting and was developed using Unity (Juliani et al., 2018) . The agent is a food serving robot. A food stall number and a table number are given at the beginning of every episode. The main task is to collect food from the instructed food stall and deliver it to the instructed table number. After the customer finishes eating, the robot should pick up the used tray and deposit it at the tray collection point. The task is divided into four stages as illustrated in Figure 2 . The agent receives a +10 reward for completing each intermediate stage and an additional +20 reward for completing the whole task. The agent will get a -0.1 penalty at every time step and an additional -5 penalty if it tries to incorrectly perform pick up or put down actions on objects, such as trying to pick up food while the customer is eating. The maximum number of steps is 1,000. The available actions are: 1) Move forward; 2) Move backward; 3) Turn left 90 degrees; 4) Turn Right 90 degrees; 5) Pick up; 6) Put down; 7) Do nothing. The observation at each time step is a 128 × 128 RGB image. Main Task Policy. As the observation space is significantly larger than the Atari games, exploring from scratch is computational resource consuming. To obtain the main task policy, we first initialize the agent with the policy obtained using imitation learning (Hussein et al., 2017) on a set of expert demonstrations. DQN (Mnih et al., 2015) is used to refine the policy via interactions further. Auxiliary Tasks and Domain Knowledge. Due to space constraints, we give a few examples of auxiliary tasks and domain knowledge relations here and provide the complete list in Appendix B.3. Three example auxiliary tasks are 1) T 1 : Navigate to the food stall; 2) T 2 : Pick up food from customer tables; and 3) T 3 : Pick up food from the food stall. Denote the main task we described as T M . One example relation is T M =⇒ T 1 ∧ (¬T 2 ) ∧ T 3 . Results and Discussion. Test-time performance comparison is summarized in ¬ < l a t e x i t s h a 1 _ b a s e 6 4 = " E u u 5 N C K 3 n C 0 P j B 8 5 i y K L H J / f p e s = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s q M i L o R i 2 5 c V r A X a I e S S T N t a J I T i M s / P M 5 V k J M Z i l g L 5 Q O T a N Q S 8 = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X a + b L 7 h F d w q 0 T L w 5 K Z S O q t / s v f x R 6 e Y / O z 1 J k o g K Q z j W u u 2 5 s f F T r A w j n I 5 z n U T T G J M h 7 t O 2 p Q J H V P v p 9 N Q x O r V K D 4 V S 2 R I G T d X f E y m O t B T i M s / P M 5 V k J M Z i l g L 5 Q O T a N Q S 8 = " > A A A B 6 n i c b V C 7 S g N B F L 0 b X z G + o o K N z W A Q r M K u i F q G 2 F g m m B c k S 5 i d z C Z D Z m e W m V k h L P k E G w t F b G 3 9 C 7 / A z s Z v c f I o N P H A h c M 5 9 3 L v P U H M m T a u + + V k V l b X 1 j e y m 7 m t 7 Z 3 d v f z + Q U P L R B F a J 5 J L 1 Q q w p p w J W j f M c N q K F c V R w G k z G N 5 M / O Y 9 V Z p J U T O j m P o R 7 g s W M o K N l e 5 q X a + b L 7 h F d w q 0 T L w 5 K Z S O q t / s v f x R 6 e Y / O z 1 J k o g K Q z j W u u 2 5 s f F T r A w j n I 5 z n U T T G J M h 7 t O 2 p Q J H V P v p 9 N Q x O r V K D 4 V S 2 R I G T d X f E y m O t B G Q 2 o 0 K c J q v L i N u k U P P F Q n O m 8 / r M = " > A A A B 6 n i c b V C 7 S g N B F L 3 j M 8 Z X V L C x G Q y C V d g N o p Y h N p Y J 5 g X J E m Y n s 8 m Q 2 d l l Z l Y I S z 7 B x k I R W 1 v / w i + w s / F b n D w K T T x w 4 X D O v d x 7 j x 8 L r o 3 j f K G V 1 b X 1 j c 3 M V n Z 7 Z 3 d v P 3 d w 2 N B R o i i r 0 0 h E q u U T z Q S X r G 6 4 E a w V K 0 Z C X G Q 2 o 0 K c J q v L i N u k U P P F Q n O m 8 / r M = " > A A A B 6 n i c b V C 7 S g N B F L 3 j M 8 Z X V L C x G Q y C V d g N o p Y h N p Y J 5 g X J E m Y n s 8 m Q 2 d l l Z l Y I S z 7 B x k I R W 1 v / w i + w s / F b n D w K T T x w 4 X D O v d x 7 j x 8 L r o 3 j f K G V 1 b X 1 j c 3 M V n Z 7 Z 3 d v P 3 d w 2 N B R o i i r 0 0 h E q u U T z Q S X r G 6 4 E a w V K 0 Z C X p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B W Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q y 1 p r 1 M s e W V v L D Q P / h R K l x / u R f L 2 5 V Y 7 x c 9 2 N y a p o N I Q j r V u + V 5 i g g w r w w i n I 7 e d a p p g M s A 9 2 r I o s a A 6 y M a z j t C h d b o o i p V 9 0 q C x + 7 s j w 0 L r o Q h t p c C m r 2 e z 3 P w v a 6 U m O g 8 y J p P U U E k m H 0 U p R y Z G + e K o y x Q l h g 8 t Y K K Y n R W R P l a Y G H s e 1 x 7 B n 1 1 5 H u r H Z f + 0 f H L r l S p X M F E B 9 u E A j s C H M 6 j A D V S h B g T 6 8 A B P 8 O w I 5 9 F 5 c V 4 n p Q v O t G c P / s h 5 / w F r W p F + < / l a t e x i t > ¬ < l a t e x i t s h a 1 _ b a s e 6 4 = " M f R T g  p i 6 z C h f o 8 v O W Y o a r Q / V m Z 0 = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d h V U c s Q G 8 s E 8 4 J k C b O T 2 W T I 7 M w y M y u E J Z 9 g Y 6 G I r a 1 / 4 R f Y 2 f g t T h 6 F J h 6 4 c D j n X u 6 9 J 4 g 5 0 8 Z 1 v 5 y l 5 Z X V t f X M R n Z z a 3 t n N 7 e 3 X 9 c y U Y T W i O R S N Q O s K W e C 1 g w z n D Z j R X E U c N o I B j d j v 3 F P l W Z S V M 0 w p n 6 E e 4 K F j G B j p b t q 5 7 y T y 7 s F d w K 0 S L w Z y R c P K 9 / s v f R R 7 u Q + 2 1 1 J k o g K Q z j W u u W 5 s f F T r A w j n I 6 y 7 U T T G J M B 7 t G W p Q J H V P v p 5 N Q R O r F K F 4 V S 2 R I G T d T f E y m O t B M f R T g p i 6 z C h f o 8 v O W Y o a r Q / V m Z 0 = " > A A A B 6 n i c b V C 7 S g N B F L 3 r M 8 Z X V L C x G Q y C V d h V U c s Q G 8 s E 8 4 J k C b O T 2 W T I 7 M w y M y u E J Z 9 g Y 6 G I r a 1 / 4 R f Y 2 f g t T h 6 F J h 6 4 c D j n X u 6 9 J 4 g 5 0 8 Z 1 v 5 y l 5 Z X V t f X M R n Z z a 3 t n N 7 e 3 X 9 c y U Y T W i O R S N Q O s K W e C 1 g w z n D Z j R X E U c N o I B j d j v 3 F P l W Z S V M 0 w p n 6 E e 4 K F j G B j p b t q 5 7 y T y 7 s F d w K 0 S L w Z y R c P K 9 / s v f R R 7 u Q + 2 1 1 J k o g K Q z j W u u W 5 s f F T r A w j n I 6 y 7 U T T G J M B 7 t G W p Q J H V P v p 5 N Q R O r F K F 4 V S 2 R I G T d T f E y m O t B v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a j r U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8 A T P 1 q 3 1 a L 1 Y r 5 P S j D X t 2 Y U / s t 5 + A E v F k E Q = < / l a t e x i t > z As a qualitative comparison, Figure 3 shows an example where the adversary successfully misled the robot into picking up food from other customers' tables. By leveraging auxiliary policies, such as navigating to the food stall and picking the food from the food stall, together with the relations between the tasks, KPR is able to correct the final action distribution prediction.

5. CONCLUSION AND FUTURE WORK

This paper proposes KPR, a novel approach to leverage domain knowledge to defend against adversarial attacks in reinforcement learning settings. KPR incorporates domain knowledge from auxiliary policies and specified logical relations between tasks, then learns flexible relations from interaction data via graph neural networks. The main advantage of KPR is that it is both policy and attack agnostic; any type of policy could be utilized, and no access nor information about the attack is required. We demonstrated its efficacy empirically in both Atari games and the complex Robot Food Court environment (RoFoCo). A number of promising avenues exist for future research. In this work, we mainly experimented with neural network policies and future work can look into other auxiliary and main policies (e.g., interpretable rule-based policies). Next, KPR worked well in RoFoCo, which is a highly-complex environment, but it is necessary to test KPR (and other existing defense methods) in real-world environments. Finally, we believe that KPR can also defend against alternative threat models, including observed adversaries that comply with environmental constraints (Gleave et al., 2020; Cao et al., 2022) ; these experiments would make for interesting next steps.

C ROBOT FOOD COURT (ROFOCO) ENVIRONMENT

In order to evaluate the performance of KPR in a complex real-world like environment, we developed a high fidelity simulated Robot Food Court environment (RoFoCo) via Unity (Juliani et al., 2018) , as shown in Figure 2 . The agent is a food serving robot. A food stall number and a table number are instructed at the beginning of every trajectory. The main task is to collect food from the correct food stall and deliver it to the correct table number, then after the customer finish eating, pick up the used tray and send it to the tray collection point. The agent gets a +10 reward for completing each intermediate stage, and an additional +20 reward for completing the whole task. The agent will get a -0.1 penalty every time step and an additional -5 penalty if it tries to perform pick up or put down actions on objects that are not affordable, such as trying to pick up something on humans. The maximum number of steps is 1000. The available actions are: 1) Move forward; 2) Move backwards; 3) Turn left 90 degrees; 4) Turn Right 90 degrees; 5) Pick up; 6) Put down; 7) Do nothing. The observation is 128 × 128 RGB image.

D.1 FAST GRADIENT SIGN METHOD (FGSM)

FGSM (Goodfellow et al., 2015) is a method to efficiently calculate the gradient of the cost function with respect to the input of the neural network. The adversarial examples are generated using the following equation: x ′ = x + ϵ • sign(∇ x J(θ, x, y)) (9) where θ denotes the parameters of a model, x is the input to the model, y is the target associated with x, J(θ, x, y) is the cost used to train the neural network, and sign is the component-wise signum operator. The adversarial examples generated by FGSM exploit the "linearity" of deep network models in the higher dimensional spaces whereas such models were commonly thought to be highly non-linear at that time. They hypothesized that the designs of deep neural networks that encourage linear behavior for computational gains also make them susceptible to cheap analytical perturbations, which is often referred as "linearity hypothesis".

D.2 PROJECTED GRADIENT DESCENT (PGD)

PGD (Madry et al., 2018) improve the performance of FGSM by running a finer iterative optimizer for multiple iterations. PGD performs FGSM with a smaller step size and projects the updated adversarial sample into the ϵ -L ∞ neighbor of the benign samples and a valid range. Hence the adversarial perturbation size is smaller than ϵ. The update procedure follows: x ′ t+1 = Proj{x ′ t + α • sign[∇ x J(θ, x ′ t , y)]}

D.3 JITTER ATTACK

In order to make adversarial attacks more effective, Jitter (Schwinn et al., 2021) proposes a novel loss function to encourage logits scale invariance, diverse attack targets, and perturbation norm minimization. The final loss function can be described as follows: L Jitter = ∥ẑ-y+N (0,σ)∥2 ∥γ∥p if x ′ is misclassified ∥ẑy + N (0, σ)∥ 2 if x ′ is not misclassified yet ẑ = softmax(α • z ∥z∥ ∞ ) ( ) where y is the ground truth, z is the output logits after perturbation, and α controls the lowest and largest possible output values of the softmax function.

D.4 SQUARE ATTACK

Square attack (Andriushchenko et al., 2020) is based on a randomized search scheme which selects localized square-shaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. The objective is to solve the constrained optimization problem: min x∈S L(f (x), y) = f y (x) -max k̸ =y f k (x), s.t. ∥x -x∥ p ≤ ϵ ( ) where f is the target network, x is the input, y is the ground truth, and S is the domain of the input.

E MODEL ARCHITECTURE

Policy Network. We adopt PPO (Schulman et al., 2017; Huang et al., 2022) as the learning algorithm. We use a three-layer CNN with {32, 64, 64} hidden size, followed by a two linear layers with 512 neurons for both actor and critic networks. VAE. The encoder is five-layer CNN with hidden dimension {32, 64, 128, 256, 512}, followed by a linear layer. The decoder is symmetric to the encoder with an additional 2D transposed convolution layer (Zeiler et al., 2010) of hidden dimension 512.



We also illustrate KPR using a goal-finding grid-world environment as a proof-of-concept. Due to space constraints, please refer to Appendix A for more details. We also attempted to compare other methods, i.e., state-adversary DQN (SA-DQN)(Zhang et al., 2020; 2021), Policy Adversarial Actor Director (PA-AD)(Sun et al., 2022), Adversary Agnostic Policy Distillation (A2PD)(Qu et al., 2021). However, these methods do not perform sufficiently well in our environments, and thus, we exclude them from our comparison.



l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x Xq e l B W v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > ^< l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q v K T w R X 2 n G + r c z C 4 t L y S n b V X l v f 2 N z K b e 9 U V Z x K h h U W i 1 j W f a p Q 8 A g r m m u B 9 U Q i D X 2 B N b 9 3 N c p r d y g V j 6 M b 3 U / Q C 2 k n 4 g F n V B u r f N / K 5 Z 2 C M x a Z B 3 c K + Y s P + z x 5 / 7 J L r d x n s x 2 z N M R I M 0 G V a r h O o r 0 B l Z o z g U O 7 m S p M K O v R D j Y M R j R E 5 Q 3 G g w 7 J g X H a J I i l e Z E m Y / d 3 x 4 C G S v V D 3 1 S G V H f V b D Y y / 8 s a q Q 7 O v A G P k l R j x C Y f B a k g O i a jr U m b S 2 R a 9 A 1 Q J r m Z l b A u l Z R p c x v b H M G d X X k e q k c F 9 6 R w X H b y x U u Y K A t 7 s A + H 4 M I p F O E a S l A B B g g P 8

7 N n d / b N g D r t a q y 5 p c = " > A A A B 6 n i c b V D L S g N B E O y N r 5 j 4 i H r 0M h g F T 2 F X R D 0 G v X i M m B c k S 5 i d z C Z j Z m e X m d l A W P I J X j w o 4 t V v 8 A f 8 A 2 9 + i J 6 d T X L Q x I K G o q q b 7 i 4 v 4 k x p 2 / 6 0 M k v L K 6 t r 2 f V c f m N z a 7 u w s 1 t X Y S w J r Z G Q h 7 L p Y U U 5 E 7 S m m e a 0 G U m K A 4 / T h j e 4 S v 3 G k E r F Q l H V o 4 i 6 A e 4 J 5 j O C t Z F u q 5 2 7 T q F o l + w J 0 C J x Z q R Y P v x 6 e x / m v y u d w k e 7 G 5 I 4 o E I T j p V q O X a k 3 Q R L z Q i n 4 1 w 7 V j T C Z I B 7 t G W o w A F V b j I 5 d Y y O j N J F f i h N C Y 0 m 6 u + J B A d K j Q L P d A Z Y9 9 W 8 l 4 r / e a 1 Y + x d u w k Q U a y r I d J E f c 6 R D l P 6 N u k x S o v n I E E w k M 7 c i 0 s c S E 2 3 S y Z k Q n P m X F 0 n 9 p O S c l U 5 v T B q X M E U W 9 u E A j s G B c y j D N V S g B g R 6 c A + P 8 G R x 6 8 F 6 t l 6 m r R l r N r M H f 2 C 9 / g D v 6 5 I D < / l a t e x i t > Tj < l a t e x i t s h a 1 _ b a s e 6 4 = " a c K w p a q 3 H w 7 N n d / b N g D r t a q y 5 p c = " > A A A B 6 n i c b V D L S g N B E O y N r 5 j 4 i H r 0 M h g F T 2 F X R D 0 G v X i M m B c k S 5 i d z C Z j Z m e X m d l A W P I J X j w o 4 t V v 8 A f 8 A 2 9 + i J 6 d T X L Q x I K G o q q b 7 i 4 v 4 k x p 2 / 6 0

s 9 a M M 5 8 5 h D 9 w 3 n 4 A C 0 6 R W g = = < / l a t e x i t > Ti < l a t e x i t s h a 1 _ b a s e 6 4 = " l 0 b X M y 3 s O + M p U v q l e w B L F E x R + B Q = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s q M i L o R i 2 5 c V r A X a I e S S T N t a J I Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q 2 3 z T r H k l b 2 x 0 D z 4 U y h d f r g X y d u X W + 0 U P 9 v d m K S C S k M 4 1 r r l e 4 k J M q w M I 5 y O 3 H a q a Y L J A P d o y 6 L E g u o g G 8 8 6 Q o f W 6 a I o V v Z J g 8 b u 7 4 4 M C 6 2 H I r S V A p u + n s 1 y 8 7 + s l Z r o P

L j j T n j 3 4 I + f 9 B 4 g s k Z E = < / l a t e x i t > _ < l a t e x i t s h a 1 _ b a s e 6 4 = " l 0 b X M y 3 s O + M p U v q l e w B L F E x R + B Q = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s q M i L o R i 2 5 c V r A X a I e S S T N t a J I Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q 2 3 z T r H k l b 2 x 0 D z 4 U y h d f r g X y d u X W + 0 U P 9 v d m K S C S k M 4 1 r r l e 4 k J M q w M I 5 y O 3 H a q a Y L J A P d o y 6 L E g u o g G 8 8 6 Q o f W 6 a I o V v Z J g 8 b u 7 4 4 M C 6 2 H I r S V A p u + n s 1 y 8 7 + s l Z r o P

v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > ^< l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q

4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / + v o q

s b P w W J 4 9 C E w 9 c O J x z L / f e E 8 S c a e O 6 X 8 7 S 8 s r q 2 n p m I 7 u 5 t b 2 z m 9 v b r 2 u Z K E J r R H K p m g H W l D N B a 4 Y Z T p u x o j g K O G 0 E g 5 u x 3 7 i n S j M p q m Y Y U z / C P c F C R r C x 0 l 2 1 c 9 H J 5 d 2 C O w F a J N 6 M 5 I u H l W / 2 X v o o d 3 K f 7 a 4 k S U S F I R x r 3 f L c 2 P g p

s b P w W J 4 9 C E w 9 c O J x z L / f e E 8 S c a e O 6 X 8 7 S 8 s r q 2 n p m I 7 u 5 t b 2 z m 9 v b r 2

/ l a t e x i t > T4 < l a t e x i t s h a 1 _ b a s e 6 4 = " E u u 5 N C K 3 n C 0 P j B 8 5 i y K L H J / f p e s = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s q M i L o R i 2 5 c V r A X a I e S S T N t a J I Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q y 1 p r 1 M s e W V v L D Q P / h R K l x / u R f L 2 5 V Y 7 x c 9 2 N y a p o N I Q j r V u + V 5 i g g w r w w i n I 7 e d a p p g M s A 9 2 r I o s a A 6 y M a z j t C h d b o o i p V 9 0 q C x + 7 s j w 0 L r o Q h t p c C m r 2 e z 3 P w v a 6 U m O g 8 y J

s h a 1 _ b a s e 6 4 = " / z J g T o 4 n Y E 0 F S P 6 7 4 y 6 f n x P c w o s = " > A A A B + n i c b V D L S s N A F L 2 p r 1 p f q S 7 d B I v g q i S l q M u i C 1 1 W s A 9 o Q 5 h M J + 3 Q y S T M T J Q S 8 y l u X C j i 1 i 9 x 5 9 8 4 a b P Q 1 g M D h 3 P u 5 Z 4 5 f s y o V L b 9 b Z T W 1 j c 2 t 8 r b l Z 3

4 y a P o w g n c A p V 8 O A K G n A H T W g B A Q 7 P 8 A p v z q P z 4 r w 7 H 4 v W g p P P H M M f O J 8 / Z f y P k A = = < / l a t e x i t > f (•) < l a t e x i t s h a 1 _ b a s e 6 4 = " M R P O Q

Figure 1: Knowledge-based Policy Recycling (KPR) Overview.Prior domain knowledge comprises a set of auxiliary tasks and a set of logical formulae defined over the auxiliary tasks. KPR combines these auxiliary tasks policies to obtain a new robust policy for the main task. At every time step, we obtain a state-conditioned action distribution for each auxiliary task, which are realizations of the variables in logical formulae. We represent logical formulae as graphs whose nodes are either auxiliary tasks or logical operators, e.g., ¬, ∧, ∨, =⇒ . The instantiated logical formula graphs with node features are processed by a GNN pooling function (Section 3.2) to yield m different action distributions (one for each graph). In other words, each logical graph results in a different policy. Finally, we combine/fuse these distributions (Section 3.3) to obtain a final action distribution. In summary, KPR uses the specified logical formula graph as prior structure and then learns flexible relations over auxiliary tasks from interaction data.

Figure 2: Robot Food Court Environment (RoFoCo) and a breakdown of the main task into four steps.

t e x i t s h a 1 _ b a s e 6 4 = " Q

5 F g e 2 M s B n o R W 8 i / u e 1 E x N e + y k T c W K o I L N F Y c K R k W j y N + o x R Y n h I 0 s w U c z e i s g A K 0 y M T S d n Q / A W X 1 4 m j f O i d 1 m 8 q N o 0 y j B D F o 7 h B M 7 A g y s o w S 1 U o A 4 E + v A A T / D s c O f R e X F e Z 6 0 Z Z z 5 z C H / g v P 0 A t l + R I g = = < / l a t e x i t > T1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q

5 F g e 2 M s B n o R W 8 i / u e 1 E x N e + y k T c W K o I L N F Y c K R k W j y N + o x R Y n h I 0 s w U c z e i s g A K 0 y M T S d n Q / A W X 1 4 m j f O i d 1 m 8 q N o 0 y j B D F o 7 h B M 7 A g y s o w S 1 U o A 4 E + v A A T / D s c O f R e X F e Z 6 0 Z Z z 5 z C H / g v P 0 A t l + R I g = = < / l a t e x i t > T1 < l a t e x i t s h a 1 _ b a s e 6 4 = "

7 C m P 7 y Z + M 1 7 p j S P Z M 2 M Y u a F p C 9 5 w C k x V r q r d Y v d X N 4 p O F P g Z e L O S b 5 0 X P 3 m 7 + W P S j f 3 2 e l F N A m Z N F Q Q r d u u E x s v J c p w K t g 4 2 0 k 0 i w k d k j 5 r W y p J y L S X T k 8 d 4 z O r 9 H A Q K V v S 4 K n 6 e y I l o d a j 0 L e d I T E D v e h N x P + 8 d m K C a y / l M k 4 M k 3 S 2 K E g E N h G e / I 1 7 X D F q x M g S Q h W 3 t 2 I 6 I I p Q Y 9 P J 2 h D c x Z e X S a N Y c C 8 L F 1 W b R h l m y M A J n M I 5 u H A F J b i F C t S B Q h 8 e 4 A m e k U C P 6 A W 9 z l p X 0 H z m C P 4 A v f 0 A t + O R I w = = < / l a t e x i t > T2 < l a t e x i t s h a 1 _ b a s e 6 4 = "

7 C m P 7 y Z + M 1 7 p j S P Z M 2 M Y u a F p C 9 5 w C k x V r q r d Y v d X N 4 p O F P g Z e L O S b 5 0 X P 3 m 7 + W P S j f 3 2 e l F N A m Z N F Q Q r d u u E x s v J c p w K t g 4 2 0 k 0 i w k d k j 5 r W y p J y L S X T k 8 d 4 z O r 9 H A Q K V v S 4 K n 6 e y I l o d a j 0 L e d I T E D v e h N x P + 8 d m K C a y / l M k 4M k 3 S 2 K E g E N h G e / I 1 7 X D F q x M g S Q h W 3 t 2 I 6 I I p Q Y 9 P J 2 h D c x Z e X S a N Y c C 8 L F 1 W b R h l m y M A J n M I 5 u H A F J b i F C t S B Q h 8 e 4 A m e k U C P 6 A W 9 z l p X 0 H z m C P 4 A v f 0 A t + O R I w = = < / l a t e x i t > T2 < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A p E Q = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 G W 9 W l m 8 E i u C q J S H U j F t 2 4 r G D a Q h v K Z D J p h 0 4 m Y W Y i l N B n c O N C E V e C r + L e j f g 2 T i 8 L b f 1 h 4 O P / z 2 H O O U H K m d K O 8 2 0 V l p Z X V t e K 6 / b G 5 t b 2 T m l 3 r 6 G S T B L q k Y Q n s h V g R T k T 1 N N M c 9 p K J c V x w G k z G F y P 8 + Y 9 l Y o l 4 k 4 P U + r H u C d Y x A j W x v I 6 H I u w W y o 7 F W c i t A j u D M q X H / Z F + v Z l 1 7 u l z 0 6 Y k C y m Q h O O l W q 7 T q r 9 H E v N C K c j u 5 M p m m I y w D 3 a N i h w T J W f T 4 Y d o S P j h C h K p H l C o 4 n 7 u y P H s V L D O D C V M d Z 9 N Z + N z f + y d q a j c z 9 n I s 0 0 F W T 6 U Z R x p B M 0 3 h y F T F K i + d A A J p K Z W R H p Y 4 m J N v e x z R H c + Z U X o X F S c a u V 0 1 u n X L u C q Y p w A I d w D C 6 c Q Q 1 u o A 4 e E G D w A E / w b A n r 0 X q x X q e l B Wv W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B k V U r d O 1 K P H F 2 Y 9 0 P C J I a 3 l A

v W s w 9 / Z L 3 / A C r 2 k e 0 = < / l a t e x i t > ^< l a t e x i t s h a 1 _ b a s e 6 4 = " E u u 5 N C K 3 n C 0 P j B 8 5 i y K L H J / f p e s = " > A A A B 6 3 i c b Z D L S g M x F I b P e K 3 j r e r S T b A I r s qM i L o R i 2 5 c V r A X a I e S S T N t a J I Z k o x Q h r 6 C G x e K u B O f x b 0 b 8 W 3 M t F 1 o 6 w + B j / 8 / h 5 x z w o Q z b T z v 2 1 l Y X F p e W S 2 s u e s b m 1 v b x Z 3 d u o 5 T R W i N x D x W z R B r y p m k N c M M p 8 1 E U S x C T h v h 4 D r P G / d U a R b L O z N M a C B w T 7 K I E W x y q y 1 p r 1 M s e W V v L D Q P / h R K l x / u R f L 2 5 V Y7 x c 9 2 N y a p o N I Q j r V u + V 5 i g g w r w w i n I 7 e d a p p g M s A 9 2 r I o s a A 6 y M a z j t C h d b o o i p V 9 0 q C x + 7 s j w 0 L r o Q h t p c C m r 2 e z 3 P w v a 6 U m O g 8 y J p P U U E k m H 0 U p R y Z G + e K o y x Q l h g 8 t Y K K Y n R W R P l a Y G H s e 1 x 7 B n 1 1 5 H u r H Z f + 0 f H L r l S p X M F E B 9 u E A j s C H M 6 j A D V S h B g T 6 8 A B P 8 O w I 5 9 F 5 c V 4 n p Q v O t G c P / s h 5 / w F r W p F + < / l a t e x i t >

5 G g e 2 M s O n r e W 8 s / u e 1 E h N e + y k T c W K o I N N F Y c K R k W j 8 N + o y R Y n h Q 0 s w U c z e i k g f K 0 y M T S d r Q / D m X 1 4 k 9 b O C d 1 m 4 q N g 0 S j B F B o 7 g G E 7 B g y s o w i 2 U o Q Y E e v A A T / D s c O f R e X F e p 6 1 L z m z m A P 7 A e f s B u W e R J A = = < / l a t e x i t > T3 < l a t e x i t s h a 1 _ b a s e 6 4 = "

5 G g e 2 M s O n r e W 8 s / u e 1 E h N e + y k T c W K o I N N F Y c K R k W j 8 N + o y R Y n h Q 0 s w U c z e i k g f K 0 y M T S d r Q / D m X 1 4 k 9 b O C d 1 m 4 q N g 0 S j B F B o 7 g G E 7 B g y s o w i 2 U o Q Y E e v A A T / D s c O f R e X F e p 6 1 L z m z m A P 7 A e f s B u W e R J A = = < / l a t e x i t > T3 < l a t e x i t s h a 1 _ b a s e 6 4 = " X l z X + g U 0 4 5 8 n 4 n u T T l U 0 j U x p O G 0 = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b H L e r R S 2 M Q P I U Z E f U i B r 1 4 T M A s k A y h p 1 O T t O l Z 6 O 4 R Y s g T e P G g i F d 9 G O 9 e x L e x s x w 0 8 Y e G j / +

Figure 3: Adversarial Attacks Illustration. The adversary successfully misled the agent into picking up food from other customers' tables. KPR corrects the final action distribution prediction by leveraging auxiliary policies and domain knowledge.

Test-time episode accumulated returns under three Atari games. The returns are averaged over 70 episodes and standard errors are reported in brackets. The best scores are in bold.

which shows episode returns averaged over 50 test episodes. KPR outperforms other baselines across attacks, which further supports the notion that prior policy knowledge encourages robustness in policies. As before, KPR policies experienced less severe degradation against the various attacks (which are unknown at training-time). As in the Atari games, we observed that the ensemble methods are able to provide comparably effective protection compared to policy distillation and adversarial training. Turn left (to navigate to the food stall) Attacked Action: Pick up food from other customers' table

Test-time episode accumulated returns under the Robot Food Court environment. The returns are averaged over 50 episodes and standard errors are reported in brackets. The best scores are in bold.

ETHICS STATEMENT

We propose a novel framework to defend against adversarial attacks in RL setting by leveraging prior knowledge. This paper does not raise any ethical concerns. Our study does not involve human subjects. The Robot Foodcourt simulation environment we developed does not contain any sensitive information.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our experimental results, we include detailed network architecture and hyper-parameters in the appendix and provide source code in the supplementary material.

A TOY EXAMPLE: GOAL FINDING

To better illustrate how KPR works, we demonstrate it on a goal-finding grid-world environment as a proof of concept evaluation.Environment. The agent's objective is to find a target goal while avoiding the obstacle in a 7 × 7 grid-world (Figure 4A .). The agent receives a reward of +10 for reaching the target, a -10 penalty for colliding with the obstacle, and a -1 penalty at each time step. The maximum episode length is 30. To mimic the redundant information in real-world scenarios, each object has a specific color, shape, and letter.Auxiliary Tasks and Domain Knowledge. We design auxiliary tasks that focus on the different aspects of the objects, such as "T 1 : Find the orange color (target object color)", "T 2 : Avoid the diamond shape (obstacle object shape)". We denote the main task as T M . A simple domain knowledge relation among them could be T M =⇒ T 1 ∧ T 2 . A complete list of auxiliary tasks and domain knowledge relations among them can be found in Appendix B.1.Compared Methods. We compared with policy ensemble (Wiering & van Hasselt, 2008) and MLP Fusion which is a variant of KPR. Instead of leveraging GNN based pooling function to incorporate logical relations among auxiliary tasks, MLP Fusion uses MLP to replace the GNN pooling function. We aim to investigate the role of the logical relation domain knowledge component by comparing it with this variant. We use simple Deep Q-Learning (DQN) (Mnih et al., 2015) to train our main task and auxiliary tasks policies. Alternative algorithms and policies can be used without changing the overall proposed framework, e.g., improved DQN variants (Hasselt et al., 2016; Hessel et al., 2018) and alternative algorithms (Schaul et al., 2015; Bellemare et al., 2017; Christodoulou, 2019) .Result and Discussion. We adopt FGSM attack (Goodfellow et al., 2015) to perturb the observation. Since the observations are grid world states, the attacks here are generally stronger than the ones on pixel perturbations. The performance is evaluated in terms of the episode return averaged over 1000 episodes. The results are summarized in Figure 4B . C.. While the performance is similar in the Benign Environment, KPR outperforms others by a large margin in the perturbed environment. Note that KPR is not trained in the perturbed environment. This demonstrated KPR's ability to correct the inconsistency caused by adversarial attacks by leveraging domain knowledge over auxiliary policies. Furthermore, to investigate the qualitative correlation between the amount of domain knowledge and episode return, we study how different knowledge levels affect the test performance, as shown in Figure 4 D. E.. The "Low" knowledge level consists of 4 auxiliary policies and 2 formulae. The "Moderate" knowledge level consists of 11 auxiliary tasks and 5 formulae. And the "High" knowledge level consists of 17 auxiliary tasks and 12 formulae. As shown in Figure 4 D . and E., as the knowledge level increases, the improved robustness to adversarial attacks also increases.

A.1 AUXILIARY TASKS AND DOMAIN KNOWLEDGE

The objective for the agent is to find the goal while avoiding the obstacle in a 7 × 7 grid world (Figure 4A .). The agent gets a +10 reward for reaching the target, a -10 penalty for touching the obstacle, and a -1 penalty every time step. The maximum step number is 30. To mimic the redundant information in real-world scenarios, we use three features to represent a single object, color, shape, and the letter on it. The auxiliary tasks are: GNN Fusion Network. Our GNN pooling network consists of 3 GCN convolution layers with hidden dimensions {64, 32, 64}, followed by the graph mutliset transformer as described in Section 3, and 2 linear layers of 64 hidden neurons.

MLP Fusion Network

The MLP Fusion Network consists of 2 linear layers of 64 hidden neurons.

