DISTRIBUTIONALLY ROBUST LEARNING FOR UNSU-PERVISED DOMAIN ADAPTATION

Abstract

We propose a distributionally robust learning (DRL) method for unsupervised domain adaptation (UDA) that scales to modern computer-vision benchmarks. DRL can be naturally formulated as a competitive two-player game between a predictor and an adversary that is allowed to corrupt the labels, subject to certain constraints, and reduces to incorporating a density ratio between the source and target domains (under the standard log loss). This formulation motivates the use of two neural networks that are jointly trained -a discriminative network between the source and target domains for density-ratio estimation, in addition to the standard classification network. The use of a density ratio in DRL prevents the model from being overconfident on target inputs far away from the source domain. Thus, DRL provides conservative confidence estimation in the target domain, even when the target labels are not available. This conservatism motivates the use of DRL in self-training for sample selection, and we term the approach distributionally robust self-training (DRST). In our experiments, DRST generates more calibrated probabilities and achieves state-of-the-art self-training accuracy on benchmark datasets. We demonstrate that DRST captures shape features more effectively, and reduces the extent of distributional shift during self-training.

1. INTRODUCTION

In many real-world applications, the target domain for deployment of a machine-learning (ML) model can significantly differ from the source training domain. Furthermore, labels in the target domain are often more expensive to obtain compared to the source domain. An example is synthetic training where the source domain has complete supervision while the target domain of real images may not be labeled. Unsupervised domain adaptation (UDA) aims to maximize performance on the target domain, and it utilizes both the labeled source data and the unlabeled target data. A popular framework for UDA involves obtaining proxy labels in the target domain through selftraining (Zou et al., 2019) . Self-training starts with a classifier trained on the labeled source data. It then iteratively obtains pseudo-labels in the target domain using predictions from the current ML model. However, this process is brittle, since wrong pseudo-labels in the target domain can lead to catastrophic failure in early iterations (Kumar et al., 2020) . To avoid this, self training needs to be conservative and select only pseudo-labels with sufficiently high confidence level. This entails accurate knowledge of the confidence levels. Accurate confidence estimation is a challenge for current deep learning models. Deep learning models tend to produce over-confident and misleading probabilities, even when predicting on the same distribution (Guo et al., 2017a; Gal & Ghahramani, 2016) . Some attempts to remedy this issue include temperature scaling (Platt et al., 1999) , Monte-Carlo sampling (Gal & Ghahramani, 2016) and Bayesian inference (Blundell et al., 2015; Riquelme et al., 2018) . However, Snoek et al. (2019) has shown that the uncertainty estimation from these models cannot be trusted under domain shifts. In this paper, we instead consider the distributionally robust learning (DRL) framework (Liu & Ziebart, 2014; 2017) which provides a principled approach for uncertainty quantification under domain shifts. DRL can be formulated as a two-player adversarial risk minimization game, as depicted in Figure 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " M i s p o Z P j B O U a i B u D 9 Q J g X X I X w R Q = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R K k h J R N B l 0 Y 3 L C v Y B T S i T y a Q Z O n k w c y O G 2 o W / 4 s a F I m 7 9 D X f + j d M 2 C 2 0 9 c O F w z r 3 c e 4 + X C q 7 A s r 6 N 0 t L y y u p a e b 2 y s b m 1 v W P u 7 r V V k k n K W j Q R i e x 6 R D H B Y 9 Y C D o J 1 U 8 l I 5 A n W 8 Y b X E 7 9 z z 6 T i S X w H e c r c i A x i H n B K Q E t 9 8 8 C B k A H B D v U T w E 4 a 8 t r D a X 7 S N 6 t W 3 Z o C L x K 7 I F V U o N k 3 v x w / o V n E Y q C C K N W z r R T c E Z H A q W D j i p M p l h I 6 J A P W 0 z Q m E V P u a H r / G B 9 r x c d B I n X F g K f q 7 4 k R i Z T K I 0 9 3 R g R C N e 9 N x P + 8 X g b B p T v i c Z o B i + l s U Z A J D A m e h I F 9 L h k F k W t C q O T 6 V k x D I g k F H V l F h 2 D P v 7 x I 2 m d 1 2 6 r b t + f V x l U R R x k d o i N U Q z a 6 Q A 1 0 g 5 q o h S h 6 R M / o F b 0 Z T 8 a L 8 W 5 8 z F p L R j G z j / 7 A + P w B y z C V S g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M i s p o Z  P j B O U a i B u D 9 Q J g X X I X w R Q = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R K k h J R N B l 0 Y 3 L C v Y B T S i T y a Q Z O n k w c y O G 2 o W / 4 s a F I m 7 9 D X f + j d M 2 C 2 0 9 c O F w z r 3 c e 4 + X C q 7 A s r 6 N 0 t L y y u p a e b 2 y s b m 1 v W P u 7 r V V k k n K W j Q R i e x 6 R D H B Y 9 Y C D o J 1 U 8 l I 5 A n W 8 Y b X E 7 9 z z 6 T i S X w H e c r c i A x i H n B K Q E t 9 8 8 C B k A H B D v U T w E 4 a 8 t r D a X 7 S N 6 t W 3 Z o C L x K 7 I F V U o N k 3 v x w / o V n E Y q C C K N W z r R T c E Z H A q W D j i p M p l h I 6 J A P W 0 z Q m E V P u a H r / G B 9 r x c d B I n X F g K f q 7 4 k R i Z T K I 0 9 3 R g R C N e 9 N x P + 8 X g b B p T v i c Z o B i + l s U Z A J D A m e h I F 9 L h k F k W t C q O T 6 V k x D I g k F H V l F h 2 D P v 7 x I 2 m d 1 2 6 r b t + f V x l U R R x k d o i N U Q z a 6 Q A 1 0 g 5 q o h S h 6 R M / o F b 0 Z T 8 a L 8 W 5 8 z F p L R j G z j / 7 A + P w B y z C V S g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M i s p o Z P j B O U a i B u D 9 Q J g X X I X w R Q = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R K k h J R N B l 0 Y 3 L C v Y B T S i T y a Q Z O n k w c y O G 2 o W / 4 s a F I m 7 9 D X f + j d M 2 C 2 0 9 c O F w z r 3 c e 4 + X C q 7 A s r 6 N 0 t L y y u p a e b 2 y s b m 1 v W P u 7 r V V k k n K W j Q R i e x 6 R D H B Y 9 Y C D o J 1 U 8 l I 5 A n W 8 Y b X E 7 9 z z 6 T i S X w H e c r c i A x i H n B K Q E t 9 8 8 C B k A H B D v U T w E 4 a 8 t r D a X 7 S N 6 t W 3 Z o C L x K 7 I F V U o N k 3 v x w / o V n E Y q C C K N W z r R T c E Z H A q W D j i p M p l h I 6 J A P W 0 z Q m E V P u a H r / G B 9 r x c d B I n X F g K f q 7 4 k R i Z T K I 0 9 3 R g R C N e 9 N x P + 8 X g b B p T v i c Z o B i + l s U Z A J D A m e h I F 9 L h k F k W t C q O T 6 V k x D I g k F H V l F h 2 D P v 7 x I 2 m d 1 2 6 r b t + f V x l U R R x k d o i N U Q z a 6 Q A 1 0 g 5 q o h S h 6 R M / o F b 0 Z T 8 a L 8 W 5 8 z F p L R j G z j / 7 A + P w B y z C V S g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " M i s p o Z P j B O U a i B u D 9 Q J g X X I X w R Q = " > A A A B / 3 i c b V D L S s N A F J 3 U V 6 2 v q O D G z W A R K k h J R N B l 0 Y 3 L C v Y B T S i T y a Q Z O n k w c y O G 2 o W / 4 s a F I m 7 9 D X f + j d M 2 C 2 0 9 c O F w z r 3 c e 4 + X C q 7 A s r 6 N 0 t L y y u p a e b 2 y s b m 1 v W P u 7 r V V k k n K W j Q R i e x 6 R D H B Y 9 Y C D o J 1 U 8 l I 5 A n W 8 Y b X E 7 9 z z 6 T i S X w H e c r c i A x i H n B K Q E t 9 8 8 C B k A H B D v U T w E 4 a 8 t r D a X 7 S N 6 t W 3 Z o C L x K 7 I F V U o N k 3 v x w / o V n E Y q C C K N W z r R T c E Z H A q W D j i p M p l h I 6 J A P W 0 z Q m E V P u a H r / G B 9 r x c d B I n X F g K f q 7 4 k R i Z T K I 0 9 3 R g R C N e 9 N x P + 8 X g b B p T v i c Z o B i + l s U Z A J D A m e h I F 9 L h k F k W t C q O T 6 V k x D I g k F H V l F h 2 D P v 7 x I 2 m d 1 2 6 r b t + f V x l U R R x k d o i N U Q z a 6 Q A 1 0 g 5 q o h S h 6 R M / o F b 0 Z T 8 a L 8 W 5 8 z F p L R j G z j / 7 A + P w B y z C V S g = = < / l a t e x i t > P (y|x) / exp ✓ Ps(x) Pt(x) ✓ • (x, y) ◆ < l a t e x i t s h a 1 _ b a s e 6 4 = " K 5 8 2 m z x d C n D W K P L 4 a G N S 8 S X S l O A = " > A A A C T X i c b Z F L a 9 w w F I X l a d o k 0 0 e m 7 T I b 0 a H g g T L Y I d A s Q 7 v p c g q d J D A a B l m + H o v I l p C u i 4 3 j P 9 h N o b v + i 2 6 6 S C i l m g e 0 T X J B c P j O v X o c J U Z J h 1 H 0 P e g 9 2 H n 4 a H d v v / / 4 y d N n B 4 P n L 8 6 c r q y A q d B K 2 4 u E O 1 C y h C l K V H B h L P A i U X C e X L 5 f + e e f w T q p y 0 / Y G J g X f F n K T A q O H i 0 G K c s 5 t p M u b K 7 q E W X G a o O a M q g N Z Q o y D F l m u W g n i 5 Y h 1 N i 6 r g v r U f c X 4 A Y w z A E 5 Z S L V 6 L f J Z V i / a U b M y m W O o 8 V g G I 2 j d d G 7 I t 6 K I d n W Z D H 4 x l I t q g J K F I o 7 N 4 s j g / O W W 5 R C Q d d n l Q P D x S V f w s z L k h f g 5 u 0 6 j Y 6 + 9 i S l m b Z + l U j X 9 N + J l h f O N U X i O w u O u b v t r e B 9 3 q z C 7 G T e y t J U C K X Y H J R V i v r A V t H S V F o Q q B o v u L D S 3 5 W K n P v 4 0 H 9 A 3 4 c Q 3 3 7 y X X F 2 N I 6 j c f z x e H j 6 b h v H H j k k r 0 h I Y v K W n J I P Z E K m R J A v 5 A e 5 J j f B 1 + B n 8 C v 4 v W n t B d u Z l + S / 6 u 3 + A V Y A q d B K 2 4 u E O 1 C y h C l K V H B h L P A i U X C e X L 5 f + e e f w T q p y 0 / Y G J g X f F n K T A q O H i 0 G K c s 5 t p M u b K 7 q E W X G a o O a M q g N Z Q o y D F l m u W g n i 5 Y h 1 N i 6 r g v r U f c X 4 A Y w z A E 5 Z S L V 6 L f J Z V i / a U b M y m W O o 8 V g G I 2 j d d G 7 I t 6 K I d n W Z D H 4 x l I t q g J K F I o 7 N 4 s j g / O W W 5 R C Q d d n l Q P D x S V f w s z L k h f g 5 u 0 6 j Y 6 + 9 i S l m b Z + l U j X 9 N + J l h f O N U X i O w u O u b v t r e B 9 3 q z C 7 G T e y t J U C K X Y H J R V i v r A V t H S V F o Q q B o v u L D S 3 5 W K n P v 4 0 H 9 A 3 4 c Q 3 3 7 y X X F 2 N I 6 j c f z x e H j 6 b h v H H j k k r 0 h I Y v K W n J I P Z E K m R J A v 5 A e 5 J j f B 1 + B n 8 C v 4 v W n t B d u Z l + S / 6 u 3 + A V Y A q d B K 2 4 u E O 1 C y h C l K V H B h L P A i U X C e X L 5 f + e e f w T q p y 0 / Y G J g X f F n K T A q O H i 0 G K c s 5 t p M u b K 7 q E W X G a o O a M q g N Z Q o y D F l m u W g n i 5 Y h 1 N i 6 r g v r U f c X 4 A Y w z A E 5 Z S L V 6 L f J Z V i / a U b M y m W O o 8 V g G I 2 j d d G 7 I t 6 K I d n W Z D H 4 x l I t q g J K F I o 7 N 4 s j g / O W W 5 R C Q d d n l Q P D x S V f w s z L k h f g 5 u 0 6 j Y 6 + 9 i S l m b Z + l U j X 9 N + J l h f O N U X i O w u O u b v t r e B 9 3 q z C 7 G T e y t J U C K X Y H J R V i v r A V t H S V F o Q q B o v u L D S 3 5 W K n P v 4 0 H 9 A 3 4 c Q 3 3 7 y X X F 2 N I 6 j c f z x e H j 6 b h v H H j k k r 0 h I Y v K W n J I P Z E K m R J A v 5 A e 5 J j f B 1 + B n 8 C v 4 v W n t B d u Z l + S / 6 u 3 + A V Y y t b I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K 5 8 2 m z x d C n D W K P L 4 a G N S 8 S X S l O A = " > A A A C T X i c b Z F L a 9 w w F I X l a d o k 0 0 e m 7 T I b 0 a H g g T L Y I d A s Q 7 v p c g q d J D A a B l m + H o v I l p C u i 4 3 j P 9 h N o b v + i 2 6 6 S C i l m g e 0 T X J B c P j O v X o c J U Z J h 1 H 0 P e g 9 2 H n 4 a H d v v / / 4 y d N n B 4 P n L 8 6 c r q y A q d B K 2 4 u E O 1 C y h C l K V H B h L P A i U X C e X L 5 f + e e f w T q p y 0 / Y G J g X f F n K T A q O H i 0 G K c s 5 t p M u b K 7 q E W X G a o O a M q g N Z Q o y D F l m u W g n i 5 Y h 1 N i 6 r g v r U f c X 4 A Y w z A E 5 Z S L V 6 L f J Z V i / a U b M y m W O o 8 V g G I 2 j d d G 7 I t 6 K I d n W Z D H 4 x l I t q g J K F I o 7 N 4 s j g / O W W 5 R C Q d d n l Q P D x S V f w s z L k h f g 5 u 0 6 j Y 6 + 9 i S l m b Z + l U j X 9 N + J l h f O N U X i O w u O u b v t r e B 9 3 q z C 7 G T e y t J U C K X Y H J R V i v r A V t H S V F o Q q B o v u L D S 3 5 W K n P v 4 0 H 9 A 3 4 c Q 3 3 7 y X X F 2 N I 6 j c f z x e H j 6 b h v H H j k k r 0 h I Y v K W n J I P Z E K m R J A v 5 A e 5 J j f B 1 + B n 8 C v 4 v W n t B d u Z l + S / 6 u 3 + A V Y y t b I = < / l a t e x i t > Ps(x) Pt(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " V R W H y b X l T E a 0 G E f A e M u m o e 7 R Y M M = " > A A A C E H i c b Z D N S s N A F I U n 9 a / W v 6 h L N 8 E i 1 k 1 J R N B l 0 Y 3 L C r Y V m h A m 0 0 k 7 d D I J M z f S E v I I b n w V N y 4 U c e v S n W / j t A 2 o r Q c G D t + 9 l z v 3 B A l n C m z 7 y y g t L a + s r p X X K x u b W 9 s 7 5 u 5 e W 8 W p J L R F Y h 7 L u w A r y p m g L W D A 6 V 0 i K Y 4 C T j v B 8 G p S 7 9 x T q V g s b m G c U C / C f c F C R j B o 5 J v H b i g x y Z p + 5 g I d Q a b y v D Y 6 y X 8 A z I B v V u 2 6 P Z W 1 a J z C V F G h p m 9 + u r 2 Y p B E V Q D h W q u v Y C X g Z l s A I p 3 n F T R V N M B n i P u 1 q K 3 B E l Z d N D 8 q t I 0 1 6 V h h L / Q R Y U / p 7 I s O R U u M o 0 J 0 R h o G a r 0 3 g f 7 V u C u G F l z G R p E A F m S 0 K U 2 5 B b E 3 S s X p M U g J 8 r A 0 m k u m / W m S A d U K g M 6 z o E J z 5 k x d N + 7 T u 2 H X n 5 q z a u C z i K K M D d I h q y E H n q I G u U R O 1 E E E P 6 A m 9 o F f j 0 X g 2 3 o z 3 W W v J K G b 2 0 R 8 Z H 9 9 s C p 4 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V R W H y b X l T E a 0 G E f A e M u m o e 7 R Y M M = " > A A A C E H i c b Z D N S s N A F I U n 9 a / W v 6 h L N 8 E i 1 k 1 J R N B l 0 Y 3 L C r Y V m h A m 0 0 k 7 d D I J M z f S E v I I b n w V N y 4 U c e v S n W / j t A 2 o r Q c G D t + 9 l z v 3 B A l n C m z 7 y y g t L a + s r p X X K x u b W 9 s 7 5 u 5 e W 8 W p J L R F Y h 7 L u w A r y p m g L W D A 6 V 0 i K Y 4 C T j v B 8 G p S 7 9 x T q V g s b m G c U C / C f c F C R j B o 5 J v H b i g x y Z p + 5 g I d Q a b y v D Y 6 y X 8 A z I B v V u 2 6 P Z W 1 a J z C V F G h p m 9 + u r 2 Y p B E V Q D h W q u v Y C X g Z l s A I p 3 n F T R V N M B n i P u 1 q K 3 B E l Z d N D 8 q t I 0 1 6 V h h L / Q R Y U / p 7 I s O R U u M o 0 J 0 R h o G a r 0 3 g f 7 V u C u G F l z G R p E A F m S 0 K U 2 5 B b E 3 S s X p M U g J 8 r A 0 m k u m / W m S A d U K g M 6 z o E J z 5 k x d N + 7 T u 2 H X n 5 q z a u C z i K K M D d I h q y E H n q I G u U R O 1 E E E P 6 A m 9 o F f j 0 X g 2 3 o z 3 W W v J K G b 2 0 R 8 Z H 9 9 s C p 4 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V R W H y b X l T E a 0 G E f A e M u m o e 7 R Y M M = " > A A A C E H i c b Z D N S s N A F I U n 9 a / W v 6 h L N 8 E i 1 k 1 J R N B l 0 Y 3 L C r Y V m h A m 0 0 k 7 d D I J M z f S E v I I b n w V N y 4 U c e v S n W / j t A 2 o r Q c G D t + 9 l z v 3 B A l n C m z 7 y y g t L a + s r p X X K x u b W 9 s 7 5 u 5 e W 8 W p J L R F Y h 7 L u w A r y p m g L W D A 6 V 0 i K Y 4 C T j v B 8 G p S 7 9 x T q V g s b m G c U C / C f c F C R j B o 5 J v H b i g x y Z p + 5 g I d Q a b y v D Y 6 y X 8 A z I B v V u 2 6 P Z W 1 a J z C V F G h p m 9 + u r 2 Y p B E V Q D h W q u v Y C X g Z l s A I p 3 n F T R V N M B n i P u 1 q K 3 B E l Z d N D 8 q t I 0 1 6 V h h L / Q R Y U / p 7 I s O R U u M o 0 J 0 R h o G a r 0 3 g f 7 V u C u G F l z G R p E A F m S 0 K U 2 5 B b E 3 S s X p M U g J 8 r A 0 m k u m / W m S A d U K g M 6 z o E J z 5 k x d N + 7 T u 2 H X n 5 q z a u C z i K K M D d I h q y E H n q I G u U R O 1 E E E P 6 A m 9 o F f j 0 X g 2 3 o z 3 W W v J K G b 2 0 R 8 Z H 9 9 s C p 4 S < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V R W H y b X l T E a 0 G E f A e M u m o e 7 R Y M M = " > A A A C E H i c b Z D N S s N A F I U n 9 a / W v 6 h L N 8 E i 1 k 1 J R N B l 0 Y 3 L C r Y V m h A m 0 0 k 7 d D I J M z f S E v I I b n w V N y 4 U c e v S n W / j t A 2 o r Q c G D t + 9 l z v 3 B A l n C m z 7 y y g t L a + s r p X X K x u b W 9 s 7 5 u 5 e W 8 W p J L R F Y h 7 L u w A r y p m g L W D A 6 V 0 i K Y 4 C T j v B 8 G p S 7 9 x T q V g s b m G c U C / C f c F C R j B o 5 J v H b i g x y Z p + 5 g I d Q a b y v D Y 6 y X 8 A z I B v V u 2 6 P Z W 1 a J z C V F G h p m 9 + u r 2 Y p B E V Q D h W q u v Y C X g Z l s A I p 3 n F T R V N M B n i P u 1 q K 3 B E l Z d N D 8 q t I 0 1 6 V h h L / Q R Y U / p 7 I s O R U u M o 0 J 0 R h o G a r 0 3 g f 7 V u C u G F l z G R p E A F m S 0 K U 2 5 B b E 3 S s X p M U g J 8 r A 0 m k u m / W m S A d U K g M 6 z o E J z 5 k x d N + 7 T u 2 H X n 5 q z a u C z i K K M D d I h q y E H n q I G u U R O 1 E E E P 6 A m 9 o F f j 0 X g 2 3 o z 3 W W v J K G b 2 0 R 8 Z H 9 9 s C p 4 S < / l a t e x i t > Target Predictor (a) (b) Figure 1 : (a) Intuition of DRL under domain shift, where X s and Y s represent the source labeled data, and X t represents the target unlabeled data, y pred is the predictor's probabilistic labels and the y fake is the adversary's proposed probabilistic labels. (b) Architecture for end-to-end training of the DRL framework without class regularization. It is an instantiation of (a) using neural networks. The expected target loss cannot be evaluated due to lack of target labels in the UDA setting. Instead, we need to compute the gradients directly for training the networks. We present the details in Sec. 2.2. that is allowed to perturb the labels, subject to certain feature-matching constraints to ensure datacompatibility. Formally, the minimax game for DRL is: min P (Y |X) max Q(Y |X) loss Pt(X) P (Y |X), Q(Y |X) , where the adversary Q(Y |X) is constrained to match the evaluation of a set of features Φ(x, y) to that of the source distribution (see Section 2 for details). Note that the loss in ( 1) is evaluated under the target input distribution P t (X), and the predictor does not have direct access to the source data {X s , Y s }. Instead, the predictor optimizes the target loss by playing a game with an adversary constrained by source data. A special case of UDA is the covariate shift setting, where the label-generating distribution P (Y |X) is assumed to be the same in both source and target domains. Under this assumption, with log-loss and a linear predictor parameterized by θ and features Φ(x, y), (1) reduces to: P (y|x) ∝ exp P s (x) P t (x) θ • Φ(x, y) . Intuitively, the density ratio P s (x)/P t (x) prevents the model from being overconfident on target inputs far away from the source domain. Thus, the DRL framework is a principled approach for conservative confidence estimation. Previous works have shown that DRL is highly effective in safety-critical applications such as safe exploration in control systems (Liu et al., 2020) and safe trajectory planning (Nakka et al., 2020) . However, these works only consider estimating the density ratio in low dimensions (e.g. control inputs) using standard kernel density estimator (KDE) and extending it to high-dimensional inputs such as images remains an open challenge. Moreover, it is not clear if the covariate-shift assumption holds for common high-dimensional settings such as images -which we investigate in this paper. In this paper, we propose a novel deep-learning method based on the DRL framework for accurate uncertainties that scales to modern domain-adaptation tasks in computer vision.

Summary of Contributions:

1. We develop differentiable density-ratio estimation as part of the DRL framework to enable efficient end-to-end training. See Figure 1 (b). 2. We employ DRL's confidence estimation in the self-training framework for domain adaptation and term it as distributionally robust self-training (DRST). See Figure 2 . 3. We further combine it with automated synthetic to real generalization (ASG) framework of (Chen et al., 2020b) to improve generalization in the real target domain when the source domain consists of synthetic images. 4. We demonstrate that DRST generates more calibrated probabilities. DRST-ASG achieves competitive accuracy on the VisDA2017 dataset (Peng et al., 2017) with 1% improvement over the baseline class-regularized self-training (CRST) using the standard soft-max confidence measure. 5. We analyze the reason for the effectiveness of DRST through a careful ablation study. One challenge for training DRL is that the training loss cannot be directly evaluated under the UDA framework. However, we show that the gradients of the target loss can indeed be evaluated. By deriving gradients for both neural networks and proposing a joint training algorithm (Alg. 1), we show the network can be trained efficiently. We also directly incorporate class regularization in the minimax game under our DRL framework. This is a principled approach in contrast to standard label smoothing incorporated on top of a given learning method. In our ablation studies, we observe that the covariate-shift assumption progressively holds to a greater extent as the iterations in self-training proceed. This is also correlated with the greater ability to capture shape features through self training, as seen in the Grad-CAM visualization (Selvaraju et al., 2017) .

2. PROPOSED FORMULATION AND ALGORITHMS

In this section, we first introduce the class regularized DRL framework (2.1) and then propose differentiable density ratio estimation to enable end-to-end learning of DRL using neural networks. We provide training details, especially the gradient computation for training both networks (2.2). This is unique for our setting since the actual training loss on target cannot be evaluated due to lack of target labels. Finally, we propose our self-training algorithm DRST in 2.3.

2.1. DISTRIBUTIONALLY ROBUST LEARNING WITH CLASS REGULARIZATION

We are interested in robustly minimizing the classification loss on the target domain with confidence of adversary's prediction regularized. We use a weighted logloss term to penalize high confidence in the adversary's label prediction as the regularization. We make the same covariate shift assumption as in (Liu & Ziebart, 2014 ) that only the marginal input distribution changes and P (y|x) is shared between source and target: P s (x) = P t (x) and P s (y|x) = P t (y|x). We aim to solve the following: min P (Y |X) max Q(Y |X)∈Ξ logloss Pt(X) P (Y |X), Q(Y |X) -rE Pt(x)Q(y|x) [Y log P (Y |X)], where Y is a one-hot encoding of the class and r ∈ [0, 1] is a hyper-parameter that controls the level of regularization. In this formulation, the estimator player P (Y |X) first chooses a conditional label distribution to minimize the regularized logloss on the target domain and then an adversarial player Q(Y |X) chooses a conditional label distribution from the set (Ξ) to maximize the regularized logloss. The constraint set Ξ defines how much flexibility we want to give to the adversary. Usually, we design feature functions on both X and Y and restrict the adversary to match statistics of the expectation of the features. We have the following lemma: Lemma 1. If we choose feature map Φ(X, Y ) as the statistics to constrain Q(Y |X), then equation 3 can be reduced to a reguralized maximum entropy problem with the estimator constrained: max P (Y |X) E Pt(x) P (y|x) [-log P (Y |X)] -rE Pt(x) P (y|x) [Y log P (Y |X)] such that: P (Y |X) ∈ ∆ and |E Ps(x) P (y|x) [Φ(X, Y )] -E Ps(x,y) [Φ(X, Y )]| ≤ λ, where ∆ defines the conditional probability simplex that P (y|x) must reside within, Φ is a vectorvalued feature function that is evaluated on input x, and E Ps(x,y) [Φ(X, Y )] is a vector of the expected feature values that corresponds with the feature function. λ is the slack term of constraints. The proof of this lemma involves strong duality of the convex-concave function, such that the min and max player can switch the order. We refer more details to the appendix. The following theorem states the solution of the problem: Theorem 1. The parametric solution of (4) for P (y|x) takes the form: P (y|x) ∝ exp   P s(x) Pt(x) θ • Φ(x, y) + ry ry + 1   , where the parameter θ can be optimized by maximizing the log-likelihood on the target distribution. The gradients take the form: ∇ θ E Pt(x)P (y|x) [-log Pθ (Y |X)] = E Ps(x) P (y|x) [Φ(X, Y )] -c, where c E Ps(x) P (y|x) [Φ(X, Y )], as the empirical evaluation of the feature expectations. Here Ps (x) and P (y|x) are the empirical distribution. In principle, P (y|x) is the ground truth conditional label distribution shared between source and target domains. We call E Pt(x)P (y|x) [-log Pθ (Y |X)] the expected target loss in the paper. Even though it is not available in practice, we can approximate the gradients (6) using the source data in training. The norm of the approximated gradient converges to the true gradient in the rate of O(1/m), where m is the amount of source data. The proof involves application of the Lagrangian multiplier, setting the derivative of each specific P (y|x) to 0 and utilizing the KKT condition. We refer details to the appendix.  = w -γ2 • ∇wLc b = b -γ2 • ∇ b Lc Optimizer Opt2(γ2) updates α epoch ← epoch +1 Output: Trained α, β, w, b We use this form to illustrate the property of representation-level conservativeness and the class-level regularization of our formulation. Representation-level conservativeness: The prediction has higher certainty for inputs closer to the source domain, when magnitude of P s (x)/P t (x) is large. On the contrary, if the inputs are farther away from the source, which means P s (x)/P t (x) is small, the prediction is uncertain. Class-level regularization: Hyper-parameter r adjusts the smoothness of the Q(Y |X)'s label prediction in (3). It translates to the ry terms in the parametric form. In training, we compute the gradients using source labels where y is the one-hot encoding of the class. In testing, we can set y to be all one vector to obtain smoothed confidence. In machine learning methods using density ratios, such as transfer learning (Pan & Yang, 2009) , or off-policy evaluation (Dudík et al., 2011) , a plugin estimator for the density ratio P s (x)/P t (x) is used. However, density ratio estimation (Sugiyama et al., 2012) , especially in the high-dimensional data, is rather different. It is also not the case that more accurate density ratio estimation would lead to the better downstream task performance. We have a synthetic example shown in Appendix E. To scale up the method for modern domain adaptation tasks, we ask the question: can we train the density ratio estimation and the learning tasks that use the ratios together such that they share the common goal-the target domain predictive performance?

2.2. DIFFERENTIABLE DENSITY RATIO ESTIMATION AND END-TO-END TRAINING FOR DRL

We propose an end-to-end training procedure for DRL such that density ratio estimation is trained together with the target classification. We use two neural networks for classification and the density ratio estimation, respectively. See figure 1(b) . Differentiable density ratio estimation: We make the observation that the density ratio in the parametric form (5) can be a trainable weight for each example, which can receive gradients from the objective function. On the other hand, the density ratio can be estimated using a binary classification using unlabeled source and target data. Therefore, we propose to train a discriminative neural network to differentiate the two domains for the density ratio estimation, which receive training signals from both the target classification loss and the binary classification loss. Expected target loss as a training objective: According to Theorem 1, even though the expected target loss cannot be evaluated using data, we can approximate the gradient (6) using source samples. We can directly apply the gradients on the last layer of the classification network and back-propagate to the other layers. We next present the overall training objective and derive the gradients. The overall training objective: Assume φ(x, α, w, b) is the classification neural net with parameters w and b in the linear last layer and parameters α in the other layers. The input is the source data. Assume τ (x, β) is the discriminative neural net with parameters β, with input as both unlabeled source and target data. The last layer of the classification network is: w • φ(x, α) + b. We use the following training objective that accounts for the interactions between training density ratios and classification performance for the target domain: L = L c + L d = E Pt(x)P (y|x) [-log P (Y |X)] + E P (x) P (c|x) [-log P (C|X)], where L c is the expected target loss E Pt(x)P (y|x) [-log P (Y |X)], which cannot be evaluated but has gradients available for training from the DRL framework, P (Y |X) takes the following form with neural networks as feature functions: P (y|x) ∝ exp Ps (x) P t (x) w•yφ(x,α)+b+ry ry+1 . We use L d to represent the cross-entropy loss on the domain discriminator that produces P (C|X), with C ∈ {s, t} as the source domain class and target domain class. Here we use P (x) and P (c|x) to represent the overall unlabeled data distribution and the empirical domain class in the data. Then the predictions from the classifier are P (c|x) ∝ exp{cτ (x, β)}. Based on the Bayes' rule, the density ratio Pt(x) Ps(x) can be computed as:  P t (x) P s (x) = P Therefore, we can use a discriminator for estimating P (t|x) P (s|x) from unlabeled source and target data (Bickel et al., 2007) . Gradients for the classification network: With density ratios, given by the discriminative network, we compute the gradient of the classification network following (6) on w, φ(x, α), and b. ∇ φ E Pt(x)P (y|x) [-log P (Y |X)] = ( P -y)w; ∇ w E Pt(x)P (y|x) [-log P (Y |X)] = ( P -y)φ; ∇ b E Pt(x)P (y|x) [-log P (Y |X)] = ( P -y); where P and y are vectors of predicted conditional label probabilities using the current parameters as in Eq. (2.2) and the one-hot encoding of the true labels in the source data, respectively. Note that ∇ w E Pt(x)P (y|x) [-log P (Y |X)] corresponds with the gradient for learning θ in Eq. ( 6), with Φ(x, y) = w • yφ(x, α) + b. Gradients of the discriminative network from expected target loss: We denote d s = P (s|x) and d t = P (t|x) as the two weight scalars for each input x, and therefore we have P (y|x) ∝ exp{ ds dt (w • yφ(x, α) + b)}. Since vec(d s , d t ) is exactly the output of the density estimator, we treat (d s , d t ) as trainable variables and derive gradients from the expected target loss L c . In this way, the parameter β of the density ratio estimation network has additional learning signals from both losses. We derive the gradients as: ∇ ds L c = 1 d t E Pt(x) P (y|x) [w • ŷφ(x, α) + b], ∇ dt L c = - d s d 2 t E Pt(x) P (y|x) [w • ŷφ(x, α) + b], which is dependent on unlabeled the target inputs Pt (x), but does not rely on the target labels. We refer to the appendix for the detailed derivation. Then we concatenate (10) into a gradient vector and back-propagate the discriminative network. We summarize the procedure in Algorithm 1. (Zou et al., 2018) , the data proportion p does not have significant impact on the results. After validating this in practice, we set p = 0.5 for DRST for simplicity. consertiveness ("R = 1"), which means we mute the differentiable density ratio estimation in our method. We can see that DRST achieves the best performance when both components are present. Figure 4 : Model attention visualized using Grad-Cam (Selvaraju et al., 2017) . We also show the predicted labels by different methods. Our method captures the shape features of the image better. Covariate shift: In figure 3(c) , we demonstrate the ratio difference between P s (x)/P t (x) -P s (x, y)/P t (x, y), where x is the last layer representations from each of the method. We estimate this ratio using discriminative density ratio estimators (per class). When P s (y|x) = P t (y|x), the ratio difference is close to 0. We can see that the P (y|x) gap between source and target is decreasing along self-training. This is due to the shape features captured by the method. We can also observe it from figure 4 , where we visualize the last layer model attention of our model and the baselines. Therefore, even though the covariate shift assumption from DRL may not be satisfied at the beginning of self-training, self-training helps align P (y|x) by promoting the learning of shape representations, such that P s (y|x) and P t (y|x) converges along the training. Density ratio estimation: We show a pair of examples in the target domain with relatively high and low density ratios. In figure 5 (a), a more typical "train" image obtains higher density ratio than a train on a busy platform, whose shape of the train is not obvious. DRL is able to give higher confidence to images that are better-represented in the synthetic training domain. More examples are given in the appendix. DRL's conservativeness in confidence: Figure 7 demonstrates DRL's more calibrated uncertainty measure. We adopt the same definition of confidence and accuracy as (Guo et al., 2017a) and the same diagram plotting protocol as (Maddox et al., 2019) . The closer the curve is to the dash line, the more calibrated the uncertainty is. DRL tends to be a little underconfident but also stays closer to the calibration line (dash line).  ∂L(θ) ∂d s = ∂ ∂d s (-θE Ps(x)P (y|x) [Φ(X, Y )] + E Pt(x) [log Z θ (X)]) = E Pt(x) P (y|x) [θΦ(X, Y )/d t ] = 1 d t ∂L(θ) ∂R , and gradient of loss over target densities d t : ∂L(θ) ∂d t = ∂ ∂d t E Pt(x) [log(Z θ (X))] = - d s d 2 t E Pt(x) P (y|x) [θΦ(X, Y )] = - d s d 2 t ∂L(θ) ∂R , where: ∂L(θ) ∂R = E Pt(x)   y∈Y exp(R, θ, Φ) Z θ (X) θΦ(X, Y )   = E Pt(x) P (y|x) [θΦ(X, Y )].

D ADDITIONAL DRST EXPERIMENTAL RESULTS

We demonstrate the TSNE plot of the learned decision boundaries for CBST, CRST and DRST in figure 8 . Figure 9 shows the misclassification entropy comparison between source model and DRL model. Misclassification entropy is calculated as S i = 1 n n i=1 m j=1 p ij log p ij , where n is the number of samples and m is the number of categories in the dataset, and p ij indicates the prediction probability of the ith sample on the jth category. The larger misclassification entropy is, the more uncertain the model prediction result is on the wrong predictions. This means the model would fail more gently. Figure 10 demonstrates additional model attention visualization. Figure 11 shows additional target examples with high and low density ratios. Our model is able to find noisy and less-represented image from the target data by estimating the density ratios. DRL would be more uncertain on those data. Grad-Cam (Selvaraju et al., 2017) . Our method can also capture the domain knowledge well. For example, the first row input image contains a giraffe and a car. However, giraffe is not a existing label for VisDA2017. While CBST and CRST captures the wrong information, DRST is able to correctly capture the domain information.

E SIMULATION ON PLUG-IN ESTIMATOR

DRL for domain shift requires P s (x)/P t (x) to adjust the representation-level conservativeness. Like many other machine learning methods using importance weights, like in transfer learning (Pan & Yang, 2009) , or off-policy evaluation (Dudík et al., 2011) , we can use a plug-in estimator for the density ratio P s (x)/P t (x). However, density ratio estimation (Sugiyama et al., 2012) , especially in the high-dimensional data, is rather different. Here, we ask the question: whether a more accurate density ratio estimation leads to greater predictive performance in the downstream tasks? We show a two-dimensional binary classification example in figure 12 to demonstrate relation between the performance of the density (ratio) estimation and the performance of the ultimate target learning tasks. We use the RBA method (Liu & Ziebart, 2014) as an example. We conduct kernel density estimation (KDE) and evaluate the average log-likelihood of the source and target domain. We take the ratio of the density from KDE and plug in the RBA method. We can see that the case with higher log-likelihood actually fail to give informative predictions. One of the reason is the Figure 11 : Additional examples for density ratio estimation for different categories. We can observe that data less well-represented in the source data has much lower density ratio. This shows that our learned density ratio is a good measure of the level of representation of data in source and target domains. density (ratio) estimation task, as an independent learning task, does not share information with the downstream prediction tasks that use the ratios. 

F ADDITIONAL RESULTS FOR DRL ON OFFICE-31

We include additional result for distributionally robust learning on office-31 in Figure 13 and Figure 14 .

G CODE REPOSITORY

Code can be found at: https://anonymous.4open.science/r/ 2ed8a9ce-f404-4489-9ef7-a3a83e02a44c/ 



CONCLUSIONIn this paper, we develop a learning method under the distributionally robust learning framework for modern domain adaptation. We propose differentiable density ratio estimation and class regularization in the framework. We develop end-to-end training techniques for our proposed method. Using DRL's model confidence, our self-training algorithm achieves SOTA predictive performance while stay calibrated. We also demonstrate that self-training helps reduce the distribution gap between source and target domains, facilitating DRL to be effective.



(a). Recall that the standard framework of empirical risk minimization (ERM) directly learns a predictor P (Y |X) from training data. In contrast, DRL also includes an adversary Q(Y |X)

y t b I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K 5 8 2 m z x d C n D W K P L 4 a G N S 8 S X S l O A = " > A A A C T X i c b Z F L a 9 w w F I X l a d o k 0 0 e m 7 T I b 0 a H g g T L Y I d A s Q 7 v p c g q d J D A a B l m + H o v I l p C u i 4 3 j P 9 h N o b v + i 2 6 6 S C i l m g e 0 T X J B c P j O v X o c J U Z J h 1 H 0 P e g 9 2 H n 4 a H d v v / / 4 y d N n B 4 P n L 8 6 c r q y

y t b I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K 5 8 2 m z x d C n D W K P L 4 a G N S 8 S X S l O A = " > A A A C T X i c b Z F L a 9 w w F I X l a d o k 0 0 e m 7 T I b 0 a H g g T L Y I d A s Q 7 v p c g q d J D A a B l m + H o v I l p C u i 4 3 j P 9 h N o b v + i 2 6 6 S C i l m g e 0 T X J B c P j O v X o c J U Z J h 1 H 0 P e g 9 2 H n 4 a H d v v / / 4 y d N n B 4 P n L 8 6 c r q y

DISTRIBUTIONALLY ROBUST SELF-TRAINING Algorithm: We propose Algorithm 2 to combine the DRL model with self-training. The idea is to regard each training epoch as a new distribution shift problem in DRL. After each training epoch, we make predictions on the target domain and select the top confident portion with the proxy labels to merge into the source training data. Both the pseudo labels and the model confidence are from the DRL model (5). Then the labeled source data and newly pseudo-labeled target portion become the new source set for the next learning epoch for DRL, as shown in Figure 2. As studied in the previous self-training work

(a)-(b) Accuracy and Brier score of self-training on VisDA2017 with 5 random seeds. We demonstrate the 95% standard error to show that DRST is outperforming the baselines significantly. (c) Distribution gap in P (y|x): P s (x)/P t (x) -P s (x, y)/P t (x, y), between source and target domains. Comparing with DRL without self-training, DRST helps reduce the gap.

(a) Left: P s (x)/P t (x) = 1.004 Right: P s (x)/P t (x) = 2.132 (b) Figure 5: (a) Density ratios and example target images for category "Train" in VisDA. Larger density ratio (P s (x)/P t (x)) indicates more certain prediction. DRL give more cerntain prediction on "train" that is better represented in the source domain, which on the right hand side. (b) Accuracy and brier score using DRL on VisDA. DRL significantly improves the source model along the training. Accuracy and Brier Score of DRL on Office31 and Office-Home. DRL improves sourceonly by a large margin. 3.2 PREDICTIVE PERFORMANCE AND UNCERTAINTY QUANTIFICATION OF DRL We compare DRL with source models on VisDA2017, Office-31, and Office-Home datasets to show the DRL model's accuracy and calibration. Note that DRL can be regarded as a lightweight model generalization method since the target unlabeled data is only used for the density ratio estimation. DRL's improvement on source training: Figure 5 (b) and Figure 6 demonstrate the significant improvement over the source synthetic training model in all the data sets. Note that the only difference between the DRL and the source model is DRL's predictive form and the differentiable density ratio estimation. In these plots, all the models are trained with the same number of total training epochs. The accuracy and brier score at epoch 5 means we incorporate DRL to improve source training mode from epoch 5. Therefore, we can observe DRL's consistent significant improvement over the source training model even when starting late in the learning process. The lower brier score provides evidence on more conservative model confidence that benefits self-training.

Figure7: Reliability diagrams with respect to source-only, temperature scaling (TS) and DRL. The data is separated into 20 bins with different interval lengths. The closer the curve is to the dash line, the more calibrated the uncertainty is. DRL is able to achieve more calibrated results, usually conservative confidence, while the other methods always tend to generate over-confident predictions.

Comparison of DRL and DRST of misclassification entropy on different datasets.

Figure10: Model attention visualized usingGrad-Cam (Selvaraju et al., 2017). Our method can also capture the domain knowledge well. For example, the first row input image contains a giraffe and a car. However, giraffe is not a existing label for VisDA2017. While CBST and CRST captures the wrong information, DRST is able to correctly capture the domain information.

(a) Source and target data points are drawn from two Gaussian distributions. Solid line: source and dashed line: target. The underlying true decision boundary for the binary classes is the same between the two domains. (b)-(c) Prediction with density ratios from low and high density estimation likelihoods. With more accurate density estimation, the RBA predictor gives overly conservative predictions on the target domain. The colormap is the confidence P ("1"|x). (d) With larger likelihood in density ratio estimation, the target log loss becomes worse.

Figure 13: Additional Office31 results on DRL compared with source-only, which is a complement for Figure 6.

Figure 14: Additional Office31 results on reliability plots, which is a complement for Figure 7. DRL is compared with source-only and temperature scaling.

Algorithm 1 End-to-end Training for DRL Input: Source data , Target data , DNN φ), DNN τ , SGD optimizer Opt1 and Opt2, learning rate γ1 and γ2, epoch number T .

Retraining T r a in in g

Pseudo Labels Pseudo Labels Confidence-based Selection (in Alg2)

Xt2

< l a t e x i t s h a 1 _ b a s e 6 4 = " l 4 T D x o e Q C + v Z O 8 C G q k h F Z O 4 k Z G w = " > A A A B 7 X i c b V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 h d 0 g 6 D H o x W M E 8 4 B k C b O T S T J m d m a Z 6 R X C k n / w 4 k E R r / 6 P N / / G S b I H T S x o K K q 6 6 e 6 K E i k s + v 6 3 t 7 a + s b m 1 X d g p 7 u 7 t H x y W j o 6 b V q e G 8 Q b T U p t 2 R C 2 X Q v E G C p S 8 n R h O 4 0 j y V j S + n f m t J 2 6 s 0 O o B J w k P Y z p U Y i A Y R S c 1 2 7 0 M q 9 N e q e x X / D n I K g l y U o Y c 9 V 7 p q 9 v X L I 2 5 Q i a p t Z 3 A T z D M q E H B J J 8 W u 6 n l C W V j O u Q d R x W N u Q 2 z + b V T c u 6 U P h l o 4 0 o h m a u / J z I a W z u J I 9 c Z U x z Z Z W 8 m / u d 1 U h x c h 5 l Q S Y p c s c W i Q S o J a j J 7 n f S F 4 Q z l x B H K j H C 3 E j a i h j J 0 A R V d C M H y y 6 u k W a 0 E f i W 4 v y z X b v I 4 C n A K Z 3 A B A V x B D e 6 g D g 1 g 8 A j P 8 A p v n v Z e v H f v Y 9 G 6 5 u U z J / A H 3 u c P d 7 C P C w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " l 4 

3. EXPERIMENTS

In this section, we evaluate the performance of our method on benchmark domain adaptation benchmark datasets. We evaluate DRST as an effective unsupervised domain adaptation method (Sec. 3.1) and DRL as a domain generalization method providing more calibrated uncertainties (Sec. 3.2).We adopt three datasets in our experiments: Office31 (Saenko et al., 2010) , Office-Home (Venkateswara et al., 2017) and VisDA2017 (Peng et al., 2017) . In particular, we use the largest data VisDA2017 for the evaluation of DRST, for which we compare with (1) traditional UDA baselines: MMD (Long et al., 2015) , MCD (Saito et al., 2018b) and ADR (Saito et al., 2018a) ; (2) recent self-training UDA baselines: CBST (Zou et al., 2020) and CRST (Zou et al., 2020) ; (3) other uncertainty quantification or UDA methods + self-training baselines: AVH (Chen et al., 2020a ) + CBST and DeepCORAL (Sun & Saenko, 2016) +CBST. In addition, we use Office31 and Office-Home for evaluating DRL's performance. We compare DRL with source training only and temperature scaling for demonstrating the calibration of the uncertainties used in DRST.Apart from accuracy, we also use Brier score (Brier, 1950) and the realiability plot (Guo et al., 2017a) to evaluate the performance of our proposed method and the baselines. Ablation study: Figure 3 (a)(b) also include two ablation methods. In the first ablation, we set r to 0 so that there is no class regularization in DRL ("r = 0"). The prediction then follows the form in Eq. ( 2). In the second ablation, we set density ratio to 1 such that there is no representation level A PROOF OF LEMMA1Proof. The two-player game in (3) can be written as:According to strong Lagrangian duality, we can switch the order of the two players and it is equivalent to:Solving the minimizing problem first assuming that we know Q(Y |X), we get the result thatPlugging it into the maximizing problem, the whole problem reduces to (4).

B PROOF OF THEOREM 1

Proof. The generalized constrained optimization problem in ( 4) can be written as: such that: The derivation of gradient then resembles the derivation in Theorem 1 in (Liu & Ziebart, 2014) .

