TRANSFER LEARNING WITH PRE-TRAINED CONDITIONAL GENERATIVE MODELS

Abstract

Transfer learning is crucial in training deep neural networks on new target tasks. Current transfer learning methods always assume at least one of (i) source and target task label spaces overlap, (ii) source datasets are available, and (iii) target network architectures are consistent with source ones. However, holding these assumptions is difficult in practical settings because the target task rarely has the same labels as the source task, the source dataset access is restricted due to storage costs and privacy, and the target architecture is often specialized to each task. To transfer source knowledge without these assumptions, we propose a transfer learning method that uses deep generative models and is composed of the following two stages: pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). PP trains a target architecture with an artificial dataset synthesized by using conditional source generative models. P-SSL applies SSL algorithms to labeled target data and unlabeled pseudo samples, which are generated by cascading the source classifier and generative models to condition them with target samples. Our experimental results indicate that our method can outperform the baselines of scratch training and knowledge distillation.

1. INTRODUCTION

For training deep neural networks on new tasks, transfer learning is essential, which leverages the knowledge of related (source) tasks to the new (target) tasks via the joint-or pre-training of source models. There are many transfer learning methods for deep models under various conditions (Pan & Yang, 2010; Wang & Deng, 2018) . For instance, domain adaptation leverages source knowledge to the target task by minimizing the domain gaps (Ganin et al., 2016) , and fine-tuning uses the pre-trained weights on source tasks as the initial weights of the target models (Yosinski et al., 2014) . These existing powerful transfer learning methods always assume at least one of (i) source and target label spaces have overlaps, e.g., a target task composed of the same class categories as a source task, (ii) source datasets are available, and (iii) consistency of neural network architectures i.e., the architectures in the target task must be the same as that in the source task. However, these assumptions are seldom satisfied in real-world settings (Chang et al., 2019; Kenthapadi et al., 2019; Tan et al., 2019) . For instance, suppose a case of developing an image classifier on a totally new task for an embedded device in an automobile company. The developers found an optimal neural architecture for the target dataset and the device by neural architecture search, but they cannot directly access the source dataset for the reason of protecting customer information. In such a situation, the existing transfer learning methods requiring the above assumptions are unavailable, and the developers cannot obtain the best model. To promote the practical application of deep models, we argue that we should reconsider the three assumptions on which the existing transfer learning methods depend. For assumption (i), new target tasks do not necessarily have the label spaces overlapping with source ones because target labels are often designed on the basis of their requisites. In the above example, if we train models on StanfordCars (Krause et al., 2013) , which is a fine-grained car dataset, there is no overlap with ImageNet (Russakovsky et al., 2015) even though ImageNet has 1000 classes. For (ii), the accessibility of source datasets is often limited due to storage costs and privacy (Liang et al., 2020; Kundu et al., 2020; Wang et al., 2021a) , e.g., ImageNet consumes over 100GB and contains person faces co-occurring with objects that potentially raise privacy concerns (Yang et al., 2022) . For (iii), the consistency of the source and target architectures is broken if the new architecture is specialized for x s t < l a t e x i t s h a 1 _ b a s e 6 4 = " W y u e G G E x p J B F M R p h e z j n M f s 8 C C A = " > A A A C b X i c h V H L S g M x F D 0 d X 7 U + W h V B U K R Y f I G U V I q K C y m 4 c d l W q 8 U q Z W a M d e h 0 Z p h J i 7 X 4 A 6 4 F F 6 K g I C J + h h t / w E U / Q V y 4 U H D j w t v p g K i o N y Q 5 O b n n 5 i R R L F 1 z B G N 1 n 9 T S 2 t b e 4 e 8 M d H X 3 9 A Z D f f 3 r j l m 2 V Z 5 R T d 2 0 s 4 r s c F 0 z e E Z o Q u d Z y + Z y S d H 5 h l J c b u x v V L j t a K a x J q o W 3 y 7 J B U P b 1 V R Z E L U 5 t Z 8 X M + F q X k z n Q x E W Z W 6 E f 4 K Y B y L w I m m G r r G F H Z h Q U U Y J H A Y E Y R 0 y H G o 5 x M B g E b e N G n E 2 I c 3 d 5 z h E g L R l y u K U I R N b p L F A q 5 z H G r R u 1 H R c t U q n 6 N R t U o Y x z h 7 Y D X t h 9 + y W P b L 3 X 2 v V 3 B o N L 1 W a l a a W W / n g 0 d D q 2 7 + q E s 0 C e 5 + q P z 0 L 7 G L B 9 a q R d 8 t l G r d Q m / r K w c n L 6 m J 6 v D b B L t k T + b 9 g d X Z H N z A q r + p V i q d P E a A P i H 1 / 7 p 9 g f T Y a m 4 v G U / F I Y s n 7 C j + G M Y Y p e u 9 5 J L C C J D J 0 r o F j n O H c 9 y w N S i P S a D N V 8 n m a A X w J a f I D z 4 m N E w = = < / l a t e x i t > r 3 v 7 m 2 2 K l A z 8 d X 0 U 0 q 5 n t X y 5 J x 1 L M 9 X I z e 5 o O t S d a X v G r C q K m 4 F Z I 0 / O q T i k q C 0 V N c 2 3 Z E W + 9 F v d R q 4 g i i I J 8 y 4 o p a C A N K p B 7 g t 2 0 E Q A B z E 8 E H x o x i 4 s K G 7 b K E E g Z G 4 X H e Y i R j L Z J 3 S R Z W 3 M W c Q Z F r M H P L Z 4 t Z 2 y P q 9 7 N V W i d v g U l 3 v E S h O T 4 k x 8 F V f i V H w T F + L v f 2 t 1 k h o 9 L 0 c 8 2 3 0 t h Y 3 R d + O 1 P w + q P J 4 1 9 m 9 U 9 3 r W 2 M N S 4 l W y 9 z B h e r d w + v r 2 8 Y e r 2 v L a Z G d K n I h L 9 v 9 J n I t f f A O / / d v 5 v E p r H 5 H l D y j 9 + 9 x 3 w c Z 8 s b R Q L K + W C 5 W l 9 C u G 8 B I T m O b 3 X k Q F K 6 i i z u e + x 3 f 8 w E 8 j b y w a r 4 x K P 9 X I p J o X u B X G m 2 t M n J o c < / l a t e x i t > (b) Pseudo Semi-supervised Learning Supervised Loss < l a t e x i t s h a 1 _ b a s e 6 4 = " b K D h q y t f M L u 6 5 9 N m 7 v k M I B u a R e Q = " > A A A C e 3 i c h V H L S g M x F D 0 d X 7 W + q o I I b o p F E Z G S S n 3 g Q g Q 3 L n 3 V B 1 a G m T H V 4 L y Y p I V a / Q F / w I U r B R H R v 3 D j D 7 j w E 8 S l g h s F b 6 c D o q L e k O T k 5 J 6 b k 8 T 0 b S E V Y w 8 x r a G x q b k l 3 p p o a + / o 7 E p 2 9 6 x J r x R Y P G 9 5 t h d s m I b k t n B 5 X g l l 8 w 0 / 4 I Z j 2 n z d 3 J + v 7 a + X e S C F 5 6 6 q i s + 3 H W P X F U V h G Y o o P d l X 0 W W q I I W T K j i G 2 r M M u 7 p 5 p the new tasks like the above example. Deep models are often specialized for tasks or computational resources by neural architecture search (Zoph & Le, 2017; Lee et al., 2021) in particular when deploying on edge devices; thus, their architectures can differ for each task and runtime environment. Since existing transfer learning methods require one of the three assumptions, practitioners must design target tasks and architectures to fit those assumptions by sacrificing performance. To maximize the potential performance of deep models, a new transfer learning paradigm is required. E s 9 m W Y Z F k b q J 8 h G I I 0 o F r 3 k J Q r Y g Q c L J T j g c K E I 2 z A g q W 0 h C w a f u G 1 U i Q s I i X C f 4 w g J 0 p Y o i 1 O G Q e w + j b u 0 2 o p Y l 9 a 1 m j J U W 3 S K T T 0 g Z Q p D 7 J 5 d s W d 2 x 6 7 Z I 3 v 7 t V Y 1 r F H z U q H Z r G u 5 r 3 c d 9 6 + 8 / q t y a F b Y + 1 T 9 6 V m h i O n Q q y D v f s j U b m H V 9 e W D k + e V m e W h 6 j A 7 Z 0 / k / 4 w 9 s F u 6 g V t + s S 6 W + P I p E v Q B 2 e / P / R O s j W e y k 5 n c U i 4 9 N x t 9 R R w D G M Q I v f c U 5 r C A R e T p 3 E O c 4 x o 3 s X c t r Y 1 q Y / V U L R Z p e v E l t I k P D A a T E Q = = < / l a t e x i t > y s ⇠ Y s < l a t e x i t s h a 1 _ b a s e 6 4 = " H 9 o 4 G f Z U X Z + O W 9 m h l c y c 0 l w 3 d O o = " > A A A C Z n i c h V F N S w J B G H 7 c v s x K r Y i C L p I Y n W Q M q e g Q Q o c 6 + p E f Y C K 7 2 2 i L 6 + 6 y u w o m / Y G g a x 4 6 F U R E P 6 N L f 6 C D / 6 D o a N C l Q 6 / r Q p R U 7 z A z z z z z P u 8 8 M y M Z q m L Z j H U 9 w s j o 2 P i E d 9 I 3 N T 3 j D w R n 5 3 K W 3 j B l n p V 1 V T c L k m h x V d F 4 1 l Z s l R c M k 4 t 1 S e V 5 q b b b 3 8 8 3 u W k p u n Z g t w x e q o t V T a k o s m g T l d k r W + V g m E W Z E 6 F h E H N B G G 4 k 9 e A t D n E E H T I a q I N D g 0 1 Y h Q i L W h E x M B j E l d A m z i S k O P s c p / C R t k F Z n D J E Y m s 0 V m l V d F m N 1 v 2 a l q O W 6 R S V u k n K E C L s i d 2 x H n t k 9 + y F f f x a q + 3 U 6 H t p 0 S w N t N w o B 8 6 W M u / / q u o 0 2 z j + U v 3 p 2 U Y F W 4 5 X h b w b D t O / h T z Q N 0 8 6 v c x 2 O t J e Z d f s l f x f s S 5 7 o B t o z T f 5 J s X T l / D R B 8 R + P v c w y K 1 H Y x v R e C o e T u y 4 X + H F M l a w R u + 9 i Q T 2 k U S W z q 3 i H B f o e J 4 F v 7 A g L A 5 S B Y + r m c e 3 E E K f h d O K s g = = < / l a t e x i t > G s < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 H p Y e G 6 R s P B C n r m y A B A 9 4 L z c H 6 8 = " > A A A C e H i c h V H L S g M x F D 0 d X 7 W + q t 0 I b o r F 1 6 a k p a i I i 0 o 3 L m t r V f A x z I x R h 8 6 L m b R Q h / 6 A P + D C h S i I i p / h x h 9 w 4 S e I S w V B X H g 7 H R A V 9 Y Y k J y f 3 3 J w k q m P o n m D s I S J 1 d H Z 1 9 0 R 7 Y 3 3 9 A 4 N D 8 e G R N c + u u R q v a L Z h u x u q 4 n F D t 3 h F 6 M L g G 4 7 L F V M 1 + L p a L b T 2 1 + v c 9 X T b W h U N h 2 + b y r 6 l 7 + m a I o i S 4 4 m C 7 O 3 4 W 6 Y i D j T F 8 J e a s m j K 8 R R L s y C S P 0 E m B C m E U b T j l 9 j C L m x o q M E E h w V B 2 I A C j 9 o m M m B w i N u G T 5 x L S A / 2 O Z q I k b Z G W Z w y F G K r N O 7 T a j N k L V q 3 a n q B W q N T D O o u K Z O Y Y P f s m j 2 z O 3 b D H t n 7 r 7 X 8 o E b L S 4 N m t a 3 l j j x 0 N F p + / V d l 0 i x w 8 K n 6 0 7 P A H u Y D r z p 5 d w K m d Q u t r a 8 f H j + X F 0 o T / i Q 7 Z 0 / k / 4 w 9 s F u 6 g V V / 0 S 5 W e O k E M f q A z P f n / g n W s u n M b D q 3 k k v l F 8 O v i G I M 4 5 i m 9 5 5 D H s s o o k L n N n C K K 1 x H 3 q S k N C X N t F O l S K h J 4 E t I 2 Q / O k J I V < / l a t e x i t > C At s < l a t e x i t s h a 1 _ b a s e 6 4 = " b c z 4 3 l F q d S b + h Z G A c H Q b b l b J m F c = " > A A A C e H i c h V H L S g M x F D 0 d X 7 W + q m 4 E N 8 V S H 5 u S S l E R F 4 o b l 7 6 q g q 3 D z J j W w X k x k x b q 0 B / w B 1 y 4 E A V R 8 T P c + A M u + g n i s o I g L r y d D o g W 9 Y Y k J y f 3 3 J w k q m P o n m C s H p E 6 O r u 6 e 6 K 9 s b 7 + g c G h + P D I j m e X X Y 3 n N N u w 3 T 1 V 8 b i h W z w n d G H w P c f l i q k a f F c 9 X m 3 u 7 1 a 4 6 + m 2 t S 2 q D i + Y S s n S i 7 q m C K L k + O i q L A 7 8 v K m I I 0 0 x / J W a L G p y P M n S L I h E O 8 i E I I k w 1 u 3 4 D f I 4 h A 0 N Z Z j g s C A I G 1 D g U d t H B g w O c Q X 4 x L m E 9 G C f o 4 Y Y a c u U x S l D I f a Y x h K t 9 k P W o n W z p h e o N T r F o O 6 S M o E U e 2 J 3 r M E e 2 T 1 7 Z h + / 1 v K D G k 0 v V Z r V l p Y 7 8 t D p 2 N b b v y q T Z o G j L 9 W f n g W K W A i 8 6 u T d C Z j m L b S W v n J y 1 t h a 3 E z 5 k + y K v Z D / S 1 Z n D 3 Q D q / K q X W / X c = " > A A A C d H i c h V H L S s N A F D 2 N r 1 o f r b o R d F E s F Q U p U y k q L q T g x m U f V g W V k M S x D U 2 T k E y L t f Q H / A E X X S k W E T / D j T / g o p 8 g L h X d u P A 2 D Y i K e s N k z p y 5 5 8 6 Z u a p t 6 K 5 g r B O Q + v o H B o e C w 6 G R 0 b H x c G R i c s e 1 q o 7 G C 5 p l W M 6 e q r j c 0 E 1 e E L o w + J 7 t c K W i G n x X L W 9 2 9 3 d r 3 H F 1 y 9 w W d Z s f V p S i q R / r m i K I k i P h h Y O S I h o n T d l d q s v u o h y J s Q T z I v o T J H 0 Q g x 8 Z K 3 K N A x z B g o Y q K u A w I Q g b U O D S t 4 8 k G G z i D t E g z i G k e / s c T Y R I W 6 U s T h k K s W X 6 F 2 m 1 7 7 M m r b s 1 X U + t 0 S k G D Y e U U c T Z A 7 t h z + y e 3 b J H 9 v 5 r r Y Z X o + u l T r P a 0 3 J b D p 9 N 5 9 / + V V V o F i h 9 q v 7 0 L H C M N c + r T t 5 t j + n e Q u v p a 6 f n z / n 1 X L w x z y 7 Z E / m / Y B 1 2 R z c w a y 9 a O 8 t z L Y S o A c n v z / 0 T 7 C w n k i u J V D Y V S 2 / 4 r Q h i B n N Y o P d e R R p b y K D g 9 a S F K 7 Q D r 9 K s F J P i v V Q p 4 G u m 8 C W k x A c J O Y / l < / l In this paper, we shed light on an important but less studied problem setting of transfer learning, where (i) source and target task label spaces do not have overlaps, (ii) source datasets are not available, and (iii) target network architectures are not consistent with source ones (Tab. 1). To transfer source knowledge while satisfying the above three conditions, our main idea is to leverage source pre-trained discriminative and generative models; their architectures differ from that of target tasks. We focus on applying the generated samples from source class-conditional generative models for target training. Deep conditional generative models precisely replicate complex data distributions such as ImageNet (Brock et al., 2018; Karras et al., 2020; Dhariwal & Nichol, 2021) , and the pre-trained models are widely used for downstream tasks (Wang et al., 2018; Zhao et al., 2020a; Patashnik et al., 2021; Ramesh et al., 2022) . Furthermore, deep generative models have the potential to resolve the problem of source dataset access because they can compress information of large datasets into much smaller pre-trained weights (e.g., about 100MB in the case of a BigGAN generator), and safely generate informative samples without re-generating training samples by differential privacy training techniques (Torkzadehmahani et al., 2019; Augenstein et al., 2020; Liew et al., 2022) . By using conditional generative models, we propose a two-stage transfer learning method composed of pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL). Figure 1 illustrates an overview of our method. PP pre-trains the target architectures by using the artificial dataset generated from the source conditional generated samples and given labels. This simple pre-process provides effective initial weights without accessing source datasets and architecture consistency. To address the non-overlap of the label spaces without accessing source datasets, P-SSL trains a target model with SSL (Chapelle et al., 2006; Van Engelen & Hoos, 2020) by treating pseudo samples drawn from the conditional generative models as the unlabeled dataset. Since SSL assumes the labeled and unlabeled datasets are drawn from the same distribution, the pseudo samples should be target-related samples, whose distribution is similar enough to the target distribution. To generate target-related samples, we cascade a classifier and conditional generative model of the source domain. Specifically, we (a) obtain pseudo source soft labels from the source classifier by applying < l a t e x i t s h a _ b a s e = " P S c s  I U G D T Y h k Z a B d O Z + N T Y = " > A A A C Z i c h V F N S w J B G H c v u D K w i h i y V G J x l D K q J D K V j a p Z g I b v b m I P x e q m P Q H O n Q t F Q Q E f M L v B D v E W j Q p U P v r g t R U r D z D z z z P u y M Y m n C c R l r h S B w a H h k d G x P j E F Q k O j z h W + U F d R M u j I D t e E w Q u u c D V e t G w u r G D T a t r d / O C I x j z a / E i X T w x R E a r s e t S h I / R y N M F S z I P g H I I E g d s o P Q x D B M q t D B Y c A l r E G G Q E N B g s o Q I s m J P x j j O E S V u n L E Z M r E G k o V Q p Y g Z e T c d X q S K R t m Z R x J s I e W J c s f W Y R + / m r N T w v T Z q V n p Z b c h L P / + r q n U X S / W n Z x c V r P t e B X m f M a h d r T N v u / m N X L K x G Z K / m / Y W R D c w G m / q X Z b n r h G m D j / f O + s L + S S q + m M t l M Y m s z + I p R z G M R y / T e a j C D n Z R o H O r u M A l r k I d K S L N S b F e q h Q K N L P F t L C J Q i i Y = < / l a t e x i t > ⇠ < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 c v G P M N s k G 1 b g D K n 2 + b S 5 s S 6 z N I = " > A A A C 7 X i c h V F L a x R B E K 4 Z X 3 F 8 Z G M 8 C K I M W S K e l h 5 Z E s k p 4 E V E I Q 9 3 E 0 i H z c x s 7 a T J T P f Q 0 z u w D n s P 3 s W D I i h I E H + G F / + A h / w E E b x E 8 O L B m t m B Y I J a T X d X f 1 9 9 V d X d Q R q L z D B 2 a N l n z p 4 7 f 2 H q o n P p 8 p W r 0 4 2 Z a 9 1 M D X W I n V D F S m 8 G f o a x k N g x w s S 4 m W r 0 k y D G j W D v Q c l v 5 K g z o e R T M 0 p x O / E j K Q Y i 9 A 1 B v c Y + j 3 F g e O H w A C M h C 1 9 r f z Q u w r H j u g U 3 x t 3 Z e Y S Y z i 2 x V t s b u 5 w f 4 4 9 F o o Y Z 1 S 1 J 7 w T 5 R O V C R m 7 X l y X L F m q W 5 3 1 l M o e j 7 N e l H K 5 F t G t a v U a T 4 i p z T z t e 7 T S h t h X V O A A O f V A Q w h A S Q J B g y I / B h 4 z G F n j A I C V s G w r C N H m i 4 h H G 4 J B 2 S F F I E T 6 h e 7 R G d N q q U U n n M m d W q U O q E t P U p H R h n n 1 h H 9 g R + 8 w + s q / s 1 1 9 z F V W O s p c R 7 c F E i 2 l v + v m N 9 Z / / V S W 0 G 9 g 9 V v 2 z Z w M D u F / 1 K q j 3 t E L K W 4 Q T f f 7 s 5 d H 6 0 t p 8 c Y e 9 Y 9 + o / 7 f s k H 2 i G 8 j 8 R / h + F d d e g U M f 4 J 1 8 7 t N O 9 1 7 L W 2 i 1 V 9 v N 5 X b 9 F V N w E + b g L r 3 3 I i z D Q 1 i B D t X O o = " > A A A C Z n i c h V F N S w J B G H 7 c v s x K r Y i C L p I Y n W Q M q e g Q Q o c 6 + p E f Y C K 7 2 2 i L 6 + 6 y u w o m / Y G g a x 4 6 F U R E P 6 N L f 6 C D / 6 D o a N C l Q 6 / r Q p R U 7 z A z z z z z P u 8 8 M y M Z q m L Z j H U 9 w s j o 2 P i E d 9 I 3 N T 3 j D w R n 5 3 K W 3 j B l n p V 1 V T c L k m h x V d F 4 1 l Z s l R c M k 4 t 1 S e V 5 q b b b 3 8 8 3 u W k p u n Z g t w x e q o t V T a k o s m g T l d k r W + V g m E W Z E 6 F h E H N B G G 4 k 9 e A t D n E E H T I a q I N D g 0 1 Y h Q i L W h E x M B j E l d A m z i S k O P s c p / C R t k F Z n D J E Y m s 0 V m l V d F m N 1 v 2 a l q O W 6 R S V u k n K E C L s i d 2 x H n t k 9 + y F f f x a q + 3 U 6 H t p 0 S w N t N w o B 8 6 W M u / / q u o 0 2 z j + U v 3 p 2 U Y F W 4 5 X h b w b D t O / h T z Q N 0 8 6 v c x 2 O t J e Z d f s l f x f s S 5 7 o B t o z T f 5 J s X T l / D R B 8 R + P v c w y K 1 H Y x v R e C o e T u y 4 X + H F M l a w R u + 9 i Q T 2 k U S W z q 3 i H B f o e J 4 F v 7 A g L A 5 S B Y + r m c e 3 E E K f h d O K s g = = < / l a t e x i t > G s < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 Z Z I z I Y v G q i p y g L a u G s S t b 3 e 8 h M = " > A A A C c H i c h V G 7 S g N B F D 1 Z X z E + E r U R L H w E R S z C r I i K h Q h a W P r K A 2 I I u + s k W d y X u 5 N A X P w B f 8 D C R g U R 8 T N s / A E L P 0 H s V L C x 8 O 5 m Q V T U O 8 z M m T P 3 3 D k z o z q G 7 g n G H m J S W 3 t H Z 1 e 8 O 9 H T 2 9 e f T A 0 M 5 j y 7 7 m o 8 q 9 m G 7 R Z U x e O G b v G s 0 I X B C 4 7 L F V M 1 e F 7 d X w 3 2 8 w 3 u e r p t 7 Y i m w 0 u m U r X 0 i q 4 p g q j S r q m I m q Y Y / t p R W Z R T a Z Z h Y Y z 9 B H I E 0 o h i w 0 5 d Y R d 7 s K G h D h M c F g R h A w o 8 a k X I Y H C I K 8 E n z i W k h / s c R 0 i Q t k 5 Z n D I U Y v d p r N K q G L E W r Y O a X q j W 6 B S D u k v K M U y y e 3 b N X t g d u 2 G P 7 P 3 X W n 5 Y I / D S p F l t a b l T T h 4 P b 7 / 9 q z J p F q h 9 q v 7 0 L F D B Y u h V J + 9 O y A S 3 0 F r 6 x u H J y / b S 1 q Q / x S 7 Y E / k / Z w / s l m 5 g N V 6 1 y 0 2 + d Y o E f Y D 8 / b l / g t x s R p 7 P z G 3 O p V e W o 6 + I Y w Q T m K b 3 X s A K 1 r G B L J 1 7 g B O c 4 T z 2 L A 1 L o 9 J 4 K 1 W K R Z o h f A l p 5 g N z A 4 9 C < / l a t e x i t > D t < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 W z e B u T S n r i n Y W w d l e d q C N 0 1 Z 4 M = " > A A A C g H i c h V H L L g R B F D 3 a a 4 z X Y C O x m Z g Q q 1 E t g l i I h I W l 1 y A x M q l u N a a j + p H u m p H R m Y W t H 7 C w I h E R G 7 7 B x g 9 Y + A S x J L G x c K e n E 0 F w K 1 V 1 6 t Q 9 t 0 5 V G Z 6 0 A s X Y Y 5 P W 3 N L a 1 p 7 o S H Z 2 d f f 0 p v r 6 N w K 3 7 J s i Z 7 r S 9 b c M H g h p O S K n L C X F l u c L b h t S b B r 7 C / X 9 z Y r w A 8 t 1 1 l X V E z s 2 3 3 O s o m V y R V Q h N Z S 3 u S q Z X I a L t U I Y 5 K U o K u 7 7 7 k F a 1 Q q p D M u y K N I / g R 6 D D O J Y d l O X y G M X L k y U Y U P A g S I s w R F Q 2 4 Y O B o + 4 H Y T E + Y S s a F + g h i R p y 5 Q l K I M T u 0 / j H q 2 2 Y 9 a h d b 1 m E K l N O k V S 9 0 m Z x g h 7 Y F f s h d 2 z a / b E 3 n + t F U Y 1 6 l 6 q N B s N r f A K v c e D a 2 / / q m y a F U q f q j 8 9 K x Q x E 3 m 1 y L s X M f V b m A 1 9 5 f D k Z W 1 2 d S Q c Z e f s m f y f s U d 2 R z d w K q / m x Y p Y P U W S P k D / / t w / w c Z E V p / K T q 5 M Z u b n 4 q 9 I Y A j D G K P 3 n s Y 8 l r C M H J 1 7 h E v c 4 F b T t D F t X N M b q V p T r B n A l 9 B m P w D L 7 Z S 9 < / l a t e x i t > D s t < l a t e x i t s h a 1 _ b a s e 6 4 = " r X z j z 3 O + i c P o a A h h R d Z a i O h A w a M = " > A A A C d n i c h V H L S s N A F D 2 N r 1 p f r W 4 E Q Y q l 6 q r c S F F x V X D j s j 6 q g k p J 4 l S D a R K S a U s t / Q F / w I W 4 U F A R P 8 O N P + D C T x C X C r p w 4 W 0 a E B X 1 D j N z 5 s w 9 d 8 7 M 6 K 5 l + p L o I a J 0 d H Z 1 9 0 R 7 Y 3 3 9 A 4 N D 8 c T w u u 9 U P E M U D M d y v E 1 d 8 4 V l 2 q I g T W m J T d c T W l m 3 x I Z + s N j a 3 6 g K z z c d e 0 3 W X b F T 1 v Z s s 2 Q a m m S q G E / U i w 1 / 2 x I l q X m e U 0 v K Z j G e o g w F k f w J 1 B C k E E b e i V 9 h G 7 t w Y K C C M g R s S M Y W N P j c t q C C 4 D K 3 g w Z z H i M z 2 B d o I s b a C m c J z t C Y P e B x j 1 d b I W v z u l X T D 9 Q G n 2 J x 9 1 i Z R J r u 6 Z q e 6 Y 5 u 6 J H e f 6 3 V C G q 0 v N R 5 1 t t a 4 R a H j k Z X X / 9 V l X m W 2 P 9 U / e l Z o o T 5 w K v J 3 t 2 A a d 3 C a O u r h 8 f P q w s r 6 c Y k n d M T + z + j B 7 r l G 9 j V F + N i W a y c I M Y f o H 5 / 7 p 9 g f S a j z m a y y 9 l U L h t + R R R j m M A 0 v / c c c l h C H g U + t 4 Z T X O I q 8 q a M K 2 l l q p 2 q R E L N C L 6 E Q h 8 v G p F V < / l a t e x i t > y s t < l a t e x i t s h a 1 _ b a s e 6 4 = " B H 7 Z G o d 8 U X y 2 h n i r B c x P Y 6 Y y d k U = " > A A A C d n i c h V H L S s N A F D 2 N 7 / p o q x t B k G K p u i q 3 U l R c C W 5 c a m t V U C l J n G p o m o R k 2 q r F H / A H X I g L B S 3 i Z 7 j x B 1 z 0 E 8 S l g i 5 c e J s G R E W 9 w 8 y c O X P P n T M z m m M a n i R q h p S O z q 7 u n t 6 + c P / A 4 F A k G h t e 9 + y K q 4 u 8 b p u 2 u 6 m p n j A N S + S l I U 2 x 6 b h C L W u m 2 N B K S 6 3 9 j a p w P c O 2 1 u S h I 3 b K 6 p 5 l F A 1 d l U w V o r G D Q t 3 b N k V R q q 5 r 1 + L y u B B N U I r 8 i P 8 E 6 Q A k E M S K H W 1 g G 7 u w o a O C M g Q s S M Y m V H j c t p A G w W F u B 3 X m X E a G v y 9 w j D B r K 5 w l O E N l t s T j H q + 2 A t b i d a u m 5 6 t 1 P s X k 7 r I y j i Q 9 0 A 0 9 0 z 3 d 0 i O 9 / 1 q r 7 t d o e T n k W W t r h V O I n I z m X v 9 V l X m W 2 P 9 U / e l Z o o h 5 3 6 v B 3 h 2 f a d 1 C b + u r R 6 f P u Y V s s j 5 J l / T E / i + o S X d 8 A 6 v 6 o l + t i u w Z w v w B 6 e / P / R O s z 6 T S s 6 n M a i a x m A m + o h d j m M A 0 v / c c F r G M F e T 5 3 B r O c Y 1 G 6 E 0 Z V 5 L K V D t V C Q W a E X w J h T 4 A L Q q R V A = = < / l a t e x i t > x s t < l a t e x i t s h a 1 _ b a s e 6 4 = " X l J l m D I G c 6 I D n f I U w / 5 x X b E t g o E = " > A A A C a H i c h V F N S w J B G H 7 c v s w + t D p U d I n E 6 C R j S E W n o E v H t D T B Q n a 3 0 Q b X 3 W V 3 l E z 8 A 1 0 6 V n Q q i I h + R p f + Q A d / Q n U s 6 N K h 1 3 U h S q p 3 m J l n n n m f d 5 6 Z 0 W x D u J K x V k D p 6 e 3 r H w g O h o a G R 0 b D k b H x r G t V H Z 1 n d M u w n J y m u t w Q J s 9 I I Q 2 e s x 2 u V j S D 7 2 j l 9 f b + T o 0 7 r r D M b V m 3 + V 5 F L Z m i K H R V E p U 5 L D R k s x C J s j j z Y r Y b J H w Q h R + b V u Q G u 9 i H B R 1 V V M B h Q h I 2 o M K l l k c C D D Z x e 2 g Q 5 x A S 3 j 5 H E y H S V i m L U 4 Z K b J n G E q 3 y P m v S u l 3 T 9 d Q 6 n W J Q d 0 g 5 i x h 7 Z L f s l T 2 w O / b E P n 6 t 1 f B q t L 3 U a d Y 6 W m 4 X w s d T W + / / q i o 0 S x x 8 q f 7 0 L F H E i u d V k H f b Y 9 q 3 0 D v 6 2 t H p 6 9 Z q O t a Y Z 1 f s h f x f s h a 7 p x u Y t T f 9 O s X T F w j R B y R + P n c 3 y C 7 G E 0 v x Z C o Z X U v 6 X x H E D O a w Q O + 9 j D V s Y B M Z O l f g B G c 4 D z w r E W V S m e 6 k K g F f M 4 F v o c x 9 A i V C i + Y = < / l a t e x i t > x t < l a t e x i t s h a 1 _ b a s e 6 4 = " P u P target data, and (b) generate conditional samples given a pseudo source soft label. By using the target-related samples, P-SSL trains target models with off-the-shelf SSL algorithms. C t f e f N x R G b P Q 5 c O P e n f a n 2 C Q = " > A A A C e H i c h V H L S g M x F D 0 d 3 / X R a j e C m 9 L i a 1 N S K S q C I L h x 6 a t V a E s 7 M 6 Z 1 6 L y Y S Q u 1 9 A f 8 A R c u R E F U / A w 3 / o C L f o K 4 r C C I C 2 + n A 6 J F v S H J y c k 9 N y e J Y u u a K x h r B a S + / o H B o e G R 4 O j Y + E Q o P D m V c a 2 q o / K 0 a u m W c 6 j I L t c 1 k 6 e F J n R + a D t c N h S d H y i V z c 7 + Q Y 0 7 r m a Z + 6 J u 8 7 w h l 0 2 t p K m y I K o Q j t Q L Y r 2 R E y J a L G 5 V D Y M 7 s W Y h H G c J 5 k W 0 F y R 9 E I c f 2 1 b 4 B j k c w Y K K K g x w m B C E d c h w q W W R B I N N X B 4 N 4 h x C m r f P 0 U S Q t F X K 4 p Q h E 1 u h s U y r r M + a t O 7 U d D 2 1 S q f o 1 B 1 S R j H L n t g d a 7 N H d s + e 2 c e v t R p e j Y 6 X O s 1 K V 8 v t Q u h 0 e u / t M I 2 0 n R u H R e 4 x V 3 g X Y p K 8 9 J i N 1 U K + J o I v o W 0 9 A n b A J G V < / l a t e x i t > y t = "Hummer" < l a t e x i t s h a 1 _ b a s e 6 4 = " m z V O V 2 V A L 7 d m c Z 0 + g m W 8 Z n b d 6 B U = " > A A A C e H i c h V H L S g M x F D 0 d X 7 W + q m 4 E N 8 V S H 5 u S S l E R F 4 o b l 7 6 q g q 3 D z J j W w X k x k x b q 0 B / w B 1 y 4 E A V R 8 T P c + A M u + g n i s o I g L r y d D o g W 9 Y Y k J y f 3 3 J w k q m P o n m C s H p E 6 O r u 6 e 6 K 9 s b 7 + g c G h + P D I j m e X X Y 3 n N N u w 3 T 1 V 8 b i h W z w n d G H w P c f l i q k a f F c 9 X m 3 u 7 1 a 4 6 + m 2 t S 2 q D i + Y S s n S i 7 q m C K L k + O i q 7 B 3 4 e V M R R 5 p i + C s 1 2 a v J 8 S R L s y A S 7 S A T g i T C W L f j N 8 j j E D Y 0 l G G C w 4 I g b E C B R 2 0 f G T A 4 x B X g E + c S 0 o N 9 j h p i p C 1 T F q c M h d h j G k u 0 2 g 9 Z i 9 b N m l 6 g 1 u g U g 7 p L y g R S 7 I n d s Q Z 7 Z P f s m X 3 8 W s s P a j S 9 V G l W W 1 r u y E O n Y 1 t v / 6 p M m g W O v l R / e h Y o Y i H w q p N 3 J 2 C In the experiments, we first confirm the effectiveness of our method through a motivating example scenario where the source and target labels do not overlap, the source dataset is unavailable, and the architectures are specialized by manual neural architecture search (Sec. 4.2). Then, we show that our method can stably improve the baselines in transfer learning without three assumptions under various conditions: e.g., multiple target architectures (Sec. 4.3), and multiple target datasets (Sec. 4.4). These indicate that our method succeeds to make the architecture and task designs free from the three assumptions. Further, we confirm that our method can achieve practical performance without the three assumptions: the performance was comparable to the methods that require one of the three assumptions (Sec. 4.5 and 4.7). We also provide extensive analysis revealing the conditions for the success of our method. For instance, we found that the target performance highly depends on the similarity of generated samples to the target data (Sec. 3.2.4 and 4.4), and a general source dataset (ImageNet) is more suitable than a specific source dataset (CompCars) when the target dataset is StanfordCars (Sec. 4.6).

2. PROBLEM SETTING

We consider a transfer learning problem where we train a neural network model f At θ on a labeled target dataset D t = {(x i t ∈ X t , y i t ∈ Y t )} Nt i=1 given a source classifier C As s and a source conditional generative model G s ; C As s and G s are off-the-shelf i.e., we do not access source datasets to pre-train them. f At θ is parameterized by θ of a target neural architecture A t . C As s outputs class probabilities by softmax function and G s is pre-trained on a labeled source dataset D s . We mainly consider classification problems and denote f At θ as the target classifier C At t in the below. In this setting, we assume the following conditions. (i) no label overlap: Y s ∩ Y t = ∅ (ii) no source dataset access: D s is not available when training C At t (iii) architecture inconsistency: A s = A t Existing methods are not available when the three conditions are satisfied simultaneously. Unsupervised domain adaptaion (Ganin et al., 2016) require Y s ∩ Y t = ∅, accessing D s , and A s = A t . Source-free domain adaptation methods (Liang et al., 2020; Kundu et al., 2020; Wang et al., 2021a) can adapt models without accessing D s , but still depend on Y s ∩ Y t = ∅ and A s = A t . Finetuning (Yosinski et al., 2014) can be applied to the problem with the condition (i) and (ii) if and only if A s = A t . However, since recent progress of neural architecture search (Zoph & Le, 2017; Wang et al., 2021b; Lee et al., 2021) enable to find specialized A t for each task, situations of A s = A t are common when developing models for environments requiring both accuracy and model size e.g., embedded devices. As a result, the specialized C At t currently sacrifices the accuracy so that the size requirement can be satisfied. Therefore, tackling this problem setting has the potential to enlarge the applicability of deep models.

3. PROPOSED METHOD

In this section, we describe our proposed method. An overview of our method is illustrated in Figure 1 . PP yields initial weights for a target architecture by training it on the source task with synthesized samples from a conditional source generative model. P-SSL takes into account the pseudo samples drawn from generative models as an unlabeled dataset in an SSL setting. In P-SSL, we generate target-related samples by pseudo conditional sampling (PCS, Figure 2 ), which cascades a source classifier and a source conditional generative model.

3.1. PSEUDO PRE-TRAINING

Without accessing source datasets and architectures consistency, we cannot directly use the existing pre-trained weights for fine-tuning C At t . To build useful representations under these conditions, we pre-train the weights of A t with synthesized samples from a conditional source generative model. Every training iteration of PP is composed of two simple steps: (step 1) synthesizing a batch of generating source conditional samples {x i s ∼ G s (y i s )} B s i=1 from uniformly sampled source labels y i s ∈ Y s , and (step 2) optimizing θ on the source classification task with the labeled batch of {(x i s , y i s )} B s i=1 by minimizing 1 B s B s i=1 CE(C At s (x i s ; θ), y i s ), where B s is the batch size for PP and CE is cross-entropy loss. Since PP alternately performs the sample synthesis and training in an online manner, it efficiently yields pre-trained weights without consuming massive storage. Further, we found that this online strategy is better in accuracy than the offline strategy: i.e., synthesizing fixed samples in advance of training (Appendix C.4). We use the pre-trained weights from PP as the initial weights of C At t by replacing the final layer of the source task to that of the target task.

3.2.1. SEMI-SUPERVISED LEARNING

Given a labeled dataset D l = {(x i , y i )} N l i=1 and an unlabeled dataset D u = {(x i )} Nu i=1 , SSL is used to optimize the parameter θ of a deep neural network by solving the following problem. min θ 1 N l (x,y)∈D l L sup (x, y, θ) + λ 1 N u x∈Du L unsup (x, θ), where L sup is a supervised loss for a labeled sample (x l , y l ) (e.g., cross-entropy loss), L unsup is an unsupervised loss for an unlabeled sample x u , and λ is a hyperparameter for balancing L sup and L unsup . In SSL, it is generally assumed that D l and D u shares the same generative distribution p(x). If there is a large gap between the labeled and unlabeled data distribution, the performance of SSL algorithms degrades (Oliver et al., 2018) . However, Xie et al. (2020) have revealed that unlabeled samples in another dataset different from a target dataset can improve the performance of SSL algorithms by carefully selecting target-related samples from source datasets. This implies that SSL algorithms can achieve high performances as long as the unlabeled samples are related to target datasets, even when they belong to different datasets. On the basis of this implication, our P-SSL exploits pseudo samples drawn from source generative models as unlabeled data for SSL.

3.2.2. PSEUDO CONDITIONAL SAMPLING

To generate informative target-related samples, our method uses PCS, which generates target-related samples by cascading C As s and G s . With PCS, we first obtain a pseudo source label y s←t from a source classifier C As s with a uniformly sampled x t from D t . y s←t = C As s (x t ) (3) Intuitively, y s←t represents the relation between source class categories and x t in the form of the probabilities. We then generate target-related samples x s←t with y s←t as the conditional label by Although G s is trained with discrete (one-hot) class labels, it can generate class-wise interpolated samples by the continuously mixed labels of multiple class categories (Miyato & Koyama, 2018; Brock et al., 2019) . By leveraging this characteristic, we aim to generate target-related samples by y s←t constructed with an interpolation of source classes. x s←t ∼ G s (y s←t ) = G s (C As s (x t )). For the training of the target task, we compose a pseudo dataset D s←t by applying Algorithm 1. In line 2, we swap the final layer of C As s with an output label function g, which is softmax function as the default. Sec. C.5.3, we empirically evaluate the effects of the choice of g.

3.2.3. TRAINING

By applying PCS, we obtain a target-related sample x s←t from x t . In the training of C At t , one can assign the label y t of x t to x s←t since x s←t is generated from x t . However, it is difficult to directly use (x s←t , y t ) in supervised learning because x s←t can harm the performance of C At t due to the gap between label spaces; we empirically confirm that this naïve approach fails to boost the target performance mentioned in Sec. C.5.1. To extract informative knowledge from x s←t , we apply SSL algorithms that train a target classifier C At t by using a labeled target dataset D t and an unlabeled dataset D s←t generated by PCS. On the basis of the implication discussed in Sec. 3.2.1, we can expect that the training by SSL improve C At t if D s←t contains target-related samples. We compute the unsupervised loss function for a pseudo sample x s←t as L unsup (x s←t , θ). We can adopt arbitrary SSL algorithms for calculating L unsup . For instance, UDA (Xie et al., 2020) is defined by L unsup (x, θ) = 1 max y ∈ ĈA t t (x,τ );θ y > β CE ĈAt t (x, τ ; θ),C At t (T (x); θ) , where 1 is an indicator function, CE is a cross entropy function, ĈAt t (x, τ ) is the target classifier replacing the final layer with the temperature softmax function with a temperature hyperparameter τ , β is a confidence threshold, and T (•) is an input transformation function such as RandAugment (Cubuk et al., 2020) . In the experiments, we used UDA because it achieves the best result; we compare and discuss applying the other SSL algorithms in Sec. C.5.1. Eventually, we optimize the parameter θ by the following objective function based on Eq (2). min θ 1 N t xt,yt∈Dt L sup (x t , y t , θ) + λ 1 N s←t xs←t∈Ds←t L unsup (x s←t , θ).

3.2.4. QUALITY OF PSEUDO SAMPLES

In this method, since we treat x s←t ∼ G s (C As s (x t )) as the unlabeled sample of the target task, the following assumption is required: Assumption 3.1 p Dt (x) ≈ p Ds←t (x) = 1 Nt xt p Gs(C As s (xt)) (x), where p D (x) is a data distribution of a dataset D. That is, if pseudo samples satisfy Assumption 3.1, P-SSL should boost the target task performance by P-SSL. From this result, we can say that our method approximates the target distribution almost as well as the actual target samples. We further discuss the relationships between the pseudo sample quality and target performance in Sec. 4.4.

4. EXPERIMENTS

We evaluate our method with multiple target architectures and datasets, and compare it with baselines including scratch training and knowledge distillation that can be applied to our problem setting with a simple modification. We further conduct detailed analyses of our pseudo pre-training (PP) and pseudo semi-supervised learning (P-SSL) in terms of (a) the practicality of our method by comparing to the transfer learning methods that require the assumptions of source dataset access and architecture consistency, (b) the effect of source dataset choices, (c) the applicability toward another target task (object detection) other than classification. We further provide more detailed experiments including the effect of varying target dataset size (Appendix C.2), the performance difference when varying source generative model (Appendix C.3), the detailed analysis of PP (Appendix C.4) and P-SSL (Appendix C.5), and the qualitative evaluations of pseudo samples (Appendix D). We provide the detailed settings for training in Appendix B.3.

4.1. SETTING

Baselines Basically, there are no existing transfer learning methods that are available on the problem setting defined in Sec. 2. Thus, we evaluated our method by comparing it with the scratch training (Scratch), which trains a model with only a target dataset, and naïve applications of knowledge distillation methods: Logit Matching (Ba & Caruana, 2014) and Soft Target (Hinton et al., 2015) . Logit Matching and Soft Target can be used for transfer learning under architecture inconsistency since their loss functions use only final logit outputs of models regardless of the intermediate layers. Datasets We used ImageNet (Russakovsky et al., 2015) as the default source datasets. In Sec. 4.6, we report the case of applying CompCars (Yang et al., 2015) as the source dataset. For the target dataset, we mainly used StanfordCars (Krause et al., 2013) , which is for fine-grained classification of car types, as the target dataset. In Sec. 4.4, we used the nine classification datasets listed in Table 3 as the target datasets. We used these datasets with the intent to include various granularities and domains. We constructed Caltech-256-60 by randomly sampling 60 images per class from the original dataset in accordance with the procedure of Cui et al. (2018) . Note that StanfordDogs is a subset of ImageNet, and thus has the overlap of the label space to ImageNet, but we added this dataset to confirm the performance when the overlapping exists. Network Architecture As a source architecture A s , we used the ResNet-50 architecture (He et al., 2016) with the pre-trained weight distributed by torchvision official repository.foot_0 For a target architecture A t , we used five architectures publicly available on torchvision: WRN-50-2 (Zagoruyko & Komodakis, 2016) , MNASNet1.0 (Tan et al., 2019) , MobileNetV3-L (Howard et al., 2019) , and EfficientNet-B0/B5 (Tan & Le, 2019) . Note that, to ensure reproducibility, we assume them as entirely new architectures for the target task in our problem setting, while they are actually existing architectures. For a source conditional generative model G s , we used BigGAN (Brock et al., 2018) generating 256 × 256 resolution images as the default architecture. We also tested the other generative models such as ADM-G (Dhariwal & Nichol, 2021) in Sec. C.3. We implemented BigGAN on the basis of open source repositories including pytorch-pretrained-BigGANfoot_1 ; we used the pre-trained weights distributed by the repositories.

4.2. MOTIVATING EXAMPLE: MANUAL ARCHITECTURE SEARCH

First of all, we confirm the effectiveness of our setting and method through a practical scenario described in Sec. 1. Here, we consider a case of manually optimizing the number of layers of ResNet-18 (RN18) for the target task to improve the performance while keeping the model size. We evaluated our method on the above scenario by assuming StanfordCars as the target dataset and ImageNet as the source dataset. We searched the custom architecture by grid search of the layers in four blocks of RN18 from (2, 2, 2, 2) to (2, 2, 2, 10) by varying layers in {2, 4, 6, 8, 10} for each block of ResNet while keeping the sum of layers less than or equal to 18 to maintain architecture size. We found that the best architecture is one with (2, 10, 2, 2). The test accuracies on StanfordCars are shown in Table 4 . We can see that finding an optimal custom RN18 architecture for the target task brings test accuracy improvements, and our method contributes to further improvements under this difficult situation. This result indicates that our method can widen the applicability of neural architecture search techniques, which have been difficult in practice in terms of accuracy.

4.3. TARGET ARCHITECTURES

We discuss the performance evaluations by varying the neural network architectures of the target classifiers to evaluate our method on the condition of architecture inconsistency. Table 5 lists the results on StanfordCars with multiple target architectures. Note that we evaluated our method by fixing the source architecture to ResNet-50. Our method outperformed the baselines on all target architectures without architecture consistency. Remarkably, our method stably performs arbitrary ±0.33 relationships between target and source architectures including from a smaller architecture (ResNet-50) to larger ones (WRN-50-2 and EfficientNet-B5), and from a larger one (ResNet-50) to smaller ones (MNASNet1.0, MobileNetV3, and EfficientNet-B0). This flexibility can lead to the effectiveness of the neural architecture search in Sec. 4.2. We show the efficacy of our method on multiple target datasets other than StanfordCars. We used WRN-50-2 as the architecture of the target classifiers. Table 6 lists the top-1 accuracy of each classification task. Our method stably improved the baselines across datasets. We also print the ablation results of our method (Ours w/o PP and Ours w/o P-UDA) for assessing the dependencies of PP and P-SSL on datasets. PP outperformed the baselines on all target datasets. This suggests that building basic representation by pre-training is effective for various target tasks even if the source dataset is drawn from generative models. For P-SSL, we observed that it boosted the scratch models in all target datasets except for DTD and OxfordFlower. As the reasons for the degradation on DTD and OxfordFlower, we consider that the pseudo samples do not satisfy Assumption 3.1 as discussed in Sec. 3.2.4. In fact, we observe that FID(D s←t , D t ) is correlated to the accuracy gain from Scratch models (-0.80 of correlation coefficient, -0.97 of Spearman rank correlation) as shown in Fig. 3 . These experimental results suggest that our method is effective on the setting without no label overlap as long as D s←t approximates D t well.

4.5. COMPARISON TO METHODS REQUIRING PARTIAL CONDITIONS

To confirm the practicality of our method, we compared it with methods requiring A s = A t and accessing D s . We tested Fine-tuning (FT) and R-SSL, which use a source pre-trained model and a real source dataset for SSL as reference. For FT and R-SSL, we used ImageNet and a subset of ImageNet (Filtered ImageNet), which was collected by confidence-based filtering similar to Xie et al. ( 2020) (see Appendix B.4). This manual filtering process corresponds to PCS in P-SSL. In Table 7 , PP and P-SSL outperformed FT and R-SSL, respectively. This suggests that the samples from G s not only preserve essential information of D s but are also more useful than D s , i.e., accessing D s may not be necessary for transfer learning. In summary, PP and P-SSL are practical enough compared to existing methods that require assumptions.

4.6. SOURCE DATASETS

We investigate the preferable characteristics of source datasets for PP and P-SSL by testing another source dataset, which was CompCars (Yang et al., 2015) , a fine-grained vehicle dataset containing 136K images (see Appendix B.5 for details). This is more similar to the target dataset (StanfordCars) than ImageNet. All training settings were the same as mentioned in Sec. 4.1. Table 8 lists the scores for each model of our methods. The models using ImageNet were superior to those using CompCars. To seek the difference, we measured FID(D s←t , D t ) when using CompCars as with the protocol in Sec. 4.5, and the score was 22.12, which is inferior to 19.92 when using ImageNet. This suggests that the fidelity of pseudo samples toward target samples is important to boost the target performance and is not simply determined by the similarity between the source and target datasets. We consider that the diversity of source classes provides the usefulness of pseudo samples; thus ImageNet is superior to CompCars as the source dataset. Although we have mainly discussed the cases where the target task is classification through the paper, our method can be applied to any task for which the SSL method exists. Here, we evaluate the applicability of PP+P-SSL toward another target task other than classification. To this end, we applied our method to object detection task on PASCAL-VOC 2007 (Everingham et al., 2015) by using FPN (Lin et al., 2017 ) models with ResNet-50 backbone. As the SSL method, we used Unbiased Teacher (UT, (Liu et al., 2021) ), which is a method for object detection based on self distillation and pseudo labeling, and implemented PP+P-SSL on the code base provided by Liu et al. (2021) . We generated samples for PP and P-SSL in the same way as the classification experiments; we used ImageNet as the source dataset. Table 9 shows the results of average precision scores calculated by following detectron2 (Wu et al., 2019) . Note that the baseline of Table 9 is FT instead of Scratch because the Scratch setting of the object detection task is hard to train and too slow to converge. We confirm that PP achieved competitive results to FT and P-SSL significantly boosted the PP model. This can be caused by the high similarity between D s←t and D t (19.0 in FID). This result indicates that our method has the flexibility to be applied to other tasks as well as classification and is expected to improve baselines.

5. RELATED WORK

We briefly discuss related works by focusing on training techniques applying generative models. We also provide discussions on existing transfer learning and semi-supervised learning in Appendix A. Similar to our study, several studies have applied the expressive power of conditional generative models to boost the performance of discriminative models; Zhu et al. (2018) and Yamaguchi et al. (2020) have exploited the generated images from conditional GANs for data augmentation in classification, and Sankaranarayanan et al. ( 2018) have introduced conditional GANs into the system of domain adaptation for learning joint-feature spaces of source and target domains. Moreover, Li et al. (2020b) have implemented an unsupervised domain adaptation technique with conditional GANs in a setting of no accessing source datasets. These studies require label overlapping between source and target tasks or training of generative models on target datasets, which causes problems of overfitting and mode collapse when the target datasets are small (Zhang et al., 2020; Zhao et al., 2020b; Karras et al., 2020) . Our method, however, requires no additional training in generative models because it simply extracts samples from fixed pre-trained conditional generative models.

6. CONCLUSION AND LIMITATION

We explored a new transfer learning setting where (i) source and target task label spaces do not have overlaps, (ii) source datasets are not available, and (iii) target network architectures are not consistent with source ones. In this setting, we cannot use existing transfer learning such as domain adaptation and fine-tuning. To transfer knowledge, we proposed a simple method leveraging pre-trained conditional source generative models, which is composed of PP and P-SSL. PP yields effective initial weights of a target architecture by generated samples and P-SSL applies an SSL algorithm by taking into account the pseudo samples from the generative models as the unlabeled dataset for the target task. Our experiments showed that our method can practically transfer knowledge from the source task to the target tasks without the assumptions of existing transfer learning settings. One of the limitations of our method is the difficulty to improve target models when the gap between source and target is too large. A future step is to modify the pseudo sampling process by optimizing generative models toward the target dataset, which was avoided in this work to keep the simplicity and stability of the method.

B.2 KNOWLEDGE DISTILLATION BASELINES

As stated in the main paper, we evaluated our method by comparing it with naïve knowledge distillation methods Logit Matching (Ba & Caruana, 2014) and Soft Target (Hinton et al., 2015) . We exploited Logit Matching and Soft Target for transfer learning under the architecture inconsistency since their loss functions use only final logit outputs of models regardless of the intermediate layers. To tranfer knowledge in C As s to C At t , we first fine-tune the parameters of C As s on the target task then train C t with knowledge distillation penalties by treating the trained C As s as the teacher model. We optimized the knowledge distillation models by the following objective function: min θ xt,yt∈Dt L sup (x t , y t , θ) + λ d L KD (l θ (x t ), l φ (x t )), where λ d is a hyperparameter, L KD is a loss function of a knowledge distillation method, φ is the parameter of a teacher model, and l θ (•) and l φ (•) are the output of the logit function on θ and φ. In Logit Matching, L KD is a simple mean squared loss between l θ and l φ . Soft Target computes a Kullback-Leibler divergence between the softmax output of l θ and the temperature softmax output of l φ as L KD . We set the temperature parameter T to 4 by searching in {2, 4, 6}.

B.3 TRAINING SETTINGS

We selected the training configurations on the basis of the previous works (Li et al., 2020a; Xie et al., 2020) . In PP, we trained a source classifier by Neterov momentum SGD for 1M iterations with a mini-batch size of 128, weight decay of 0.0001, momentum of 0.9, and initial learning rate of 0.1; we decayed the learning rate by 0.1 at 30, 60, 90 epochs. We trained a target classifier C At t by Neterov momentum SGD for 300 epochs with a mini-batch size of 16, weight decay of 0.0001, and momentum of 0.9. We set the initial learning rate to 0.05 for the scratch models and 0.005 for the models with PP. We dropped the learning rate by 0.1 at 150 and 250 epochs. For each target dataset, we split the training set into 9 : 1, and used the former in the training and the later in validating. The input samples were resized into 224×224 resolution. For the SSL algorithms, we set the minibatch size for L unsup to 112, and fixed λ in Eq. (2) to 1.0. We fixed the hyperparameters of UDA as the confidence threshold β = 0.5 and the temperature parameter τ = 0.4 following Xie et al. (2020) . For P-SSL, we generated 50,000 samples by PCS. We ran the target trainings three times with different seeds and selected the best models in terms of the validation accuracy for each epoch. We report the average top-1 test accuracies and standard deviations.

B.4 FILTERING OF REAL DATASET BY RELATION TO TARGET

We provide the details of the protocol of dataset filtering discussed in Sec. 4.5 of the main paper and list the correspondences between the target classes and selected source classes. For filtering datasets, we first calculated the confidence (the maximum probability of a class in the prediction) of the target samples by using the pre-trained source classifiers, then averaged the confidence scores for each target class, and finally selected the source classes with a confidence higher than 0.001 as the unlabeled dataset similar to (Xie et al., 2020) . We list the filtered ImageNet classes for each target dataset in Table 24 and 25 Fine-tuning 90.56 ±0.17 91.41 ±0.15 L2-SP (Li et al., 2018) 91.20 ±0.05 91.43 ±0.10 DELTA (Li et al., 2019) 91.52 ±0.26 91.82 ±0.29 BSS (Chen et al., 2019) 91.25 ±0.27 91.45 ±0.16 Co-Tuning (You et al., 2020) For assessing the practicality of our method, we additionally compare our method with the finetuning methods that require architecture consistency, i.e., Fine-tuning: naïvely training target classifiers by using the source pre-trained weights as the initial weights. L2-SP (Li et al., 2018) : finetuning with the L 2 penalty term between the current training weights and the pre-trained source weights. DELTA (Li et al., 2019) : fine-tuning with a penalty minimizing the gaps of channel-wise outputs of feature maps between source and target models. BSS (Chen et al., 2019) : fine-tuning with the penalty term enlarging eigenvalues of training features to avoid negative transfer. Co-Tuning (You et al., 2020) : fine-tuning on source and target task simultaneously by translating the target labels to the source labels. We implemented these methods on the basis of the open source repositories provided by the authors. All of hyperparameters used in L2-SP, DELTA, BSS, and Co-Tuning are those in the respective papers: β for L2-SP and DELTA was 0.01, η for BSS was 0.001, and λ for Co-Tuning was 2.3. We also tested the models by combining our methods and the fine-tuning methods. Table 12 and 13 list the extended results of the experiments on multiple architectures and target datasets in Secs. 4.3 and 4.4, respectively. Table 10 summarizes the results using the fine-tuning variants. The +P-SSL column indicates the results of the combination models of a fine-tuning method and P-SSL. We confirm that our method (pseudo pre-training + P-SSL) achieved competitive or superior results to the naïve fine-tuning. This means that our method can improve target models as well as fine-tuning without architectures consistency. We also observed that P-SSL can outperform fine-tuning baselines by being combined with them. These results indicate that our P-SSL can be applied even if the source and target architectures are consistent.

C.1.1 COMPARSION TO FINE-TUNING DISCRIMINATOR

We compare PP with a transfer learning method applying encoders of generative models as the pre-trained encoder for target tasks. Here, we assume the discriminator of ImageNet pre-trained BigGAN as the pre-trained encoder, and fine-tune it on the target classification task of StanfordCars. Table 11 shows the results. The BigGAN discriminator easily overfitted the target dataset and catastrophically degraded the test performance. It seems difficult to achieve high accuracy by naïvely applying the discriminator as a simple pre-trained encoder. This suggests that the representation of the generative model is quite different from one of the classification tasks and that our method is well suited to absorb such a difference between tasks for knowledge transfer.

C.1.2 COMPARSION OF RN18 AND CUSTOM ARCHITECTURE

We provide further comparison of RN-18 architecture and the custom architecture found by the architecture search in Sec. 4.2. Here, we tested RN18 with our methods (PP, P-SSL, and PP+P-SSL) on the same setting as Sec. 4.2 to confirm the superiority of the custom architecture. We summarize the results in Table 16 . This shows that the custom architecture is superior to the original RN18 (4, 4, 4, 4) in all settings, and also indicates that this is a fair comparison among methods.

C.2 TARGET DATASET SIZE

We evaluate the performance of our method on a limited target data setting. We randomly sampled the subsets of the training dataset (StanfordCars) for each sampling rate of 25, 50, and 75%, and used them to train target models with our method. Note that the pseudo datasets were created by using the reduced datasets. Table 14 lists the results. We observed that our method can outperform the baselines in all sampling rate settings.

C.3 SOURCE GENERATIVE MODELS

Table 17 lists the results with the architectures of multiple generative models. We tested our method by using pseudo samples generated from the generative models for multiple resolutions including SNGAN (Miyato et al., 2018) , SAGAN (Zhang et al., 2019) , and ADM-G (Dhariwal & Nichol, 2021) . We implemented these generative models on the basis of open source repositories including pytorch-pretrained-BigGANfoot_2 , Pytorch-StudioGANfoot_3 by Minguk et al. (2021) , and guided-diffusionfoot_4 by Dhariwal & Nichol (2021) ; we used the pre-trained weights distributed by the repositories. We measured the top-1 test accuracy on the target task (StanfordCars) and Fréchet Inception Distance (FID, Heusel et al. (2017) ). In Table 17 , the generative models with better FID scores tended to achieve a higher top-1 accuracy score with PP and P-SSL. Regarding Here, we investigate the effect of hyperparameter τ for UDA, which is mainly used in the main paper. By the definition of temperature softmax function (exp(y i /τ )/ K j exp(y j /τ )), the temperature parameter τ controls the sharpness of the predicted class-conditional distribution; lower temperature outputs sharper distribution. In this regard, in our P-SSL, Ĉt (x s←t , τ ; θ), which determines how to represent the pseudo sample x s←t with the soft target class labels, can be changed by τ . Thus, the choice of τ can be an important factor for training in P-SSL. To evaluate the effect, we tested P-SSL by varying τ on StanfordCars as shown in Table 21 The results show that the moderately higher τ achieved better target performances. This suggests that representing x s←t by softer target class labels can bring a positive effect to target classifiers, which is consistent with the discussions in Sec. 3.2.1 and 3.2.3. C.5.5 P-SSL VS. SSL USING TARGET LABELED DATA We provide further analysis to confirm the performance of the unlabeled pseudo samples x s←t by comparing with a SSL method using real target samples in the unsupervised loss function. We call this method as T-SSL, and T-SSL can be another simple baseline for P-SSL because it can simply discard pseudo sample synthesis in P-SSL. The results are shown in Table 22 . We can see that applying supervised loss and unsupervised loss to the same sample has a negative effect: the use of pseudo-labels to target labeled data might promote overfitting. This result indicates that x s←t in P-SSL certainly performs as useful unlabeled data in UDA than directly using target data.

D QUALITATIVE EVALUATION OF PSEUDO SAMPLES

We discuss qualitative studies of the pseudo samples generated by PCS. To confirm the correspondences between the target and pseudo samples, we used StanfordCars as the target dataset and generated samples from BigGAN with the same setting as in Sec. 4. Figure 5 shows the visualizations of the source dataset (ImageNet), target dataset (StanfordCars), and pseudo samples generated by PCS. The samples were randomly selected from each dataset. We can see that PCS succeeded in generating target-related samples from the target samples. To assess the validity of using pseudo source soft labels in PCS, we analyzed the pseudo samples corresponding to each target label. Figure 6 shows the pseudo samples generated by the target samples of Hummer and Aston Martin V8 Convertible classes in StanfordCars. We confirm that the pseudo samples by PCS can capture the features of target classes. This also can confirm the ranking of the confidence scores for source classes listed in Table 23 ; the pseudo source soft labels seem to represent the target samples by the interpolation of source classes. ImageNet (target-related) StanfordCars PCS CD player, chain, chest, chime, clog, coffee mug, coffeepot, coil, combination lock, corkscrew, cowboy hat, cradle, crane, crash helmet, croquet ball, desktop computer, dial telephone, doormat, drilling platform, drum, dumbbell , electric fan, electric guitar, envelope, face powder, fire engine, fire screen, flagpole, folding chair, football helmet, fountain, French horn, frying pan, gasmask, gas pump, goblet, golf ball, gong, grand piano, guillotine, hair slide, hamper, hand blower, hand-held computer, handkerchief, harmonica, harp, harvester, hook, horse cart, hourglass, iPod, jersey, jigsaw puzzle, joystick, knee pad, ladle, lampshade, laptop, lawn mower, letter opener, lighter, loudspeaker, loupe, lumbermill, magnetic compass, mailbag, mailbox, maraca, marimba, maze, microphone, microwave, missile, modem, moped, mortar, mosque, mountain bike, mountain tent, mouse, mousetrap, muzzle, nail, neck brace, nipple, notebook, obelisk, ocarina, oil filter, oscilloscope, oxygen mask, packet, paddle, palace, parachute, park bench, pay-phone, pedestal, pencil sharpener, perfume, Petri dish, photocopier, pick, pier, pill bottle, ping-pong ball, pitcher, plane, pole, pool table, pot, printer, projectile, projector, puck, punching bag, quill, racket, radio, radio telescope, reel, refrigerator, revolver, rifle, rubber eraser, rule, running shoe, safety pin, sandal, scabbard, scale, school bus, schooner, screen, screw, screwdriver, shield, ski, slot, snowmobile, soap dispenser, soccer ball, sock, solar dish, sombrero, space bar, spatula, speedboat, spotlight, stethoscope, stopwatch, stretcher, studio couch, sunglass, sunscreen, suspension hen, brambling, goldfinch, house finch, junco, indigo bunting, robin, bulbul, jay, magpie, chickadee, water ouzel, kite, bald eagle, vulture, great grey owl, black grouse, ptarmigan, ruffed grouse, prairie chicken, quail, partridge, macaw, lorikeet, coucal, bee eater, hornbill, hummingbird, jacamar, toucan, drake, red-breasted merganser, goose, black swan, white stork, black stork, spoonbill, little blue heron, American egret, bittern, crane, limpkin, European gallinule, American coot, bustard, ruddy turnstone, red-backed sandpiper, redshank, dowitcher, oystercatcher, pelican, king penguin, albatross, worm fence DTD electric ray, stingray, leatherback turtle, thunder snake, hognose snake, horned viper, sidewinder, trilobite, harvestman, barn spider black widow tick, jellyfish, sea anemone, brain coral, flatworm, nematode, conch, sea slug, chiton, chambered nautilus, fiddler crab, isopod, komondor, tiger, leaf beetle, dung beetle, bee, ant, walking stick, cockroach, sea urchin, sea cucumber, zebra, apron, backpack, bakery, balloon, Band Aid, bannister, bath towel, beer glass, bib, binder, bonnet, bottlecap, bow tie, breastplate, broom, buckle, candle, cardigan, chain, chainlink fence, chain mail, cliff dwelling, cloak, coil, confectionery, crate, cuirass, dishrag, dome, doormat, envelope, face powder, feather boa, fire screen, fountain, fur coat, golf ball, gown, hair slide, hamper, handkerchief, honeycomb, hook, hoopskirt, jean, jersey, jigsaw puzzle, knot, lampshade, lighter, loudspeaker, mailbag, manhole cover, mask, matchstick, maze, megalith, microphone, mitten, mosquito net, nail, necklace, overskirt, packet, padlock, paintbrush, pajama, paper towel, pencil bathtub, beer bottle, bookcase, bookshop, bullet train, butcher shop, carousel, carton, cash machine, china cabinet, church, cinema, coil, confectionery, cradle, crate, crib, crutch, desk, desktop computer, dining table, dishwasher, dome, drum, dumbbell, electric locomotive, entertainment center, file, fire screen, folding chair, forklift, fountain, four-poster, golfcart, grand piano, greenhouse, grocery store, guillotine, home theater, horizontal bar, jigsaw puzzle, lab coat, laptop, library, limousine, loudspeaker, lumbermill, marimba, maze, medicine chest, microwave, minibus, monastery, monitor, mosquito net, organ, oxygen mask, palace, parallel bars, passenger car, patio, photocopier, pier, ping-pong ball, planetarium, plate rack, pole, pool table, pot, potter's wheel, prayer rug, printer, prison, projector, quilt, radio, refrigerator, restaurant, rocking chair, rotisserie, scoreboard, screen, shoe shop, shoji, shopping basket, shower curtain, sliding door, slot, solar dish, spotlight, stage, steel arch bridge, stove, streetcar, stretcher, studio couch, table lamp, tape player, television, theater curtain, throne, tobacco shop, toilet seat, toyshop, tripod, tub, turnstile, upright, vacuum, vault, vending machine, vestment, wardrobe, washbasin, washer, window screen, window shade, wine bottle, wok, comic book, plate, groom 



https://github.com/pytorch/vision https://github.com/huggingface/pytorch-pretrained-BigGAN https://github.com/huggingface/pytorch-pretrained-BigGAN https://github.com/POSTECH-CVLab/PyTorch-StudioGAN https://github.com/openai/guided-diffusion



(x t , y t ) < l a t e x i t s h a _ b a s e = " m D Z c x i D t A D S L v w O A Y z N s s = " > A A A C g i c h V H L S g M x F D O r p f V T e C I M Wi V M S S S t E i C A U L l t r V V A p M N Y g O Z Y S Y t t K d K / A h S s F E X W p f + D G H D R T x C X C m c e D s d E B X h i Q n J / f c n C S a b Q h X M t Z o U o O r u A z B r + g c H Q P C a U d n e d y C c T U u S F M n p N C G n z T d r h a g y + o e v N / c K t x x h W W u y a r N d p q R S Q l c l U f n Q + L b k B I W V a f D a Z e X C Y f B Z a j C F G a x n g F W I x E f J j I w I + F b r E N g q w o K O M E j h M S M I G V L j U t h A H g c D m r E O Y S E t R R C Z c r i l K E S u j k V Z b P m v S u l n T d Q n W J Q d g Z x i R Z F f s h T w G / b E n + t V f N q N L U a d Z a W m n B H s / / q k o S + x q v L L G L p O d V k H f b Y q F v y u H J S Z x d b I x c Z M / k / Y w T z c w K / R Y a v n i J I H x D / / t w / w f p c L D f S Q S k V T S / o A x j C B K L A l J Y Q R o O v c I j F n d K p z C h z S q K V q r T m h F C W X p Aw y l P = < / l a t e x i t > (a) Pseudo Pre-training < l a t e x i t s h a 1 _ b a s e 6 4 = " g 4 m l I r B G G M d z A j 2 1 5 T s R v c T X z s g = " > A A A C j 3 i c h V H P S x t B G H 1 Z 2 6 p p r Y m 9 C L 0 s B k U F w 0 S C i o c S 6 M V C D 9 E Y F V T C 7 u Y z D u 4 v d m a D G v I P 9 N R b D z 1 V K K X 0 3 m s L v f g P e P B P E I 8 W e u n B L 5 s F s V L 9 h p l 5 8 + Z 7 3 7 y Z s U N X K i 3 E e c Y Y e P T 4 y e D Q c P b p s 5 H n o 7 n 8 2 I Y K 4 s i h u h O 4 Q b R l W 4 p c 6 V N d S + 3 S V h i R 5 d k u b d o H

w z X P E 6 A M y P 5 + 7 H e z M p j N z 6 e x G N r m 8 F H 5 F F O O Y w D S 9 9 z y W s Y Z 1 5 O j c K i 5 w i 7 v I u 5 S Q pq S Z V q o U C T W j + B b S 7 C f Q o J I W < / l a t e x i t > C At t < l a t e x i t s h a 1 _ b a s e 6 4 = " m z V O V 2 V A L 7 d m c Z 0 + g m W 8 Z n b d 6 B U = " > A A A C e H i c h V H L S g M x F D 0 d X 7 W + q m 4 E N 8 V S H 5 u S S l E R F 4 o b l 7 6 q g q 3 D z J j W w X k x k x b q 0 B / w B 1 y 4 E A V R 8 T P c + A M u + g n i s o I g L r y d D o g W 9 Y Y k J y f 3 3 J w k q m P o n m C s H p E 6 Or u 6 e 6 K 9 s b 7 + g c G h + P D I j m e X X Y 3 n N N u w 3 T 1 V 8 b i h W z w n d G H w P c f l i q k a f F c 9 X m 3 u 7 1 a 4 6 + m 2 t S 2 q D i + Y S s n S i 7 q m C K L k + O i q 7 B 3 4 e V M R R 5 p i + C s 1 2 a v J 8 S R L s y A S 7 S A T g i T C W L f j N 8 j j E D Y 0 l G G C w 4 I g b E C B R 2 0 f G T A 4 x B X g E + c S 0 o N 9 j h p i p C 1 T F q c M h d h j G k u 0 2 g 9 Z i 9 b N m l 6 g 1 u g U g 7 p L y g R S 7 I n d s Q Z 7 Z P f s m X 3 8 W s s P a j S 9 V G l W W 1 r u y E O n Y 1 t v / 6 p M m g W O v l R / e h Y o Y i H w q p N 3 J 2 C a t 9 B a + s r J W W N r c T P l T 7 I r 9 k L + L 1 m d P d A N r M q r d r 3 B N 8 8 R o w / I / H z u d r A z m 8 7 M p b M b 2 e T y U v g V U Y x j A t P 0 3 v N Y x h r W k a N z q 7 j A L e 4 i 7 1 J C m p J m W q l S J N S M 4 l t I s 5 / M j 5 I U < / l a t e x i t > C As s < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 T S M o / 3 n f u t J 4 t m H / w / P 8 d 7 b o

Figure 1: Proposed transfer learning methods leveraging conditional source generative model Gs. Red color represents given source models, light blue represents target models and datasets, and dark blue represents the output of the proposed methods. (a) We produce initial weights of a target architecture At by training a source classifier C A t s with pairs of conditional sample xs ∼ Gs(ys) and uniformly sampled source label ys. (b) We penalize a target classifier C A t t with unsupervised loss derived from SSL method by applying a pseudo sample xs←t while supervised training on target dataset Dt. xs←t is sampled from Gs conditioned by pseudo source label ys←t = C As s (xt).

9 b l 2 3 b l m 3 b W W / s F / b b y a h t l V r Z u E P s w 9 + A + k I t / A = < / l a t e x i t t e x i t s h a 1 _ b a s e 6 4 = " H 9 o 4 G f Z U X Z + O W 9 m h l c y c 0 l w 3 d

X 5 V B s 8 D x l + p P z w I l r H p e N f J u e 0 z n F m p X X z s 5 a + + t 7 c 4 2 5 t g V e y H / l 6 z F H u g G Z u 1 V v d 7 h u + c I 0 g c k f z 5 3 L 8 g s J Z L L i d R O K r 6 R 8 r 9 i G D O I Y Y H e e w U b 2

Figure 2: Pseudo conditional sampling. We obtain a pseudo soft label ys←t by applying a target data xt to a source classifier C As s , and then generate a target-related sample xs←t from a source generative model Gs conditioned by ys←t. In this example, C As s output ys←t from the input car image xt of Hummer class by interpreting xt as a mixture of source car classes (Jeep, Limousine, MovingVan, etc.), and then Gs generate a target-related car image from ys←t.

Figure 3: Correlation between FID and accuracy gain

. B.5 SOURCE DATASET: COMPCARS CompCars (Yang et al., 2015) is a fine-grained vehicle image dataset for classifying the vehicle manufacturers or models. It contains 163 manufacturer classes, 1,716 model classes, and 136,726 images of entire vehicles collected from the web. As the source dataset for PP and P-SSL, we used 163 manufacturer classes for training classifiers and conditional generative models since the manufacturer classes do not overlap with the classes of the target dataset, i.e., StanfordCars.

Figure 5: Samples of source, target, and pseudo datasets (random picking).

Figure 6: Pseudo samples generated by PCS (random picking)

Comparison of transfer learning settings

Algorithm 1 Pseudo conditional sampling Require: Target dataset Dt, source classifier C As s , source generator Gs, number of pseudo samples Ns←t, output label function g

Distribution gaps of pseudo samples on multiple target datasets Distribution Gap (FID) (D s , D t ) (F (D s ), D t ) (D s←t , D t )

List of target datasets

Evaluations on Motivating ExampleTo confirm that pseudo samples can satisfy Assumption 3.1, we assess the difference between the target distribution and pseudo distribution. Since it is difficult to directly compute the likelihood of the pseudo samples, we leverage Fréchet Inception Distance(FID, Heusel et al. (2017)), which measures a distribution gap between two datasets by 2-Wasserstein distance in the closed-form (lower FID means higher similarity). We evaluate the quality of D s←t by comparing FID(D s←t , D t ) to FID(D s , D t ) and FID(F (D s ), D t ), where F (D s ) is a subset of D s constructed by confidence-based filtering similar to a previous study(Xie et al., 2020) (see Appendix B.4 for more detailed protocol). That is, if D s←t achieves lower FID than D s or D s , then D s←t can approximate D t well.Table2shows the FID scores when D s is ImageNet. The experimental settings are shared with Sec. 4.4. Except for DTD and StanfordDogs, D s←t outperformed D s and F (D s ) in terms of the similarity to D t . This indicates that PCS can produce more target-related samples than the natural source dataset. On the other hand, in the case of DTD (texture classes), D s←t relatively has low similarity to D t . This implies that PCS does not approximate p Dt (x) well when the source and target datasets are not so relevant. Note that StanfordDogs is a subset of ImageNet, and thus FID(F (D

To transfer knowledge in C As

Performance comparison of multiple target architectures on StanfordCars (Top-1 Acc.(%))

Comparison to fine-tuning (FT) and semi-supervised learning (SSL) on StanfordCars

Comparison of source datasets on Stan-fordCars

Results on PASCAL-VOC 2007

Comparison of our methods with finetuning methods

91.08 ±0.15 91.16±0.05

Comparison of our methods with finetuning discriminator

Performance comparison of multiple target architectures on StanfordCars (Top-1 Acc. (%))

Top-1 accuracy in various target dataset sizes ±0.88 45.27 ±0.92 69.21 ±0.88 Ours w/o P-SSL 59.48 ±1.23 82.46 ±0.36 88.31 ±0.10 Ours 61.90 ±0.60 82.65 ±0.30 89.33 ±0.19

Comparison of output label functions in PCS

Effect of τ in UDA

P-SSL vs. T-SSL

Ranking of averaged confidence scores of ImageNet classes corresponding to target classes

Corresponding ImageNet classes to target datasets (1) ray, ostrich, great grey owl, tree frog, tailed frog, loggerhead, leatherback turtle, common iguana, triceratops, trilobite, scorpion, barn spider, tick, centipede, hummingbird, drake, goose, tusker, wallaby, jellyfish, nematode, conch, snail, rock crab, fiddler crab, American lobster, isopod, black stork, American egret, king penguin, killer whale, whippet,Saluki, leopard, jaguar, cheetah, brown bear, American black bear, fly, grasshopper, cockroach, mantis, monarch, starfish, sea urchin, porcupine, sorrel, zebra, ibex, hartebeest, Arabian camel, llama, skunk, gorilla, chimpanzee, Indian elephant, African elephant, acoustic guitar, airliner, airship, altar, analog clock, assault rifle, backpack, balloon, ballpoint, Band Aid, barbell, barrow, bathtub, beacon, beaker, bearskin, beer glass, bell cote, bicycle-built-for-two, binder, binoculars, bolo tie, bottlecap, bow, brass, breastplate, buckle, candle, cannon, canoe, can opener, carpenter's kit, car wheel, cassette, cassette player,

bridge, swing, switch, syringe, table lamp, tape player, teapot, teddy, tennis ball, thimble, thresher, toaster, tobacco shop, toilet seat, torch, tow truck, tray, tricycle, tripod, tub, typewriter keyboard, umbrella, unicycle, upright, vase, waffle iron, wall clock, wallet, warplane, washer, water jug, whiskey jug, whistle, Windsor tie, wine bottle, wool, worm fence, yawl, web site, comic book, street sign, traffic light, book jacket, plate, cheeseburger, hotdog, spaghetti squash, fig, carbonara, red wine, cup, eggnog, cliff, geyser, lakeside, promontory, seashore, valley, volcano, daisy, hip, earthstar, hen-of-the-woods CUB-200-2011

box, Petri dish, picket fence, pillow, pinwheel, plastic bag, poncho, pot, prayer rug,  prison, purse, quill, quilt, radiator, radio, rubber eraser, rule, safety pin, saltshaker, sarong, screw, shield, shoji,  shopping basket, shovel, shower  cap, shower curtain, sleeping bag, solar dish, space heater, spider web, stole, stone wall, strainer, swab, sweatshirt, swimming trunks, switch, syringe, tennis ball, thatch, theater curtain, thimble, tile roof, tray, trench coat, umbrella, vase, vault, velvet, waffle iron, wall clock, wallet, wardrobe, water bottle, wig, window screen, window shade, Windsor tie, wooden spoon, wool, web site, crossword puzzle, book jacket, trifle, ice cream, ice lolly, French loaf, pretzel, head cabbage, broccoli, cauliflower, strawberry, lemon, fig, jackfruit, custard apple, pomegranate, hay, chocolate sauce, dough, meat loaf, potpie, cup, eggnog, bubble, cliff, coral reef, geyser, sandbar, valley, volcano, corn, buckeye, coral fungus, hen-of-the-woods, ear, toilet tissue FGVC-Aircraft aircraft carrier, airliner, airship, missile, projectile, space shuttle, speedboat, trimaran, warplane, wing Indoor67 academic gown, altar, bakery, balance beam, bannister, barbell, barber chair, barbershop, barrel, bathing cap,

APPENDIX

The following manuscript provides the supplementary materials of the main paper: Transfer Learning with Pre-trained Conditional Generative Models. We describe (A) additional related works of domain adaptation and fine-tuning, (B) details of experimental settings used in the main paper, (C) additional experiments including comparison of our method and fine-tuning, and detailed analysis of PP and P-SSL, and (D) qualitative evaluations of pseudo samples by PCS.

A EXTENDED RELATED WORK

A.1 DOMAIN ADAPTATION Domain adaptation leverages source knowledge to the target task by minimizing domain gaps between source and target domains through joint-training (Ganin et al., 2016) . It is generally assumed that the source and target task label spaces overlap (Pan & Yang, 2010; Wang & Deng, 2018) and labeled source datasets are available when training target models. Several studies have attempted to solve the transfer learning problems called source-free adaptation (Chidlovskii et al., 2016; Liang et al., 2020; Kundu et al., 2020; Wang et al., 2021a) , where the model must adapt to the target domain without target labels and the source dataset. However, they still require architecture consistency and overlaps between the source and target tasks, i.e., it is not applicable to our problem setting.

A.2 FINETUNING

Our method is categorized as an inductive transfer learning approach (Pan & Yang, 2010) , where thelabeled target datasets are available and the source and target task label spaces do not overlap. In deep learning, fine-tuning (Yosinski et al., 2014; Agrawal et al., 2014; Girshick et al., 2014) , which leverages source pre-trained weights as initial parameters of the target modesl, is one of the most common approaches of inductive transfer learning because of its simplicity. Previous studies have attempted to improve fine-tuning by adding a penalty of the gaps between source and target models such as adding L 2 penalty term (Li et al., 2018) or penalty using channel-wise importance of feature maps (Li et al., 2019) . You et al. (2020) have introduced category relationships between source and target tasks into target-task training and penalized the target models to predict pseudo source labels that are the outputs of the source models by applying target data. Shu et al. (2021) have presented an approach leveraging multiple source models pre-trained on different datasets and tasks by mixing the outputs via adaptive aggregation modules. Although these methods outperform the naïve fine-tuning baselines, they require architecture consistency between source and target tasks. In contrast to the fine-tuning methods, our method can be used without architecture consistency and source dataset access since it transfers source knowledge via pseudo samples drawn from source pre-trained generative models.

A.3 SEMI-SUPERVISED LEARNING

SSL is a paradigm that trains a supervised model with labeled and unlabeled samples by minimizing supervised and unsupervised loss simultaneously. Historically, various SSL algorithms have been used or proposed for deep learning such as entropy minimization (Grandvalet & Bengio, 2005) , pseudo-label (Lee et al., 2013) , virtual adversarial training (Miyato et al., 2017) , and consistency regularization (Bachman et al., 2014; Sajjadi et al., 2016; Laine & Aila, 2016) . UDA (Xie et al., 2020) and FixMatch (Sohn et al., 2020) , which combine ideas of pseudo-label and consistency regularization, have achieved remarkable performance. An assumption with these SSL algorithms is that the unlabeled data are sampled from the same distribution as the labeled data. If there is a large gap between the labeled and unlabeled data distribution, the performance of SSL algorithms degrades (Oliver et al., 2018) . However, Xie et al. (2020) have revealed that unlabeled samples in another dataset different from a target dataset can improve the performance of SSL algorithms by carefully selecting target-related samples from source datasets. This indicates that SSL algorithms can achieve high performances as long as the unlabeled samples are related to target datasets, even when they belong to different datasets. On the basis of this implication, our P-SSL exploits pseudo samples drawn from source generative models as unlabeled data for SSL. We tested SSL algorithms the resolution, the models of 256×256, the generated samples of which are the nearest to the input size of C At t (224×224), were the best. From these results, we recommend using generative models synthesizing high-fidelity samples at a resolution close to the target models when applying our method.

C.4 ANALYSIS OF PSEUDO PRE-TRAINING

We analyze the characteristics of PP by varying the synthesized samples.

C.4.1 SYNTHESIZING STRATEGY

We compare the four strategies using the source generative models for PP. We tested Uniform: synthesizing samples for all source classes (default), Filtered: synthesizing samples for target-related source classes identified by the same protocol as in Sec. 4.5, PCS: synthesizing samples by PCS, Offline: synthesizing fixed samples in advance of training and training C At s with them instead of sampling from G s . In PCS, we optimized the pre-training models with pseudo source soft labels generated in the process of PCS. Table 18 shows the target task performances. We found that the Uniform model achieved the best performance. We can infer that, in PP, pre-training with various classes and samples is more important than that with only target-related classes or fixed samples.

C.5 ANALYSIS OF PSEUDO SEMI-SUPERVISED LEARNING C.5.1 SEMI-SUPERVISED LEARNING ALGORITHM

We compare SSL algorithms in P-SSL. We used six SSL algorithms: EntMin (Grandvalet & Bengio, 2005) , Pseudo Label (Lee et al., 2013) , Soft Pseudo Label, Consistency Regularization, Fix-Match (Sohn et al., 2020) , and UDA (Xie et al., 2020) . Soft Pseudo Label is a variant of Pseudo Label, which uses the sharpen soft pseudo labels instead of the one-hot (hard) pseudo labels. Consistency Regularization computes the unsupervised loss of UDA without the confidence thresholding. Table 19 shows the results on StanfordCars, where Pseudo Supervised is a model using a pairs of (x s←t , y t ) in pseudo conditional sampling for supervised training. UDA achieved the best result. (Grandvalet & Bengio, 2005) 72.56 ±3.33 Pseudo Label (Lee et al., 2013) 74.49 ±2.26 Soft Pseudo Label 78.44 ±2.41 Consistency Regularization 79.17 ±1.79 FixMatch (Sohn et al., 2020) 74.31 ±3.27 UDA (Xie et al., 2020) 80.01 More importantly, the methods using hard labels (Pseudo Supervised, Pseudo Label, and FixMatch) failed to outperform the scratch models, whereas the soft label based methods improved the performance. This indicates that translating the label of pseudo samples as the interpolation of the target labels can improve the performance as mentioned in Sec. 3.2.3.

C.5.2 SAMPLE SIZE

We evaluate the effect of the sizes of pseudo datasets for P-SSL on the target test accuracy. We varied the pseudo dataset sizes in {10K, 50K, 100K, 500K } and tested the target performance of P-SSL on the StanfordCars dataset, as shown in Figure 4 (right). We found that the middle range of the dataset size (50K and 100K images) achieved better results. This suggests that P-SSL does not require generating extremely large pseudo datasets for boosting the target models.

C.5.3 OUTPUT LABEL FUNCTION

We discuss the performance comparison of output label functions in PCS. The output label function is crucial for synthesizing the target-related samples from source generative models since it directly determines the attributes on the pseudo samples. We tested six labeling strategies, i.e., Random Label: attaching uniformly sampled source labels, Softmax: using softmax outputs of C As s (default), Temperature Softmax: applying temperature scaling to output logits of C As s and using the softmax output, Argmax: using one-hot labels generated by selecting the class with the maximum probability in the softmax output of C As s , Sparsemax (Martins & Astudillo, 2016) : computing the Euclidean projections of the logit of C t representing sparse distributions in the source label space, and Classwise Mean: computing the mean of softmax outputs of C As s for each target class and using it as representative pseudo source labels of the target class to generate pseudo samples. Table 20 shows the comparison of the labeling strategies. Among the strategies, Softmax is the best choice for PCS in terms of the target performance (top-1 accuracy) and the relatedness toward target datasets (FID). This means that the pseudo source label y s←t by Softmax succeeds in representing the characteristics of a target sample x t and its form of the soft label is important to extract target-related information via a source generative model G s .

