BOWTIE NETWORKS: GENERATIVE MODELING FOR JOINT FEW-SHOT RECOGNITION AND NOVEL-VIEW SYNTHESIS

Abstract

We propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. While existing work copes with two or more tasks mainly by multi-task learning of shareable feature representations, we take a different perspective. We focus on the interaction and cooperation between a generative model and a discriminative model, in a way that facilitates knowledge to flow across tasks in complementary directions. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with a feedback loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. Code and pre-trained models are released at https://github.com/zpbao/bowtie_networks.

1. INTRODUCTION

Given a never-before-seen object (e.g., a gadwall in Figure 1 ), humans are able to generalize even from a single image of this object in different ways, including recognizing new object instances and imagining what the object would look like from different viewpoints. Achieving similar levels of generalization for machines is a fundamental problem in computer vision, and has been actively explored in areas such as few-shot object recognition (Fei-Fei et al., 2006; Vinyals et al., 2016; Wang & Hebert, 2016; Finn et al., 2017; Snell et al., 2017) and novel-view synthesis (Park et al., 2017; Nguyen-Phuoc et al., 2018; Sitzmann et al., 2019) . However, such exploration is often limited in separate areas with specialized algorithms but not jointly. We argue that synthesizing images and recognizing them are inherently interconnected with each other. Being able to simultaneously address both tasks with a single model is a crucial step toward human-level generalization. This requires learning a richer, shareable internal representation for more comprehensive object understanding than it could be within individual tasks. Such "cross-task" knowledge becomes particularly critical in the low-data regime, where identifying 3D geometric structures of input images facilities recognizing their semantic categories, and vice versa. Inspired by this insight, here we propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. This joint task is challenging, because of its (i) weak supervision, where we do not have access to any 3D supervision, and (ii) few-shot setting, where we need to effectively learn both 3D geometric and semantic representations from minimal data. While existing work copes with two or more tasks mainly by multi-task learning or meta-learning of a shared feature representation (Yu et al., 2020; Zamir et al., 2018; Lake et al., 2015) , we take a different perspective in this paper. Motivated by the nature of our problem, we focus on the interaction and cooperation between a generative model (for view synthesis) and a discriminative model (for recognition), in a way that facilitates knowledge to flow across tasks in complementary directions, thus making the tasks help each other. For example, the synthesized images produced by the generative model provide viewpoint variations and could be used as additional training data to build a better recognition model; meanwhile, the recognition model ensures the preservation of the desired category information and deals with partial occlusions during the synthesis. To this end, we propose a feedback-based bowtie network (FBNet), as illustrated in Figure 1 . The network consists of a view synthesis module and a recognition module, which are linked through feedback connections in a bowtie fashion. This is a general architecture that can be used on top of any view synthesis model and any recognition model. The view synthesis module explicitly learns a 3D geometric representation from 2D images, which is transformed to target viewpoints, projected to 2D features, and rendered to generate images. The recognition module then leverages these synthesized images from different views together with the original real images to learn a semantic feature representation and produce corresponding classifiers, leading to the feedback from the output of the view synthesis module to the input of the recognition module. The semantic features of real images extracted from the recognition module are further fed into the view synthesis module as conditional inputs, leading to the feedback from the output of the recognition module to the input of the view synthesis module. One potential difficulty, when combining the view synthesis and the recognition modules, lies in the mismatch in their level of image resolutions. Deep recognition models can benefit from highresolution images, and the recognition performance greatly improves with increased resolution (Wang et al., 2016; Cai et al., 2019; He et al., 2016) . By contrast, it is still challenging for modern generative models to synthesize very high-resolution images (Regmi & Borji, 2018; Nguyen-Phuoc et al., 2019) . To address this challenge, while operating on a resolution consistent with state-of-the-art view synthesis models (Nguyen-Phuoc et al., 2019), we further introduce resolution distillation to leverage additional knowledge in a recognition model that is learned from higher-resolution images. Our contributions are three-folds. (1) We introduce a new problem of simultaneous few-shot recognition and novel-view synthesis, and address it from a novel perspective of cooperating generative and discriminative modeling. (2) We propose feedback-based bowtie networks that jointly learn 3D geometric and semantic representations with feedback in the loop. We further address the mismatch issue between different modules by leveraging resolution distillation. (3) Our approach significantly improves both view synthesis and recognition performance, especially in the low-data regime, by enabling direct manipulation of view, shape, appearance, and semantics in generative image modeling.

2. RELATED WORK

Few-Shot Recognition is a classic problem in computer vision (Thrun, 1996; Fei-Fei et al., 2006 (Wang et al., 2018) . MetaGAN improves few-shot recognition by producing fake images as a new category (Zhang et al., 2018) . However, these methods either do not synthesize images directly or use a pre-trained generative model that is not optimized towards the downstream task. By contrast, our approach performs joint training of recognition and view synthesis, and enables the two tasks to cooperate through feedback connections. In addition, while there has been work considering both classification and exemplar generation in the few-shot regime, such investigation focuses on simple domains like handwritten characters (Lake et al., 2015) but we address more realistic scenarios with natural images. Note that our effort is largely orthogonal to designing the best few-shot recognition or novel-view synthesis method; instead, we show that the joint model outperforms the original methods addressing each task in isolation. Novel-View Synthesis aims to generate a target image with an arbitrary camera pose from one given source image (Tucker & Snavely, 2020) . It is also known as "multiview synthesis." For this task, some approaches are able to synthesize lifelike images (Park et Wortsman et al., 2020) . However, they heavily rely on pose supervision or 3D annotation, which is not applicable in our case. An alternative way is to learn a view synthesis model in an unsupervised manner. Pix2Shape learns an implicit 3D scene representation by generating a 2.5D surfel based reconstruction (Rajeswar et al., 2020) . HoloGAN proposes an unsupervised approach to learn 3D feature representations and render 2D images accordingly (Nguyen-Phuoc et al., 2019). Nguyen-Phuoc et al. (2020) learn scene representations from 2D unlabeled images through foreground-background fragmenting. Different from them, not only can our view synthesis module learn from weakly labeled images, but it also enables conditional synthesis to facilitate recognition. Feedback-Based Architectures, where the full or partial output of a system is routed back into the input as part of an iterative cause-and-effect process (Ford, 1999) 2020) use multiview images to tackle fine-grained recognition tasks. However, their method needs strong pose supervision to train the view synthesis model, while we do not. Also, these approaches do not treat the two tasks of equal importance, i.e., one task as an auxiliary task to facilitate the other. On the contrary, our approach targets the joint learning of the two tasks and improves both of their performance. Importantly, we focus on learning a shared generative model, rather than a shared feature representation as is normally the case in multi-task learning. 2020) leverage a generative model to boost viewpoint estimation. The main difference is that we focus on the joint task of synthesis and recognition and achieve bi-directional feedback, while existing work only considers optimizing the target discriminative task using adversarial training or with a feedforward network.

3.1. JOINT TASK OF FEW-SHOT RECOGNITION AND NOVEL-VIEW SYNTHESIS

Problem Formulation: Given a dataset D = {(x i , y i )}, where x i ∈ X is an image of an object and y i ∈ C is the corresponding category label (X and C are the image space and label space, respectively), we address the following two tasks simultaneously. (i) Object recognition: learning a discriminative model R : X → C that takes as input an image x i and predicts its category label. (ii) Novel-view synthesis: learning a generative model G : X × Θ → X that, given an image x i of category y i and an arbitrary 3D viewpoint θ j ∈ Θ, synthesizes an image in category y i viewed from θ j . Notice that we are more interested in category-level consistency, for which G is able to generate images of not only the instance x i but also other objects of the category y i from different viewpoints. This joint-task scenario requires us to improve the performance of both 2D and 3D tasks under weak supervision without any ground-truth 3D annotations. Hence, we need to exploit the cooperation between them. 

3.2. FEEDBACK-BASED BOWTIE NETWORKS

To address the joint task, we are interested in learning a generative model that can synthesize realistic images of different viewpoints, which are also useful for building a strong recognition model. We propose a feedback-based bowtie network (FBNet) for this purpose. This model consists of a view synthesis module and a recognition module, trained in a joint, end-to-end fashion. Our key insight is to explicitly introduce feedback connections between the two modules, so that they cooperate with each other, thus enabling the entire model to simultaneously learn 3D geometric and semantic representations. This general architecture can be used on top of any view synthesis model and any recognition model. Here we focus on a state-of-the-art view synthesis model -HoloGAN (Nguyen-Phuoc et al., 2019), and a widely adopted few-shot recognition model -prototypical network (Snell et al., 2017) , as shown in Figure 2 .

3.2.1. VIEW SYNTHESIS MODULE

The view synthesis module V is shown in the blue shaded region in Figure 2 . It is adapted from HoloGAN (Nguyen-Phuoc et al., 2019), a state-of-the-art model for unsupervised view synthesis. This module consists of a generator G which first generates a 3D feature representation from a latent constant tensor (initial cube) through 3D convolutions. The feature representation is then transformed to a certain pose and projected to 2D with a projector. The final color image is then computed through 2D convolutions. This module takes two inputs: a latent vector input z and a view input θ. z characterizes the style of the generated image through adaptive instance normalization (AdaIN) (Huang & Belongie, 2017) units. θ = [θ x , θ y , θ z ] guides the transformation of the 3D feature representation. This module also contains a discriminator D to detect whether an image is real or fake (not shown in Figure 2 ). We use the standard GAN loss from DC-GAN (Radford et al., 2016) , L GAN (G, D). We make the following important modifications to make the architecture applicable to our joint task. Latent Vector Formulation: To allow the synthesis module to get feedback from the recognition module (details are shown in Section 3.2.3), we first change HoloGAN from unconditional to conditional. To this end, we model the latent input z as: z i = f i ⊕ n i , where f i is the conditional feature input derived from image x i and n i is a noise vector sampled from Gaussian distribution. ⊕ is the combination strategy (e.g., concatenation). By doing so, the synthesis module leverages additional semantic information, and thus maintains the category-level consistency with a target image and improves the diversity of the generated images. Identity Regularizer: Inspired by Chen et al. ( 2016), we introduce an identity regularizer to ensure that the synthesis module simultaneously satisfies two critical properties: (i) the identity of the generated image remains when we only change the view input θ; (ii) the orientation of the generated image preserves when we only change the latent input z, and this orientation should be consistent with the view input θ. Specifically, we leverage an encoding network H to predict the reconstructed latent vector z and the view input θ : H(G(z, θ)) = [z , θ ], where G(z, θ) is the generated image. Then we minimize the difference between the real and the reconstructed inputs as L identity (G, H) = E z z -z 2 + E θ θ -θ 2 . ( ) Here H shares the majority of the convolution layers of the discriminator D, but uses an additional fully-connected layer. Section A explains the detailed architecture of the view synthesis module.

3.2.2. RECOGNITION MODULE

The recognition module R (green shaded region in Fig. 2 We first pre-train F highR following the standard practice with a cross-entropy softmax classifier (Liu et al., 2016) . We then train our feature extraction network F lowR (the one used in the recognition module), under the guidance of F highR through matching their features: L feature (F lowR ) = E x F highR (x) -F lowR (x) 2 , ( ) where x is a training image. With the help of resolution distillation, the feature extraction network re-captures information in high-resolution images but potentially missed in low-resolution images. Prototypical Classification Network: We use the prototypical network P (Snell et al., 2017) as our classifier. The network assigns class probabilities p based on distance of the input feature vector from class centers µ; and µ is calculated by using support images in the latent feature space: pc (x) = e -d(P (FlowR(x)),µc) j e -d(P (FlowR(x)),µj ) , µ c = (xi,yi)∈S P (F lowR (x i ))I [y i = c] (xi,yi)∈S I [y i = c] , where x is a real query image, pc is the probability of category c, and d is a distance metric (e.g., Euclidean distance). S is the support dataset. P operates on top of the feature extraction network F , and consists of 3 fully-connected layers as additional feature embedding (the classifier is nonparametric). Another benefit of using the prototypical network lies in that it enables the recognition module to explicitly leverage the generated images in a way of data augmentation, i.e., S contains both real and generated images to compute the class mean. Notice that, though, the module parameters are updated based on the loss calculated on the real query images, which is a cross-entropy loss L rec (R) between their predictions p and ground-truth labels.

3.2.3. FEEDBACK-BASED BOWTIE MODEL

As shown in Figure 2 , we leverage a bowtie architecture for our full model, where the output of each module is fed into the other module as one of its inputs. Through joint training, such connections work as explicit feedback to facilitate the communication and cooperation between different modules. Feedback Connections: We introduce two complementary feedback connections between the view synthesis module and the recognition module: (1) recognition output → synthesis input (green arrow in Figure 2 ), where the features of the real images extracted from the recognition module are fed into the synthesis module as conditional inputs to generate images from different views; (2) synthesis output → recognition input (blue arrow in Figure 2 ), where the generated images are used to produce an augmented set to train the recognition module. Categorical Loss for Feedback: The view synthesis module needs to capture the categorical semantics in order to further encourage the generated images to benefit the recognition. Therefore, we introduce a categorical loss to update the synthesis module with the prediction results of the generated images: L cat (G) = E yi -log(R(G(z i , θ i ))) , where y i is the category label for the generated image G(z i , θ i ). This loss also implicitly increases the diversity and quality of the generated images. Final Loss Function: The final loss function is: L Total = L GAN + L rec + L feature + λ id L identity + λ cat L cat , where λ id and λ cat are trade-off hyper-parameters. Training Procedure: We first pre-train F highR on the high-resolution dataset and save the computed features. These features are used to help train the feature extraction network F lowR through L feature . Then the entire model is first trained on C base and then fine-tuned on C novel . The training on the two sets are similar. During each iteration, we randomly sample some images per class as a support set and one image per class as a query set. The images in the support set, together with their computed features via the entire recognition module, are fed into the view synthesis module to generate multiple images from different viewpoints. These synthesized images are used to augment the original support set to compute the prototypes. Then, the query images are used to update the parameters of the recognition module through L rec ; the view-synthesis module is updated through L GAN , L identity , and L cat . The entire model is trained in an end-to-end fashion. More details are in Section B. 2020)) is interesting, it is not the main focus of this paper. We aim to validate that the feedback-based bowtie architecture outperforms the single-task models upon which it builds, rather than designing the best few-shot recognition or novel-view synthesis method. In Section F, we show that our framework is general and can be used on top of other single-task models and improve their performance. All the models are trained following the same few-shot setting described in Section 3.1.

4. EXPERIMENTAL EVALUATION

View Synthesis Facilitates Recognition: Table 1 presents the top-1 recognition accuracy for the base classes and the novel classes, respectively. We focus on the challenging 1, 5-shot settings, where the number of training examples per novel class K is 1 or 5. For the novel classes, we run five trials for each setting of K, and report the average accuracy and standard deviation for all the approaches. Table 1 shows that our FBNet consistently achieves the best few-shot recognition performance on the two datasets. Moreover, the significant improvement of FBNet over FBNet-aug (where the recognition model uses additional data from the conditional view synthesis model, but they are trained separately) indicates that the feedback-based joint training is the key to improve the recognition performance. Recognition Facilitates View Synthesis: We investigate the novel-view synthesis results under two standard metrics. The FID score computes the Fréchet distance between two Gaussians fitted to feature representations of the source (real) images and the target (synthesized) images (Dowson & Landau, 1982). The Inception Score (IS) uses an Inception network pre-trained on ImageNet (Deng et al., 2009) to predict the label of the generated image and calculate the entropy based on the predictions. IS seeks to capture both the quality and diversity of a collection of generated images (Salimans et al., 2016) . A higher IS or a lower FID value indicates better realism of the generated images. A larger variance of IS indicates more diversity of the generated images. We generate images of random views in one-to-one correspondence with the training examples for all the models, and compute the IS and FID values based on these images. The results are reported in Table 2 . As a reference, we also show the results of real images under the two metrics, which are the best results we could expect from synthesized images. Our FBNet consistently achieves the best performance under both metrics. Compared with HoloGAN, our method brings up to 18% improvement under FID and 19% under IS. Again, the significant performance gap between FBNet and FBNet-view shows that the feedback-based joint training substantially improves the synthesis performance. IS and FID cannot effectively evaluate whether the generated images maintain the category-level identity and capture different viewpoints. Therefore, Figure 3 not expect that the quality of the synthesized images would match those generated based on large amounts of training data, e.g. Brock et al. (2019) . This demonstrates the general difficulty of image generation in the few-shot setting, which is worth further exploration in the community. Notably, even in this challenging setting, our synthesized images are of significantly higher visual quality than the state-of-the-art baselines. Specifically, (1) our FBNet is able to perform controllable conditional generation, while HoloGAN cannot. Such conditional generation enables FBNet to better capture the shape information of different car models on CompCars, which is crucial to the recognition task. On CUB, FBNet captures both the shape and attributes well even in the extremely low-data regime (1-shot), thus generating images of higher quality and more diversity. (2) Our FBNet also better maintains the identity of the objects in different viewpoints. For both HoloGAN and FBNet-view, it is hard to tell whether they keep the identity, but FBNet synthesizes images well from all the viewpoints while maintaining the main color and shape. (3) In addition, we notice that there is just a minor improvement for the visual quality of the synthesis results from HoloGAN to FBNet-view, indicating that simply changing the view synthesis model from unconditional to conditional versions does not improve the performance. However, through our feedback-based joint training with recognition, the quality and diversity of the generated images significantly improve. Shared Generative Model vs. Shared Feature Representation: We further compare with a standard multi-task baseline (Ruder, 2017) , which learns a shared feature representation across the joint tasks, denoted as 'Multitask-Feat' in Table 4 . We treat the feature extraction network as a shared component between the recognition module and the view synthesis module, and update its parameters using both tasks without feedback connections. Table 4 shows that, through the feedback connections, our shared generative model captures the underlying image generation mechanism for more comprehensive object understanding, outperforming direct task-level shared feature representation. Ablation -Different Recognition Networks: While we used ResNet-18 as the default feature extraction network, our approach is applicable to different recognition models. Table 3 shows that the recognition performance with different feature extraction networks consistently improves. Interestingly, ResNet-10/18 outperform the deeper models, indicating that the deeper models might suffer from over-fitting in few-shot regimes, consistent with the observation in Chen et al. (2019a). Ablation -Categorical Loss: In addition to the feedback connections, our synthesis and recognition modules are linked by the categorical loss. To analyze its effect, we vary λ cat among 0 (without the categorical loss), 0.1, 1, and 5. Figure 4 shows the quantitative and qualitative results on CUB. With λ cat increasing, the recognition performance improves gradually. Meanwhile, a too large λ cat reduces the visual quality of the generated images: checkerboard noise appears. While these images are not visually appealing, they still benefit the recognition task. This shows that the categorical loss trades off the performance between the two tasks, and there is a "sweet spot" between them. 𝜆 !"# = 0 𝜆 !"# = 0.1 𝜆 !"# = 1 𝜆 !"# = 5 Ablation -Resolution Distillation and Prototypical Classification: Our proposed resolution distillation reconciles the resolution inconsistency between the synthesis and recognition modules, and further benefits from a recognition model trained on high-resolution images. The prototypical network leverages the synthesized images, which constitutes one of the feedback connections. We evaluate their effect by building two variants of our model without these techniques: 'FBNet w/o Dist' trains the feature extraction network directly from low-resolution images; 'FBNet w/o Proto' uses a regular classification network instead of the prototypical network. Table 4 shows that the performance of full FBNet significantly outperforms these variants, verifying the importance of our techniques. Qualitative Results on the CelebA-HQ Dataset: We further show that the visual quality of our synthesized images significantly gets improved on datasets with better aligned poses. For this purpose, we conduct experiments on CelebA-HQ (Lee et al., 2020), which contains 30,000 aligned human face images regarding 40 attributes in total. We randomly select 35 attributes as training attributes and 5 as few-shot test attributes. While CelebA-HQ does not provide pose annotation, the aligned faces mitigate the pose issue to some extent. Figure 5 shows that both the visual quality and diversity of our synthesized images substantially improve, while consistently outperforming HoloGAN. Discussion and Future Work: Our experimental evaluation has focused on fine-grained categories, mainly because state-of-the-art novel-view synthesis models still cannot address image generation for a wide spectrum of general images (Liu et al., 2019) . Meanwhile, our feedback-based bowtie architecture is general. With the advance in novel-view synthesis, such as the recent work of BlockGAN (Nguyen-Phuoc et al., 2020) and RGBD-GAN (Noguchi & Harada, 2020), our framework could be potentially extended to deal with broader types of images. Additional further investigation includes exploring more architecture choices and dealing with images with more than one object.

5. CONCLUSION

This paper has proposed a feedback-based bowtie network for the joint task of few-shot recognition and novel-view synthesis. Our model consistently improves performance for both tasks, especially with extremely limited data. The proposed framework could be potentially extended to address more tasks, leading to a generative model useful and shareable across a wide range of tasks. Additional Implementation Details: In the main paper, we set λ id = 10 and λ cat = 1 via crossvalidation, and found that the performance is relatively stable to the setting of these trade-off hyper-parameters. We sample 5 images per class for C base and 1 image for C novel . We use Adam optimizer for all the networks. The learning rate is set to 5e -5. The final dimension of the feature extraction network is 1,000. The hidden size of all the three fully-connected layers is 128, and the



http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/



Figure 1: Left: Given a single image of a novel visual concept (e.g., a gadwall), a person can generalize in various ways, including imagining what this gadwall would look like from different viewpoints (top) and recognizing new gadwall instances (bottom). Right: Inspired by this, we introduce a general feedback-based bowtie network that facilitates the interaction and cooperation between a generative module and a discriminative module, thus simultaneously addressing few-shot recognition and novel-view synthesis in the low-data regime.

Task Model Learning leverage generative networks to improve other visual tasks (Peng et al., 2018; Hu et al., 2019; Luo et al., 2020; Zhang et al., 2020). A generative network and a discriminative pose estimation network are trained jointly through adversarial loss in Peng et al. (2018), where the generative network performs data augmentation to facilitate the downstream pose estimation task. Luo et al. (2020) design a controllable data augmentation method for robust text recognition, which is achieved by tracking and refining the moving state of the control points. Zhang et al. (2020) study and make use of the relationship among facial expression recognition, face alignment, and face synthesis to improve training. Mustikovela et al. (

Figure 2: Architecture of our feedback-based bowtie network. The whole network consists of a view synthesis module and a recognition module, which are linked through feedback connections in a bowtie fashion.Few-Shot Setting: The few-shot dataset consists of one or only a few images per category, which makes our problem even more challenging. To this end, following the recent work on knowledge transfer and few-shot learning (Hariharan & Girshick, 2017; Chen et al., 2019a), we leverage a set of "base" classes C base with a large-sample dataset D base = {(x i , y i ), y i ∈ C base } to train our initial model. We then fine-tune the pre-trained model on our target "novel" classes C novel (C base ∩ C novel = 0) with its small-sample dataset D novel = {(x i , y i ), y i ∈ C novel } (e.g., a K-shot setting corresponds to K images per class).

Figure 3: Synthesized images from multiple viewpoints. Images in the same row/column are from the same viewpoint/object. Our approach captures the shape and attributes well even in the extremely low-data regime.

Figure 4: Ablation on λcat. Categorical loss trades off the performance between view synthesis and recognition.

Figure A: Architecture of ResBlock used in the view synthesis module. The default kernel size is 3 and the stride is 1.

). Many algorithms have been proposed to address this problem (Vinyals et al., 2016; Wang & Hebert, 2016; Finn et al., 2017; Snell et al., 2017), including the recent efforts on leveraging generative models (Li et al., 2015; Wang et al., 2018; Schwartz et al., 2018; Zhang et al., 2018; Tsutsui et al., 2019; Chen et al., 2019b; Li et al., 2019; Zhang et al., 2019; Sun et al., 2019). A hallucinator is introduced to generate additional examples in a pre-trained feature space as data augmentation to help with low-shot classification

, have been recently introduced into neural networks (Belagiannis & Zisserman, 2017; Zamir et al., 2017; Yang et al., 2018). Compared with prior work, our FBNet contains two complete sub-networks, and the output of each module is fed into the other as one of the inputs. Therefore, FBNet is essentially a bi-directional feedback-based framework which optimizes the two sub-networks jointly. Zamir et al., 2018; Standley et al., 2020). Some recent work investigates the connection between recognition and view synthesis, and makes some attempt to combine them together (Sun et al., 2018; Wang et al., 2018; Xian et al., 2019; Santurkar et al., 2019; Xiong et al., 2020; Michalkiewicz et al., 2020). For example, Xiong et al. (

View Synthesis Module V

We randomly split the entire dataset into 75% as the training set and 25% as the test set. For CUB, 150 classes are selected as base classes and 50 as novel classes. For CompCars, 240 classes are selected as base classes and 120 as novel classes. Note that we focus on simultaneous recognition and synthesis over all base or novel classes, which is significantly more challenging than typical 5-way classification over sampled classes in most of few-shot classification work(Snell et al., 2017; Chen et al., 2019a). We also include evaluation on additional datasets in Section D. We set λ id = 10 and λ cat = 1 via cross-validation. We use ResNet-18(He et al., 2016) as the feature extraction network, unless otherwise specified. To match the resolution of our data, we change the kernel size of the first convolution layer of ResNet from 7 to 5. The training process requires hundreds of examples at each iteration, which may not fit in the memory of our device. Hence, inspired byWang et al. (2018), we make a trade-off to first train the feature extraction network through resolution distillation. We then freeze its parameters and train the other parts of our model. Section C includes more implementation details. Top-1 (%) recognition accuracy on the CUB and CompCars datasets. For base classes: 150-way classification on CUB and 240-way classification on CompCars; for K-shot novel classes: 50-way classification on CUB and 120-way classification on CompCars. Our FBNet consistently achieves the best performance for both base and novel classes, and joint training significantly outperforms training each module individually.

Novel-view synthesis results under the FID and IS metrics. ↑ indicates that higher is better, and ↓ indicates that lower is better. As a reference, FID and IS of Real Images represent the best results we could expect. FBNet consistently outperforms the baselines, achieving 18% improvements for FID and 19% for IS.

Few-shot recognition accuracy consistently improves with different feature extraction networks.

Ablation studies on CUB regarding (i) learning a shared feature representation through standard multi-task learning, (ii) FBNet without resolution distillation, and (iii) FBNet using a regular classification network without prototypical classification. Our full model achieves the best performance.

acknowledgement

Acknowledgement: This work was supported in part by ONR MURI N000014-16-1-2007 and by AFRL Grant FA23861714660. We also thank NVIDIA for donating GPUs and AWS Cloud Credits for Research program.

annex

final feature dimension of the prototypical classification network is also 128. The batch size is 64 for the view synthesis module. We train 1,400 iterations for C base and 100 iterations for C novel . We randomly select 350 classes as base classes and 255 classes as novel classes. For the DOG dataset, there are 20,580 images belonging to 120 classes. We randomly select 80 classes as base and 40 classes as novel. For each class of the two datasets, we randomly select 75% as training data and 25% as test data. We pre-process these two datasets in a similar way as the CUB dataset.Slightly different from the evaluation in the main paper, we only compare our FBNet with Holo-GAN (Nguyen-Phuoc et al., 2019) for the view synthesis task. For the recognition task, we compare our method with FBNet-rec and FBNet-aug. All the experimental setting remains the same as that in the main paper on the CUB and CompCars datasets.Quantitative Results: Table A shows the recognition performance of the three competing models. Again, our method achieves the best performance for both base and novel classes. Table B shows the results of view synthesis and our method also achieves the best performance.Qualitative Results: Figure D shows the synthesized images by HoloGAN and our FBNet on the two datasets. First, we note that the overall quality of the synthesized images for both methods becomes substantially worse than one would expect with large amounts of training images. This demonstrates the general difficulty of the task due to weak supervision and lack of data, indicating the need for the community to focus on such problems. Second, our FBNet significantly outperforms the state-of-the-art HoloGAN, especially for the diversity of the synthesized images. Additionally, even though the data is scarcely limited, FBNet still captures the shape and some detailed attributes of images well. We see that the proposed method effectively works with 128-resolution. We also note that higher resolution requires higher-quality training data. The unsatisfactory size and quality of the CUB dataset introduce additional challenges for synthesis, such as missing of details, inconsistency of identity across different viewpoints, and noisy background. However, such problems could be further addressed by improving the quantity and quality of the training data.

H FROM JOINT-TASK TO TRIPLE-TASK: ATTRIBUTE TRANSFER

The proposed bowtie framework could also be extended to address more than two tasks, by either introducing additional feedback connections or changing the architectures of the two modules. Here as an example, we introduce an additional "attribute transfer" task combined with the novel-view synthesis task. That is, instead of seeing one image each time, the view synthesis module sees one source image ('im1') and one target image ('im2') at the same time; then it generates images with the object of the source image and the attributes of the target image. We achieve this by arranging source latent input for 3D AdaIN units and target latent input for 2D AdaIN units in the view-synthesis module. Figure F shows the result of this additional task: images in the same row keep the same identity with im1; images in the same column have the same attributes of im2. Note that, to have a better visualization, we use the predicted views from the source images as the view input for all the generated images in Figure F, but the model could still synthesize images of different viewpoints.

