BOWTIE NETWORKS: GENERATIVE MODELING FOR JOINT FEW-SHOT RECOGNITION AND NOVEL-VIEW SYNTHESIS

Abstract

We propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. While existing work copes with two or more tasks mainly by multi-task learning of shareable feature representations, we take a different perspective. We focus on the interaction and cooperation between a generative model and a discriminative model, in a way that facilitates knowledge to flow across tasks in complementary directions. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with a feedback loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. Code and pre-trained models are released at https://github.com/zpbao/bowtie_networks.

1. INTRODUCTION

Given a never-before-seen object (e.g., a gadwall in Figure 1 ), humans are able to generalize even from a single image of this object in different ways, including recognizing new object instances and imagining what the object would look like from different viewpoints. Achieving similar levels of generalization for machines is a fundamental problem in computer vision, and has been actively explored in areas such as few-shot object recognition (Fei-Fei et al., 2006; Vinyals et al., 2016; Wang & Hebert, 2016; Finn et al., 2017; Snell et al., 2017) and novel-view synthesis (Park et al., 2017; Nguyen-Phuoc et al., 2018; Sitzmann et al., 2019) . However, such exploration is often limited in separate areas with specialized algorithms but not jointly. We argue that synthesizing images and recognizing them are inherently interconnected with each other. Being able to simultaneously address both tasks with a single model is a crucial step toward human-level generalization. This requires learning a richer, shareable internal representation for more comprehensive object understanding than it could be within individual tasks. Such "cross-task" knowledge becomes particularly critical in the low-data regime, where identifying 3D geometric structures of input images facilities recognizing their semantic categories, and vice versa. Inspired by this insight, here we propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. This joint task is challenging, because of its (i) weak supervision, where we do not have access to any 3D supervision, and (ii) few-shot setting, where we need to effectively learn both 3D geometric and semantic representations from minimal data. While existing work copes with two or more tasks mainly by multi-task learning or meta-learning of a shared feature representation (Yu et al., 2020; Zamir et al., 2018; Lake et al., 2015) , we take a different perspective in this paper. Motivated by the nature of our problem, we focus on the interaction and cooperation between a generative model (for view synthesis) and a discriminative model (for recognition), in a way that facilitates knowledge to flow across tasks in complementary directions, thus making the tasks help each other. For example, the synthesized images produced by the generative model provide viewpoint variations and could be used as additional training data to build a better recognition model; meanwhile, the recognition model ensures the preservation of the desired category information and deals with partial occlusions during the synthesis. To this end, we propose a feedback-based bowtie network (FBNet), as illustrated in Figure 1 . The network consists of a view synthesis module and a recognition module, which are linked through feedback connections in a bowtie fashion. This is a general architecture that can be used on top of any view synthesis model and any recognition model. The view synthesis module explicitly learns a 3D geometric representation from 2D images, which is transformed to target viewpoints, projected to 2D features, and rendered to generate images. The recognition module then leverages these synthesized images from different views together with the original real images to learn a semantic feature representation and produce corresponding classifiers, leading to the feedback from the output of the view synthesis module to the input of the recognition module. The semantic features of real images extracted from the recognition module are further fed into the view synthesis module as conditional inputs, leading to the feedback from the output of the recognition module to the input of the view synthesis module. One potential difficulty, when combining the view synthesis and the recognition modules, lies in the mismatch in their level of image resolutions. Deep recognition models can benefit from highresolution images, and the recognition performance greatly improves with increased resolution (Wang et al., 2016; Cai et al., 2019; He et al., 2016) . By contrast, it is still challenging for modern generative models to synthesize very high-resolution images (Regmi & Borji, 2018; Nguyen-Phuoc et al., 2019) . To address this challenge, while operating on a resolution consistent with state-of-the-art view synthesis models (Nguyen-Phuoc et al., 2019), we further introduce resolution distillation to leverage additional knowledge in a recognition model that is learned from higher-resolution images. Our contributions are three-folds. (1) We introduce a new problem of simultaneous few-shot recognition and novel-view synthesis, and address it from a novel perspective of cooperating generative and discriminative modeling. (2) We propose feedback-based bowtie networks that jointly learn 3D geometric and semantic representations with feedback in the loop. We further address the mismatch issue between different modules by leveraging resolution distillation. (3) Our approach significantly improves both view synthesis and recognition performance, especially in the low-data regime, by enabling direct manipulation of view, shape, appearance, and semantics in generative image modeling.

2. RELATED WORK

Few-Shot Recognition is a classic problem in computer vision (Thrun, 1996; Fei-Fei et al., 2006) . Many algorithms have been proposed to address this problem (Vinyals et al., 2016; Wang & Hebert, 2016; Finn et al., 2017; Snell et al., 2017) , including the recent efforts on leveraging generative models (Li et al., 2015; Wang et al., 2018; Schwartz et al., 2018; Zhang et al., 2018; Tsutsui et al., 2019; Chen et al., 2019b; Li et al., 2019; Zhang et al., 2019; Sun et al., 2019) . A hallucinator is introduced to generate additional examples in a pre-trained feature space as data augmentation to help with low-shot classification (Wang et al., 2018) . MetaGAN improves few-shot recognition by producing fake images as a new category (Zhang et al., 2018) . However, these methods either do not synthesize images directly or use a pre-trained generative model that is not optimized towards the downstream task. By contrast, our approach performs joint training of recognition and view synthesis, and enables the two tasks to cooperate through feedback connections. In addition, while there has been work considering both classification and exemplar generation in the few-shot regime, such investigation focuses on simple domains like handwritten characters (Lake et al., 2015) but we address more realistic scenarios with natural images. Note that our effort is largely orthogonal to



Figure 1: Left: Given a single image of a novel visual concept (e.g., a gadwall), a person can generalize in various ways, including imagining what this gadwall would look like from different viewpoints (top) and recognizing new gadwall instances (bottom). Right: Inspired by this, we introduce a general feedback-based bowtie network that facilitates the interaction and cooperation between a generative module and a discriminative module, thus simultaneously addressing few-shot recognition and novel-view synthesis in the low-data regime.

