BOWTIE NETWORKS: GENERATIVE MODELING FOR JOINT FEW-SHOT RECOGNITION AND NOVEL-VIEW SYNTHESIS

Abstract

We propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. While existing work copes with two or more tasks mainly by multi-task learning of shareable feature representations, we take a different perspective. We focus on the interaction and cooperation between a generative model and a discriminative model, in a way that facilitates knowledge to flow across tasks in complementary directions. To this end, we propose bowtie networks that jointly learn 3D geometric and semantic representations with a feedback loop. Experimental evaluation on challenging fine-grained recognition datasets demonstrates that our synthesized images are realistic from multiple viewpoints and significantly improve recognition performance as ways of data augmentation, especially in the low-data regime. Code and pre-trained models are released at https://github.com/zpbao/bowtie_networks.

1. INTRODUCTION

Given a never-before-seen object (e.g., a gadwall in Figure 1 ), humans are able to generalize even from a single image of this object in different ways, including recognizing new object instances and imagining what the object would look like from different viewpoints. Achieving similar levels of generalization for machines is a fundamental problem in computer vision, and has been actively explored in areas such as few-shot object recognition (Fei-Fei et al., 2006; Vinyals et al., 2016; Wang & Hebert, 2016; Finn et al., 2017; Snell et al., 2017) and novel-view synthesis (Park et al., 2017; Nguyen-Phuoc et al., 2018; Sitzmann et al., 2019) . However, such exploration is often limited in separate areas with specialized algorithms but not jointly. We argue that synthesizing images and recognizing them are inherently interconnected with each other. Being able to simultaneously address both tasks with a single model is a crucial step toward human-level generalization. This requires learning a richer, shareable internal representation for more comprehensive object understanding than it could be within individual tasks. Such "cross-task" knowledge becomes particularly critical in the low-data regime, where identifying 3D geometric structures of input images facilities recognizing their semantic categories, and vice versa. Inspired by this insight, here we propose a novel task of joint few-shot recognition and novel-view synthesis: given only one or few images of a novel object from arbitrary views with only category annotation, we aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints. This joint task is challenging, because of its (i) weak supervision, where we do not have access to any 3D supervision, and (ii) few-shot setting, where we need to effectively learn both 3D geometric and semantic representations from minimal data. While existing work copes with two or more tasks mainly by multi-task learning or meta-learning of a shared feature representation (Yu et al., 2020; Zamir et al., 2018; Lake et al., 2015) , we take a different perspective in this paper. Motivated by the nature of our problem, we focus on the interaction and cooperation between a generative model (for view synthesis) and a discriminative model (for recognition), in a way that facilitates knowledge to flow across tasks in complementary directions, thus making the tasks help each other. For example, the synthesized images produced by

