ADAPTIVE AND GENERATIVE ZERO-SHOT LEARNING

Abstract

We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a seamless fusion of adaptive and generative learning. To this end, we extend the semantic features of each class by supplementing image-adaptive attention so that the learned S2V embedding can account for not only inter-class but also intra-class variations. In addition, to break the limit of training with images only from seen classes, we design a generative scheme to simultaneously generate virtual class labels and their visual features by sampling and interpolating over seen counterparts. In inference, a testing image will give rise to two different S2V embeddings, seen and virtual. The former is used to decide whether the underlying label is of the unseen category or otherwise a specific seen class; the latter is to predict an unseen class label. To demonstrate the effectiveness of our method, we report state-of-the-art results on four standard GZSL datasets, including an ablation study of the proposed modules.

1. INTRODUCTION

Different from conventional learning tasks, zero-shot learning (ZSL) by Lampert et al. (2009) ; Palatucci et al. (2009) ; Akata et al. (2013) explores the extreme case of performing inference only over samples of unseen classes. To make the scenario more realistic, generalized zero-shot learning (GZSL) (Chao et al., 2016; Xian et al., 2017) is subsequently proposed so that inference can concern samples of both seen and unseen classes. Nevertheless. the learning setting in ZSL/GZSL is essentially the same where sample classes are divided into two categories, seen and unseen, but only those samples of seen classes are accessible to training. In addition, each of all the classes under consideration is characterized by semantic features such as attributes (Xian et al., 2018b) or text descriptions (Zhu et al., 2018) to specify and relate seen and unseen classes. The lack of training samples from unseen classes has prompted generative approaches (Chen et al., 2018; Felix et al., 2018; Kumar Verma et al., 2018; Mishra et al., 2018) to creating synthetic data from semantic features of unseen classes. The strategy could enable learning semantic-visual alignment on unseen classes implicitly, and thus improves the ability to classify unseen classes. However, such generative models are indeed trained on seen samples, and the quality of synthesized unseen samples is predominantly influenced by seen classes. If the number of training samples of each seen class is small, it is hard for generative models to adequately synthesize samples of unseen classes, leading to unsatisfactory zero-shot learning. To better address the issue, we propose to synthesize visual and semantic features of virtual classes rather than those of the unseen classes. An interesting analogy is that childhood experience and relevant study (Greene, 1995) suggest the behavior of using human imagination to produce new object concepts could assist our cognitive capability. To mimic people utilizing imagination for exploring new knowledge, we create virtual classes by the integration of past "experience" (seen classes). In detail, we extend the mixup technique by Zhang et al. (2018) to 

