ADAPTIVE AND GENERATIVE ZERO-SHOT LEARNING

Abstract

We address the problem of generalized zero-shot learning (GZSL) where the task is to predict the class label of a target image whether its label belongs to the seen or unseen category. Similar to ZSL, the learning setting assumes that all class-level semantic features are given, while only the images of seen classes are available for training. By exploring the correlation between image features and the corresponding semantic features, the main idea of the proposed approach is to enrich the semantic-to-visual (S2V) embeddings via a seamless fusion of adaptive and generative learning. To this end, we extend the semantic features of each class by supplementing image-adaptive attention so that the learned S2V embedding can account for not only inter-class but also intra-class variations. In addition, to break the limit of training with images only from seen classes, we design a generative scheme to simultaneously generate virtual class labels and their visual features by sampling and interpolating over seen counterparts. In inference, a testing image will give rise to two different S2V embeddings, seen and virtual. The former is used to decide whether the underlying label is of the unseen category or otherwise a specific seen class; the latter is to predict an unseen class label. To demonstrate the effectiveness of our method, we report state-of-the-art results on four standard GZSL datasets, including an ablation study of the proposed modules.

1. INTRODUCTION

Different from conventional learning tasks, zero-shot learning (ZSL) by Lampert et al. (2009) ; Palatucci et al. (2009) ; Akata et al. (2013) explores the extreme case of performing inference only over samples of unseen classes. To make the scenario more realistic, generalized zero-shot learning (GZSL) (Chao et al., 2016; Xian et al., 2017) is subsequently proposed so that inference can concern samples of both seen and unseen classes. Nevertheless. the learning setting in ZSL/GZSL is essentially the same where sample classes are divided into two categories, seen and unseen, but only those samples of seen classes are accessible to training. In addition, each of all the classes under consideration is characterized by semantic features such as attributes (Xian et al., 2018b) or text descriptions (Zhu et al., 2018) to specify and relate seen and unseen classes. The lack of training samples from unseen classes has prompted generative approaches (Chen et al., 2018; Felix et al., 2018; Kumar Verma et al., 2018; Mishra et al., 2018) to creating synthetic data from semantic features of unseen classes. The strategy could enable learning semantic-visual alignment on unseen classes implicitly, and thus improves the ability to classify unseen classes. However, such generative models are indeed trained on seen samples, and the quality of synthesized unseen samples is predominantly influenced by seen classes. If the number of training samples of each seen class is small, it is hard for generative models to adequately synthesize samples of unseen classes, leading to unsatisfactory zero-shot learning. To better address the issue, we propose to synthesize visual and semantic features of virtual classes rather than those of the unseen classes. An interesting analogy is that childhood experience and relevant study (Greene, 1995) suggest the behavior of using human imagination to produce new object concepts could assist our cognitive capability. To mimic people utilizing imagination for exploring new knowledge, we create virtual classes by the integration of past "experience" (seen classes). In detail, we extend the mixup technique by Zhang et al. (2018) to generate virtual classes, with a subtle difference that mixing is conducted on the semantic features (in addition to the visual ones), instead of the class label vectors. In ZSL/GZSL, each seen or unseen class is typically described by a single semantic feature vector. The practice is useful in differentiating different classes in a principled way, but may not be sufficient to reflect the inter-class and intra-class visual discrepancies, not to mention the ambiguities caused by different backgrounds, view orientations, or occlusion in images. The concern of inefficient class-level representation can also be observed from how the semantic feature vectors are constructed. Take, for example, the Attribute Pascal and Yahoo (aPY) dataset (Farhadi et al., 2009) , where each instance is annotated by 64 attributes. The semantic features of each class in aPY are obtained by averaging the attribute vectors of all its instances. We are thus motivated to introduce an imageadaptive class representation, integrating the original semantic features for inter-class discrimination with an image-specific attention vector for intra-class variations. With the addition of virtual training data and the image-adaptive class representation, our method is designed to learn two classification experts: one for seen classes and the other for unseen classes. Both experts project the image-adaptive semantic feature vectors to the visual space and use cosine similarity to find the class label most similar to the given visual feature vector. The seen expert is trained with the provided training (seen) data, while the class prediction is over all possible classes, including seen and unseen. In inference, if its predicted class is not within the seen category. The testing sample is deemed to be from the unseen category, whose label is then decided by the unseen expert. The unseen expert is trained with the virtual data only, and the process indeed resembles meta-learning. However, the effectiveness of meta-learning is boosted by the design of the imageadaptive mechanism in that fine-tuning is not needed in performing zero-shot classification over unseen classes. We characterize the main contributions of this work as follows. • Instead of generating synthetic data of unseen classes, we propose to yield virtual classes and data by mixup interpolations. The virtual classes of synthetic data can then be seamlessly coupled with meta-learning to improve the inference on unseen testing samples. • We introduce the concept of representing each class with image-adaptive semantic features that could vary among intra-class samples. While the adaptive mechanism improves classifying the seen classes, it manifests the advantage in boosting the effect of meta-learning over virtual data to zero-shot inference over unseen classes. • We demonstrate state-of-the-art results of zero-shot learning over four popular benchmark datasets and justify the design of our method with a thorough ablation study.

2. RELATED WORK

We review relevant literature in this section. First, we describe generative approaches for ZSL/GZSL that synthesize unseen images for training. To improve GZSL performance, we propose to couple virtual class generation with meta-learning for mimicking the inference scenario. Next, we discuss attention approaches that extract discriminating features from images to help classification.

2.1. GENERATIVE APPROACHES FOR ZSL/GZSL

Arguably one of the most important problems in ZSL/GZSL is to prevent models from being biased to seen classes. Generative approaches (Chen et al., 2018; Felix et al., 2018; Kumar Verma et al., 2018; Mishra et al., 2018; Schonfeld et al., 2019; Paul et al., 2019; Xian et al., 2019) One key issue behind generative approaches comes from the insufficient amount of data to learn a good generative model. As a consequence, some semantic features that seem important during training may cause overfitting, and others that seem less important may be completely dropped. Therefore, several prior techniques propose new constraints or losses to preserve semantic features and regularize the generative model. For instance, (Chen et al., 2018) avoids the dropping of semantic information by disentangling the semantic space into two subspaces, one for classification and the

