ISOMETRIC PROPAGATION NETWORK FOR GENERALIZED ZERO-SHOT LEARNING

Abstract

Zero-shot learning (ZSL) aims to classify images of an unseen class only based on a few attributes describing that class but no access to any training sample. A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data. Thus, an unseen class image can be ideally mapped to its corresponding class attributes. The key challenge is how to align the representations in the two spaces. For most ZSL settings, the attributes for each seen/unseen class are only represented by a vector while the seen-class data provide much more information. Thus, the imbalanced supervision from the semantic and the visual space can make the learned mapping easily overfitting to the seen classes. To resolve this problem, we propose Isometric Propagation Network (IPN), which learns to strengthen the relation between classes within each space and align the class dependency in the two spaces. Specifically, IPN learns to propagate the class representations on an auto-generated graph within each space. In contrast to only aligning the resulted static representation, we regularize the two dynamic propagation procedures to be isometric in terms of the two graphs' edge weights per step by minimizing a consistency loss between them. IPN achieves state-of-the-art performance on three popular ZSL benchmarks. To evaluate the generalization capability of IPN, we further build two larger benchmarks with more diverse unseen classes, and demonstrate the advantages of IPN on them.

1. INTRODUCTION

One primary challenge on the track from artificial intelligence to human-level intelligence is to improve the generalization capability of machine learning models to unseen problems. While most supervised learning methods focus on generalization to unseen data from training task/classes, zeroshot learning (ZSL) (Larochelle et al., 2008; Lampert et al., 2014; Xian et al., 2019a) has a more ambitious goal targeting the generalization to new tasks and unseen classes. In the context of image classification, given images from some training classes, ZSL aims to classify new images from unseen classes with zero training image available. Without training data, it is impossible to directly learn a mapping from the input space to the unseen classes. Hence, recent works introduced learning to map between the semantic space and visual space (Zhu et al., 2019a; Li et al., 2017; Jiang et al., 2018) so that the query image representation and the class representation can be mapped to a shared space for comparison. However, learning to align the representation usually leads to overfitting on seen classes (Liu et al., 2018) . One of the reasons is that in most zero-shot learning settings, the number of images per class is hundreds while the semantic information provided for one class is only limited to one vector. Thus, the mapping function is easily overfitting to the seen classes when trained using the small set of attributes. The learned model typically has imbalanced performance on samples from the seen classes and unseen classes, i.e., strong when predicting samples from the seen classes but struggles in the prediction for samples in the unseen classes. In this paper, we propose "Isometric Propagation Network (IPN)" which learns to dynamically interact between the visual space and the semantic space. Within each space, IPN uses an attention module to generate a category graph, and then perform multiple steps of propagation on the graph. In every step, the distribution of the attention scores generated in each space are regularized to be isometric so that implicitly aligns the relationships between classes in two spaces. Due to different motifs of the generated graphs, the regularizer for the isometry between the two distributions are provided with sufficient training supervisions and can potentially generalize better on unseen classes. To get more diverse graph motifs, we also apply episodic training which samples different subset of classes as the nodes for every training episode rather than the whole set of seen classes all the time. To evaluate IPN, we compared it with several state-of-the-art ZSL methods on three popular ZSL benchmarks, i.e., AWA2 (Xian et al., 2019a), CUB (Welinder et al., 2010) and aPY (Farhadi et al., 2009) . To test the generalization ability on more diverse unseen classes rather than similar classes to the seen classes, e.g., all animals for AWA2 and all birds for CUB, we also evaluate IPN on two new large-scale datasets extracted from tieredImageNet (Ren et al., 2018), i.e., tieredImageNet-Mixed and tieredImageNet-Segregated. IPN consistently achieves state-of-the-art performance in terms of the harmonic mean of the accuracy on unseen and seen classes. We show the ablation study of each component in our model and visualize the final learned representation in each space in Figure 1 . It shows the class representations in each space learned by IPN are evenly distributed and connected smoothly between the seen and unseen classes. The two spaces can also reflect each other.

2. RELATED WORKS

Most ZSL methods can be categorized as either discriminative or generative. Discriminative methods generate the classifiers for unseen classes from its attributes by learning a mapping from a input image's visual features to the associated class's attributes on training data of seen classes (Frome et al., 2013; Akata et al., 2015a; b; Romera-Paredes & Torr, 2015; Kodirov et al., 2017; Socher et al., 2013; Cacheux et al., 2019; Li et al., 2019b) , or by relating similar classes according to their shared properties, e.g., semantic features (Norouzi et al., 2013; Chi et al., 2021) 



Figure 1: t-SNE (Maaten & Hinton, 2008) of the prototypes produced by IPN on AWA2. Blue/red points represent unseen/seen classes.

or phantom classes(Changpinyo et al., 2016), etc. In contrast, generative methods employ the recently popular image generation models(Goodfellow et al., 2014)  to synthesize images of unseen classes, which reduces ZSL to supervised learning(Huang et al., 2019)  but usually increases the training cost and model complexity. In addition, the study of unsupervised learning(Fan et al., 2021; Liu et al., 2021), meta learning(Santoro et al., 2016; Liu et al., 2019b;a; Ding et al., 2020; 2021), or architecturedesign (Dong et al., 2021; Dong & Yang, 2019)  can also be used improve the representation ability of visual or semantic features.Li et al. (2017)  introduces how to optimize the semantic space so that the two spaces can be better aligned.Jiang et al. (2018); Liu et al. (2020)  tries to map the class representations from two spaces to another common space. Comparatively, our IPN does not align two spaces by feature transformation but by regularizing the isometry of the intermediate output from two propagation model.Graph in ZSL Graph structures have already been explored for their benefits to some discriminative ZSL models, e.g.,Wang et al. (2018); Kampffmeyer et al. (2019)  learn to generate the weights of a fully-connected classifier for a given class based on its semantic features computed on a pre-

