REFINING VISUAL REPRESENTATION FOR GENERAL-IZED ZERO-SHOT RECOGNITION THROUGH IMPLICIT-SEMANTICS-GUIDED METRIC LEARNING

Abstract

Deep metric learning (DML) is effective to address the large intra-and the small inter-class variation problem in visual recognition; however, when applied for generalized zero-shot learning (GZSL) in which the label of a target image may belong to an unseen category, this technique can be easily biased towards seen classes. Alternatively in GZSL some form of semantic space is available, which plays an important role in relating seen and unseen classes and is widely used to guide the learning of visual representation. To take advantage of DML while avoiding overfitting to seen classes, we propose a novel representation learning framework-Metric Learning with Implicit Semantics (MLIS)-to refine discriminative and generalizable visual features for GZSL. Specifically, we disentangle the effects of semantics on feature extractor and image classification of the model, so that semantics only participate in feature learning, and classification only uses the refined visual features. We further relax the visual-semantic alignment requirement, avoiding performing pair-wise comparisons between the image and the class embeddings. Experimental results demonstrate that the proposed MLIS framework bridges DML and GZSL. It achieves state-of-the-art performance, and is robust and flexible to the integration with several metric learning based loss functions.

1. INTRODUCTION

With the consideration of real-world recognition problems that may not have defined all classes during training, generalized zero-shot learning (GZSL) aims to leverage third-party data (e.g., attributes, semantic descriptors) to recognize samples from both of the seen and unseen classes Socher et al. (2013) ; Chao et al. (2016) ; Pourpanah et al. (2022) . Therefore, a typical dataset for studying this problem is divided into two class sets: seen and unseen, with no intersection in between Xian et al. (2018a) . Only samples of the seen classes are available for training the image recognition model; however, samples of both seen and unseen classes may appear during inference. The third-party data involving class-level semantic descriptors such as attributes are important in GZSL to relate seen and unseen classes. The knowledge learned from the seen classes must be generalized to recognize an unseen class through semantic information, because the visual data of unseen classes are absent in the training stage. Whether a zero-shot setting is applied or not, an image recognition task can be categorized into fine-grained and coarse-grained, based on the amount of inter-class variation in visual appearance. Fine-grained recognition is considered more difficult than coarse-grained recognition due to subtle differences between classes. Nevertheless, the large intra-class variation in fine-grained recognition, often neglected in current studies, poses additional challenges to the task. Figure 1 2021). As a result, the model is forced to align visual and semantic spaces, which may be difficult because of the modality gap mentioned above. Furthermore, semantics are used in both synthesizing visual features and learning embedding functions, which may introduce serious bias towards seen classes. To better leverage the semantic information, a viable solution is to refine visual embeddings by semantics, while the classification is performed only based on visual features. Therefore, we present a novel representation learning framework, named Metric Learning with Implicit Semantics (MLIS), for GZSL. It takes advantage of metric learning to refine discriminative visual features from the original image features, while avoiding overfitting by making good (but not too extensive) use of semantics. MLIS decouples the effect of semantics on feature extractor and image classification, so that semantics only participate in feature learning, and classification only uses the refined visual features. This decoupling facilitates the training of both tasks. In feature learning we further relax the visual-semantic alignment requirement, avoiding performing pair-wise comparisons between the image and the class embeddings. To summarize, MLIS has the following characteristics that distinguish itself from existing methods: • Semantic descriptors are given and fixed; they cannot be trained or fine-tuned. A GZSL model will reply on semantics to relate the seen and unseen classes; therefore fixing semantics reduces model complexity and thereby also reduces the chances of being overfitted. • Semantic descriptors are involved only in training the encoder; they are not used for downstream tasks (e.g., classification, segmentation). The downstream model utilizes only the visual features to perform the task. In this work semantic information is agnostic to the classification task. • The entire framework learns only to refine visual features, and semantic descriptors implicitly affect the learning of visual features. We only pair an input visual feature vector up with its semantic descriptor to compute the loss, not all semantic descriptors. • Visual-semantic alignment is not strictly enforced as we rely only on visual features to perform classification. The MLIS model is optimized to refine visual features so that when they are concatenated with the semantic descriptor of the target class, the metric learning based loss is minimized. We learn semantically meaningful visual features via metric learning from the semantics without aligning the visual and the semantic space. (2009) . We demonstrate the superiority of the proposed method with performance on par with the state of the art. The code will be available upon paper acceptance.



displays a few samples in the CUB benchmarkWah et al. (2011). Some samples look quite differently to other samples in the same class. Such a large intra-class variation in appearance is inevitable because factors such as migration and molt may affect how birds change their colors. Deep metric learning (DML) offers a natural solution to address large intra-class and small interclass variance problem Hoffer & Ailon (2015); Wang & Chen (2017); Wang et al. (2019); Sun et al. (2020). It provides a flexible similarity measurement of data points. Each sample can have a different penalty in updating the model. By optimizing the contrastive loss from positive pairs (intra-class) and negative pairs (inter-class), a model can leverage class-wise supervision to learn embeddings and give more penalties to hard samples. Furthermore, this technique can be applied on large scale dynamic and open-ended image datasets, and can allow extensions with limited efforts to new classes. The semantic information in GZSL has been used to generate the visual features for unseen classes Xian et al. (2018b; 2019), as well as to guide the learning of discriminative visual features Ji et al. (2018); Zhu et al. (2019); Li & Yeh (2021). However, when DML is applied, a model can be easily overfitted to the seen classes despite the merits of DML mentioned above Bucher et al. (2016). Furthermore, a broad family of GZSL methods learn a joint embedding space, in which the classification is performed by directly comparing the embedded data points with the class prototypes Xian et al. (2018a). Learning such embedding functions can be difficult, because image features are extracted by a visual model pre-trained on ImageNet and class prototypes are human annotated attributes or are from word embeddings learned from text corpus. The visual and the semantic feature vectors may reflect inconsistent inter-class and intra-class discrepancies. The difficulty is exacerbated with the integration of generative methods to create visual features for unseen classes because the distribution of synthesized features is less predictable. Delving into the image classification framework, it is composed of a feature extractor and a classifier Krizhevsky et al. (2012); He et al. (2016). Previous works typically use semantics in both feature extraction and classification Hu et al. (2020); Liu et al. (2021); Chen et al. (2021a); Chandhok & Balasubramanian (

We conduct extensive experiments on five benchmark datasets, including CUB Wah et al. (2011), AWA2 Xian et al. (2018a), SUN Patterson & Hays (2012), FLO Nilsback & Zisserman (2008), and aPY Farhadi et al.

