REFINING VISUAL REPRESENTATION FOR GENERAL-IZED ZERO-SHOT RECOGNITION THROUGH IMPLICIT-SEMANTICS-GUIDED METRIC LEARNING

Abstract

Deep metric learning (DML) is effective to address the large intra-and the small inter-class variation problem in visual recognition; however, when applied for generalized zero-shot learning (GZSL) in which the label of a target image may belong to an unseen category, this technique can be easily biased towards seen classes. Alternatively in GZSL some form of semantic space is available, which plays an important role in relating seen and unseen classes and is widely used to guide the learning of visual representation. To take advantage of DML while avoiding overfitting to seen classes, we propose a novel representation learning framework-Metric Learning with Implicit Semantics (MLIS)-to refine discriminative and generalizable visual features for GZSL. Specifically, we disentangle the effects of semantics on feature extractor and image classification of the model, so that semantics only participate in feature learning, and classification only uses the refined visual features. We further relax the visual-semantic alignment requirement, avoiding performing pair-wise comparisons between the image and the class embeddings. Experimental results demonstrate that the proposed MLIS framework bridges DML and GZSL. It achieves state-of-the-art performance, and is robust and flexible to the integration with several metric learning based loss functions.

1. INTRODUCTION

With the consideration of real-world recognition problems that may not have defined all classes during training, generalized zero-shot learning (GZSL) aims to leverage third-party data (e.g., attributes, semantic descriptors) to recognize samples from both of the seen and unseen classes Socher et al. (2013); Chao et al. (2016); Pourpanah et al. (2022) . Therefore, a typical dataset for studying this problem is divided into two class sets: seen and unseen, with no intersection in between Xian et al. (2018a) . Only samples of the seen classes are available for training the image recognition model; however, samples of both seen and unseen classes may appear during inference. The third-party data involving class-level semantic descriptors such as attributes are important in GZSL to relate seen and unseen classes. The knowledge learned from the seen classes must be generalized to recognize an unseen class through semantic information, because the visual data of unseen classes are absent in the training stage. Whether a zero-shot setting is applied or not, an image recognition task can be categorized into fine-grained and coarse-grained, based on the amount of inter-class variation in visual appearance. Fine-grained recognition is considered more difficult than coarse-grained recognition due to subtle differences between classes. Nevertheless, the large intra-class variation in fine-grained recognition, often neglected in current studies, poses additional challenges to the task. Figure 1 



displays a few samples in the CUB benchmark Wah et al. (2011). Some samples look quite differently to other samples in the same class. Such a large intra-class variation in appearance is inevitable because factors such as migration and molt may affect how birds change their colors. Deep metric learning (DML) offers a natural solution to address large intra-class and small interclass variance problem Hoffer & Ailon (2015); Wang & Chen (2017); Wang et al. (2019); Sun et al. (2020). It provides a flexible similarity measurement of data points. Each sample can have 1

