ZERO-SHOT RECOGNITION THROUGH IMAGE-GUIDED SEMANTIC CLASSIFICATION

Abstract

We present a new visual-semantic embedding method for generalized zero-shot learning. Existing embedding-based methods aim to learn the correspondence between an image classifier (visual representation) and its class prototype (semantic representation) for each class. Inspired by the binary relevance method for multilabel classification, we learn the mapping between an image and its semantic classifier. Given an input image, the proposed Image-Guided Semantic Classification (IGSC) method creates a label classifier, being applied to all label embeddings to determine whether a label belongs to the input image. Therefore, a semantic classifier is image conditioned and is generated during inference. We also show that IGSC is a unifying framework for two state-of-the-art deep-embedding methods. We validate our approach with four standard benchmark datasets.

1. INTRODUCTION

As a feasible solution for addressing the limitations of supervised classification methods, zeroshot learning (ZSL) aims to recognize objects whose instances have not been seen during training (Larochelle et al., 2008; Palatucci et al., 2009) . Unseen classes are recognized by associating seen and unseen classes through some form of semantic space; therefore, the knowledge learned from seen classes is transferred to unseen classes. In the semantic space, each class has a corresponding vector representation called a class prototype. Class prototypes can be obtained from human-annotated attributes that describe visual properties of objects (Farhadi et al., 2009; Lampert et al., 2014) or from word embeddings learned in an unsupervised manner from text corpus (Mikolov et al., 2013; Pennington et al., 2014; Devlin et al., 2018) . A majority of ZSL methods can be viewed using the visual-semantic embedding framework, as displayed in Figure 1 (a). Images are mapped from the visual space to the semantic space in which all classes reside, or images and labels are projected to a latent space (Yang & Hospedales, 2015; Liu et al., 2018) . Then, the inference is performed in this common space (Akata et al., 2013; Frome et al., 2013; Socher et al., 2013) , typically using cosine similarity or Euclidean distance. Another perspective of embedding-based methods is to construct an image classifier for each unseen class by learning the correspondence between a binary one-versus-rest image classifier (i.e., visual representation of a class) and its class prototype in the semantic space (i.e., semantic representation of a class) (Wang et al., 2019) . Once this correspondence function is learned, a binary one-versus-rest image classifier can be constructed for an unseen class with its prototype (Wang et al., 2019) . For example, a commonly used choice for such correspondence is the bilinear function (Frome et al., 2013; Akata et al., 2013; 2015; Romera-Paredes & Torr, 2015; Li et al., 2018) . Considerable efforts have been made to extend the linear function to nonlinear ones (Xian et al., 2016; Wang et al., 2017; Elhoseiny et al., 2017; Qiao et al., 2016) . Figure 1 (b) illustrates this perspective. Learning the correspondence between an image classifier and a class prototype has the following drawbacks. First, the assumption of using a single image classifier for each class is restrictive because the manner for separating classes in both visual and semantic spaces would not be unique. We argue that semantic classification should be conducted dynamically conditioned on an input image. For example, the visual attribute wheel may be useful for classifying most car images. Nevertheless, cars with missing wheels should also be correctly recognized using other visual attributes. Therefore, instance-specific semantic classifiers are more preferable than category-specific ones because the classifier weights can be adaptively determined based on image content. Second, the scale of training data for learning the correspondence is constrained to be the number of class labels. In other words, a training set with C labels has only C visual-semantic classifier pairs to build the correspondence. This may hinder the robustness of deep models that usually require large-scale training data. Finally, although class embedding has rich semantic meanings, each class is represented by only a single class prototype to determine where images of that class collapse inevitably (MarcoBaroni, 2016; Fu et al., 2015) . The mapped semantic representations from images may collapse to hubs, which are close to many other points in the semantic space, rather than being similar to the true class label (MarcoBaroni, 2016) . In this paper, we present a new method, named Image-Guided Semantic Classification (IGSC), to address these problems. IGSC aims to learn the correspondence between an image and its corresponding label classifier, as illustrated in Figure 1 (c ). In contrast to existing methods focusing on the learning of visual (or semantic) representations (Zhang et al., 2016; Frome et al., 2013; Socher et al., 2013) , IGSC analyzes the input image and seeks for combinations of variables in the semantic space (e.g., combinations of attributes) that distinguish a class (belonging to the input) from other classes. The proposed IGSC method has the following characteristics: • IGSC learns the correspondence between an image in the visual space and a classifier in the semantic space. The correspondence can be learned with training pairs in the scale of training images rather than that of classes. • IGSC performs learning to learn in an end-to-end manner. Label classification is conducted by a semantic classifier whose weights are generated on the fly. This model is simple yet powerful because of its adaptive nature. • IGSC unifies visual attribute detection and label classification. This is achieved via the design of a conditional network (the proposed classifier learning method), in which label classification is the main task of interest and the conditional input image provides additional information of a specific situation. • IGSC alleviates the hubness problem. The correspondence between an image and a semantic classifier learned from seen classes can be transferred to recognize unseen concepts. We (Farhadi et al., 2009) . Experimental results demonstrated that the proposed method achieved promising performance, compared with current state-of-the-art methods.The remainder of the paper is organized as follows: We briefly review related work in Section 2. Section 3 presents the proposed framework. The experimental results and conclusions are provided in Sections 4 and 5, respectively.

2. RELATED WORK

Zero-shot learning has evolved rapidly during the last decade, and therefore documenting the extensive literature with limited pages is rarely possible. In this section, we review a few representative zero-shot learning methods and refer readers to (Xian et al., 2019a; Wang et al., 2019) for a comprehensive survey. One pioneering main stream of ZSL uses attributes to infer the label of an image belonging to one of the unseen classes (Lampert et al., 2014; Al-Halah et al., 2016; Norouzi et al., 



Figure 1: Zero-shot learning paradigms. (a) Conventional visual-to-semantic mapping trained on classification loss. (b) Another interpretation of visual-to-semantic mapping between visual and semantic representations. (c) The proposed IGSC, aiming to learn the correspondence between an image and a semantic classifier.

evaluated IGSC with experiments conducted on four public benchmark datasets, including SUN (Patterson & Hays, 2012), CUB (Patterson & Hays, 2012), AWA2 (Lampert et al., 2014), and aPY

