ATTRIBUTE ALIGNMENT AND ENHANCEMENT FOR GENERALIZED ZERO-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Generalized zero-shot learning (GZSL) aims to recognize both seen and unseen classes, which challenges the generalization ability of a model. In this paper, we propose a novel approach to fully utilize attributes information, referred to as attribute alignment and enhancement (A3E) network. It contains two modules. First, attribute localization (AL) module utilizes the supervision of class attribute vectors to guide visual localization for attributes through the implicit localization capability within the feature extractor, and the visual features corresponding to the attributes (attribute-visual features) are obtained. Second, enhanced attribute scoring (EAS) module employs the supervision of the attribute word vectors (attribute semantics) to project input attribute visual features to attribute semantic space using Graph Attention Network (GAT). Based on the constructed attribute relation graph (ARG), EAS module generates enhanced representation of attributes. Experiments on standard datasets demonstrate that the enhanced attribute representation greatly improves the classification performance, which helps A3E to achieve state-of-the-art performances in both ZSL and GZSL tasks.

1. INTRODUCTION

Zero-shot learning aims to recognize unseen classes that have not been appeared during training phase, a common solution resort to auxiliary information to bridge the gap between seen and unseen domains to achieve knowledge transfer from the seen to the unseen. Semantics are the most frequently used auxiliary information for ZSL, either by class descriptions, word vectors (Mikolov et al., 2013) or attributes (Farhadi et al., 2009) . A general paradigm (Xie et al., 2019; Zhu et al., 2019; Huynh & Elhamifar, 2020b; Min et al., 2020; Xie et al., 2020; Ge et al., 2021; Liu et al., 2021b; Chen et al., 2021b; 2022) is to learn a mapping that projects visual features of seen samples into an embed-ding space to align with semantic attributes. With the assumption that seen and unseen domains share the same attribute space, the learned knowledge from seen classes is easily transferred to the unseen ones. And then, the subsequent classi-fication is accomplished by measuring compatibility scores between the projected features and the attribute prototypes. Recent works on embeddings turn to local features of image parts, i.e. part-based embedding meth-ods (Elhoseiny et al., 2017) , to learn discriminative features easy for classification. Comparatively, gener-ative methods (Xian et al., 2019b; Huynh & Elhamifar, 2020a; Ma & Hu, 2020; Han et al., 2021; Chen et al., 2021a; c; Chou et al., 2021) utilize semantic information of unseen classes to synthesize unseen visual features by a generative model, such as generative adversarial network (GAN) (Goodfellow et al., 2020) or variational autoencoder (VAE) (Kingma & Welling, 2013) , so that convert zero-shot classification to the traditional supervised model learning that could be trainable with generated samples. However, the features inferred from semantic information mostly are high-level visual representation, which are often non-discriminative to class recognition (Huynh & Elhamifar, 2020b; Xian et al., 2019b; Huynh & Elhamifar, 2020a) . Recently, generalized zero-shot learning (GZSL) for its rigorous and realistic nature has received increasing attention in this field, where seen classes and unseen classes constitute the testing space. Embedding methods are inherently inferior in GZSL since the model training merely relies on samples of seen classes, and thus inevitably biases towards the seen ones. Moreover, the visual-semantic alignment in embedding models is just operated in seen domain, and the visual di-vergence between seen and unseen domains may strengthen the bias, namely domain shift (Fu et al., 2014) . Different methods have been explored to improve the model performance in GZSL. Some studies try to mitigate the bias by introducing constraints on losses to calibrate output pre-diction probability, which usually require unseen semantics as side infor-mation (Huynh & Elhamifar, 2020b; Xie et al., 2020) . The Parts Relation Rea-soning is used in RGEN (Xie et al., 2020) to capture appearance relationships among image parts, which is believed to be a complementary cue for improving the performance. GCNZ (Veličković et al., 2018) utilizes class relationships to infer classifier parameters directly from knowledge graph. Relation learning is no novelty to ZSL, however, the semantic relationship between attributes is rarely explored in previous works. Huynh & Elhamifar (2020b) have informed us by introduc-ing word vectors of attribute that there is a wealth of semantic information in attributes beyond the commonly used class attribute vectors. There are also rich semantic relationships between attributes, which can be transferred to visual domain to help mitigate visualsemantic gap. Once the relations between attributes are modeled, it is possible to enhance the fi-nal prediction of classes by the interplay of attributes. Existing methods tried to capture the semantic relations in the at-tributes, such as using the entanglement of CNN and GCN based on knowledge graph about attrib-utes (Hu et al., 2022) . However, despite the fact that nodes in graph are explic-itly defined as attributes, those methods lack a mechanism to accurately align nodes to the corresponding attributes. To the best of our knowledge, the fusion of relation learning and attention mechanism has not been studied in ZSL. Accordingly, we propose an attribute alignment and enhancement (A3E) network for GZSL, which incorporates attribute alignment (AA) pipeline and attribute enhancement (AE) module. AA pipeline consists of attribute localization (AL) module and attribute scoring (AS) module, this novel approach of attribute alignment allows model to subtly catch visual features corresponding to attributes (namely attribute-visual features, AVFs) by fully utilization of attribute knowledge (both class attribute vectors and attribute word vectors). Compared to previous part-based methods that require complex accessories such as attention module and part detector, A3E simplifies its AA pipeline to a single convolutional layer with a single linear transformation, and still delivers competitive results. Most importantly, the resulted AVFs serve as the carriers for attributes which support the subsequent attribute enhancement process. In order to model the relations of attributes, AE module first constructs an attribute-relation graph (ARG), where relation-ships of attributes are quantified as graph edges, then, facilitated by graph neural networks, embeds the input AVFs into attribute semantics space. The enhanced attribute features are obtained through the outputs of graph nodes. Figure 1 demonstrates the basic process of AE module. Experiments in three standard ZSL datasets show that A3E reaches the state-of-the-art results in both ZSL and GZSL without extra information from unseen classes or auxiliary constraints on output probabilities, verifying the advantages of our proposed method. Our contributions can be summarized as: • A novel attribute enhancement (AE) module is created to explicitly model the relationship between attributes, and the enhanced attribute representation is generated with attribute-relations modeled inside. • To align graph nodes with attributes, an efficient attribute alignment (AA) pipeline is designed to generate visual fea-tures corresponding to attributes, namely attribute-visual features (AVFs). • We propose an attribute alignment and enhancement (A3E) network that based on the AA pipeline and AE module, an innovative combination of attention mechanism and semantic-relation learning.



Figure 1: Attribute Localization.

