ATTRIBUTE ALIGNMENT AND ENHANCEMENT FOR GENERALIZED ZERO-SHOT LEARNING Anonymous authors Paper under double-blind review

Abstract

Generalized zero-shot learning (GZSL) aims to recognize both seen and unseen classes, which challenges the generalization ability of a model. In this paper, we propose a novel approach to fully utilize attributes information, referred to as attribute alignment and enhancement (A3E) network. It contains two modules. First, attribute localization (AL) module utilizes the supervision of class attribute vectors to guide visual localization for attributes through the implicit localization capability within the feature extractor, and the visual features corresponding to the attributes (attribute-visual features) are obtained. Second, enhanced attribute scoring (EAS) module employs the supervision of the attribute word vectors (attribute semantics) to project input attribute visual features to attribute semantic space using Graph Attention Network (GAT). Based on the constructed attribute relation graph (ARG), EAS module generates enhanced representation of attributes. Experiments on standard datasets demonstrate that the enhanced attribute representation greatly improves the classification performance, which helps A3E to achieve state-of-the-art performances in both ZSL and GZSL tasks.

1. INTRODUCTION

Zero-shot learning aims to recognize unseen classes that have not been appeared during training phase, a common solution resort to auxiliary information to bridge the gap between seen and unseen domains to achieve knowledge transfer from the seen to the unseen. Semantics are the most frequently used auxiliary information for ZSL, either by class descriptions, word vectors (Mikolov et al., 2013) or attributes (Farhadi et al., 2009) . A general paradigm (Xie et al., 2019; Zhu et al., 2019; Huynh & Elhamifar, 2020b; Min et al., 2020; Xie et al., 2020; Ge et al., 2021; Liu et al., 2021b; Chen et al., 2021b; 2022) is to learn a mapping that projects visual features of seen samples into an embed-ding space to align with semantic attributes. With the assumption that seen and unseen domains share the same attribute space, the learned knowledge from seen classes is easily transferred to the unseen ones. And then, the subsequent classi-fication is accomplished by measuring compatibility scores between the projected features and the attribute prototypes. Recent works on embeddings turn to local features of image parts, i.e. part-based embedding meth-ods (Elhoseiny et al., 2017) , to learn discriminative features easy for classification. Comparatively, gener-ative methods (Xian et al., 2019b; Huynh & Elhamifar, 2020a; Ma & Hu, 2020; Han et al., 2021; Chen et al., 2021a; c; Chou et al., 2021) utilize semantic information of unseen classes to synthesize unseen visual features by a generative model, such as generative adversarial network (GAN) (Goodfellow et al., 2020) or variational autoencoder (VAE) (Kingma & Welling, 2013) , so that convert zero-shot classification to the traditional supervised model learning that could be trainable with generated samples. However, the features inferred from semantic information mostly are high-level visual representation, which are often non-discriminative to class recognition (Huynh & Elhamifar, 2020b; Xian et al., 2019b; Huynh & Elhamifar, 2020a) . Recently, generalized zero-shot learning (GZSL) for its rigorous and realistic nature has received increasing attention in this field, where seen classes and unseen classes constitute the testing space. Embedding methods are inherently inferior in GZSL since the model training merely relies on samples of seen classes, and thus inevitably biases towards the seen ones. Moreover, the visual-semantic alignment in embedding models is just operated in seen domain, and the visual di-vergence between seen and unseen domains may strengthen the bias, namely domain shift (Fu et al., 2014) . Different methods have been explored to improve the model performance in GZSL. Some studies try to

