DSP: DYNAMIC SEMANTIC PROTOTYPE FOR GENERA-TIVE ZERO-SHOT LEARNING

Abstract

Conditional generative models (e.g., generative adversarial network (GAN)) have advanced zero-shot learning (ZSL). Studies on the generative ZSL methods typically produce class-specific visual features of unseen classes to mitigate the issue of lacking unseen samples based on the predefined class semantic prototypes. As these empirically predefined prototypes are not able to faithfully represent the actual semantic prototypes of visual features (i.e., visual prototypes), existing methods limit their ability to synthesize visual features that accurately represent real features and prototypes. We formulate this phenomenon as a visual-semantic domain shift problem. It prevents the conditional generative models from further improving the ZSL performance. In this paper, we propose a dynamic semantic prototype learning (DSP) method to align the empirical and actual semantic prototypes for synthesizing accurate visual features. The alignment is conducted by jointly refining semantic prototypes and visual features so that the conditional generator synthesizes visual features which are close to the real ones. We utilize a visual→semantic mapping network (V2SM) to map both the synthesized and real features into the class semantic space. The V2SM benefits the generator to synthesize visual representations with rich semantics. The visual features supervise our visual-oriented semantic prototype evolving network (VOPE) where the predefined class semantic prototypes are iteratively evolved to become dynamic semantic prototypes. Such prototypes are then fed back to the generative network as conditional supervision. Finally, we enhance visual features by fusing the evolved semantic prototypes into their corresponding visual features. Our extensive experiments on three benchmark datasets show that our DSP improves existing generative ZSL methods, e.g., the average improvements of the harmonic mean over four baselines (e.g., CLSWGAN, f-VAEGAN, TF-VAEGAN and FREE) by 8.5%, 8.0% and 9.7% on CUB, SUN and AWA2, respectively.

1. INTRODUCTION

Zero-shot learning (ZSL) recognizes the unseen classes, by transferring semantic knowledge from some known classes to unknown ones. Recently, conditional generative models (e.g., generative adversarial networks (GANs) Goodfellow et al. (2014) The class-specific unseen sample generations are conditioned on the semantic prototypes. However, there is a limitation that the empirically predefined semantic prototypes are not able to faithfully represent the actual semantic prototypes of visual features (i.e., visual prototypes). As shown in Fig. 1, i ) the predefined class semantic prototypes are annotated by human, inevitably resulting in some inaccurate annotations (e.g., the attribute "bill color orange" in Sample-2 in Fig. 1(a) ), and ii) the visual images are with multi-views, resulting in the phenomenon that some importantly annotated attributes do not appear in the visual representations (e.g., the attribute "bill color orange" in Sample-3 and attribute "leg color black" in Sample-4 in Fig. 1(a) ). That is, some visual representations of one , which heavily limits their classification performance. We formulate this phenomenon as a problem of Visual-Semantic Domain Shift. Therefore, it is essential to refine the empirical defined semantic prototype. Using an accurate prototype benefits the generator supervision to improve the generative ZSL performance. In light of these observations, we argue that the predefined semantic prototypes can be refined to become consistent to the actual semantic prototypes (i.e., visual prototypes) according to the visual information. As such, the refined semantic prototypes can capture the visual samples exactly and act as an accurate conditional signal for the generator, enabling generative ZSL methods to learn a desirable semantic→visual mapping (i.e., generator). Targeting this goal, in this paper, we propose a dynamic semantic prototype learning (DSP) method to align the empirical and actual semantic prototypes for synthesizing reliable visual features. The alignment is conducted by jointly refining semantic prototypes and visual features so that the conditional generator synthesizes class-specific visual features which are close to the real ones, as shown in Fig. 1(c ). Accordingly, our DSP well tackles the visual-semantic domain shift problem and advances the generative ZSL methods. Specifically, DSP consists of a visual→semantic mapping network (V2SM) and visual-oriented semantic prototype evolving network (VOPE). Cooperated with the generator, V2SM maps visual features into the class semantic space, enabling the conditional generator to synthesize class-specific visual samples with rich semantic information and convey visual information into the VOPE. Under the supervision of visual information, VOPE iteratively evolves and refines the predefined class semantic prototypes to become dynamic semantic prototypes, which are fed back to the generative network as a conditional supervision to encourage the generator for reliable visual feature synthesis. 2021a)) are 8.5%, 8.0% and 9.7% on CUB, SUN and AWA2, respectively. Notably, our DSP is flexible and effective to be entailed in any generative ZSL method.



and variational autoencoder (VAE) Kingma & Welling (2014)) have been successfully applied in ZSL and achieved promising performance. They synthesize the class-specific images or visual features of unseen classes to mitigate the lack of unseen samples based on the condition of the predefined class semantic prototypes Arora et al. (2018); Xian et al. (2018; 2019b); Chen et al. (2021a;b). Because of the high quality synthesis of GAN, studies arise to generate unseen samples in the CNN features to benefit ZSL Yan et al. (2021); Xian et al. (2019b).

Figure 1: Motivation illustration. (a) The fact that some visual representations of one class are incorrectly mapped with their common predefined semantic prototype. (b) Existing generative ZSL methods merely generate visual features guided by the predefined class semantic vectors that fail to accurately represent the semantic prototypes of visual representations (also denoted visual prototypes), resulting in the synthesized visual features far away from the corresponding real visual features and real semantic prototypes. (c) Our DSP dynamically evolves the predefined semantic prototypes as dynamic semantic prototypes that are closer to their visual prototype, enabling the generator to synthesize reliable visual features and enrich semantic information into visual features with the dynamic semantic prototypes. (Best viewed in color)

Finally, DSP concatenates the evolved semantic prototypes into visual features for enhancement, enabling the visual features to be closer to their corresponding semantic prototypes and alleviate the cross-dataset bias problem Chen et al. (2021a). Extensive experimental results demonstrate consistent performance gains over the state-of-the-art generative ZSL methods on three challenging benchmark datasets, i.e., CUB Welinder et al. (2010), SUN Patterson & Hays (2012) and AWA2 Xian et al. (2019a). For example, the average improvements of harmonic mean over four baselines (e.g., CLSWGAN Xian et al. (2018), f-VAEGAN Xian et al. (2019b), TF-VAEGAN Narayan et al. (2020) and FREE Chen et al. (

