CLASS NORMALIZATION FOR (CONTINUAL)? GENERALIZED ZERO-SHOT LEARNING

Abstract

Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward pass. Next, we demonstrate that they are insufficient to normalize a deep ZSL model and propose Class Normalization (CN): a normalization scheme, which alleviates this issue both provably and in practice. Third, we show that ZSL models typically have more irregular loss surface compared to traditional classifiers and that the proposed method partially remedies this problem. Then, we test our approach on 4 standard ZSL datasets and outperform sophisticated modern SotA with a simple MLP optimized without any bells and whistles and having ≈50 times faster training speed. Finally, we generalize ZSL to a broader problem -continual ZSL, and introduce some principled metrics and rigorous baselines for this new setup. The source code is available at https://github.com/universome/class-norm.

1. INTRODUCTION

Zero-shot learning (ZSL) aims to understand new concepts based on their semantic descriptions instead of numerous input-output learning pairs. It is a key element of human intelligence and our best machines still struggle to master it (Ferrari & Zisserman, 2008; Lampert et al., 2009; Xian et al., 2018a) . Normalization techniques like batch/layer/group normalization (Ioffe & Szegedy, 2015; Ba et al., 2016; Wu & He, 2018) are now a common and important practice of modern deep learning. But despite their popularity in traditional supervised training, not much is explored in the realm of zero-shot learning, which motivated us to study and investigate normalization in ZSL models. We start by analyzing two ubiquitous tricks employed by ZSL and representation learning practitioners: normalize+scale (NS) and attributes normalization (AN) (Bell et al., 2016; Zhang et al., 2019; Guo et al., 2020; Chaudhry et al., 2019) . Their dramatic influence on performance can be observed from Table 1 . When these two tricks are employed, a vanilla MLP model, described in Sec 3.1, can outperform some recent sophisticated ZSL methods. Normalize+scale (NS) changes logits computation from usual dot-product to scaled cosine similarity: ŷc = z p c =⇒ ŷc = γ • z z 2 γ • p c p c ( ) where z is an image feature, p c is c-th class prototype and γ is a hyperparameter, usually picked from [5, 10] interval (Li et al., 2019; Zhang et al., 2019) . Scaling by γ is equivalent to setting a high temperature of γ 2 in softmax. In Sec. 3.2, we theoretically justify the need for this trick and explain why the value of γ must be so high. These two tricks work well and normalize the variance to a unit value when the underlying ZSL model is linear (see Figure 1 ), but they fail when we use a multi-layer architecture. To remedy this issue, we introduce Class Normalization (CN): a novel normalization scheme, which is based on a different initialization and a class-wise standardization transform. Modern ZSL methods either utilize sophisticated architectural design like training generative models (Narayan et al., 2020; Felix et al., 2018) or use heavy optimization schemes like episode-based training (Yu et al., 2020; Li et al., 2019) . In contrast, we show that simply adding Class Normalization on top of a vanilla MLP is enough to set new state-of-the-art results on several standard ZSL datasets (see Table 2 ). Moreover, since it is optimized with plain gradient descent without any bells and whistles, training time for us takes 50-100 times less and runs in about 1 minute. We also demonstrate that many ZSL models tend to have more irregular loss surface compared to traditional supervised learning classifiers and apply the results of Santurkar et al. (2018) to show that our CN partially remedies the issue. We discuss and empirically validate this in Sec 3.5 and Appx F. Apart from the theoretical exposition and a new normalization scheme, we also propose a broader ZSL setup: continual zero-shot learning (CZSL 2018)). At test time, the trained generator is expected to produce synthetic/fake data of unseen classes given its semantic descriptor. The fake data is then used to train a traditional classifier or to perform a simple kNN-classification on the test images. Embedding-based approaches learn a mapping that projects semantic attributes and images into a common space where the distance



Effectiveness of Normalize+Scale, Attributes Normalization and Class Normalization. When NS and AN are integrated into a basic ZSL model, its performance is boosted up to a level of some sophisticated SotA methods and additionally using CN allows to outperform them. ±NS and ±AN denote if normalize+scale or attributes normalization are being used. Bold/normal blue font denote best/second-best results. Extended results are in Table 2, 5 and 8. Attributes Normalization (AN) technique simply divides class attributes by their L 2 norms:a c -→ a c / a c 2(2)While this may look inconsiderable, it is surprising to see it being preferred in practice(Li et al., 2019;  Narayan et al., 2020; Chaudhry et al., 2019)  instead of the traditional zero-mean and unit-variance data standardization(Glorot & Bengio, 2010). In Sec 3, we show that it helps in normalizing signal's variance in and ablate its importance in Table1and Appx D.

). Continual learning (CL) is an ability to acquire new knowledge without forgetting (e.g. (Kirkpatrick et al., 2017)), which is scarcely investigated in ZSL. We develop the ideas of lifelong learning with class attributes, originally proposed by Chaudhry et al. (2019) and extended by Wei et al. (2020a), propose several principled metrics for it and test several classical CL methods in this new setup.

