CLASS NORMALIZATION FOR (CONTINUAL)? GENERALIZED ZERO-SHOT LEARNING

Abstract

Normalization techniques have proved to be a crucial ingredient of successful training in a traditional supervised learning regime. However, in the zero-shot learning (ZSL) world, these ideas have received only marginal attention. This work studies normalization in ZSL scenario from both theoretical and practical perspectives. First, we give a theoretical explanation to two popular tricks used in zero-shot learning: normalize+scale and attributes normalization and show that they help training by preserving variance during a forward pass. Next, we demonstrate that they are insufficient to normalize a deep ZSL model and propose Class Normalization (CN): a normalization scheme, which alleviates this issue both provably and in practice. Third, we show that ZSL models typically have more irregular loss surface compared to traditional classifiers and that the proposed method partially remedies this problem. Then, we test our approach on 4 standard ZSL datasets and outperform sophisticated modern SotA with a simple MLP optimized without any bells and whistles and having ≈50 times faster training speed. Finally, we generalize ZSL to a broader problem -continual ZSL, and introduce some principled metrics and rigorous baselines for this new setup. The source code is available at https://github.com/universome/class-norm. While this may look inconsiderable, it is surprising to see it being preferred in practice (Li et al., 2019; Narayan et al., 2020; Chaudhry et al., 2019) instead of the traditional zero-mean and unit-variance data standardization (Glorot & Bengio, 2010) . In Sec 3, we show that it helps in normalizing signal's variance in and ablate its importance in Table 1 and Appx D. These two tricks work well and normalize the variance to a unit value when the underlying ZSL model is linear (see Figure 1 ), but they fail when we use a multi-layer architecture. To remedy this issue, we introduce Class Normalization (CN): a novel normalization scheme, which is based on a different initialization and a class-wise standardization transform. Modern ZSL methods either utilize sophisticated architectural design like training generative models (Narayan et al., 2020; Felix et al., 2018) or use heavy optimization schemes like episode-based training (Yu et al., 2020; Li et al., 2019) . In contrast, we show that simply adding Class Normalization on top of a vanilla MLP is enough to set new state-of-the-art results on several standard ZSL datasets (see Table 2 ). Moreover, since it is optimized with plain gradient descent without any bells and whistles, training time for us takes 50-100 times less and runs in about 1 minute. We also demonstrate that many ZSL models tend to have more irregular loss surface compared to traditional supervised learning classifiers and apply the results of Santurkar et al. (2018) to show that our CN partially remedies the issue. We discuss and empirically validate this in Sec 3.5 and Appx F. Apart from the theoretical exposition and a new normalization scheme, we also propose a broader ZSL setup: continual zero-shot learning (CZSL). Continual learning (CL) is an ability to acquire new knowledge without forgetting (e.g. (Kirkpatrick et al., 2017) ), which is scarcely investigated in ZSL. We develop the ideas of lifelong learning with class attributes, originally proposed by Chaudhry et al. (2019) and extended by Wei et al. (2020a), propose several principled metrics for it and test several classical CL methods in this new setup. Zero-shot learning. Zero-shot learning (ZSL) aims at understand example of unseen classes from their language or semantic descriptions. Earlier ZSL methods directly predict attribute confidence from images to facilitate zero-shot recognition (e.g., Lampert et al. (2009); Farhadi et al. (2009); Lampert et al. (2013b)). Recent ZSL methods for image classification can be categorized into two groups: generative-based and embedding-based. The main goal for generative-based approaches is to build a conditional generative model (e.g.,

1. INTRODUCTION

Zero-shot learning (ZSL) aims to understand new concepts based on their semantic descriptions instead of numerous input-output learning pairs. It is a key element of human intelligence and our best machines still struggle to master it (Ferrari & Zisserman, 2008; Lampert et al., 2009; Xian et al., 2018a) . Normalization techniques like batch/layer/group normalization (Ioffe & Szegedy, 2015; Ba et al., 2016; Wu & He, 2018) are now a common and important practice of modern deep learning. But despite their popularity in traditional supervised training, not much is explored in the realm of zero-shot learning, which motivated us to study and investigate normalization in ZSL models. We start by analyzing two ubiquitous tricks employed by ZSL and representation learning practitioners: normalize+scale (NS) and attributes normalization (AN) (Bell et al., 2016; Zhang et al., 2019; Guo et al., 2020; Chaudhry et al., 2019) . Their dramatic influence on performance can be observed from Table 1 . When these two tricks are employed, a vanilla MLP model, described in Sec 3.1, can outperform some recent sophisticated ZSL methods. Normalize+scale (NS) changes logits computation from usual dot-product to scaled cosine similarity: ŷc = z p c =⇒ ŷc = γ • z z 2 γ • p c p c ( ) where z is an image feature, p c is c-th class prototype and γ is a hyperparameter, usually picked from [5, 10] interval (Li et al., 2019; Zhang et al., 2019) . Scaling by γ is equivalent to setting a high temperature of γ 2 in softmax. In Sec. 3.2, we theoretically justify the need for this trick and explain why the value of γ must be so high. Attributes Normalization (AN) technique simply divides class attributes by their L 2 norms: a c -→ a c / a c 2 (2) between a class projection and the corresponding images is minimized (e.g, Romera-Paredes & Torr (2015) ; Frome et al. (2013); Lei Ba et al. (2015) ; Akata et al. (2016a) ; Zhang et al. (2017) ; Akata et al. (2015; 2016b) ). One question that arises is what space to choose to project the attributes or images to. Previous works projected images to the semantic space (Elhoseiny et al., 2013; Frome et al., 2013; Lampert et al., 2013a) or some common space (Zhang & Saligrama, 2015; Akata et al., 2015) , but our approach follows the idea of Zhang et al. (2016) ; Li et al. (2019) that shows that projecting attributes to the image space reduces the bias towards seen data. Normalize+scale and attributes normalization. It was observed both in ZSL (e.g., Li et al. (2019) ; Zhang et al. (2019) ; Bell et al. (2016) ) and representation learning (e.g., Sohn (2016) ; Guo et al. (2020) ; Ye et al. (2020) ) fields that normalize+scale (i.e. ( 1)) and attributes normalization (i.e. (2)) tend to significantly improve the performance of a learning system. In the literature, these two techniques lack rigorous motivation and are usually introduced as practical heuristics that aid training (Changpinyo et al., 2017; Zhang et al., 2019; 2021) . One of the earliest works that employ attributes normalization was done by (Norouzi et al., 2013) , and in (Changpinyo et al., 2016a) authors also ablate its importance. The main consumers of normalize+scale trick had been similarity learning algorithms, which employ it to refine the distance metric between the representations (Bellet et al., 2013; Guo et al., 2020; Shi et al., 2020) . Luo et al. (2018) proposed to use cosine similarity in the final output projection matrix as a normalization procedure, but didn't incorporate any analysis on how it affects the variance. They also didn't use the scaling which our experiments in Table 5 show to be crucial. Gidaris & Komodakis (2018) demonstrated a greatly superior performance of an NS-enriched model compared to a dot-product based one in their setup where the classifying matrix is constructed dynamically. Li et al. (2019) motivated their usage of NS by variance reduction, but didn't elaborate on this in their subsequent analysis. Chen et al. (2020) related the use of the normalized temperature-scaled cross entropy loss (NT-Xent) to different weighting of negative examples in contrastive learning framework. Overall, to the best of our knowledge, there is no precise understanding of the influence of these two tricks on the optimization process and benefits they provide. Initialization schemes. In the seminal work, Xavier's init Glorot & Bengio (2010) , the authors showed how to preserve the variance during a forward pass. He et al. (2015) applied a similar analysis but taking ReLU nonlinearities into account. There is also a growing interest in two-step Jia et al. (2014) , data-dependent Krähenbühl et al. (2015) , and orthogonal Hu et al. (2020) initialization schemes. However, the importance of a good initialization for multi-modal embedding functions like attribute embedding is less studied and not well understood. We propose a proper initialization scheme based on a different initialization variance and a dynamic standardization layer. Our variance analyzis is similar in nature to Chang et al. (2020) since attribute embedder may be seen as a hypernetwork (Ha et al., 2016) that outputs a linear classifier. But the exact embedding transformation is different from a hypernetwork since it has matrix-wise input and in our derivations we have to use more loose assumptions about attributes distribution (see Sec 3 and Appx H). Normalization techniques. A closely related branch of research is the development of normalization layers for deep neural networks (Ioffe & Szegedy, 2015) since they also influence a signal's variance. BatchNorm, being the most popular one, normalizes the location and scale of activations. It is applied in a batch-wise fashion and that's why its performance is highly dependent on batch size (Singh & Krishnan, 2020 ). That's why several normalization techniques have been proposed to eliminate the batch-size dependecy (Wu & He, 2018; Ba et al., 2016; Singh & Krishnan, 2020) . The proposed class normalization is very similar to a standardization procedure which underlies BatchNorm, but it is applied class-wise in the attribute embedder. This also makes it independent from the batch size. Continual zero-shot learning. We introduce continual zero-shot learning: a new benchmark for ZSL agents that is inspired by continual learning literature (e.g., Kirkpatrick et al. (2017) ). It is a development of the scenario proposed in Chaudhry et al. (2019) , but authors there focused on ZSL performance only a single task ahead, while in our case we consider the performance on all seen (previous tasks) and all unseen data (future tasks). This also contrasts our work to the very recent work by Wei et al. (2020b) , where a sequence of seen class splits of existing ZSL benchmsks is trained and the zero-shot performance is reported for every task individually at test time. In contrast, for our setup, the label space is not restricted and covers the spectrum of all previous tasks (seen tasks so far), and future tasks (unseen tasks so far). Due to this difference, we need to introduce a set of new metrics and benchmarks to measure this continual generalized ZSL skill over time. From the lifelong learning perspective, the idea to consider all the processed data to evaluate the model is not new and was previously explored by Elhoseiny et al. (2018) ; van de Ven & Tolias (2019) . It lies in contrast with the common practice of providing task identity at test time, which limits the prediction space for a model, making the problem easier (Kirkpatrick et al., 2017; Aljundi et al., 2017) . In Isele et al. (2016) ; Lopez-Paz & Ranzato (2017) authors motivate the use of task descriptors for zero-shot knowledge transfer, but in our work we consider class descriptors instead. We defined CZSL as a continual version of generalized-ZSL which allows us to naturally extend all the existing ZSL metrics Xian et al. (2018a) ; Chao et al. (2016) to our new continual setup.

3. NORMALIZATION IN ZERO-SHOT LEARNING

The goal of a good normalization scheme is to preserve a signal inside a model from severe fluctuations and to keep it in the regions that are appropriate for subsequent transformations. For example, for ReLU activations, we aim that its input activations to be zero-centered and not scaled too much: otherwise, we risk to find ourselves in all-zero or all-linear activation regimes, disrupting the model performance. For logits, we aim them to have a close-to-unit variance since too small variance leads to poor gradients of the subsequent cross-entropy loss and too large variance is an indicator of poor scaling of the preceding weight matrix. For linear layers, we aim their inputs to be zero-centered: in the opposite case, they would produce too biased outputs, which is undesirable. In traditional supervised learning, we have different normalization and initialization techniques to control the signal flow. In zero-shot learning (ZSL), however, the set of tools is extremely limited. In this section, we justify the popularity of Normalize+Scale (NS) and Attributes Normalization (AN) techniques by demonstrating that they just retain a signal variance. We demonstrate that they are not enough to normalize a deep ZSL model and propose class normalization to regulate a signal inside a deep ZSL model. We empirically evaluate our study in Sec. 5 and appendices A, B, D and F.

3.1. NOTATION

A ZSL setup considers access to datasets of seen and unseen images with the corresponding labels D s = {x s i , y s i } Ns i=1 and D u = {x u i , y u i } Nu i=1 respectively. Each class c is described by its class attribute vector a c ∈ R da . All attribute vectors are partitioned into non-overlapping seen and unseen sets as well: A s = {a i } Ks i=1 and A u = {a i } Ku i=1 . Here N s , N u , K s , K u are number of seen images, unseen images, seen classes, and unseen classes respectively. In modern ZSL, all images are usually transformed via some standard feature extractor E : x → z ∈ R dz (Xian et al., 2018a) . Then, a typical ZSL method trains attribute embedder P θ : a c → p c ∈ R dz which projects class attributes a s c onto feature space R dz in such a way that it lies closer to exemplar features z s of its class c. This is done by solving a classification task, where logits are computed using formula (1). In such a way at test time we are able to classify unseen images by projecting unseen attribute vectors a u c into the feature space and computing similarity with the provided features z u . Attribute embedder P θ is usually a very simple neural network (Li et al., 2019) ; in many cases even linear (Romera-Paredes & Torr, 2015; Elhoseiny et al., 2013) , so it is the training procedure and different regularization schemes that carry the load. We will denote the final projection matrix and the body of P θ as V and H ϕ respectively, i.e. P θ (a c ) = V H ϕ (a c ). During training, it receives matrix of class attributes A = [a 1 , ..., a Ks ] of size K s × d a and outputs matrix W = P θ (A) of size K s × d z . Then W is used to compute class logits with a batch of image feature vectors z 1 , ..., z Ns .

3.2. UNDERSTANDING NORMALIZE + SCALE TRICK

One of the most popular tricks in ZSL and deep learning is using the scaled cosine similarity instead of a simple dot product in logits computation (Li et al., 2019; Zhang et al., 2019; Ye et al., 2020) : ŷc = z p c =⇒ ŷc = γ 2 z p c z p c (3) where hyperparameter γ is usually picked from [5, 10] interval. Both using the cosine similarity and scaling it afterwards by a large value is critical to obtain good performance; see Appendix D. To our knowledge, it has not been clear why exactly it has such big influence and why the value of γ must be so large. The following statement provides an answer to these questions. Statement 1 (informal). Normalize+scale trick forces the variance for ŷc to be approximately: Var [ŷ c ] ≈ γ 4 d z (d z -2) 2 , ( ) where d z is the dimensionality of the feature space. See Appendix A for the assumptions, derivation and the empirical study. Formula (4) demonstrates two things: 1. When we use cosine similarity, the variance of ŷc becomes independent from the variance of W = P θ (A), leading to better stability. 2. If one uses Eq. (3) without scaling (i.e. γ = 1), then the Var [ŷ c ] will be extremely low (especially for large d z ) and our model will always output uniform distribution and the training would stale. That's why we need very large values for γ. Usually, the optimal value of γ is found via a hyperparameter search (Li et al., 2019) , but our formula suggests another strategy: one can obtain any desired variance ν = Var [ŷ c ] by setting γ to: γ = ν • (d z -2) 2 d z 1 4 For example, for Var [ŷ c ] = ν = 1 and d z = 2048 we obtain γ ≈ 6.78, which falls right in the middle of [5, 10] -a usual search region for γ used by ZSL and representation learning practitioners Li et al. (2019) ; Zhang et al. (2019) ; Guo et al. (2020) . The above consideration not only gives a theoretical understanding of the trick, which we believe is important on its own right, but also allows to speed up the search by either picking the predicted "optimal" value for γ or by searching in its vicinity.

3.3. UNDERSTANDING ATTRIBUTES NORMALIZATION TRICK

We showed in the previous subsection that "normalize+scale" trick makes the variances of ŷc independent from variance of weights, features and attributes. This may create an impression that it does not matter how we initialize the weights -normalization would undo any fluctuations. However it is not true, because it is still important how the signal flows under the hood, i.e. for an unnormalized and unscaled logit value ỹc = z p c . Another common trick in ZSL is the normalization of attribute vectors to a unit norm a c -→ ac ac 2 . We provide some theoretical underpinnings of its importance. Let's first consider a linear case for P θ , i.e. H ϕ is an identity, thus ỹc = z p c = z V a c . Then, the way we initialize V is crucial since Var [ỹ] c depends on it. To derive an initialization scheme people use 3 strong assumptions for the inputs Glorot & Bengio (2010) ; He et al. (2015) ; Chang et al. (2020) : 1) they are zero-centered 2) independent from each other; and 3) have the covariance matrix of the form σ 2 I. But in ZSL setting, we have two sources of inputs: image features z and class attributes a c . And these assumptions are safe to assume only for z but not for a c , because they do not hold for the standard datasets (see Appendix H). To account for this, we derive the variance Var [ŷ c ] without relying on these assumptions for a c (see Appendix B): Var [ỹ c ] = d z • Var [z i ] • Var [V ij ] • E a a 2 2 (6) From equation ( 6) one can see that after giving up invalid assumptions for a c , pre-logits variance Var [ỹ c ] now became dependent on a c 2 , which is not captured by traditional Glorot & Bengio (2010) and He et al. (2015) initialization schemes and thus leads to poor variance control. Attributes normalization trick rectifies this limitation, which is summarized in the following statement. Statement 2 (informal). Attributes normalization trick leads to the same pre-logits variance as we have with Xavier fan-out initialization. (see Appendix B for the formal statement and the derivation). Xavier fan-out initialization selects such a scale for a linear layer that the variance of backward pass representations is preserved across the model (in the absence of non-linearities). The fact that attributes normalization results in the scaling of P θ equivalent to Xavier fan-out scaling and not some other one is a coincidence and shows what underlying meaning this procedure has. 

3.4. CLASS NORMALIZATION

What happens when P θ is not linear? Let h c = H ϕ (a c ) be the output of H ϕ . The analysis of this case is equivalent to the previous one but with plugging in h c everywhere instead of a c . This lead to: Var [ỹ c ] = d z • Var [z i ] • Var [V ij ] • E h h 2 2 (7) As a result, to obtain Var [ỹ c ] = Var [z i ] property, we need to initialize Var [V ij ] the following way: Var [V ij ] = d z • E hc h c 2 2 -1 This makes the initialization dependent on the magnitude of h c instead of a c , so normalizing attributes to a unit norm would not be sufficient to preserve the variance. To initialize the weights of V using this formula, a two-step data-dependent initialization is required: first initializing H ϕ , then computing average h c 2 2 , and then initializing V . However, this is not reliable since h c 2 2 changes on each iteration, so we propose a more elegant solution to standardize h c S(h c ) = (h c -μ)/ σ As one can note, this is similar to BatchNorm standardization without the subsequent affine transform, but we apply it class-wise on top of attribute embeddings h c . We plug it in right before V , i.e. P θ (a c ) = V S(H ϕ (a c )). This does not add any parameters and has imperceptible computational overhead. At test time, we use statistics accumulated during training similar to batch norm. Standardization (9) makes inputs to V have constant norm, which now makes it trivial to pick a proper value to initialize V : E hc S(h c ) 2 2 = d h =⇒ Var [V ij ] = 1 d z d h . ( ) We coin the simultaneous use of ( 9) and ( 10) class normalization and highlight its influence in the following statement. See Fig. 3 for the model diagram, Fig. 1 for empirical study of its impact, and Appendix C for the assumptions, proof and additional details. Statement 3 (informal). Standardization procedure (9) together with the proper variance formula (10), preserves the variance between z and ỹ for a mutli-layer attribute embedder P θ .

3.5. IMPROVED SMOOTHNESS

We also analyze the loss surface smoothness for P θ . There are many ways to measure this notion (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016; Dinh et al., 2017; Skorokhodov & Burtsev, 2019) , but following Santurkar et al. (2018) , we define it in a "per-layer" fashion via Lipschitzness: g = max X -1 2≤λ ∇ W L 2 2 , where is the layer index and X -1 is its input data matrix. This definition is intuitive: larger gradient magnitudes indicate that the loss surface is prone to abrupt changes. We demonstrate two things: 1. For each example in a batch, parameters of a ZSL attribute embedder receive K more updates than a typical non-ZSL classifier, where K is the number of classes. This suggests a hypothesis that it has larger overall gradient magnitude, hence a more irregular loss surface. 2. Our standardization procedure (9) makes the surface more smooth. We demonstrate it by simply applying Theorem 4.4 from (Santurkar et al., 2018) . Due to the space constraints, we defer the exposition on this to Appendix F.

4. CONTINUAL ZERO-SHOT LEARNING 4.1 PROBLEM FORMULATION

In continual learning (CL), a model is being trained on a sequence of tasks that arrive one by one. Each task is defined by a dataset D t = {x t i , y t i } Nt i=1 of size N t . The goal of the model is to learn all the tasks sequentially in such a way that at each task t it has good performance both on the current task and all the previously observed ones. In this section we develop the ideas of Chaudhry et al. (2019) and formulate a Continual Zero-Shot Learning (CZSL) problem. Like in CL, CZSL also assumes a sequence of tasks, but now each task is a generalized zero-shot learning problem. This means that apart from D t we also receive a set of corresponding class descriptions A t for each task t. In this way, traditional zero-shot learning can be seen as a special case of CZSL with just two tasks. In Chaudhry et al. (2019) , authors evaluate their zero-shot models on each task individually, without considering the classification space across tasks; looking only one step ahead, which gives a limited picture of the model's quality. Instead, we borrow ideas from Generalized ZSL (Chao et al., 2016; Xian et al., 2018a) , and propose to measure the performance on all the seen and all the unseen data for each task. More formally, for timestep t we have the datasets: D ≤t = t r=1 D r D >t = T r=t+1 D r A ≤t = t r=1 A r A >t = T r=t+1 A r (12) which are the datasets of all seen data (learned tasks), all unseen data (future tasks), seen class attributes, and unseen class attributes respectively. For our proposed CZSL, the model at timestep t has access to only data D t and attributes A t , but its goal is to have good performance on all seen data D ≤t and all unseen data D >t with the corresponding attributes sets A ≤t and A >t . For T = 2, this would be equivalent to traditional generalized zero-shot learning. But for T > 2, it is a novel and a much more challenging problem.

4.2. PROPOSED EVALUATION METRICS

Our metrics for CZSL use GZSL metrics under the hood and are based on generalized accuracy (GA) (Chao et al., 2016; Xian et al., 2018a) . "Traditional" seen (unseen) accuracy computation discards unseen (seen) classes from the prediction space, thus making the problem easier, since the model has fewer classes to be distracted with. For generalized accuracy, we always consider the joint space of both seen and unseen and this is how GZSL-S and GZSL-U are constructed. We use this notion to construct mean seen (mS), mean unseen (mU) and mean harmonic (mH) accuracies. We do that just by measuring GZSL-S/GZSL-U/GZSL-H at each timestep, considering all the past data as seen and all the future data as unseen. Another set of CZSL metrics are mean joint accuracy (mJA) which measures the performance across all the classes and mean area under seen/unseen curve (mAUC) which is an adaptation of AUSUC measure by Xian et al. (2018a) . A more rigorous formulation of these metrics is presented in Appendix G.2. Apart from them, we also employ a popular forgetting measure (Lopez-Paz & Ranzato, 2017) .

5.1. ZSL EXPERIMENTS

Experiment details. We use 4 standard datasets: SUN (Patterson et al., 2014) , CUB (Welinder et al., 2010) , AwA1 and AwA2 and seen/unseen splits from Xian et al. (2018a) . They have 645/72, 150/50, 40/10 and 40/10 seen/unseen classes respectively with d a being equal to 102, 312, 85 and 85 respectively. Following standard practice, we use ResNet101 image features (with d z = 2048) from Xian et al. (2018a) . Our attribute embedder P θ is a vanilla 3-layer MLP augmented with standardization procedure 9 and corrected output matrix initialization 10. For all the datasets, we train the model with Adam optimizer for 50 epochs and evaluate it at the end of training. We also employ NS and AN techniques with γ = 5 for NS. Additional hyperparameters are reported in Appx D. To perform cross-validation, we first allocate 10% of seen classes for a validation unseen data (for AwA1 and AwA2 we allocated 15% since there are only 40 seen classes). Then we allocate 10% out of the remaining 85% of the data for validation seen data. This means that in total we allocate ≈ 30% of all the seen data to perform validation. It is known (Xian et al., 2018a; Min et al., 2020) , that GZSL-H score can be improved slightly by reducing the weight of seen class logits during accuracy computation since this would partially relieve the bias towards seen classes. We also employ this trick by multiplying seen class logits by value s during evaluation and find its optimal value using cross-validation together with the other hyperparameters. On Figure 4 in Appendix D.4, we provide validation/test accuracy curves of how it influences the performance. Evaluation and discussion. We evaluate the model on the corresponding test sets using 3 metrics as proposed by Xian et al. (2018a) : seen generalized unseen accuracy (GZSL-U), generalized seen accuracy (GZSL-S) and GZSL-S/GZSL-U harmonic mean (GZSL-H), which is considered to the main metric for ZSL. Table 2 shows that our model has the state-of-the-art in 3 out of 4 datasets. Training speed results. We conducted a survey and rerun several recent SotA methods from their official implementations to check their training speed, which details we report in Appx D. Table 2 shows the average training time for each of the methods. Since our model is just a vanilla MLP and does not use any sophisticated training scheme, it trains from 30 to 500 times faster compared to other methods, while outperforming them in the final performance.

5.2. CZSL EXPERIMENTS

Datasets. We test our approach in CZSL scenario on two datasets: CUB Welinder et al. (2010) and SUN Patterson et al. (2014) . CUB contains 200 classes and is randomly split into 10 tasks with 20 

6. CONCLUSION

We investigated and developed normalization techniques for zero-shot learning. We provided theoretical groundings for two popular tricks: normalize+scale and attributes normalization and showed both provably and in practice that they aid training by controlling a signal's variance during a forward pass. Next, we demonstrated that they are not enough to constrain a signal from fluctuations for a deep ZSL model. That motivated us to develop class normalization: a new normalization scheme that fixes the problem and allows to obtain SotA performance on 4 standard ZSL datasets in terms of quantitative performance and training speed. Next, we showed that ZSL attribute embedders tend to have more irregular loss landscape than traditional classifiers and that class normalization partially remedies this issue. Finally, we generalized ZSL to a broader setting of continual zero-shot learning and proposed a set of principled metrics and baselines for it. We believe that our work will spur the development of stronger zero-shot systems and motivate their deployment in real-world applications.

A "NORMALIZE + SCALE" TRICK

As being discussed in Section 3.2, "normalize+scale" trick changes the logits computation from a usual dot product to the scaled cosine similarity: ŷc = z, pc =⇒ ŷc = γ z z , γ pc pc , ( ) where ŷc is the logit value for class c; z is an image feature vector; pc is the attribute embedding for class c: pc = P θ (ac) = V H ϕ(ac) ( ) and γ is the scaling hyperparameter. Let's denote a penultimate hidden representation of P θ as hc = Hϕ(ac). We note that in case of linear P θ , we have hc = ac. Let's also denote the dimensionalities of z and hc by dz and d h . A.1 ASSUMPTIONS To derive the approximate variance formula for ŷc we will use the following assumptions and approximate identities: (i) All weights in matrix V : • are independent from each other and from z k and hc,i (for all k, i); • E [Vij] = 0 for all i, j; • Var [Vij] = sv for all i, j. (ii) There exists > 0 s.t. (2 + )-th central moment exists for each of hc,1, ..., h c,d h . We require this technical condition to be able apply the central limit theorem for variables with non-equal variances. (iii) All hc,i, hc,j are independent from each other for i = j. This is the least realistic assumption from the list, because in case of linear P θ it would be equivalent to independence of coordinates in attribute vector ac. We are not going to use it in other statements. As we show in Appendix A.3 it works well in practice. (iv) All pc,i, pc,j are independent between each other. This is also a nasty assumption, but more safe to assume in practice (for example, it is easy to demonstrate that Cov [pc,i, pc,j] = 0 for i = j). We are going to use it only in normalize+scale approximate variance formula derivation. (v) z ∼ N (0, s z I). This property is safe to assume since z is usually a hidden representation of a deep neural network and each coordinate is computed as a vector-vector product between independent vectors which results in the normal distribution (see the proof below for pc N (0, s p I)). (vi) For ξ ∈ {z, pc} we will use the approximations: E ξi • 1 ξ 2 ≈ E [ξi] • E 1 ξ 2 and E ξiξj • 1 ξ 2 2 ≈ E [ξiξj] • E 1 ξ 2 2 ( ) This approximation is safe to use if the dimensionality of ξ is large enough (for neural networks it is definitely the case) because the contribution of each individual ξi in the norm ξ 2 becomes negligible. Assumptions (i-v) are typical for such kind of analysis and can also be found in (Glorot & Bengio, 2010; He et al., 2015; Chang et al., 2020) . Assumption (vi), as noted, holds only for large-dimensional inputs, but this is exactly our case and we validate that using leads to a decent approximation on Figure 2 .

A.2 FORMAL STATEMENT AND THE PROOF

Statement 1 (Normalize+scale trick). If conditions (i)-(vi) hold, then: Var [ŷc] = Var γ z z , γ pc pc ≈ γ 4 dz (dz -2) 2 (16) Proof. First of all, we need to show that pc,i N (0, s p ) for some constant s p . Since pc,i = d h j=1 Vi,jhc,j from assumption (i) we can easily compute its mean: E [pc,i] = E d h j=1 Vi,jhc,j = d h j=1 E [Vi,jhc,j] (19) = d h j=1 E [Vi,j] • E [hc,j] (20) = d h j=1 0 • E [hc,j] (21) = 0. ( ) and the variance: Var [pc,i] = E p 2 c,i -(E [pc,i]) 2 (23) = E p 2 c,i = E   d h j=1 Vi,jhc,j 2   (25) = E   d h j,k=1 Vi,jV i,k hc,jh c,k   Using E [Vi,jV i,k ] = 0 for k = j, we have: = E d h j V 2 i,j h 2 c,j = d h j E V 2 i,j E h 2 c,j Since sv = Var [Vi,j] = E V 2 i,j -E [Vi,j] 2 = E V 2 i,j , we have: = d h j svE h 2 c,j = scE d h j h 2 c,j = svE hc 2 2 (31) = s p (32) Now, from the assumptions (ii) and (iii) we can apply Lyapunov's Central Limit theorem to pc,i, which gives us: 1 √ s p pc,i N (0, 1) For finite d h , this allows us say that: pc,i ∼ N (0, s p ) Now note that from (vi) we have: E [ŷc] = E γ z z , γ pc pc (36) = γ 2 E z z , E pc pc (37) ≈ γ 2 E 1 z • E 1 pc • E [z] , E [pc] (38) = γ 2 E 1 z • E 1 pc • 0, 0 = 0 (40) Since ξ ∼ N (0, s ξ ) for ξ ∈ {z, pc}, d ξ ξ 2 2 follows scaled inverse chi-squared distribution with inverse variance τ = 1/s ξ , which has a known expression for expectation: E d ξ ξ 2 2 = τ d ξ d ξ -2 = d ξ s ξ (d ξ -2) (41) Now we are left with using approximation (vi) and plugging in the above expression into the variance formula: Var [ŷc] = Var γ z z , γ pc pc (42) = E γ z z , γ pc pc 2 -E γ z z , γ pc pc 2 (43) ≈ E γ z z , γ pc pc 2 -0 (44) = γ 4 E (z pc) 2 z 2 pc 2 (45) ≈ γ 4 E (z p) 2 • E 1 dz • dz z 2 • E 1 dz • dz pc 2 (46) = γ 4 d 2 z • E pc p c E z zz pc • dz s z (dz -2) • dz s p (dz -2) (47) = γ 4 E pc p c s z I dz pc • 1 s z s p (dz -2) 2 (48) = γ 4 E dz i=1 f 2 c,i • 1 s p (dz -2) 2 (49) = γ 4 dz • s p • 1 s p (dz -2) 2 (50) = γ 4 dz (dz -2) 2 (51)

A.3 EMPIRICAL VALIDATION

In this subsection, we validate the derived approximation empirically. For empirical validation of the variance analyzis, see Appendix E. For this, we perform two experiments: • Synthetic data. An experiment on a synthetic data. We sample x ∼ N (0, I d ), y ∼ N (0, I d ) for different dimensionalities d = 32, 64, 128, ..., 8192 and compute the cosine similarity: z = γx x , γy y After that, we compute Var [z] and average it out across different samples. The result is presented on figure 2a . • Real data. We take ImageNet-pretrained ResNet101 features and real class attributes (unnormalized) for SUN, CUB, AwA1, AwA2 and aPY datasets. Then, we initialize a random 2-layer MLP with 512 hidden units, and generate real logits (without scaling). Then we compute mean empirical variance and the corresponding standard deviation over different batches of size 4096. The resulted boxplots are presented on figure 2b . In both experiments, we computed the logits with γ = 1. As one can see, even despite our demanding assumptions, our predicted variance formula is accurate for both synthetic and real-world data.

B ATTRIBUTES NORMALIZATION

We will use the same notation as in Appendix A. Attributes normalization trick normalizes attributes to the unit L2-norm: ac -→ ac ac 2 We will show that it helps to preserve the variance for pre-logit computation when attribute embedder P θ is linear: ỹc = z F (ac) = z V ac (55) For a non-linear attribute embedder it is not true, that's why we need the proposed initialization scheme.

B.1 ASSUMPTIONS

We will need the following assumptions: (i) Feature vector z has the properties: • E [z] = 0; • Var [zi] = s z for all i = 1, ..., dz. • All zi are independent from each other and from all pc,j. (ii) Weight matrix V is initialized with Xavier fan-out mode, i.e. Var [Vij] = 1/dz and are independent from each other. Note here, that we do not have any assumptions on ac. This is the core difference from Chang et al. (2020) and is an essential condition for ZSL (see Appendix H).

B.2 FORMAL STATEMENT AND THE PROOF

Statement 2 (Attributes normalization for a linear embedder). If assumptions (i)-(ii) are satisfied and ac 2 = 1, then: Var [ỹc] = Var [zi] = s z (56) Proof. Now, note that: E [ỹc] = E z V ac = E V,ac E z z V ac = E 0 V ac = 0 Then the variance for ỹc has the following form: Var [ỹc] = E ỹ2 c -E [ỹc] 2 (58) = E ŷ2 c (59) = E (z V ac) 2 (60) = E a c V zz V ac (61) = E ac E V E z a c V zz V ac (62) = E ac a c E V V E z zz V ac (63) = s z E ac a c E V V V ac (64) = s z s v dz E ac a c ac since s v = 1/dz, then: = s z E ac 2 2 (66) since attributes are normalized, i.e. ac 2 = 1, then: = s z C NORMALIZATION FOR A DEEP ATTRIBUTE EMBEDDER

C.1 FORMAL STATEMENT AND THE PROOF

Using the same derivation as in B, one can show that for a deep attribute embedder: P θ (ac) = V • Hϕ(ac) ( ) normalizing attributes is not enough to preserve the variance of Var [ỹc], because Var [ỹc] = s z E hc 2 2 ( ) and hc = Hϕ(ac) is not normalized to a unit norm. To fix the issue, we are going to use two mechanisms: 1. A different initialization scheme: Var [Vij] = 1 dzd h (70) 2. Using the standardization layer before the final projection matrix: S(x) = (x -μx) σx, µx, σx are the sample mean and variance and is the element-wise division.

C.2 ASSUMPTIONS

We'll need the assumption: (i) Feature vector z has the properties: • E [z] = 0; • Var [zi] = s z for all i = 1, ..., dz. • All zi are independent from each other and from all pc,j.  Var [ỹc] ≈ Var [zi] = s z (72) Proof. With some abuse of notation, let hc = S(hc) (in practice, S receives a batch of hc instead of a single vector). This leads to: E [hc] = 0 and Var [hc] ≈ 1 (73) Using the same reasoning as in Appendix B, one can show that: Var [ỹc] = dz • s z • Var [Vij] • E hc 2 2 = s z d h • E hc 2 2 (74) So we are left to demonstrate that E hc We depict the architecture of our model on figure 3. As being said, P θ is just a simple multi-layer MLP with standardization procedure (9) and adjusted output layer initialization (10). 2 2 = d h : E hc 2 2 = d h i=1 E h2 c,i = d h i=1 Var hc,i ≈ d h Besides, we also found it useful to use entropy regularizer (the same one which is often used in policy gradient methods for exploration) for some datasets: Lent(•) = -H( p) = K c=1 pc log pc (76) We train the model with Adam optimizer with default β1 and β2 hyperparams. The list of hyperparameters is presented in Table 4 . In those ablation experiments where we do not attributes normalization (2), we apply simple standardization to convert attributes to zero-mean and unit-variance.

D.2 ADDITIONAL EXPERIMENTS AND ABLATIONS

In this section we present additional experiments and ablation studies for our approach (results are presented in Table 5 ). We also run validate :  DN(h) = h/E h 2 2 (77) between V and Hϕ, i.e. P θ (ac) = V DN(Hϕ(ac)). Expectation E h 2 2 is computed over a batch on each iteration. A downside of such an approach is that if the dimensionality is large, than a lot of dimensions will get suppressed leading to bad signal propagation. Besides, one has to compute the running statistics to use them at test time which is cumbersome. • Traditional initializations + standardization procedure (9). These experiments ablate the necessity of using the corrected variance formula (10). • Performance of NS for different scaling values of γ and different number of layers.

D.3 MEASURING TRAINING SPEED

We conduct a survey and search for open-source implementations of classification ZSL papers that were recently published on top conferences. This is done by 1) checking the papers for code urls; 2) checking their supplementary; 3) searching for implementations on github.com and 4) searching authors by their names on github.com and checking their repositories list. As a result, we found 8 open-source implementations of the recent methods, but one of them got discarded since the corresponding data was not provided. We reran all these Table 6 : Training time for the recent ZSL methods that made their official implementations publicly available. We reran them on the corresponding datasets with the official hyperparameters and training setups. All the comparisons are done on the same machine and hardware: NVidia GeForce RTX 2080 Ti GPU, Intel Xeon Gold 6142 CPU and 64 GB RAM. N/C stands for "no code" meaning that authors didn't release the code for a particular dataset. 6 in Appx D.

SUN

All runs are made with the official hyperparameters and training setups and on the same hardware: NVidia GeForce RTX 2080 Ti GPU, ×16 Intel Xeon Gold 6142 CPU and 128 GB RAM. The results are depicted on Table 6 . As one can see, the proposed method trains 50-100 faster than the recent SotA. This is due to not using any sophisticated architectures employing generative models (Xian et al., 2018b; Narayan et al., 2020) ; or optimization schemes like episode-based training (Li et al., 2019; Yu et al., 2020) .

D.4 CHOOSING SCALE s FOR SEEN CLASSES

As mentioned in Section 5, we reweigh seen class logits by multiplying them on scale value s. This is similar to a strategy considered by Xian et al. (2018a) ; Min et al. ( 2020), but we found that multiplying by a value instead of adding it by summation is more intuitive. We find the optimal scale value s by cross-validation together with all other hyperparameters on the grid [1.0, 0.95, 0.9, 0.85, 0.8]. On Figure 4 , we depict the curves of how s influences GZSL-U/GZSL-S/GZSL-U for each dataset.

D.5 INCORPORATING CN FOR OTHER ATTRIBUTE EMBEDDERS

In this section, we employ our proposed class normalization for two other methods: RelationNetfoot_0 (Sung et al., 2018) and CVC-ZSLfoot_1 (Sung et al., 2018) . We build upon the officially provided source code bases and use the official hyperparameters for all the setups. For RelationNet, the authors provided the running commands. For CVC-ZSL, we used those hyperparameters for each dataset, that were specified in their paper. That included using different weight decay of 1e -4, 1e -3, 1e -3 and 1e -5 for AwA1, AwA2, CUB and SUN respectively, as stated in Section 4.2 of the paper (Li et al., 2019) . We incorporated our Class Normalization procedure to these attribute embedders and launched them on the corresponding datasets. The results are reported in Table 7 . For some reason, we couldn't reproduce the official results for both these methods which we additionally report. As one can see from the presented results, our method gives +2.0 and +1.8 of GZSL-H improvement on average for these two methods respectively which emphasizes once again its favorable influence on the learned representations.

D.6 ADDITIONAL ABLATION ON AN AND NS TRICKS

In Table 8 we provide additional ablations on attributes normalization and normalize+scale tricks. As one can see, they greatly influence the performance of ZSL attribute embedders.  GZSL-H [val] GZSL-H [test] GZSL-S [val] GZSL-S [test] GZSL-U [val] GZSL-U [test] AwA1 (s * = 0.95) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 Scale value s for seen classes 0.0 0.2 0.4 0.6 0.8 Accuracy GZSL-H [val] GZSL-H [test] GZSL-S [val] GZSL-S [test] GZSL-U [val] GZSL-U [test] AwA2 (s * = 0.95) 2 is caused by splitting the train set into train/validation sets for the presented run, i.e. using less data for training: we allocated 50, 30, 5 and 5 seen classes for the validation unseen ones to construct these plots for SUN, CUB, AwA1 and AwA2 respectively and an equal amount of data was devoted to being used as validation seen data, i.e. we "lost" ≈ 25% train data in total. As one can see from these plots, the trick works for those datasets where the gap between GZSL-S and GZSL-U is large and does not give any benefit for CUB where seen and unseen logits are already well-balanced. 

E ADDITIONAL VARIANCE ANALYZIS

In this section, we provide the extended variance analyzis for different setups and datasets. The following models are used: 1. A linear ZSL model with/without normalize+scale (NS) and/or attributes normalization (AN). 2. A 3-layer ZSL model with/without NS and/or AN. 3. A 3-layer ZSL model with class normalization, with/without NS and/or AN. These models are trained on 4 standard ZSL datasets: SUN, CUB, AwA1 and AwA2 and their logits variance is calculated on each iteration and reported. The same batch size, learning rate, number of epochs, hidden dimensionalities were used. Results are presented on figures 5, 6, 7 and 8, which illustrates the same trend: • A traditional linear model without NS and AN has poor variance. • Adding NS with a proper scaling of 5 and AN improves it and bounds to be close to 1. • After introducing new layers, NS and AN stop "working" and variance vanishes below unit. • Incorporating class normalization allows to push it back to 1.

F LOSS LANDSCAPE SMOOTHNESS ANALYSIS F.1 OVERVIEW

As being said, we demonstrate two things: 1. For each example in a batch, parameters of a ZSL attribute embedder receive K more updates than a typical non-ZSL classifier, where K is the number of classes. This suggests a hypothesis that it has larger overall gradient magnitude, hence a more irregular loss surface. 2. Our standardization procedure (9) makes the surface more smooth. We demonstrate it by simply applying Theorem 4.4 from (Santurkar et al., 2018) . To see the first point, one just needs to compute the derivative with respect to weight Wij for n-th data sample for loss surface L SL n of a traditional model and loss surface L ZSL n of a ZSL embedder: ∂L SL n ∂W ij = ∂L SL n ∂y (n) i x (n) j ∂L ZSL n ∂W ij = K c=1 ∂L ZSL n ∂yi xj(ac) Since the gradient has K more terms and these updates are not independent from each other (since the final representations are used to construct a single logits vector after a dot-product with zn), this may lead to an increased overall gradient magnitude. We verify this empirically by computing the gradient magnitudes for our model and its non-ZSL "equivalent": a model with the same number of layers and hidden dimensionalities, but trained to classify objects in non-ZSL fashion. To show that our class standardization procedure (9) smoothes the landscape, we apply Theorem 4.4 from Santurkar et al. (2018) that demonstrates that a model augmented with normalization (BN) has smaller Lipschitz constant. This is easily done after noticing that ( 9) is equivalent to BN, but without scaling/shifting and is applied in a class-wise instead of the batch-wise fashion. We empirically validate the above observations on Figures 1 and 9 . F.2 FORMAL REASONING ZSL embedders are prone to have more irregular loss surface. We demonstrate that the loss surface of attribute embedder P θ is more irregular compared to a traditional neural network. The reason is that its output vectors pc = P θ (ac) are not later used independently, but instead combined together in a single matrix W = [p1, ..., pK s ] to compute the logits vector y = W z. Because of this, the gradient update for θi receives K signals instead of just 1 like for a traditional model, where K is the number of classes. Consider a classification neural network F ψ (x) optimized with loss L and some its intermediate transformation y = W x. Then the gradient of Ln on n-th training example with respect to W ij is computed as: y = W x =⇒ ∂Ln ∂W ij = ∂Ln ∂y (n) i ∂y (n) i ∂W ij = ∂Ln ∂y (n) i x (n) j While for attribute embedder P θ (ac), we have K times more terms in the above sum since we perform K forward passes for each individual class attribute vector ac. The gradient on n-th training example for its inner transformation y = W x(ac) is computed as: y = W x(ac) =⇒ ∂Ln ∂W ij = K c=1 ∂Ln ∂yi xj(ac) From this, we can see that the average gradient for P θ is K times larger which may lead to the increased overall gradient magnitude and hence more irregular loss surface as defined in Section 3.5. CN smoothes the loss landscape. In contrast to the previous point, we can prove this rigorously by applying Theorem 4.4 by Santurkar et al. (2018) , who showed that performing standardization across hidden representations smoothes the loss surface of neural networks. Namely Santurkar et al. (2018) proved the following: Theorem 4.4 from (Santurkar et al., 2018) . For a network with BatchNorm with loss L and a network without BatchNorm with loss L if: g = max X ≤λ ∇W L 2 , ĝ = max X ≤λ ∇W L 2 (81) then: ĝ ≤ γ 2 σ 2 g 2 -mµ 2 g -λ 2 ∇y L, ŷ 2 where y , ŷ are hidden representations at the -th layer, m is their dimensionality, σ is their standard deviation, µg = 1 m 1, ∂ L/∂z for z = γy + β, is the average gradient norm, γ is the BN scaling parameter, X is the input data matrix at layer . Now, it easy easy to see that our class standardization ( 9) is "equivalent" to BN (and thus the above theorem can be applied to our model): 

G.2 ADDITIONAL CZSL METRICS

In this subsection, we describe our proposed CZSL metrics that are used to access a model's performance. Subscripts "tr"/"ts" denote train/test data. Mean Seen Accuracy (mSA). We compute GZSL-S after tasks t = 1, .., T and take the average: mSA(F ) = 1 T T t=1 GZSL-S(F, D ≤t ts , A ≤t ) Mean Unseen Accuracy (mUA). We compute GZSL-U after tasks t = 1, ..., T -1 (we do not compute it after task T since D >T = ∅) and take the average: mUA(F ) = 1 T -1 T -1 t=1 GZSL-U(F, D >t ts , A >t ) Mean Harmonic Seen/Unseen Accuracy (mH). We compute GZSL-H after tasks t = 1, ..., T -1 and take the average: mH(F ) = 1 T -1 T -1 t=1 GZSL-H(F, D ≤t ts , D >t ts , A) Mean Area Under Seen/Unseen Curve (mAUC). We compute AUSUC Chao et al. (2016) after tasks t = 1, ..., T -1 and take the average: mAUC(F ) = 1 T -1 T -1 t=1 AUSUC(F, D ≤t ts , D >t ts , A) AUSUC is a performance metric that allows to detect model's bias towards seen or unseen data and in our case it measures this in a continual fashion. Mean Joint Accuracy (mJA). On each task t we compute the generalized accuracy on all the test data we have for the entire problem: mJA(F ) = 1 T T t=1 ACC(F, Dts, A) This evaluation measure allows us to understand how far behind a model is from the traditional supervised classifiers. A perfect model would be able to generalize on all the unseen classes from the very first task and maintain the performance on par with normal classifiers.

H WHY CANNOT WE HAVE INDEPENDENCE, ZERO-MEAN AND SAME-VARIANCE ASSUMPTIONS FOR ATTRIBUTES IN ZSL?

Usually, when deriving an initialization scheme, people assume that their random vectors have zero mean, the same coordinate variance and the coordinates are independent from each other. In the paper, we stated that these are unrealistic assumptions for class attributes in ZSL and in this section elaborate on it. Attribute values for the common datasets need to be standardized to satisfy zero-mean and unit-variance (or any other same-variance) assumption. But it is not a sensible thing to do, if your data does not follow normal distribution, because it makes it likely to encounter a skewed long-tail distribution like the one illustrated on Figure 14 . In reality, this does not break our theoretical derivations, but this creates an additional optimizational issue which hampers training and that we illustrate in Table 8 . This observation is also confirmed by Changpinyo et al. (2016b) . If we do not use these assumptions but rather use zero-mean and unit-variance one (and enforce it during training), than the formula (6) will transform into: This means, that we need to adjust the initialization by a value 1/da to preserve the variance. This means, that we initialize the first projection matrix with the variance: Var [ỹc] = dz • Var [zi] • Var [Vij] • E a a 2 2 = dz • Var [zi] • Var [Vij] • da (88) Var [Vij] = 1 dz • 1 da . In Table 10 , we show what happens if we do count for this factor and if we don't. As one can see, just standardizing the attributes without accounting for 1/da factor leads to worse performance. To show more rigorously that attributes do not follow normal distribution and are not independent from each other, we report two statistical results: • Results of a normality test based on D'Agostino and Pearson's tests, which comes with scipy python stats library. We run it for each attribute dimension for each dataset and report the distribution of the resulted χ 2 -statistics with the corresponding p-values on Figure 12 . • Compute the distribution of absolute values of correlation coefficients between attribute dimensions. The results are presented on Figure 13 which demonstrates that attributes dimensions are not independent between each other in practice and thus we cannot use a common indepdence assumption when deriving the initialization sheme for a ZSL embedder. We note, however, that attributes distribution is uni-modal and, in theory, it is possible to transform it into a normal one (by hacking it with log/inverse/sqrt/etc), but such an approach is far from being scalable. It is not scalable because transforming a non-normal distribution into a normal one is tricky and is done either manually by finding a proper transformation or by solving an optimization task. This is tedious to do for each dataset and thus scales poorly. This figure shows that attributes are not independent that's why it would be unreasonable to use such an assumption. If attributes would be independent from each other, that would mean that, for example, that "having black stripes" is independent from "being orange", which tigers would argue not to be a natural assumption to make. 



RelationNet: https://github.com/lzrobots/LearningToCompare_ZSL CVC-ZSL: https://github.com/kailigo/cvcZSL



Figure 2: Empirical validation of the derived approximation for variance (4)

Figure3: Our architecture: a plain MLP with the standardization procedure (9) inserted before the final projection and output matrix V being initialized using (10).

, we cover hyperparameter and training details for our ZSL experiments and also provide the extended training speed comparisons with other methods.

official hyperparameters on the corresponding datasets and report their training time in Table

Figure4: Optimal value of seen logits scale s for different datasets. Multiplying seen logits by some scale s < 1 during evaluation leads to sacrificing GZSL-S for an increase in GZSL-U which results in the increased GZSL-H valueXian et al. (2018a);Min et al. (2020). High gap between validation/test accuracy is caused by having different number of classes in these two sets. Lower test GZSL-H than reported in Table2is caused by splitting the train set into train/validation sets for the presented run, i.e. using less data for training: we allocated 50, 30, 5 and 5 seen classes for the validation unseen ones to construct these plots for SUN, CUB, AwA1 and AwA2 respectively and an equal amount of data was devoted to being used as validation seen data, i.e. we "lost" ≈ 25% train data in total. As one can see from these plots, the trick works for those datasets where the gap between GZSL-S and GZSL-U is large and does not give any benefit for CUB where seen and unseen logits are already well-balanced.

Figure 5: plots for different models for SUN dataset. See Appendix E for the experimental details.

Figure 10: Additional CZSL results for CUB dataset

Figure12: Results of the normality test for class attributes for real-world datasets. Higher values mean that the distribution is further away from a normal one. For a dataset of truly normal random variables, these values are usually in the range [0, 5]. As one can see from 12a, real-world distribution of attributes does not follow a normal one, thus requires more tackling and cannot be easily converted to it.

Figure 13: Distribution of mean absolute correlation values between different attribute dimensions.This figure shows that attributes are not independent that's why it would be unreasonable to use such an assumption. If attributes would be independent from each other, that would mean that, for example, that "having black stripes" is independent from "being orange", which tigers would argue not to be a natural assumption to make.

Figure 14: Histogram of standardized attribute values for SUN and AwA2. These figures demonstrate that the distribution is typically long-tailed and skewed, so it is far from being normal.

Effectiveness of Normalize+Scale, Attributes Normalization and Class Normalization. When NS and AN are integrated into a basic ZSL model, its performance is boosted up to a level of some sophisticated SotA methods and additionally using CN allows to outperform them. ±NS and ±AN denote if normalize+scale or attributes normalization are being used. Bold/normal blue font denote best/second-best results. Extended results are in Table 2, 5 and 8.

Generalized Zero-Shot Learning results. S, U denote generalized seen/unseen accuracy and H is their harmonic mean. Bold/normal blue font denotes the best/second-best result.

Continual Zero-Shot Learning results with and without CN. Best scores are in bold blue. SUN contains 717 classes which is randomly split into 15 tasks, the first 3 tasks have 47 classes and the rest of them have 48 classes each (717 classes are difficult to separate evenly). We use official train/test splits for training and testing the model.Model and optimization. We follow the proposed cross-validation procedure fromChaudhry et al. (2019). Namely, for each run we allocate the first 3 tasks for hyperparameter search, validating on the test data. After that we reinitialize the model from scratch, discard the first 3 tasks and train it on the rest of the data. This reduces the effective number of tasks by 3, but provides a more fair way to perform cross-validationChaudhry et al. (2019). We use an ImageNet-pretrained ResNet-18 model as an image encoder E(x) which is optimized jointly with P θ . For CZSL experiments, P θ is a 2-layer MLP and we test the proposed CN procedure. All the details can be found in Appendix G. Results for the proposed metrics mU, mS, mH, mAUC, mJA and forgetting measure fromLopez-Paz & Ranzato (2017)  are reported in Table3and Appendix G. As one can observe, class normalization boosts the performance of classical regularization-based and replay-based continual learning methods by up to 100% and leads to lesser forgetting. However, we are still far behind traditional supervised classifiers as one can infer from mJA metric. For example, some state-of-the-art approaches on CUB surpass 90% accuracyGe et al. (2019) which is drastically larger compared to what the considered approaches achieve.

Hyperparameters for ZSL experiments

Additional GZSL ablation studies. From this table one can observe the sensitivity of normalize+scale to γ value. We also highlight γ importance in Section 3.

Incorporating Class Normalization into RelationNet(Sung et al., 2018) and CVC-ZSL(Li et al., 2019) based on the official source code and running hyperparameters. For some reason, our results differ considerably from the reported ones on AwA2 for RelationNet and on SUN for CVC-ZSL. Adding CN provides the improvement in all the setups.

Ablating other methods for AN and NS importance. For CVC-ZSL, we used the officially provided code with the official hyperparameters. When we do not employ AN, we standardize them to zero-mean and unit-variance: otherwise training diverges due to too high attributes magnitudes.

Hyperparameters range for CZSL experiments

Checking how a model performs when we replace AN with the standardization procedure and with the standardization procedure, accounted for 1/d factor from (88). In the latter case, the performance is noticeably improved. .4  39.4 48.0 40.1 43.7 59.7 68.5 63.8 54.9 69.4 61.3 2-layer MLP -AN + 1/d a 37.1 38.4 37.7 50.8 33.3 40.2 60.1 66.3 63.1 50.5 71.8 59.3 3-layer MLP 31.4 40.4 35.3 45.2 48.4 46.7 55.6 73.0 63.1 54.5 72.2 62.1 3-layer MLP -AN 34.7 38.5 36.5 46.9 42.8 44.9 57.0 69.9 62.8 49.7 76.4 60.2 3-layer MLP -AN + 1/d a 42.0 33.4 37.2 50.4 30.6 38.1 57.1 64.7 60.6 55.2 69.0 61.4

F.3 EMPIRICAL VALIDATION

To validate the above claim empirically, we approximate the quantity 11, but computed for all the parameters of the model instead of a single layer on each iteration. We do this by taking 10 random batches of size 256 from the dataset, adding ∼ N (0, I) noise to this batch, computing the gradient of the loss with respect to the parameters, then computing its norm scaled by 1/n factor to account for a small difference in number of parameters (≈ 0.9) between a ZSL model and a non-ZSL one. This approximates the quantity 11, but instead of approximating it around 0, we approximate it around real data points since it is more practically relevant. We run the described experiment for three models:1. A vanilla MLP classifier, i.e. without any class attributes. For each dataset, it receives feature vector z and produces logits.2. A vanilla MLP zero-shot classifier, as described in section 3.3. An MLP zero-shot classifier with class normalization.All three models were trained with cross-entropy loss with the same optimization hyperparameters: learning rate of 0.0001, batch size of 256, number of iterations of 2500. They had the same numbers of layers, which was equal to 3. The results are illustrated on figures 1 (left) and 9. As one can see, traditional MLP models indeed have more flat loss surface which is observed by a small gradient norm. But class normalization helps to reduce the gap.

G CONTINUAL ZERO-SHOT LEARNING DETAILS G.1 CZSL EXPERIMENT DETAILS

As being said, we use the validation sequence approach from Chaudhry et al. (2019) to find the best hyperparameters for each method. We allocate the first 3 tasks to perform grid search over a fixed range. After the best hyperparameters have been found, we train the model from scratch for the rest of the tasks. The hyperparameter range for CZSL experiments are presented in Table 9 (we use the same range for all the experiments).We train the model for 5 epochs on each task with SGD optimizer. We also found it beneficial to decrease learning rate after each task by a factor of 0.9. This is equivalent to using step-wise learning rate schedule with the number of epochs equal to the number of epochs per task. As being said, for CZSL experiments, we use an ImageNet-pretrained ResNet-18 model as our image encoder. In contrast with ZSL, we do not keep it fixed during training. The results for our mS, mU, mH, mJA, mAUC metrics, as well as the forgetting measure (Lopez-Paz & Ranzato, 2017) are presented on figures 10 and 11.

