ON LINEAR IDENTIFIABILITY OF LEARNED REPRE-SENTATIONS

Abstract

Identifiability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufficient computational resources and data. We study identifiability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to some downstream task. When parameterized as deep neural networks, such representation functions lack identifiability in parameter space, because they are overparameterized by design. In this paper, building on recent advances in nonlinear Independent Components Analysis, we aim to rehabilitate identifiability by showing that a large family of discriminative models are in fact identifiable in function space, up to a linear indeterminacy. Many models for representation learning in a wide variety of domains have been identifiable in this sense, including text, images and audio, state-of-the-art at time of publication. We derive sufficient conditions for linear identifiability and provide empirical support for the result on both simulated and real-world data.

1. INTRODUCTION

An increasingly common methodology in machine learning is to improve performance on a primary down-stream task by first learning a high-dimensional representation of the data on a related, proxy task. In this paradigm, training a model reduces to fine-tuning the learned representations for optimal performance on a particular sub-task (Erhan et al., 2010) . Deep neural networks (DNNs), as flexible function approximators, have been surprisingly successful in discovering effective high-dimensional representations for use in downstream tasks such as image classification (Sharif Razavian et al., 2014) , text generation (Radford et al., 2018; Devlin et al., 2018) , and sequential decision making (Oord et al., 2018) . When learning representations for downstream tasks, it would be useful if the representations were reproducible, in the sense that every time a network relearns the representation function on the same data distribution, they were approximately the same, regardless of small deviations in the initialization of the parameters or the optimization procedure. In some applications, such as learning real-world causal relationships from data, such reproducible learned representations are crucial for accurate and robust inference (Johansson et al., 2016; Louizos et al., 2017) . A rigorous way to achieve reproducibility is to choose a model whose representation function is identifiable in function space. Informally speaking, identifiability in function space is achieved when, in the limit of infinite data, there exists a single, global optimum in function space. Interestingly, Figure 1 exhibits learned representation functions that appear to be the same up to a linear transformation, even on finite data and optimized without convergence guarantees (see Appendix A.1 for training details). In this paper, we account for Figure 1 by making precise the relationship it exemplifies. We prove that a large class of discriminative and autoregressive models are identifiable in function space, up to a linear transformation. Our results extend recent advances in the theory of nonlinear Independent Components Analysis (ICA), which have recently provided strong identifiability results for generative models of data (Hyvärinen et al., 2018; Khemakhem et al., 2019; 2020; Sorrenson et al., 2020) . Our key contribution is to bridge the gap between these results and discriminative models, commonly used for representation learning (e.g., (Hénaff et al., 2019; Brown et al., 2020) ). The rest of the paper is organized as follows. In Section 2, we describe a general discriminative model family, defined by its canonical mathematical form, which generalizes many supervised, self-Figure 1 : Left and Middle: Two learned DNN representation functions f θ1 (B), f θ2 (B) visualized on held-out data B. The DNNs are word embedding models Mnih and Teh (2012) trained on the Billion Word Dataset (Chelba et al., 2013) (see Appendix A.1 for code release and training details). Right: Af θ1 (B) and f θ2 (B), where A is a linear transformation learned after training. The overlap exhibits linear identifiability (see Section 3): different representation functions, learned on the same data distribution, live within linear transformations of each other in function space. supervised, and contrastive learning frameworks. In Section 3, we prove that learned representations in this family have an asymptotic property desirable for representation learning: equality up to a linear transformation. In Section 4, we show that this family includes a number of highly performant models, state-of-the-art at publication for their problem domains, including CPC (Oord et al., 2018) , BERT (Devlin et al., 2018) , and GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020) . Section 5 investigates the actually realizable regime of finite data and partial optimization, showing that representations learned by members of the identifiable model family approach equality up to a linear transformation as a function of dataset size, neural network capacity, and optimization progress.

2. MODEL FAMILY AND DATA DISTRIBUTION

The learned embeddings of a DNN are a function not only of the parameters, but also the network architecture and size of dataset (viewed as a sample from the underlying data distribution). This renders any analysis in full generality challenging. To make such an analysis tractable, in this section, we begin by specifying a set of assumptions about the underlying data distribution and model family that must hold for the learned representations to be similar up to a linear transformation. These assumptions are, in fact, satisfied by a number of already published, highly performant models. We establish definitions in this section, and discuss these existing approaches in depth in Section 4.

Data Distribution

We assume the existence of a generalized dataset in the form of an empirical distribution p D (x, y, S) over random variables x, y and S with the following properties: • The random variable x is an input variable, typically high-dimensional, such as text or an image. • The random variable y is a target variable whose value the model predicts. In case of object classification, this would be some semantically meaningful class label. However, in our model family, y may also be a high-dimensional context variable, such a text, image, or sentence fragment. • S is a set containing the possible values of y given x, so p D (y|x, S) > 0 ⇐⇒ y ∈ S. Note that the set of labels S is not fixed, but a random variable. This allows supervised, contrastive, and self-supervised learning frameworks to be analyzed together: the meaning of S encodes the task. For supervised classification, S is deterministic and contains class labels. For self-supervised pretraining, S contains randomly-sampled high-dimensional variables such as image embeddings. For deep metric learning (Hoffer and Ailon, 2015; Sohn, 2016) , the set S contains one positive and k negative samples of the class to which x belongs. Canonical Discriminative Form Given a data distribution as above, a generalized discriminative model family may be defined by its parameterization of the probability of a target variable y conditioned on an observed variable x and a set S that contains not only the true target label y, but also a collection of distractors y : p θ (y|x, S) = exp(f θ (x) g θ (y)) y ∈S exp(f θ (x) g θ (y )) , The codomain of the functions f θ (x) and g θ (y) is R M , and the domains vary according to modelling task. For notational convenience both are parameterized by θ ∈ Θ, but f and g may use disjoint parts of θ, meaning that they do not necessarily share parameters. With F and G we denote the function spaces of f θ and g θ respectively. Our primary domain of interest is when f θ and g θ are highly flexible function approximators, such as DNNs. This brings certain analytical challenges. In neural networks, different choices of parameters θ can result in the same functions f θ and g θ , hence the map Θ → F × G is many-to-one. In the context of representation learning, the function f θ is typically viewed as a nonlinear feature extractor, e.g., the learned representation of the input data. While other choices meet the membership conditions for the family defined by the canonical form of Equation ( 1), in the remainder, we will focus on DNNs in the remainder. We next present a definition of identifiability suitable for DNNs, and prove that members of the above family satisfy it under additional assumptions.

3. MODEL IDENTIFIABILITY

In this section, we derive identifiability conditions for models in the family defined in Section 2.

3.1. IDENTIFIABILITY IN PARAMETER SPACE

Identifiability analysis answers the question of whether it is theoretically possible to learn the parameters of a statistical model exactly. Specifically, given some estimator θ for model parameters θ * , identifiability is the property that, for any {θ , θ * } ⊂ Θ, p θ = p θ * =⇒ θ = θ * . (2) Models that do not have this property are said to be non-identifiable. This happens when different values {θ , θ * } ⊂ Θ can give rise to the same model distribution p θ (y|x, S) = p θ * (y|x, S). In such a case, observing an empirical distribution p θ * (y|x, S), and fitting a model p θ (y|x, S) to it perfectly does not guarantee that θ = θ * . Neural networks exhibit various symmetries in parameter space such that there is almost always a many-to-one correspondence between a choice of θ and resulting probability function p θ . A simple example in neural networks is that one can swap the (incoming and outgoing) connections of two neurons in a hidden layer. This changes the value of the parameters, but does not change the network's function. Thus, when representation functions f θ or g θ are parameterized as DNNs, equation 2 is not satisfiable.

3.2. IDENTIFIABILITY IN FUNCTION SPACE

For reliable and efficient representation learning, we want learned representations f θ from two identifiable models to be sufficiently similar for interchangeable use in downstream tasks. The most general property we wish to preserve among learned representations is their ability to discriminate among statistical patterns corresponding to categorical groupings. In the model family defined in Section 2, the data and context functions f θ and g θ parameterize p θ (y|x, S), the probability of label assignment, through a normalized inner product. This induces a hyperplane boundary, for discrimination, in a joint space of learned representations for data x and context y. Therefore, in the following, we will derive identifiability conditions up to a linear transformation, using a notion of similarity in parameter space inspired by Hyvärinen et al. (2018) . Definition 1. Let L ∼ be a pairwise relation on Θ defined as: θ L ∼ θ * ⇐⇒ f θ (x) = Af θ * (x) g θ (y) = Bg θ * (y) where A and B are invertible M × M matrices. See Appendix B for proof that L ∼ is an equivalence relation. In the remainder, we refer to identifiability up to the equivalence relation L ∼ as L ∼-identifiable or linearly identifiable.

3.3. LINEAR IDENTIFIABILITY OF LEARNED REPRESENTATIONS

We next present a simple derivation of the L ∼-identifiability of members of the generalized discriminative family defined in Section 2. This result reveals sufficient conditions under which a discriminative probabilistic model p θ (y|x, S) has a useful property: the learned representations of the input x and target random variables y for any two pairs of parameters (θ , θ * ) are related as θ L ∼ θ * , that is, f θ (x) = Af θ * (x) and g θ (y) = Bg θ * (y). We first review the notation for the proof, which is introduced in detail in Section 2. We then highlight an important requirement on the diversity of the data distribution, which must be satisfied for the proof statement to hold. We prove the result immediately after. Notation. The target random variables y, associated with input random variables x, may be class labels (as in supervised classification), or they could be stochastically generated from datapoints x as, e.g., perturbed image patches (as in self-supervised learning). We account for this additional stochasticity as a set-valued random variable S, containing all possible values of y conditioned on some x. For brevity, we will use shorthands that drop the parameters θ: p := p θ , p * := p θ * , f * := f θ * , f := f θ , g := g θ . Diversity condition. We assume that for any (θ , θ * ) for which it holds that p = p * , and for any given x, by repeated sampling S ∼ p D (S|x) and picking y A , y B ∈ S, we can construct a set of M distinct tuples {(y (i) A , y (i) B )} M i=1 such that the matrices L and L * are invertible, where L consists of columns (g (y (i) A ) -g (y (i) B )), and L * consists of columns g * (y (i) A ) -g * (y (i) B ), i ∈ {1, . . . , M }. See Section 3.4 for detailed discussion. Theorem 1. Under the diversity condition, models in the family defined by Equation (1) are linearly identifiable. That is, for any θ , θ * ∈ Θ, and f * , f , g * , g , p * , p defined as in Section 2, p = p * =⇒ θ L ∼ θ * . (4) To establish the result, we proceed by directly constructing an invertible linear transformation that satisfies Definition 1. Consider y A , y B ∈ S. The likelihood ratios for these points p (y A |x, S) p (y B |x, S) = p * (y A |x, S) p * (y B |x, S) are equal. Substituting our model definition from equation (1), we find: exp(f (x) g (y A )) exp(f (x) g (y B )) = exp(f * (x) g * (y A )) exp(f * (x) g * (y B )) , where the normalizing constants cancelled out on the left-and right-hand sides. Taking the logarithm, this simplifies to: (g (y A ) -g (y B )) f (x) = (g * (y A ) -g * (y B )) f * (x). Note that this equation is true for any triple (x, y A , y B ) for which p D (x, y B , y B ) > 0. We next collect M distinct tuples (y (i) A , y B ) so that by repeating Equation ( 7) M times and by the diversity condition noted above, the resulting difference vectors are linearly independent. We collect these vectors together as the columns of (M × M )-dimensional matrices L and L * , forming the following system of M linear equations: L f (x) = L * f * (x). Since L and L * are invertible, we rearrange: f (x) = (L * L -1 ) f * (x). Hence, f (x) = Af * (x) where A = (L * L -1 ). This completes the first half of the proof. See Appendix C for the second half of the proof, which is similar, and handles the function g.

3.4. DISCUSSION: WHEN DOES THE DIVERSITY CONDITION HOLD?

Theorem 1 is a constructive proof of existence that exhibits invertible (M × M ) matrices L and L * . We require the diversity condition to hold in order to guarantee invertibility. Such a requirement is similar to the conditions in earlier work on nonlinear ICA such as (Hyvärinen et al., 2018) , as discussed in Section 6. Informally, this means that there needs to be a sufficient number of possible values y ∈ S. In the case of supervised classification with K classes, S is fixed and of size K. Then, we need K ≥ M + 1 in order to generate M difference vectors g θ (y (1) ) -g θ (y (j) ), j = 2, . . . , M + 1. In case of self-supervised or deep metric learning, where S and y may be algorithmically generated from x, this requirement is easy to satisfy, as there will typically be a diversity of values of y. The same holds for language models with large vocabularies. However, for supervised classification with a small number of classes, this requirement on the size of S may be restrictive, as we discuss further in Section 4. Note that by placing the diversity requirement on the number of classes K, we implicitly assumed that the context representation function g θ has the following property: the M difference vectors span the range of g θ . This is a mild assumption in the context of DNNs: for random initialization and iterative weight updates, this property follows from the stochasticity of the distribution used to initialize the network. Briefly, a set of M + 1 unique points y (j) such that the M vectors g θ (y (1) ) -g θ (y (j) ), j = 2, . . . , M + 1 are not linearly independent has measure zero. For other choices of g θ , care must be taken to ensure this condition is satisfied. What can be said when L and L * are ill-conditioned, that is, the ratio between maximum and minimum singular value σmax(L) σmin(L) (dropping superscripts when a statement apply to both) is large? In the context of a data representation matrix such as L, this implies that there exists at least one column j of L and constants λ k for k = j such that jk =j λ k k 2 < ε for small ε. In other words, some column is nearly a linear combination of the others. This implies, in turn, that there exists some tuple (y (k) , y (i) ) such that the resulting difference vector j = g θ (y (k) A ) -g θ (y B ) can nearly (in the sense above) be written as a linear combination of the other columns. Such near singularity is in this case a function of the choice of samples y that yield the difference vectors. The issue could be handled by resampling different data points until the condition number of the matrices is satisfactory. This amounts to strengthening the diversity condition. We leave more detailed analysis to future work, as the result will depend on the choice of architectures for f and g.

4. EXAMPLES OF LINEARLY IDENTIFIABLE MODELS

The form of Equation ( 1) is already used as a general approach for a variety of machine learning problems. We present a non-exhaustive sample of such publications, chosen to exhibit the range of applications. Many of these approaches were state-of-the-art at the time of their release: Contrastive Predictive Coding (Hénaff et al., 2019) , BERT (Devlin et al., 2018) , GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020) , XLNET (Yang et al., 2019) , and the triplet loss for deep metric learning (Sohn, 2016) . In this section, we discuss how to interpret the functional components of these frameworks with respect to the generalized data distribution of Section 2 and canonical parameterization of Equation (1). See Appendix D for reductions to the canonical form of Equation (1). Supervised Classification. Although the scope of this paper is identifiable representation learning, under certain conditions, standard supervised classifiers can learn identifiable representations as well. In this case, the number of classes must be strictly greater than the feature dimension, as noted in Section 3.4. We simulate such a model in Section 5.1 to show evidence of its linear identifiability. We stress that representation learning as pretraining for classification is a way to ensure that the conditions on label diversity are met, rather than relying on the supervised classifier itself to generate identifiable representations. This paradigm is discussed in the next subsection. Representations learned during supervised classification can be linearly identifiable under the following model specification. The input random variables x represent some data domain to be classified, such as images or word embeddings. The target variables y represent label assignments for x, typically semantically meaningful. These are often encoded these as the standard basis vectors e y , a "one-hot encoding." The set S contains all K possible values of y. In this case, notice that S is not stochastic: the empirical distribution p D (S|x) is modelled as a Dirac measure with all probability mass on the set S = {0, . . . , K -1} (using integers, here, to represent distinct labels) . The representation function f θ (x) of a classifier is often implemented as DNN that maps from the input layer to the layer just prior to the model logits. The context map g θ (y) is given by the weights in the final, linear projection layer, which outputs unnormalized logits. Concretely, g θ (y) = We y , where W ∈ R M ×M is a learnable weight matrix. In order satisfy the diversity condition, the dimension M of the number of classes K must be strictly greater than the dimension of the learned representation M , that is, |S| ≥ M + 1. Finally, the output of the final, linear projection layer is normalized through a Softmax function, yielding the parameterization of Equation (1). Self-Supervised Pretraining for Image Classification. Self-supervised learning is a framework that first pretrains a DNN before deploying it on some other, related task. The pretraining task often takes the form of Equation ( 1) and meets the sufficient conditions to be linearly identifiable. A paradigmatic example is Contrastive Predictive Coding (CPC) (Oord et al., 2018) . CPC is a general pretraining framework, but we focus for the sake of clarity on its use in image models here. CPC as applied to images involves: (1) preprocessing an image into augmented patches, (2) assigning labels according to which image the patch came from, and then (3) predicting the representations of the patches whether below, to the right, to the left, or above a certain level (Oord et al., 2018) . The context function of CPC, g θ (y), encodes a particular position in the sequence of patches, and the representation function, f θ (x), is an autoregressive function of the previous k patches, according to some predefined patch ordering. Given some x, the collection of all patches from the sequence, from a given minibatch of images, is the set S ∼ p D (S|x), where the randomness enters via the patch preprocessing algorithm. Since the preprocessing phase is part of the algorithm design, it is straightforward to make it sufficiently diverse (enough transformations of enough patches) so as to meet the requirements for the model to be linearly identifiable. Multi-task Pretraining for Natural Language Generation. Autoregressive language models, such as (Mikolov et al., 2010; Dai and Le, 2015) and more recently GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020) , are typically also instances of the model family of Equation 1. Data points x are the past tokens, f θ (x) is a nonlinear representation of the past estimated by either an LSTM (Hochreiter and Schmidhuber, 1997) or an autoregressive Transformer model (Vaswani et al., 2017) , y is the next token, and w i = g θ (y = i) is a learned representation of the next token, often implemented as a simple look-up table, as in supervised classification. BERT (Devlin et al., 2018) is also a member of the linearly identifiable family. This model pretrains word embeddings through a denoising autoencoder-like (Vincent et al., 2008) architecture. For a given sequence of tokenized text, some fixed percentage of the symbols are extracted and set aside, and their original values set to a special null symbol, "corrupting" the original sequence. The pretraining task in BERT is to learn a continuous representation of the extracted symbols conditioned on the remainder of the text. A transformer (Vaswani et al., 2017) function approximator is used to map from the corrupted sequence into a continuous space. The transformer network is the f θ (x) function of Equation 1. The context map g θ (y) is a lookup map into the learned basis vector for each token.

5. EXPERIMENTS

The derivation in Section 3 shows that, for models in the general discriminative family defined in Section 2, the functions f θ and g θ are identifiable up to a linear transformation given unbounded data and assuming model convergence. The question remains as to how close a model trained on finite data and without convergence guarantees will approach this limit. One subtle issue is that poor architecture choices (such as too few hidden units, or inadequate inductive priors) or insufficient data samples when training can interfere with model estimation and thereby linear identifiability of the learned representations, due to underfitting. In this section, we study this issue over a range of models, from low-dimensional language embedding and supervised classification (Figures 1 and 2 respectively) to GPT-2 (Radford et al., 2019) , an approximately 1.5 * 10 9 -parameter generative model of natural language (Figure 4 ). See Appendix A and the code release for details needed to reproduce. Through these experiments, we show that (1) in the small dimensional, large data regime, linearly identifiable models yield learned representations that lie approximately within a linear transformation of each other (Figures 1 and 2 ) as predicted by Theorem 1; and (2) in the high dimensional, large data regime, linearly identifiable models yield learned representations that exhibit a strong trend towards linear identifiability. The learned representations approach a linear transformation of each other monotonically, as a function of dataset sample size, neural network capacity (number of hidden units), and optimization progress. In the case of GPT-2, which has benefited from substantial tuning by engineers to improve model estimation, we find strong evidence of linear identifiability. Measuring linear similarity between learned representations. How can we measure whether pairs of learned representations live within a linear transformation of each other in function space? We adapt Canonical Correlation Analysis (CCA) (Hotelling, 1936) for this purpose, which finds the optimal linear transformations to maximize correlation among two random vectors. On a randomly selected held-out subset B ⊂ D of the training data we compute f θ1 (B) and f θ2 (B) for two models with parameters θ 1 and θ 2 respectively. Assume without loss of generality that f θ1 (B) and f θ2 (B) are centered. CCA finds the optimal linear transformations C and D such that the pairwise correlations ρ i between the i th columns of C f θ1 (B) and D f θ2 (B) are maximized. We collect correlations together in ρ. If after linear transformation the two matrices are aligned, the mean of ρ will be 1; if they are instead uncorrelated, then the mean of ρ will be 0. We use the mean of ρ as a proxy for the existence of a linear transformation between f θ1 (B) and f θ2 (B). For DNNs, it is a well known phenomenon that most of the variability in a learned representation tends to concentrate in a low-dimensional subspace, leaving many noisy, random dimensions (Morcos et al., 2018) . Such random noise can result in spurious high correlations in CCA. A solution to this problem is to apply Principal Components Analysis (PCA) (Pearson, 1901) to each of the two matrices f θ2 (B) and f θ1 (B), projecting onto their top-k principal components, before applying CCA. This technique is known as SVCCA (Raghu et al., 2017) . We report first on a simulation study of linearly identifiable K-way classification, where all assumptions and sufficient conditions of Theorem 1 are guaranteed to be met. We generated a synthetic data distribution with the properties required by Section 2, and chose DNNs that had sufficient capacity to learn a specified nonlinear relationship between inputs x and targets y. In short, the data distribution p D (x, y, S) consists of inputs x sampled from a 2-D Gaussian with σ = 3. The targets y were assigned among K = 18 classes according to their radial position (angle swept out by a ray fixed at the origin). The number of classes K was chosen to ensure K ≥ dim[f θ (x)] + 1, the diversity condition. See Appendix D.1 for more details.

5.1. SIMULATION STUDY: CLASSIFICATION BY DNNS

To evaluate linear similarity, we trained two randomly initialized models of p D (y|x, S). Plots show f θ (x), the data representation function, on random x. Figure 2b shows that the mean CCA increases to its maximum value over training, demonstrating that the feature spaces converge to the same solution up to a linear transformation modulo model estimation noise. Similarly, Figure 2c shows that the learned representations exhibit a strongly linear relationship. Figure 4 : Text Embeddings by GPT-2. GPT-2 results. Representations of the last hidden layer (which is identifiable), in addition to three earlier layers (not necessarily identifiable) for four GPT-2 models. For each representation layer, SVCCA is computed over to all pairs of models, over which correlation coefficients were averaged. SVCCA was applied with 16, 64, 256 and 768 principal components. The learned representations in the last, identifiable layer more correlated than representations learned in preceding layers. We next investigate high-dimensional, self-supervised representation learning on CIFAR-10 ( Krizhevsky et al., 2009) using CPC (Oord et al., 2018; Hénaff et al., 2019) . For a given input image, this model predicts the identity of a bottom image patch representation given a top patch representation (Figure 3a .) Here, S comprises the true patch with a set of distractor patches from across the current minibatch. For each model we define both f θ and g θ as a 3-layer MLP with 256 units per layer (except where noted otherwise) and fix output dimensionality of 64. In Figure 3b , CCA coefficients are plotted over the course of training. As training progresses, alignment between the learned representations increases. In Figure 3c , we artificially limited the size of the dataset, and plot mean correlation after training and convergence. This shows that increasing availability of data correlates with closer alignment. In Figure 3d , we fix dataset size and artificially limit the model capacity (number hidden units) to investigate the effect of model size on the learned representations, varying the number of hidden units from 64 to 8192. This show that increasing model capacity correlates with increase in alignment of learned representations.

5.3. GPT-2

Finally, we report on a study of GPT-2 (Radford et al., 2019) , a massive-scale language model. The identifiable representation is the set of features just before the last linear layer of the model. We use pretrained models from HuggingFace (Wolf et al., 2019) . HuggingFace provides four different versions of the GPT-2: gpt2, gpt2-medium, gpt2-large and gpt2-xl, which differ mainly in the hyper-parameters that determine the width and depth of the neural network layers. For approximately 2000 input sentences, per timestep, for each model, we extracted representations at the last layer (which is identifiable) in addition to the representations per timestep given by three earlier layers in the model. Then, we performed SVCCA on each possible pair of models, on each of the four representations. SVCCA was performed with 16, 64, 256 and 768 principal components, computed by applying SVD separately for each representations of each model. We chose 768 as the largest number of principal components, since that is the representation size for the smallest model in the repository (gpt2). We then averaged the CCA correlation coefficients across the pairs of models. Figure 4 shows the results. The results align well with our theory, namely that the representations at the last layer are more linearly related than the representations at other layers of the model.

5.4. INTERPRETATION AND SUMMARY

Theorem 1 establishes linear identifiability as an asymptotic property of a model that holds in the limit of infinite data and exact estimation. The experiments of this section have shown that for linear identifiable models, when the dimensionality is small relative to dataset size (Figures 1 and 2 ), the learned embeddings are closely linearly related, up to noise. Problems of model estimation and sufficient dataset size are more pronounced in high dimensions. Nevertheless, in GPT-2, representations among different trained models do in fact approach a mean correlation coefficient of 1.0 after training (Figure 4 , blue line), providing strong evidence of linear identifiability.

6. RELATED WORKS

Prior to Hyvärinen and Morioka (2016) , identifiability analysis was uncommon in deep learning. We build on advances in the theory of nonlinear ICA (Hyvärinen and Morioka, 2016; Hyvärinen et al., 2018; Khemakhem et al., 2019) . In this section, we carefully distinguish our results from prior and concurrent works. Our diversity assumption is similar to diversity assumptions in these earlier works, while differing on certain conditions. The main difference is that their results apply to related but distinct families of models compared to the general discriminative family outlined in this paper. Arguably most related is Theorem 3 of Hyvärinen et al. (2018) and its proof, which shows that a class of contrastive discriminative models will estimate, up to an affine transformation, the true latent variables of a nonlinear ICA model. The main difference with our result is that they additionally assume: (1) that the mapping between observed variables and latent representations is invertible; and (2) that the discriminative model is binary logistic regression exhibiting universal approximation (Hornik et al., 1989) , estimated with a contrastive objective. In addition, (Hyvärinen et al., 2018) does not present conditions for affine identifiability for their version of the context representation function g. It should be noted that Theorem 1 in (Hyvärinen et al., 2018) provides a potential avenue for further generalization of our theorem 1 to discriminative models with non-linear interaction between f and g. Concurrent work (Khemakhem et al., 2020) has expanded the theory of identifiable nonlinear ICA to a class of conditional energy-based models (EBMs) with universal density approximation capability, therefore imposing milder assumptions than previous nonlinear ICA results. Their version of affine identifiability is similar to our result of linear identifiability in Section 3.2. The main differences are that Khemakhem et al. (2020) focus in both theory and experiment on EBMs. This allows for alternative versions of the diversity condition, assuming that the Jacobians of their versions of f or g are full rank. This is only possible if x or y are assumed continuous-valued; note that we do not make such an assumption. Khemakhem et al. (2020) also presents an architecture for which the conditions provably hold, in addition to sufficient conditions for identifiability up to element-wise scaling, which we did not explore in this work. While we build on these earlier results, we are, to the best of our knowledge, the first to apply identifiability analysis to state-of-the-art discriminative and autoregressive generative models.

7. CONCLUSION

We have shown that representations learned by a large family of discriminative models are identifiable up to a linear transformation, providing a novel perspective on representation learning using DNNs. Since identifiability is a property of a model class, and identification is realized in the asymptotic limit of data and compute, we perform experiments in the more realistic setting with finite datasets and finite compute. Our empirical results show that as the representational capacity of the model and dataset size increases, learned representations indeed tend towards solutions that are equal up to only a linear transformation.

A REPRODUCING EXPERIMENTS AND FIGURES

In this section, we present training and optimization details needed to reproduce our empirical validation of Theorem 1. We also published notebooks and check-pointed weights for two crucial experiments that investigate the result in the small and massive scale regimes, for Figure 1 and GPT-2 (ANONYMIZED). A.1 FIGURE 1 We provide a Jupyter notebook and model checkpoints for reproducing Figure 1 . Please refer to this for hyperparameter settings. In short, we implemented a model (Mnih and Teh, 2012) in the family of Section 2 and trained it on the Billion Word dataset (Chelba et al., 2013) . This is illustrative of the property of Theorem 1 because the relatively modest size of the parameter space (see notebook) and massive dataset minimizes model convergence and data availability restrictions, e.g., approaches the asymptotic regime. The word embedding space is 2-D for ease of visualization. We randomly selected a subset of words, mapped them into their learned embeddings, and visualized them as points in the left and middle panes. We then regress pane one onto pane two in order to learn the best linear transformation between them. Note that if the two are linear transformations of each other, regression will recover that transformation exactly.

A.2 SIMULATION STUDY: CLASSIFICATION BY DNNS

For this experiment, we want to ensure that the chosen model can fit the data distribution exactly. Controlling this removes one possible factor that could prevent linear identifiability of learned representations despite the model formally having that property. We do this by making sure that the process that generates the dataset matches the model chosen to learn the relationships between inputs and labels. This is achieved through the following algorithm. We first randomly assign initialization labels based on angular position, then fit two neural networks f θ and g θ to predict the final labels, using the discriminative model of Equation ( 1) and Appendix D.1. Both f θ and g θ 4-hidden-layer MLPs with two 64 unit layers and one 2-D bottle neck layer. After training these representation functions to convergence, generated new batch of points x, and used the trained networks to predict the ground truth labels y. Finally, to conduct experiments, we chose f θ and g θ to be the same architecture as f θ and g θ . This ensures that the supervised classifier we attempted to learn would using the function approximators f θ and g θ would be able to capture the true data generating process, e.g, would not fail due to too few hidden units, or too complex a relationship between targets and inputs. Remaining training details are as follows. We optimize weights using Adam with a learning rate of 10 -4 for 5 * 10 4 iterations. To make the classification problem more challenging, we additionally add 20 input dimensions of random noise to the data. The Adam optimizer Kingma and Ba (2014) with a learning rate of 3 • 10 -4 is used.

A.3 SELF-SUPERVISED LEARNING FOR IMAGE CLASSIFICATION

To compute linear similarity between representations, we train two independent models in parallel. For each model we define both f θ and g θ as a 3-layer fully connected neural network with 2 8 units per layer and a fixed output dimensionality of 2 6 . We define our model following Equation (1), where S is the set of the other image patches from the current minibatch and optimize the objective of (Hénaff et al., 2019) . We augment both sampled patches independently with randomized brightness, saturation, hue, and contrast adjustments, following the recipe of (Hénaff et al., 2019) . We train on the CIFAR10 dataset (Krizhevsky et al., 2009) with batchsize 2 8 , using the Adam optimizer with a learning rate of 10 -4 and the JAX (Bradbury et al., 2018) software package. For each model, we early stop based on a validation loss failing to improve further. Additional details about the experiments that generated Figure 3 : A.4 GPT-2 We include all details through a notebook in the code release. Pretrained GPT-2 weights as specified in the main text are publicly available from HuggingFace Wolf et al. (2019) .

A.5 REMARK ON EFFECT OF INITIALIZATION AND HYPERPARAMETERS OF MODELS

One question that may be of interest is whether initialization affects whether learned representations will be within a linear transformation of each other. This depends on whether the optimization routines (like Adam, AdaGrad, etc.) are robust to wider initialization within a certain range. If so, model convergence will be unaffected. However, this cannot make up for poor initialization or poor optimization: just as in any deep neural network, a poor initialization and inadequate optimizer will interfere with learning the model parameters. In the case of a linearly identifiable model, means that the learned representations would not live within a linear transformation of each other (up to noise from model fitting), since the models have failed to converge to a reasonable solution for the task at hand. When the hyperparameters of a DNN are changed, this changes the class of functions that the network can represent (i.e., the size and stride of convolution filters will change which input pixels could be correlated in deeper layers). Typically, hyperparameters are carefully tuned using cross validation based on held-out data. We did so in our experiments also. We expect that such a tuning procedure would yield hyperparameters that are as good as possible for the model to be optimized, allowing sufficient optimization so that the linear identifiability of the learned representations is realized. If the hyperparameters are sufficiently bad and optimization suffers, this will interfere with model fitting, and with linear identifiability of the learned representations also.

B PROOF THAT LINEAR SIMILARITY IS AN EQUIVALENCE RELATION

We claim that L ∼ is an equivalence relation. It suffices to show that it is reflexive, transitive, and symmetric. Proof. Consider some function g θ and some θ , θ , θ † ⊂ Θ. Suppose θ L ∼ θ . Then, there exists an invertible matrix B such that g θ (x) = Bg θ (x). Since g θ (x) = B -1 g θ (x), L ∼ is symmetric. Reflexivity follows from setting g θ to g θ and B to the identity matrix. To show transitivity, suppose also that θ L ∼ θ † . Then, there exists an invertible C such that g θ (x) = Cg θ † (x). Since g θ L ∼ g θ , B -1 g θ (x) = Cg θ † (x). Rearranging terms, g θ (x) = BCg θ † (x), so that θ L ∼ θ † as required. C SECTION 3.2 CONTINUED: CASE OF CONTEXT REPRESENTATION FUNCTION g Our derivation of identifiability of g θ is similar to the derivation of f θ . The primary difference is that the normalizing constants in Equation ( 6) do not cancel out. First, note that we can rewrite Equation 1as: p θ (y|x, S) = exp( f θ (x, S) g θ (y)) Under review as a conference paper at ICLR 2021 where: f θ (x, S) = [-Z(x, S); f θ (x)] (10) g θ (y) = [1; g θ (y)] (11) Z(x, S) = log y ∈S exp(f θ (x) g θ (y )). Below, we will show that for the model family defined in Section 2, p θ = p θ * =⇒ g θ (y) = B g θ (y), where B is an invertible (M ×M )-dimensional matrix, concluding the proof of the linear identifiability of models in the family defined by Equation (1). We adopt the same shorthands as in the main text.

C.1 DIVERSITY CONDITION

We assume that for any (θ , θ * ) ⊂ Θ for which it holds that p = p * , and for any given y, there exist M +1 tuples {(x (i) , S (i) )} M i=0 , such that p D (x (i) , y, S (i) ) > 0, and such that the ((M +1)×(M +1)) matrices M and M * are invertible, where M consists of columns f (x (i) , S (i) ), and M * consists of columns f * (x (i) , S (i) ). This is similar to the diversity condition of Section 3.2 but milder, since a typical dataset will have multiple x for each y.

C.2 PROOF

With the data distribution p D (x, y, S), for a given y, there exists a conditional distribution p D (x, S|y). Let (x, S) be a sample from this distribution. From equation 1 and the statement to prove, it follows that: p (y|x, S) = p * (y|x, S) Substituting in the definition of our model from equation ( 9), we find: exp( f (x, S) g (y)) = exp( f * (x, S) g * (y)), which, evaluating logarithms, becomes f (x, S) g (y) = f * (x, S) g * (y), which is true for any triple (x, y, S) where p D (y|x, S) > 0. From M and M * (Section C.1) and equation 16 we form a linear system of equations, collecting the M + 1 relationships together: M g (y) = M * g * (y) (17) g (y) = A g * (y), where A = (M * M -1 ) , an invertible (M + 1) × (M + 1) matrix. It remains to show the existence of an invertible M × M matrix B such that g (y) = Bg * (y). We proceed by constructing B from A. Since A is invertible, there exist j elementary matrices {E 1 , . . . , E j } such that their action R = E j E j-1 . . . E 1 converts A to a (non-unique) row echelon form. Without loss of generality, we build R such that the a 1,1 entry of A is the first pivot, leading to the particular row echelon form:  RA =       a 1,1 a 1,2 a 1,3 . . . a 1,m×1 D.1 SUPERVISED CLASSIFICATION Supervised classifiers commonly employ a neural network feature extractor followed by a linear projection of the output of this network into a space of unnormalized logits. All the layers prior to the logits are the representation function f θ , and the final projection layer is the context map g θ (y = i) = w i , where w i is the i-th column of a weight matrix W. The set S in this case contains human-chosen labels and has no stochasticity. The loss function is the negative log-likelihood of the data under a categorical distribution with a softmax parameterization: log p θ (y = i|x; S) = f θ (x) w i -log |S| j=1 exp(f θ (x) w j ) Supervised classification is thus an member of the family defined in Section 2. It exhibits the simplest functional form for the g function while allowing f to be arbitrarily complicated.

D.2 CPC

Consider a sequence of points x t . We wish to learn the parameters φ to maximize the k-step ahead predictive distribution p(x t+k |x t , φ). In the image patch example, each patch center i, j is indexed by t. Substituting in the definition of l k makes equation ( 24) identical to the model family (Equation 1).

D.3 AUTOREGRESSIVE LANGUAGE MODELS (E.G. GPT-2)

Let U = {u 1 , . . . , u n } be a corpus of tokens. Autoregressive language models maximize a loglikelihood L(U) = n i=1 log P (u i |u i-k , . . . , u i-1 ; Θ), Concretely, the conditional density is modelled as log P (u i |u i-k:i-1 ; Θ) = W i: h i -log j exp(W j: h i ), where h i is the m × 1 output of a function approximator (e.g. a Transformer decoder (Liu et al., 2018) ), and W i: is the i'th row of the |U| × m token embedding matrix.

D.4 BERT

Consider a sequence of text x = [x 1 , . . . , x T ]. Some proportion of the symbols in x are extracted into a vector x, and then set in x to a special null symbol, "corrupting" the original sequence. This operation generates the corrupted sequence x. The representational learning task is to predict x conditioned on x, that is, to maximize w.r.t. θ: log p θ (x| x) ≈ T t=1 m t log p θ (x t | x) = T t=1 m t H θ ( x) t e(x t ) -log x exp H θ ( x) t e(x ) , where H is a transformer, e is a lookup table, and m t = 1 if symbol x t is masked. That is, corrupted symbols are "reconstructed" by the model, meaning that their index is predicted. As noted in Yang et al. (2019) , BERT models the joint conditional probability p(x| x) as factorized so that each masked token is separately reconstructed. This means that the log likelihood is approximate instead of exact.

D.5 QUICKTHOUGHT VECTORS

Let f and g be functions that take a sentence as input and encode it into an fixed length vector. Let s be a given sentence, and S ctxt be the set of sentences appearing in the context of s for a fixed context size. Let S cand be the set of candidate sentences considered for a given context sentence s ctxt ∈ S ctxt . Then, S cand contains a valid context sentence s ctxt as well as many other non-context sentences. S cand is used for the classification objective. For any given sentence position in the context of s (for example, the preceding sentence), the probability that a candidate sentence s cand ∈ S cand is the correct sentence for that position is given by log p(s cand |s, S cand ) = f θ (s) g θ (s cand )) -log s ∈S cand exp f θ (s) g θ (s cand ) .

D.6 DEEP METRIC LEARNING

The multi-class N-pair loss in Sohn ( 2016) is proportional to log N - 1 N N i=1 log   1 + j =i exp{f θ (x i ) f θ (y j ) -f θ (x i ) f θ (y i ))}   , which can be simplified as - 1 N N i=1 log   1 K K j=1 exp{f θ (x i ) f θ (y j ) -f θ (x i ) f θ (y i )}   = 1 N N i=1 log 1 1 K K j=1 exp{f θ (x i ) f θ (y j ) -f θ (x i ) f θ (y i )} = 1 N N i=1 log exp{f θ (x i ) f θ (y i )} 1 K K j=1 exp{f θ (x i ) f θ (y j )} . Setting N to 1 and evaluating the log gives f θ (x i ) f θ (y i ) - 1 K K j=1 exp(f θ (x i ) f θ (y j )), which is Equation 1 where f θ = g θ .

D.7 NEURAL PROBABILISTIC LANGUAGE MODELS (NPLMS)

Figure 1 shows results from a neural probabilistic language model as proposed in Mnih and Teh (2012) . Mnih and Teh (2012) propose using a log-bilinear model (Mnih and Hinton, 2009) which, given some context h, learns a context word vectors r w and target word vectors q w . Two different embedding matrices are maintained, in other words: one to capture the embedding of the word and the other the context. The representation for the context vector, q, is then computed as the linear combination of the context words and a context weight matrix C i so that q = n-1 i=1 C i r wi . The score for the match between the context and the next word is computed as a dot product, e.g., s θ (w, h) = q qwfoot_0 and substituting into the definition of P h θ (w), we see that log P h θ (w) = q qw -log w exp q qw



We have absorbed the per-token baseline offset b into the qw defined inMnih and Teh (2012), forming the vector qw whose i'th entry is (qw)i = (qw)i + bw/(q)i



Figure 2: Deep Supervised Classification. (a) Data distribution for a linearly identifiable K-way classification problem. (b) Mean (centered) CCA between the learned representations over the course of training. After approx. 4000 iterations, CCA finds a linear transformation that rotate the learned representations into alignment, up to optimization error. (c) Learned representations after transformation via optimal linear transformation. The first dimension of the first model's feature space is plotted against the first dimension of second. The learned representations have a nearly linear relationship, modulo estimation noise.

Figure 3: Self-Supervised Representation Learning. Error bars are computed over 5 pairs of models. (a) Input data. Two patches are taken (one from top half, and one from the bottom half) of an image at random. Using a contrastive loss, we predict the identity of the bottom patch encoding from the top. (b) Linear similarity of learned representations at checkpoints (see legend). As models converge, linear similarity increases. (c) Linear similarity as we increase the amount of data for f θ and g θ . Error bars are computed over 5 pairs of models. (d) As we increase model size, linear similarity after convergence increases for both f θ and g θ .

Figure 3 a. Patches are sampled randomly from training images.

Figure 3 b. For each model, we train for at most 3 * 10 4 iterations, early stopping when necessary based on validation loss.

Figure 3 c. For each model, we train for at most 3 * 10 4 iterations, early stopping when necessary based on validation loss.

Figure 3 d. Error bars show standard error computed over 5 pairs of models after 1.5 * 10 4 training iterations.

Each x t is mapped to a sequence of feature vectors z t = f θ (x t ) An autoregressive model, already updated with the previous latent representations z ≤t-1 , transforms the z t into a "context" latent representation c t = g AR (z ≤t ). Instead of predicting future observations k steps ahead, x t+k , directly through a generative model p k (x t+k |c t ),Oord et al. (2018) model a density ratio in order to preserve the mutual information between x t+k and c t .Objective Let X = {x 1 , . . . , x N } be a set of N random samples containing one positive sample from p(x t+k |c t ) and N -1 samples from the proposal distribution p(x t+k ).Oord et al. (2018) define the following link function: l k (x t+k , c t ) exp z t+k W k c t . Then, CPC optimizes -E X log l k (x t+k , c t )

annex

where ãi,j indicates that the corresponding entry in RA may differ from A due to the action of R. Applying R to Equation (17), we have R g (y) = RA g * (y).(21)We now show that removing the first row and column of RA and R generates matrices of rank M . Let RA and R denote the (M × M ) submatrices formed by removing the first row and column of RA and R respectively.Equation (20) shows that RA has a pivot in each column, and thus has rank M . To show that R is invertible, we must show that removing the first row and column reduces the rank of R = E j E j-1 . . . E 1 by exactly 1. Clearly, each E k is invertible, and their composition is invertible. We must show the same for the composition of E k .There are three cases to consider, corresponding to the three unique types of elementary matrices. Each elementary matrix acts on A by either (1) swapping rows i and j, (2) replacing row j by a multiple m of itself, or (3) adding a multiple m of row i to row j. We denote elementary matrix types by superscripts.In Case ( 1), E 1 k is an identity matrix with row i and row j swapped. For Case ( 2), E 2 l is an identity matrix with the j, j th entry replaced by some m. For each E 1 k and E 2 l in R , where 1 ≤ k, l ≤ j, we know that the indices i, j ≥ 2, because we chose the first entry of the first row of A to be the pivot, and hence do not swap the first row, or replace the first row by itself multiplied by a constant. This implies that removing the first row and column of E 1 k and E 2 l removes a pivot entry 1 in the (1, 1) position, and removes zeros elsewhere. Hence, the (M × M ) submatrices E 1 k and E 2 l are elementary matrices with rank M .For Case (3), E 3 k has some value m ∈ R in the j, i th entry, and 1s along the diagonal. In this case, we may find a non-zero entry in some E 3 k , so that, e.g., the second row has a pivot at position (2, 2). Without loss of generality, suppose i = 1, j = 2 and let m be some nonzero constant. Removing the first row and column of E 3 1 removes this m also. Nevertheless, E 3 1 = I M , the rank M identity matrix. For any other E 3 k 1 < i ≤ M + 1, j ≥ 2 because we chose a 1,1 as the first pivot, and hence do not swap the first row, or replace the first row by itself multiplied by a constant. In both cases, removing the first row and first column creates an E 3 k that is a rank M elementary matrix. We have shown by the above that R is a composition of rank M matrices. We now construct the matrix B by removing the first entries of g and g , and removing the first row and first column of R and RA in Equation (equation 21). Then, we haveChoosing B = R -1 RA proves the result.

D REDUCTIONS TO CANONICAL FORM OF EQUATION (1)

In the following, we show membership in the model family of Equation 1 using the mathematical notation of the papers under discussion in Section 4. Note that each subsection will change notation to match the papers under discussion, which varies quite widely. We employ the following colour-coding scheme to aid in clarity:where f θ (x) is generalized to a data representation function, g θ (y) is generalized to a context representation function, and y ∈S exp(f θ (x) g θ (y )) is some constant.shows that Mnih and Teh (2012) is a member of the model family.Interestingly, a touchstone work in the area of NPLMs, Word2Vec (Mikolov et al., 2013) , does not fall under the model family due to an additional nonlinearity applied to the score of Mnih and Teh (2012) .

