NEURAL NETWORKS AS PATHS THROUGH THE SPACE OF REPRESENTATIONS

Abstract

Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. We consider a simple hypothesis for interpreting the layer-by-layer construction of useful representations: perhaps the role of each layer is to reformat information to reduce the "distance" to the desired outputs. With this framework, the layer-wise computation implemented by a deep neural network can be viewed as a path through a high-dimensional representation space. We formalize this intuitive idea of a "path" by leveraging recent advances in metric representational similarity. We extend existing representational distance methods by computing geodesics, angles, and projections of representations, going beyond mere layer distances. We then demonstrate these tools by visualizing and comparing the paths taken by ResNet and VGG architectures on CIFAR-10. We conclude by sketching additional ways that this kind of representational geometry can be used to understand and interpret network training, and to describe novel kinds of similarities between different models.

1. INTRODUCTION

A core design principle of modern neural networks is that they process information serially, progressively transforming inputs until the information is in a format that is immediately usable for some task (Rumelhart et al., 1988; LeCun et al., 2015) . This idea of composing sets of simple units to construct more complicated functions is central to both artificial neural networks and how neuroscientists conceptualize various functions in the brain (Kriegeskorte, 2015; Richards et al., 2019; Barrett et al., 2019) . Our work is motivated by a spatial analogy for information-processing: we imagine that outputs are "far" from inputs if the mapping between them is complex, or "close" if it is simple. In this spatial analogy, any one layer of a neural network contributes a single step, and the composition of many steps transports representations along a path towards the desired target representation. Formalizing this intuition requires a method to quantify if any two representations are "close" (similar) or "far" (dissimilar) (Kriegeskorte, 2009; Kornblith et al., 2019) . In order to use this kind of spatial or geometric analogy for neural representations, we need some way to quantify the "distance" between representations. We build on recent work introducing metrics for quantifying representational dissimilarity (Williams et al., 2021; Shahbazi et al., 2021) . Representational dissimilarity is quantified using a function d(X, Y) : X × X → R + that takes in two matrices of neural data and outputs a nonnegative value for their dissimilarity. Here, X = n=1,2,3,... R m×n is the space of all m × n matrices for all n. The matrices X and Y could be, for instance, the values of two hidden layers in a network with n x and n y units, respectively, in response to m inputs. What are desirable properties of such a representational dissimilarity function? Previous work has argued that any sensible dissimilarity function should be nonnegative, so d(X, Y) ≥ 0, and should return zero between any equivalent representations, so d(X, Y) = 0 ⇔ X ∼ Y, where X ∼ Y means that X and Y are in the same equivalence class. For example, we may wish to design the function d so that d(X, Y) = 0 if Y is a shifted copy of X, or if it is a non-degenerate scaling, The resulting points are on a spherical manifold where the arc distance between points provides a measure of the distance between representations (D). E) Pairwise distance between layers computed using Angular CKA. F) 2D embedding of the network's path using multi-dimensional scaling (MDS) down to 15D followed by PCA. Includes points for the input pixels (black square), target class labels (purple star), and points calculated from the geodesic between input and labels (black dashed line). G) Three types of distances plotted in (H): distance of layer from input (orange ▲), distance to target (green ▼), and projected distance along the geodesic from input to target (blue). I) Two types of angles plotted in (J): "internal angle" or the angle between adjacent path segments (red), and "target angle" or the angle between each segment and the geodesic from segment to targets (green). Note that we treat each residual block as a single step or segment of the path. rotation, or affine transformation of X (Kornblith et al., 2019; Williams et al., 2021; Shahbazi et al., 2021) . A second desirable property is that d is symmetric, so d(X, Y) = d(Y, X). A third is that d satisfies the triangle inequality, or d(X, Y) ≤ d(X, Z) + d(Z, Y). As argued by Williams et al. (2021) , a representational dissimilarity function that fails to satisfy the triangle inequality can lead to errant results when, for instance, clustering or embedding representations based on their pairwise dissimilarity. A dissimilarity function that satisfies all of the above properties -equivalence, symmetry, and the triangle inequality -qualifies as a metricfoot_0 on X (Burago et al., 2001) . Examples of metrics between neural representations were recently developed independently by Williams et al. (2021) and Shahbazi et al. (2021) . We are interested in using metrics between neural representations to explore how representations evolve spatially as they are transformed through the hidden layers of deep networks. Not all metrics are sufficient for the kind of spatial reasoning -that is, not all metrics can be interpreted as distances. For example, consider the trivial metric d(X, Y) = 0 if X ∼ Y 1 otherwise . This is a valid metric according to the equivalence, symmetry, and triangle inequality criteria, but it is useless as a tool for characterizing distances. To be interpretable as a measure of distance, d(X, Y) must satisfy an intuitive fourth condition called rectifiability: the distance between any two points must be realizable as the (infimum of the) sum of distances of segments along a path between them (Burago et al., 2001) . While not all metrics are rectifiable (such as the trivial metric above), it is perhaps unsurprising that this condition is met by many sensible metrics, including those already developed by Williams et al. (2021) and Shahbazi et al. (2021) . In fact, all metrics considered in this paper are Riemannian metrics, which not only implies that they are rectifiable, but further requires that points live on a smooth manifold (Burago et al., 2001) . Each restriction on the type of metric comes with additional structure that we can use to inspect and visualize neural representations: rectifiability allows us to smoothly interpolate neural representations along a geodesic as well as compute projections and angles, and Riemannian structure allows us to meaningfully compare the direction of steps taken by different layers. The main contributions of this paper are • We put forward the spatial "path" analogy for deep neural networks, and quantitatively evaluate it using recently developed methods for representational distance. • We develop a toolbox for analyzing geometric properties of sequences of neural network layers, and show how existing representational distance measures imply a rich set of geometric concepts beyond mere pairwise distance. • We create novel visualizations of how representations are transformed through the layers of deep networks. • We apply these techniques to compare paths taken by wide and deep residual networks, as well as four VGG architectures, all trained on CIFAR-10, finding differences in both the magnitude and direction of steps taken by layers in wide versus deep models Nguyen et al. (2021) .

2.1. RELATED WORK

One motivation for thinking of neural networks as paths is that it provides a compelling analogy for the way that complex functions (deep networks) can be composed out of simple parts (layers). Indeed, it is well known that both deeper (Poole et al., 2016; Raghu et al., 2017; Rolnick and Tegmark, 2017) and wider (Hornik et al., 1989) neural network architectures can express a larger class of functions than their shallower or narrower counterparts. However, much less is known about how implementing a particular complex function constrains the role of individual layers and intermediate representations in the intervening layers between input and output. Our work is in line with other recent efforts to characterize the features learned in hidden layers as smoothly varying between inputs and outputs (Chan et al., 2020; Yang et al., 2022; He and Su, 2022) . In our path framework, relatively narrow layers take shorter steps, and chaining them together increases the total achievable length, while wide layers are capable of taking a single large step, and all such steps contribute by moving "closer" to the targets. Our work is a first step, so to speak, towards formalizing this notion of composing simple functions in geometric terms. There is a rich literature applying geometric concepts like distance between representations to formalize notions of "similarity" in neuroscience and psychology (Edelman, 1998; Jäkel et al., 2008; Rodriguez and Granger, 2017; Hénaff et al., 2019; Kriegeskorte and Wei, 2021) . However, there is a crucial difference between measuring similarity or distance between points in a given space, and measuring distances between representational spaces themselves. The former includes questions like, "how far apart are the activation vectors for two inputs in a given layer?" The latter, which is the subject of this paper, asks instead, "how far apart are two layers' representations, considering all inputs?" Our work is most closely related to, and draws much inspiration from recent advances in representational similarity analysis (RSA). In particular, Kornblith et al. (2019) showed that a kernel method for testing statistical dependence, known as CKA (Gretton et al., 2005; Cortes et al., 2012) , is closely related to classic RSA Kriegeskorte (2009) . In follow-up work, Nguyen et al. (2021) used CKA to make layer-by-layer comparisons between wide and deep networks -which we elaborate on in section 3.2 below. Independently, both Williams et al. ( 2021) and Shahbazi et al. (2021) developed methods to compute metrics between neural representations. Shahbazi et al. (2021) proposed using the so-called Affine-Invariant Riemannian Metric on the space of symmetric positive-definite matrices (Pennec, 2006; 2019) . Williams et al. ( 2021) derived a metric variation of CKA which we call Angular CKA, as well as a family of Generalized Shape Metrics. We extend this prior work by computing not just pairwise distances, but by also introducing a suite of tools for analyzing the geometry of representation-space induced by each of these distance functions. Finally, whereas Williams et al. ( 2021) and Shahbazi et al. (2021) compare representations across models, we compare representations within a single model to study the transformation of information from inputs to outputs through hidden layers.

2.2. DISTANCE METRICS BETWEEN NEURAL REPRESENTATIONS

In our notation, a deep neural network processes an input x 0 through a sequence of hidden layers, {x 1 , . . . , x L-1 }, and produces outputs x L , trained to match some target outputs (LeCun et al., 2015) . In the common example of image classification, x 0 is the input image, targets are the class label, and x L a set of (log) class probabilities. We will use subscripts to index inputs, so x l i is the n l -dimensional response of the lth hidden layer to the ith input, x 0 i , with i ∈ {1, . . . , m}. We use capital letters like X l , X k , or Y to refer to m × n matrices of neural representations in response to m inputsfoot_2 , and we treat target labels as one-hot vectors (i.e. if Y is the target representation, then it is a m × 10 matrix for the CIFAR-10 dataset (Krizhevsky, 2009) ). We will adopt the convention that l < k if x l is a direct ancestor of x k through the layers of a network (layers will in general be partially but not strictly ordered). We are interested in functions d(X, Y) for that quantify the "dissimilarity" between two neural representations; when d(X, Y) satisfies the four conditions of a length metric -equivalence, symmetry, triangle inequality, and rectifiability -we will say that it measures their "representational distance." Note that X and Y may be different layers of the same model or layers from different models, as long as they are evaluated on the same inputs. We use d(X, Y) to refer to the distance function between arbitrary representations X and Y. We can unify and organize existing representational distance metrics by recognizing that distance metrics use a two-stage approach to defining d(X, Y): first, X and Y are mapped to a common space M through an embedding function f : X → M, then distance is computed using a distance metric defined on M. More precisely, d(X, Y) ≡ d M ( X, Ỹ) , where X = f(X) is the result of mapping X from X to M. In all cases we consider here, M is a Riemannian manifold with metric d M . Equivalence relations on X can be built into this two-stage approach in either stage: d(X, Y) can be made invariant to changes in the scale of X either by imposing a canonical scale in f, or by preserving scale in f but using a scale-invariant metric d M . We implemented three existing families of representational distance, each of which can be understood as different choices for f, M, and d M , and extended them to further compute various geometric quantities including geodesics, projections, tangent vectors, and angles. The representational distance metrics we investigate here are Angular CKA (with a linear kernel), Shape Metrics (with angular distance), and the Affine Invariant Riemannian Metric (with a squared exponential kernel and ridge regularization) (Williams et al., 2021; Shahbazi et al., 2021) . We have found that results using Angular CKA are the most interpretable among the metrics we have tested, so our main results are presented using Angular CKA. Results using other metrics are shown in Figure C.1. In the following section, we will give mathematical details for Angular CKA; Table A .1 and Appendix A provide mathematical details on all metrics in one place, including a discussion of their invariances.

2.3. THE GEOMETRY OF ANGULAR CKA

Angular CKA was introduced by Williams et al. ( 2021) (eq (60) in their supplement). It is defined as the arccosine of CKA, which is itself derived from the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; Cortes et al., 2012; Kornblith et al., 2019) . Because HSIC and CKA measure statistical dependence, distance measured by Angular CKA is small when the rows of X and Y are strongly statistically dependent, and large when they are independent. While Angular CKA was originally introduced simply as a method for computing a metric between neural representations, here we exploit the fact that Angular CKA is the arc-length on a Hypersphere to compute additional geometric properties of the space. Angular CKA is equivalent to arc-length distance on the spherical manifold consisting of centered and normalized m × m Gram matrices. Let G X denote the Gram matrix of X, i.e. G Xij is given by the inner-product between the ith and jth rows of X. Optionally, this inner-product may be computed using a kernel; following previous work (Kornblith et al., 2019; Nguyen et al., 2021) , results in this paper use Linear CKA, i. Comparing to equation (1), we can see that the embedding function for Angular CKA is f(X) = GX . Distance according to Angular CKA is d(X, Y) = d M (f(X), f(Y)) = d M ( GX , GY ) = arccos GX , GY F (2) (Figure 1B-D). Because Angular CKA is an arc length on a hypersphere, we can easily compute its geodesics: the shortest path connecting GX to GY is geodesic( GX , GY , t) = sin((1 -t)Ω) sin(Ω) GX + sin(tΩ) sin(Ω) GY , where t ∈ [0, 1] is the fraction of distance along the geodesic from X to Y, and Ω = d M ( GX , GY ). This is a direct application of the SLERP formulafoot_3 for hyperspheres, applied to G-space. For all metrics, we use numerical optimization methods to project a point in M onto the geodesic spanning two other points. The projection of X onto the geodesic spanning Ỹ and Z can be found by minimizing the distance from X to geodesic( Ỹ, Z, t) with respect to t. In practice, we solve this as an optimization problem with respect to t. Figure 1G demonstrates one way that this idea of projecting neural representations can be used: we project each hidden layer onto the geodesic connecting inputs (raw pixels) to targets (one-hot labels). (Note that we keep only the representations labeled (i) through (vii) in Figure 1A so that the resulting path consists only of steps that perform comparable operations). This provides a more gradual picture of the "progress" being made by each layer towards the targets than mere distance between layers and inputs or targets (Figure 1H ). For all metrics, we can compute angles between any three neural representations using the tangent space defined by the metric. The logarithmic map is a function that takes in two points in M and returns a tangent vector that "points" from one to the other (Burago et al., 2001) . The logarithmic map for Angular CKA from GX to GY is log GX GY = W arccos GX , GY F ( ) where W is the unit tangent vector at GX pointing towards GY , given by W = GY -GX GX , GY F || GY -GX GX , GY F || F . We then compute angles between triplets of points by computing the inner-product of their tangent vectors. In the case of Angular CKA in particular, let W XY denote the tangent vector pointing from X to Y, i.e. the result of log GX ( GY ). Then, θ( GX , GY , GZ ) = arccos ⟨W YX W YZ ⟩ F ||W YX || F ||W YZ || F (5) is the angle of the XYZ triangle at Y. Equations ( 3)-( 5) are simply applications of the geometry of a hypersphere to the space of m × m centered and normalized Gram matrices. Figure 1I demonstrates two types of angles that are of interest: "internal angles," which quantify how straight a path is, and "target angles," which quantify to what extent each step points in the direction of the targets. Interestingly, we find that nearly all internal angles are orthogonal, and only later layers begin to take steps in the direction of the targets (Figure 1J ).

2.4. COMPARISONS WITH OTHER METRICS

The main results in this paper are presented using Angular CKA for ease of exposition. Mathematical details for other metrics can be found in Table A .1 and Appendix A. At initialization, the distance from inputs initially increases (Figure 2B ), but the distance to targets is constant (Figure 2A ). This implies that the network's representations are changing in some direction that is orthogonal to making "progress" towards the targets. Indeed, we find that plotting projected progress is a useful way to visualize training (Figure 2C ). At initialization, projected progress is zero for all layers, and it quickly begins to ramp upwards after a few training steps. We hypothesized that training would cause the network's path to become more straight, but surprisingly we found that this is not the case. In fact, internal angles begin slightly above π/2 at initialization, and concentrate towards π/2 as training progresses (Figure 2D ). This means that the function implemented by each residual block applies an orthogonal operation (in representation space) compared to the blocks before or after it. To our knowledge, this kind of orthogonality between layers has not been previously observed. Figure 2E shows how "target angles" evolve over training. Consistent with past work, this visualization shows that the emergence of layers that point in the direction of the targets does not happen until later in training. Inspired by the results of Nguyen et al. (2021) , we compare paths taken by "wide" versus "deep" models. In Figure 3 , we show the same quantities we computed in Figure 2 , but comparing different Resnet widths and depths. Our results confirm the main findings of Nguyen et al. (2021) , namely that wide and deep Resnets compute similar functions -in other words, wide-model paths and deep-model paths all follow roughly the same trajectory in representational space. Note that the x-axis for these plots is scaled by the depth of each model; the fact that all Resnets are superimposed on each other means that all of them distribute the function of each layer relatively evenly over their depth. This is not universally true of all models -we also compare with VGG, and see salient differences between the VGG and Resnet architectures, where the progress made in VGG (without skip connections) is much more dependent on absolute rather than relative depth.

3.3. DECOMPOSING EACH STEP INTO "PROGRESS" AND "DEVIATION"

We next asked to what extent each step in the path, i.e. each residual block, moves representations towards the targets (progress) or in orthogonal directions (deviation; Figure 4A ). Note that unlike "projected progress" above, which was measured relative to the input-targets geodesic, here we compare each step from X l to X l+1 to the geodesic connecting current position X l to the targets. A model with D blocks provides D-1 measures of both progress and deviation -one for each block. Figure 4B summarizes the net contribution of progress and deviation for all layers in models with varying architectures. Note the following salient effects: first, there is a strong overall upward trend -for every block that makes 0.1 radians of progress towards the targets (recall that Angular CKA is an arc-length), it incurs some deviation of about 0.3 radians in an orthogonal direction, and this trend is remarkably uniform for both Resnet and VGG architectures. Second, deeper models are lower and to the left; deeper models parcel out both progress and deviation among their layers. Third, we can identify an effect of width at constant depth; at a constant depth, wider models incur less deviation than their narrower counterparts.

4. DISCUSSION

In this work, we interpret neural networks spatially as paths from inputs to outputs through a space of intermediate representations. These paths have rich geometric structure, inherited from computing distances between representations, and we can use that structure to build intuitions about network structure, training, and comparisons between models. We investigated three families of representational distance metrics -Angular CKA, Shape Metrics, and the Affine Invariant metric on symmetric positive definite matrices (Williams et al., 2021; Shahbazi et al., 2021) . Surprisingly, we found that trained networks take circuitous paths according to all of these metrics, and deviate far from the shortest paths from inputs to targets which are defined geometrically by each representational distance metric. There are three potential explanations for this. First, networks may be taking short paths according to some metric other than those we investigated here, implying that our metrics may not reflect the distance between representations in terms of ease of computation. Second, neural networks may fail to take efficient paths. The distance metrics we consider are all differentiable, and so an interesting question for future work is whether networks can be regularized to take shorter paths, and whether such regularization will improve or reduce their performance or generalization ability. Third, it could be that networks take the shortest and most direct path which is possible under some architectural constraints, which may prevent the hidden layers of the network from moving directly along the geodesic. This explanation must be at least partially true, since the dimensionality of the representation space M generally exceeds the number of parameters in each layer/block of the network. We were also surprised to discover that according to all metrics we investigated, network paths tended to be jagged, and consist predominantly of 90 • angle turns. Although it is well known that random directions in high-dimensional spaces such as the representation space M tend to be nearly always orthogonal, this does not explain why we found that path angles are straighter at initialization and become more orthogonal during training. This is surprising because the training objective ostensibly should encourage all layers to point in the same direction towards the targets. This means that steps become more orthogonal through training. This is an especially surprising result in light of recent work by Chan et al. ( 2020) that suggests that a sequence of residual blocks can be interpreted as a sequence of small gradient steps optimizing a rate reduction objective; in this interpretation, all layers ought to be moving in the same general direction. However, this is not what we find in our models trained by backpropagation and where angles are quantified using Angular CKA (nor using other metrics -see Figure C .1). Ultimately, ours is an empirical finding which suggests that future theoretical work is needed to interpret the direction of steps taken in representation space in the context of a given representational distance metric, and to understand which directions are realizable by a given network architecture. Figure 4 : A) For each model, we decompose every path segment into "progress" and "deviation" components, where "progress" is the component along the geodesic from the segment start to the targets, and "deviation" is the component which is orthogonal to the geodesic. B) plots the average (circle) and ± standard error (horizontal and vertical lines) "progress" and "deviation" over all layers of a variety of model architectures. Note that the names of Resnets indicate depth followed by width, and the names of VGG models indicate depth. As described in the introduction, we are motivated to develop a general spatial analogy for information-processing, where complex transformations of representations cover more "distance" than simple ones. In this sense, a measure of representational distance ought to reflect the function complexity of transforming X into Y. We chose to extend and compare existing representational distance metrics in order to build directly on previous work, but the metrics we evaluated here may not be interpretable as measures of function complexity. An exciting avenue for future work is thus to derive a measure of representational distance directly from a measure of function complexity. Such a measure will likely violate the axioms of a distance metric. For example, the complexity of a function and its inverse are in general not equal, and so it may be desirable to have d(X, Y) < d(Y, X) if the transformation from X to Y can be implemented by a simpler function than the inverse. Notably, many of the geometric properties studied here (shortest paths, angles, projection, etc.) can be extended to the asymmetric case using the theory of Finsler rather than Riemannian manifolds (Burago et al., 2001) . The analogy of neural networks as paths in a representation space brings together ideas about representational similarity and the expressivity of deep networks, marrying these techniques with intuitive and mathematically rigorous geometric concepts. Our work takes a first step in exploring the possibilities of this new geometric framework, and we anticipate that it will spark new insights about model design, model training, and model comparison. As summarized in equation ( 1), all distance metrics between neural representations operate in two stages: first, a layer's activity X ∈ X is transformed into a point in some canonical space M through an embedding function f, and second distances are measured in that shared space. In the following subsections we give details for each metric in Table A .1.

5. REPRODUCIBILITY STATEMENT

Angular CKA † Gram m arccos ⟨ X, Ỹ⟩F √ ⟨ X, X⟩F⟨ Ỹ, Ỹ⟩F Angular Shape † R m×p min R arccos ⟨ X, ỸR⟩F √ ⟨ X, X⟩F⟨ Ỹ, Ỹ⟩F Euclidean Shape † R m×p min R 1 m m i=1 || Xi -Ỹi R|| 2 Affine Invariant Riemannian Sym + m i log(d i ) 2 or Sym + p where d i is the ith eigenvalue of X-1 2 Ỹ X- We say a metric is scale-invariant if d(X, αX) = 0 for all scalars α ̸ = 0. A metric is shiftinvariant if d(X, X+1b ⊤ ) = 0 for any n-dimensional vector b. A metric is rotation-invariant if d(X, XR) = 0 for any n×n orthonormal matrix R. A metric is affine-invariant if d(X, XA) = 0 for any full-rank n × n matrix A. For all geometric calculations, we drew a subset of m = 1000 items randomly from the test set, and used the same subset throughout. We verified that m = 1000 was sufficient to reliably estimate the geometric quantities of interest in both the AngularCKA and Angular Shape metrics , by randomly resampling m = 1000 points from the test set multiple times and inspecting the variance of computed quantities (Figure C.3) . Targets (class labels) are converted to 1-hot vectors before embedding. Our Python implementation of various quantities from Riemannian geometry draws much inspiration from the geomstats Python package (Miolane et al., 2020) . Our analyses in the main paper were done using • Angular CKA with m = 1000 and the linear kernel k(x i , x j ) = x ⊤ i x j . • The Angular Shape metric with m = 1000, p = 100, α = 0. • The Gram-matrix version of the Affine-Invariant Riemannian metric with m = 1000, ϵ = 0.1, and a squared exponential kernel with length scale set to the median Euclidean distance of X. We begin with an introduction to Angular CKA, where we also review some key terms from Riemannian geometry. A.1 ANGULAR CKA Angular CKA was introduced by Williams et al. (2021) (eq (60) in their supplement). It is defined as the arccosine of centered kernel alignment (CKA), which is itself derived from the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005; Cortes et al., 2012; Kornblith et al., 2019) . Because HSIC and CKA measure statistical dependence, distance measured by Angular CKA is high when the rows of X and Y are statistically independent, and low when they are highly correlated. Angular CKA is equivalent to arc-length distance on the spherical manifold consisting of centered and normalized m × m Gram matrices. Let G X denote the Gram matrix of X, i.e. G Xij is given by the inner-product between the ith and jth rows of X. Optionally, this inner-product may be computed using a kernel; following previous work (Kornblith et al., 2019; Nguyen et al., 2021) , results in this paper use Linear CKA, i.e. we use G X = XX ⊤ , but our python package supports the use of other kernels. The normalized and centered Gram matrix is given by GX = HG X H ||HG X H|| F where H = I m -1 m 11 ⊤ is the m×m centering matrix, and ||•|| F is the Frobenius norm of a matrix. Note that both G and G are symmetric and positive definite matrices. The Riemannian manifold M for Angular CKA consists of all such centered, normalized, symmetric positive definite matrices; it is a sphere in sense that GX , GX F = 1, where ⟨A, B⟩ F = Tr(A ⊤ B) is the Frobenius innerproduct. The embedding function for Angular CKA is f(X) = GX . Distance according to Angular CKA is equal to arc length on the sphere consisting of centered and normalized Gram matrices: d(X, Y) = d M (f(X), f(Y)) = d M ( GX , GY ) = arccos GX , GY F . (A.1) Because Angular CKA is an arc length, its geodesics lie along great circles G-space. We therefore can compute points along the geodesic in closed-form using the SLERP formula: 5 geodesic( GX , GY , t) = sin((1 -t)Ω) sin(Ω) GX + sin(tΩ) sin(Ω) GY , ( where t ∈ [0, 1] is the fraction of distance along the geodesic from X to Y, and Ω = d M ( GX , GY ). The tangent space for Angular CKA is the space of all symmetric m × m matrices, and the innerproduct in the tangent space is simply the Frobenius inner-product. The logarithmic map computes tangent vectors from a base point that point towards another point. In the case of Angular CKA, the logarithmic map from GX to GY is a tangent vector (symmetric matrix) at GX given by log GX GY = W arccos GX , GY F ((4) restated) where W is the unit tangent vector at GX pointing towards GY , given by W = GY -GX GX , GY F || GY -GX GX , GY F || F . The exponential map is the inverse of the logarithmic map -it is a function that "extrapolates" a tangent vector W from a point to give another point on the manifold. In the case of Angular CKA, the exponential map is given by exp GX (W ) = cos (||W || F ) GX + sinc (||W || F ) W (A.2) where sinc(x) = sin(x) x . For all metrics, we compute angles between triplets of points by computing the inner-product of their tangent vectors. In the case of Angular CKA in particular, let W XY denote the tangent vector pointing from X to Y, i.e. the result of log GX ( GY ). Then, θ( GX , GY , GZ ) = arccos ⟨W YX W YZ ⟩ F ||W YX || F ||W YZ || F ((5) restated) is the angle of the XYZ triangle.

A.1.1 INVARIANCES OF ANGULAR CKA

The invariances of Angular CKA depend on the kernel used to compute the Gram matrix. In the simplest case of Linear CKA, the Gram matrix is simply G X = XX ⊤ . The resulting metric is • shift-invariant due to centering the Gram matrix. • scale-invariant due to normalizing the Gram matrix. • rotation-invariant since (XR)(XR) ⊤ = XRR ⊤ X ⊤ = XX ⊤ for any orthonormal R. However, Angular CKA with is not invariant to arbitrary affine transformations -a feature it inherits from CKA and has been argued to be an important feature of CKA (Kornblith et al., 2019) . Note that when using a nonlinear kernel to compute the Gram matrix, the resulting metric may lose these invariances. However, Angular CKA with a nonlinear kernel may still be shift-, scale-, and rotationinvariant if the kernel itself has those invariances. For example the squared exponential kernel G ij = k(x i , x j ) = exp ||x i -x j || 2 2 /τ 2 (A. 3) is naturally shift-and rotation-invariant, and it can me made further scale-invariant by setting the length scale τ automatically based on the scale of the data.

A.2 SHAPE METRICS

Williams et al. ( 2021) proposed using a generalization of Procrustes distance and Kendall's shape space to measure metric distance between neural representations. Shape-space and Procrustes distance are a well-studied case of a Riemannian manifold between point clouds (Nava-Yazdani et al., 2020) . Williams et al. ( 2021) consider two different shape metrics -one angular shape metric and one Euclidean shape metric. The key idea behind both of these metrics is as follows: m×n matrices of neural data are first transformed into a common m × p space, and interpreted as a point cloud consisting of p-dimensional points. Then, any two point clouds are scaled and rotated so that they maximally align with each other. The final distance is then computed as some measure of discrepancy between these maximally-aligned point clouds. The behavior of these shape metrics is tuned using two hyperparameters: the dimensionality p, and a partial whitening parameter α. The role of the embedding function for shape metrics is to convert n-dimensional neural data into a canonical zero-mean p-dimensional space (i.e. M = R m×p is the space of all m × p matrices whose column means are all zero). Williams et al. ( 2021) also include a partial whitening stage as part of the embedding. This space of m × p zero-mean (and sometimes scaled) matrices is called the pre shape space (Nava-Yazdani et al., 2020). In the case where n < p, this conversion from n to p dimensions is done by simply padding X with p -n columns of all zeros. In the case where p < n, we reduce the dimensionality of X by keeping only the top p principal components. Formally, let X = X -1 m m i=1 X i be the matrix of neural data with its mean subtracted, then X = f(X) = whiten([ X, 0], α) if n ≤ p whiten( XU :p , α) if n > p (A.4) where U :p stands for the first p principal components of X, as unit column vectors, and 0 is a m × (p -n) matrix of all zeros. The partial whitening function begins by computing the eigendecomposition of its input, m -1 X⊤ X = V ΣV ⊤ (here, V is a p×p orthonormal matrix containing the top principal components of X, and Σ is a diagonal matrix of variances). Then, the partial whitening stage is whiten( X, α) = XV αI p + (1 -α)Σ -1 2 V ⊤ . Note that when α = 0, this is equivalent to ZCA whitening, and when α = 1 it leaves X unchanged. All shape metric results we report are with p = 100 and α = 0. We use α = 0 because this entails further invariances, making metrics more interpretable across disparate layer shapes and types (see section A.2.1 below for details on shape metric invariances). Both the angular and Euclidean shape metrics require aligning by rotating the embedded points by minimizing || X -ỸR|| F where R is a p × p orthonormal matrix. This is known as the orthogonal Procrustes problem, and its solution is given by R = V ⊤ U ⊤ where U ΣV ⊤ = X⊤ Ỹ is a singular value decomposition of X⊤ Ỹ. The generalized shape metrics introduced by Williams et al. ( 2021) include further restrictions on R, such as considering rotations across channel but not spatial dimensions of convolutional layers, but we omit these restrictions in our work. In the case of angular shape metrics, distance is defined as d M ( X, Ỹ) = arccos   X, ỸR F || X|| F || Ỹ|| F   . (A.5) In the case of Euclidean shape metrics, distance is defined as d M ( X, Ỹ) = 1 m m i=1 || Xi -Ỹi R|| . (A.6) We compute geodesics in shape space after finding R to align Ỹ to X. Then, the geodesic from X to ỸR in the angular case is given by the SLERP formula as in (3): geodesic( X, Ỹ, t) = sin((1 -t)Ω) sin(Ω) X + sin(tΩ) sin(Ω) ỸR , (A.7) where Ω = d M ( X, Ỹ) is the angular shape distance. Note that this means that geodesic( X, Ỹ, 1) results in a point that is equivalent but not identical to Ỹ. Tangent vectors for Euclidean shape metrics can be any m × p matrix whose column-wise mean is zero. In the case of angular shape metrics, the tangent space is further restricted to the tangent space of the hypersphere of unit-Frobenius-norm matrices (i.e. a tangent vector W at X must satisfy X, W F = 0 in the angular case). The tangent space is further divided into so-called horizontal and vertical subspaces, where the vertical subspace captures changes to X that leave distance invariant, i.e. rotations that are removed by alignment by R, and the horizontal subspace captures changes that affect the metric (Nava-Yazdani et al., 2020) . The vertical component of a tangent vector W at point X is given by vert X(W ) = XA, where A ∈ R p×p is the solution to the following Sylvester equation: X⊤ XA + A X⊤ X = W ⊤ X -X⊤ W . Following the example of Miolane et al. (2020) , we use the solve sylvester function from Scipy to compute this (Virtanen et al., 2020) . The horizontal component of a tangent vector is given by simply subtracting the vertical part of W : horz X(W ) = W -vert X(W ) ⟨vert X(W ), W ⟩ F ||vert X(W )|| F . To compute the angle between any triplet of representations, we use the inner-product of tangent vectors, as in ( 5), but using only the horizontal part of each tangent vector. As in Angular CKA, we compute horizontal tangent vectors from X to Ỹ using the logarithmic map, which in the case of shape metrics is given by horizontal log X( Ỹ) = ỸR -X (A.8) in the Euclidean case, or horizontal log X( Ỹ) = W arccos   X, ỸR F || X|| F || Ỹ|| F   (A.9) where W is the unit tangent vector at X pointing towards Ỹ, given by -Yazdani et al., 2020) . As in (A.5) and (A.6), R is the rotation matrix that optimally aligns Ỹ to X. W = ỸR -X X, ỸR F || ỸR -X X, ỸR F || F . (

A.2.1 INVARIANCES OF SHAPE METRICS

The invariances of the shape metrics depend on a variety of hyperparameter settings. • All shape metrics are shift-invariant because the embedding function X = f(X) subtracts the mean. • All shape metrics are rotation-invariant because of the Procrustes alignment procedure, and because rotation does not affect the principal component projection nor the zeropadding step of (A.4). • The angular shape metric is scale-invariant because (A.5) divides by the norms of X and Ỹ. • The Euclidean shape metric is not scale-invariant in general, but it is for the special case of α = 0, since scale is removed by whitening. • Neither angular nor Euclidean shape metrics is affine-invariant in general, but both can become affine-invariant for the special case of α = 0, since full-rank affine transforms are removed by whitening as long as n ≤ p. However, in the n > p case, an affine transformation may amplify or suppress the principal components of the data, and as a result it can affect the embedding stage (A.4).

A.3 AFFINE INVARIANT RIEMANNIAN METRIC

The Affine Invariant Riemannian (AIR) metric is a metric between symmetric positive definite (SPD) matrices, originally derived for use in image processing (Pennec, 2006; 2019) , and recently it was proposed to use it as a metric between neural representations by first converting neural data into a SPD matrix (Shahbazi et al., 2021) . The embedding function can therefore be any function that maps m × n matrices in X into M = Sym + k for some k. Shahbazi et al. (2021) considered two possibilities for the embedding stage: either using the m × m Gram matrix f(X) = G X , or using the n × n data covariance matrix f(X) = cov(X) = 1 m-1 X ⊤ HX. These correspond to complementary perspectives on the nature of neural representation, analogous to the difference between representational similarity analysis and pattern component analysis (Diedrichsen and Kriegeskorte, 2017) . The challenge when using the m × m Gram matrix approach is that, without further regularization, a m × m Gram matrix has rank n when n < m, which implies that it cannot be SPD (and the metric considers all rank-deficient matrices to be infinitely far away). To address this, our toolbox implements the AIR metric between neural representations with additional regularization options. In the Gram matrix case, we regularize in two ways: first, we compute the Gram matrix using a kernel that implicitly has an infinite feature space (so that m is much less than the number of features). This alleviates the rank-deficiency problem in cases where n < m but rows are unique. However, when rows of X contain duplicates (notably, this is true for the target labels), G K is still rank-deficient. To address this, we include a second regularization stage where we add a small diagonal ridge with magnitude ϵ. The full embedding function in the Gram matrix case is given by f(X) = G X + ϵI m (A.10) where the ijth element of G X is given by k(X i , X j ). For our results in the paper, we use ϵ = 0.05 and a squared exponential kernel for k as in (A.3), setting the length scale τ automatically to the median pairwise Euclidean distance between rows of X. The main challenge when using the covariance matrix approach is that it cannot be directly applied to compare layers with different numbers of neurons n. To address this, we first convert from m × n matrices of neural data into a common m × p size, using the same method as we use for the shape metrics as in (A.4), but without the whitening stage. We can then embed all layers into a common space of p × p covariance matrices. As in the Gram matrix case, we again run into rank-deficiency issues when n < m (e.g. for the one-hot embedding of targets for which n = 10), and so we again regularize by adding a diagonal ridge to the resulting covariance matrices. The full embedding function in the covariance matrix case is given by f(X) = cov ([X, 0]) + ϵI p if n ≤ p cov (XU :p ) + ϵI p if n > p (A.11) • The AIR metric is scaleand rotation-invariant due to the eponymous "affine-invariances" of the metric itself (Pennec, 2006; 2019) . • As in the case of shape metrics discussed above, the AIR metric may or may not be invariant to arbitrary affine transformations of X due to the restriction to the top p principal components in the embedding stage. Only in the case where n > p and A is a matrix such that XA changes the subspace of the top p principal components, then the resulting metric is not invariant to A.

B MODELS AND TRAINING DETAILS

We trained a collection of convolutional networks including both residual networks (He et al., 2016) and VGG (Simonyan and Zisserman, 2014) on CIFAR-10 (Krizhevsky, 2009) using PyTorch (Paszke et al., 2019) , managing compute jobs using GNU Parallel (Tange, 2011) . We used the OpenLTH frameworkfoot_6 for training and checkpointing models, using the default hyperparameters for each model. Following Nguyen et al. (2021) , we trained Residual networks of varying widths and depths. The "width" refers to the number of feature channels per convolutional layer, and took on values of {16, 32, 64, 128, 160} (corresponding to the base size of 16 multiplied by {1×, 2×, 4×, 8×, 10×}). The "depth" controls the number of residual blocks, according to the formula #blocks = (depth -2)/2, since each block contains two convolutional layers, and there are two additional preprocessing/projection layers before/after the blocks. We trained models of depths {14, 20, 26, 32, 38, 44, 56, 110}. The VGG architecture supports "depths" of {11, 13, 16, 18}, all at the same width. The test accuracy of all models is shown in Table B .2. We analyzed the representational distances and geometry of a subset of these, focusing on depths 14 and 38 (for all widths), and widths 16 and 64 (for all depths). All models were trained using the default training hyperparameters of OpenLTHfoot_7 . Specifically, all models were trained by stochastic gradient descent for 160 epochs with a batch size of 128 (390.6 batches per epoch of 50k training items), an initial learning rate of 0.1 reducing to 0.01 and 0.001 after 80 and 120 epochs respectively, momentum of 0.9, and weight decay of 0.0001. During training, images were augmented by random horizontal flips and random ±4 pixel left/right or up/down shifts (padding with zeros). For the training analysis in Figure 2 , we trained a second Resnet-14 with default width (16), and analyzed the representational distance and geometry once every 10 batches for the epoch, since the vast majority of the changes to the network performance and geometry occur in the first epoch. Notably, all metrics agree with the finding that "internal angles" are close to orthogonal throughout the network, and that "target angles" are close to orthogonal for early layers, with the later layers being the only ones that appreciably point in the direction of the targets. Both the Angular Shape Metric and the Affine-Invariant Riemannian metric measure large distances in the last few layersas the last convolutional layer is projected to logits and then passed through a softmax function. For both of these metrics, the first principal component describing the network path primarily describes just these last few layers (Figure C.1A, B, G, H) . This suggests that, while these metrics give sensible results for distances between hidden layers, they may be ill-suited for measuring distances between internal representations and targets.



It would be more precise to say it is a "metric" on the quotient space of the equivalence class X ∼ Y, or a "pseudometric" on X, but we suppress this distinction throughout to avoid excess verbiage. For a more thorough treatment of equivalence classes, see(Williams et al., 2021). We assume convolutional layers are flattened, in which case n is the product of height, width, and feature dimensions. https://en.wikipedia.org/wiki/Slerp https://github.com/facebookresearch/open_lth https://en.wikipedia.org/wiki/Slerp https://github.com/facebookresearch/open_lth https://github.com/facebookresearch/open_lth



Figure 1: Representation paths of Resnet-14 model trained on CIFAR-10, evaluated using the Angular CKA metric with a linear kernel. A) Schematic of Resnet-14 architecture with color-coded layers. Gray boxes correspond to residual blocks. B, C, D) Matrices of neural data are converted into points in the metric space of Angular CKA. The outputs of each block form a m × n matrix for m examples and n channels (B); data from layers (ii), (v), and (vii) in panel A are shown. Outputs are transformed into centered and normalized Gram matrices by the embedding function (C).The resulting points are on a spherical manifold where the arc distance between points provides a measure of the distance between representations (D). E) Pairwise distance between layers computed using Angular CKA. F) 2D embedding of the network's path using multi-dimensional scaling (MDS) down to 15D followed by PCA. Includes points for the input pixels (black square), target class labels (purple star), and points calculated from the geodesic between input and labels (black dashed line). G) Three types of distances plotted in (H): distance of layer from input (orange ▲), distance to target (green ▼), and projected distance along the geodesic from input to target (blue). I) Two types of angles plotted in (J): "internal angle" or the angle between adjacent path segments (red), and "target angle" or the angle between each segment and the geodesic from segment to targets (green). Note that we treat each residual block as a single step or segment of the path.

e. we use G X = XX ⊤ . Other kernels are in principle admissible and available in our python package at [link redacted]. The normalized and centered Gram matrix is given by GX = HG X H ||HG X H|| F where H = I m -1 m 11 ⊤ is the m×m centering matrix, and ||•|| F is the Frobenius norm of a matrix. The Riemannian manifold M for Angular CKA consists of all centered and normalized symmetric positive definite matrices; it is a sphere because GX , GX F = 1, where ⟨A, B⟩ F = Tr(A ⊤ B) is the Frobenius inner-product.

Figure C.1 gives results like those in Figure 1 using the metrics proposed by Williams et al. (2021) (Angular Shape and Euclidean Shape) and Shahbazi et al. (2021) (Affine-Invariant Riemannian).3 EXPERIMENTS3.1 HOW PATHS CHANGE OVER TRAININGWhereas Figure1visualizes properties of the path taken by a well-trained Resnet-14 model, here we ask how properties of its path emerge over training. For this, we trained a new model of the same architecture, and logged model parameters every 10 batches. Since the vast majority of changes occur in the first epoch, Figure2shows how distances, projected distances, and angles change for every 10 steps in the first epoch, plus a single additional point after 20 epochs to show where values converge after longer training.

Figure2: Evolution of paths over training for a Resnet-14 model trained on CIFAR-10. We plot lines every 10 batches for the first epoch (blue to yellow), and an additional line at epoch 20 (magenta). The first row plots the same distances as in Figure1H. A) Distance to targets (one-hot labels) from hidden layers as a function of layer depth. B) Distance from inputs (pixels) to hidden layers. C) Projected progress, or the distance from inputs to the projection of hidden layers onto the geodesic connecting inputs to targets. D) Internal angles, or angles between adjacent path segments versus increasing layer depth. E) Target angles, or angles between each segment and the geodesic from its starting point to the targets.

Figure 3: Comparison between paths of different model architectures. We compare residual models by width (brightness) and by depth (hue), as well as VGG networks (yellows). Plots follow layout of figure 2: A) distance to targets, B) distance from inputs, C) projected distance to targets, D) angle between path segments, and E) angle between segment and target.

Figure C.1: Comparison of paths taken by the same Resnet-14 model as in Figure 1, but computed with the Angular Shape Metric and Euclidean Shape Metric of Williams et al. (2021), and the Affine-Invariant Riemannian Metric of Shahbazi et al. (2021). A) Pairwise distance between layers according to the Angular Shape Metric, computed with p = 100 and α = 0.0, equivalent to Figure 1E. B) 2D embedding of the network's path using MDS, as in Figure 1F. C) Internal angles and target angles, as in Figure 1J. D-F) Same as A-C but for the Euclidean Shape Metric with p = 100 and α = 0.0. G-I) Same as A-C but for the Affine-Invariant Riemannian Metric using a squared exponential kernel and ϵ = 0.1.Notably, all metrics agree with the finding that "internal angles" are close to orthogonal throughout the network, and that "target angles" are close to orthogonal for early layers, with the later layers being the only ones that appreciably point in the direction of the targets. Both the Angular Shape Metric and the Affine-Invariant Riemannian metric measure large distances in the last few layersas the last convolutional layer is projected to logits and then passed through a softmax function. For both of these metrics, the first principal component describing the network path primarily describes just these last few layers(Figure C.1A,B,G,H). This suggests that, while these metrics give sensible results for distances between hidden layers, they may be ill-suited for measuring distances between internal representations and targets.

Alex H Williams, Erin Kunz, Simon Kornblith, and Scott Linderman. Generalized shape metrics on neural representations. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 4738-4750. Curran Associates, Inc., 2021.

Table A.1: Summary of representational distance metrics. Gram m is the set of m × m centered Gram matrices. Sym + m is the set of m × m symmetric positive definite matrices. Dist m is the set of m × m pairwise-distance matrices. p is a hyperparameter used to set the dimensionality of points in M. ||X|| F denotes the Frobenius norm, and ⟨X, Y⟩ F denotes the Frobenius inner-product. R is a p×p orthonormal matrix, found by solving the orthogonal Procrustes problem to maximally align X and Ỹ. Metrics marked with " †" are due to Williams et al. (2021). The Affine Invariant Riemannian metric is due to Pennec (2006; 2019); Shahbazi et al. (2021). For further details on each metric and its implied geometry, see Appendix A.

annex

(compare with (A.4)).Let P = f(X) and Q = f(Y) be SPD matrices (we are using P and Q instead of X and Ỹ to use a consistent notation with Pennec (2019) ). The AIR metric distance is defined aswhere d i is the ith eigenvalue of P -1 2 QP -1 2 (Pennec, 2006; 2019) . Since P is SPD, its singular value decomposition can be written P = V ΣV ⊤ , where Σ is a diagonal matrix and V is orthonormal. Following Pennec (2019) , we use element-wise square root, exp, and log operations on the singular values to define the matrix square root, matrix exponential, and matrix logarithm:where the operations on the left hand side are matrix power, exponential, and log, whereas pow, exp, and log operations are performed element-wise on the diagonal of Σ. P k is equivalent to pow (P, k).The geodesics from P to Q is given by geodesic(combining equations (3.12) and (3.13) in Pennec ( 2019)).Tangent vectors in this space are symmetric matrices, and the logarithmic map is given by As before, we compute angles between triplets of representations X, Y, Z by computing the inner product of the log Y (X) and log Y (Z) tangent vectors, but unlike the previous metrics the definition of inner products for the AIR metric is not simply the Frobenius inner product. For the AIR metric, the inner product of tangent vectors W and V at P is defined asWe will treat the Gram matrix (A.10) and the covariance matrix (A.11) cases separately. In the Gram matrix case,• The AIR metric is shift-invariant, scale-invariant, and/or rotation-invariant if and only if the kernel used to compute G X has the corresponding invariance. Because we use a squared-exponential kernel with a length scale that adapts to the data scale, we have all three invariances. • The AIR metric is, despite its name, not affine-invariant in the sense we are interested in, since affine transformations of X will in general affect G X through the nonlinear kernel (e.g. the squared exponential kernel with an isotropic length scale is sensitive to nonisotropic scaling of X).In the covariance matrix case,• The AIR metric is shift-invariant because covariance subtracts the mean. 

