REPRESENTATIONAL DISSIMILARITY METRIC SPACES FOR STOCHASTIC NEURAL NETWORKS

Abstract

Quantifying similarity between neural representations-e.g. hidden layer activation vectors-is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of deterministic representational similarity ignore the scale and geometric structure of noise, both of which play important roles in neural computation. To rectify this, we generalize previously proposed shape metrics (Williams et al., 2021) to quantify differences in stochastic representations. These new distances satisfy the triangle inequality, and thus can be used as a rigorous basis for many supervised and unsupervised analyses. Leveraging this novel framework, we find that the stochastic geometries of neurobiological representations of oriented visual gratings and naturalistic scenes respectively resemble untrained and trained deep network representations. Further, we are able to more accurately predict certain network attributes (e.g. training hyperparameters) from its position in stochastic (versus deterministic) shape space.

1. INTRODUCTION

Comparing high-dimensional neural responses-neurobiological firing rates or hidden layer activations in artificial networks-is a fundamental problem in neuroscience and machine learning (Dwivedi & Roig, 2019; Chung & Abbott, 2021) . There are now many methods for quantifying representational dissimilarity including canonical correlations analysis (CCA; Raghu et al., 2017) , centered kernel alignment (CKA; Kornblith et al., 2019) , representational similarity analysis (RSA; Kriegeskorte et al., 2008a ), shape metrics (Williams et al., 2021) , and Riemannian distance (Shahbazi et al., 2021) . Intuitively, these measures quantify similarity in the geometry of neural responses while removing expected forms of invariance, such as permutations over arbitrary neuron labels. However, these methods only compare deterministic representations-i.e. networks that can be represented as a function f : Z → R n , where n denotes the number of neurons and Z denotes the space of network inputs. For example, each z ∈ Z could correspond to an image, and f (z) is the response evoked by this image across a population of n neurons (Fig. 1A ). Biological networks are essentially never deterministic in this fashion. In fact, the variance of a stimulus-evoked neural response is often larger than its mean (Goris et al., 2014) . Stochastic responses also arise in the deep learning literature in many contexts, such as in deep generative modeling (Kingma & Welling, 2019) , Bayesian neural networks (Wilson, 2020), or to provide regularization (Srivastava et al., 2014) . Stochastic networks may be conceptualized as functions mapping each z ∈ Z to a probability distribution, F (• | z), over R n (Fig. 1B , Kriegeskorte & Wei 2021). Although it is easier to study the representational geometry of the average response, it is well understood that this provides an incomplete and potentially misleading picture (Kriegeskorte & Douglas, 2019) . For instance, the ability to discriminate between two inputs z, z ′ ∈ Z depends on the overlap of F (z) and F (z ′ ), and not simply the separation of their means (Fig. 1C-D ). A rich literature in neuroscience has built on top of this insight (Shadlen et al., 1996; Abbott & Dayan, 1999; Rumyantsev et al., 2020) . However, to our knowledge, no studies have compared noise correlation structure across animal subjects or species, as has been done with trial-averaged responses. In machine learning, many studies have characterized the effects of noise on model predictions (Sietsma & Dow, 1991; An, 1996) , but only a handful have begun to characterize the geometry of stochastic hidden layers (Dapello et al., 2021) . To address these gaps, we formulate a novel class of metric spaces over stochastic neural representations. That is, given two stochastic networks F i and F j , we construct distance functions d(F i , F j ) that are symmetric, satisfy the triangle inequality, and are equal to zero if and only if F i and F j are equivalent according to a pre-defined criterion. In the deterministic limit-i.e., as F i and F j map onto Dirac delta functions-our approach converges to well-studied metrics over shape spaces (Dryden & Mardia, 1993; Srivastava & Klassen, 2016) , which were proposed by Williams et al. (2021) to measure distances between deterministic networks. The triangle inequality is required to derive theoretical guarantees for many downstream analyses (e.g. nonparametric regression, Cover & Hart 1967, and clustering, Dasgupta & Long 2005) . Thus, we lay an important foundation for analyzing stochastic representations, akin to results shown by Williams et al. (2021) in the deterministic case.

2.1. DETERMINISTIC SHAPE METRICS

We begin by reviewing how shape metrics quantify representational dissimilarity in the deterministic case. In the Discussion (sec. 4), we review other related prior work. Let {f 1 , . . . , f K } denote K deterministic neural networks, each given by a function f k : Z → R n k . Representational similarity between networks is typically defined with respect to a set of M inputs, {z 1 , . . . , z M } ∈ Z M . We can collect the representations of each network into a matrix: X k =    f k (z 1 ) . . . f k (z M )    . A naïve dissimilarity measure would be the Euclidean distance, ∥X i -X j ∥ F . This is nearly always useless. Since neurons are typically labelled in arbitrary order, our notion of distance should-at the very least-be invariant to permutations. Intuitively, we desire a notion of distance such that d(X i , X j ) = d(X i , X j Π) for any permutation matrix, Π ∈ R n×n . Linear CKA and RSA achieve this by computing the dissimilarity between X i X ⊤ i and X j X ⊤ j instead of the raw representations. Generalized shape metrics are an alternative approach to quantifying representational dissimilarity. The idea is to compute the distance after minimizing over nuisance transformations (e.g. permutations or rotations in R n ). Let ϕ k : R M ×n k → R M ×n be a fixed, "preprocessing function" for each network and let G be a set of nuisance transformations on R n . Williams et al. (2021) showed that: d(X i , X j ) = min T ∈G ∥ϕ i (X i ) -ϕ j (X j )T ∥ F (2)



Figure 1: (A) Illustration of a deterministic network mapping inputs, (color-coded images) into points in R n . (B) Illustration of a stochastic network, where each input, z ∈ Z, maps onto a distribution, F (• | z), over R n . (C) Example where noise correlations impair discriminability between two image classes. (D) Example where noise correlations improve discriminability (see Abbott & Dayan, 1999). (E) Illustration of two stochastic networks with equivalent representational geometry.

