BRAIN-LIKE REPRESENTATIONAL STRAIGHTENING OF NATURAL MOVIES IN ROBUST FEEDFORWARD NEURAL NETWORKS

Abstract

Representational straightening refers to a decrease in curvature of visual feature representations of a sequence of frames taken from natural movies. Prior work established straightening in neural representations of the primate primary visual cortex (V1) and perceptual straightening in human behavior as a hallmark of biological vision in contrast to artificial feedforward neural networks which did not demonstrate this phenomenon as they were not explicitly optimized to produce temporally predictable movie representations. Here, we show robustness to noise in the input image can produce representational straightening in feedforward neural networks. Both adversarial training (AT) and base classifiers for Random Smoothing (RS) induced remarkably straightened feature codes. Demonstrating their utility within the domain of natural movies, these codes could be inverted to generate intervening movie frames by linear interpolation in the feature space even though they were not trained on these trajectories. Demonstrating their biological utility, we found that AT and RS training improved predictions of neural data in primate V1 over baseline models providing a parsimonious, bio-plausible mechanism -noise in the sensory input stages -for generating representations in early visual cortex. Finally, we compared the geometric properties of frame representations in these networks to better understand how they produced representations that mimicked the straightening phenomenon from biology. Overall, this work elucidating emergent properties of robust neural networks demonstrates that it is not necessary to utilize predictive objectives or train directly on natural movie statistics to achieve models supporting straightened movie representations similar to human perception that also predict V1 neural responses.

1. INTRODUCTION

In understanding the principles underlying biological vision, a longstanding debate in computational neuroscience is whether the brain is wired to predict the incoming sensory stimulus, most notably formalized in predictive coding (Rao & Ballard, 1999; Friston, 2009; Millidge et al., 2021) , or whether neural circuitry is wired to recognize or discriminate among patterns formed on the sensory epithelium, popularly exemplified by discriminatively trained feedforward neural networks (DiCarlo et al., 2012; Tacchetti et al., 2018; Kubilius et al., 2018) . Arguing for a role of prediction in vision, recent work found perceptual straightening of natural movie sequences in human visual perception (Hénaff et al., 2019) . Such straightening is diagnostic of system whose representation could be linearly read out to perform prediction over time, and the idea of representational straightening resonates with machine learning efforts to create new types of models that achieve equivariant, linear codes for natural movie sequences. Discriminatively trained networks, however, lack any prediction over time in their supervision. It may not be surprising then that large-scale ANNs trained for classification produce representations that have almost no improvement in straightening relative to the input pixel space, while human observers clearly demonstrated perceptual straightening of natural movie sequences (subsequently also found in neurons of primary visual cortex, V1 (Hénaff et al., Figure 1 : Perceptual straightening of movie frames can be viewed as invertibility of latent representations for static images. Left: straightening of representations refers to a decrease in the curvature of the trajectory in representation space such as a neural population in the brain or human perceptual space, but standard ANNs do not show straightening (Hénaff et al., 2019; 2021) . Right: Invertibility of latent representation refers to interpolation between representation of two images (e.g. an image of a dog and an image of a cat), where the invertible interpolations show the main features of a dog morph into the main features of a cat. Invertible representations emerge in robust ANNs (Engstrom et al., 2019b) , obviating the need to directly train for temporal straightening. 2019; 2021)). This deficiency in standard feedforward ANNs might suggest a need for new models trained on predictive loss functions rather than pure classification to emulate biological vision. Here, we provide evidence for an alternative viewpoint, that biologically plausible straightening can be achieved in ANNs trained for robust discrimination, without resorting to a prediction objective or natural movies in training. Drawing on insights from emergent properties of adversarially-trained neural networks in producing linearly invertible latent representations, we highlight the link between perceptual straightening of natural movies to invertible latent representations learned from static images (Figure 1 ). We examine straightening in these robust feedforward ANNs finding that their properties relate to those in the biological vision framework. The contributions of this work are as follows: 1. We show that robust neural networks give rise to straightened feature representations for natural movies in their feature space, comparable to the straightening measured in the primate brain and human behavior, and completely absent from standard feedforward networks. 2. We show that linearly interpolating between the start and end frames of a movie in the output feature space of robust ANNs produces synthetic frames similar to those of the original natural movie sequence in image space. Such invertible linear interpolation is precisely the definition of a temporally predictive feature representation. 3. Compared to prior models of early visual cortex, robustness to input noise (corruption or adversarial robustness) is significantly better at explaining neural variance measured from V1 neurons than non-robustly trained baseline models, suggesting a new hitherto unconsidered mechanism for learning the representations in early cortical areas that achieves natural movie straightening.

2.1. MECHANISMS FOR PRODUCING BRAIN-LIKE REPRESENTATIONS

Feedforward ANNs as models of biological vision. Standard feedforward ANNs, although lacking a number of bio-plausible features such as feedback connections or a local learning rule (Whittington & Bogacz, 2019) , still can explain the neural variance (Schrimpf et al., 2018) recorded from rodent (Bakhtiari et al., 2021 ), monkey (Yamins et al., 2014; Bashivan et al., 2019) , and human visual cortex (Khaligh-Razavi & Kriegeskorte, 2014; Cichy et al., 2016) better than alternatives which are considered more bio-plausible by using a prediction objective function (e.g., PredNet and CPC (Zhuang et al., 2021; Schrimpf et al., 2020) ). Thus, to learn the representations in the brain, regardless of bio-plausibility of mechanisms, feedforward ANNs provide a parsimonious more tractable class of leading models for object recognition in the visual cortex. Models of primary visual cortex. In neuroscience, rather than rely solely a on top-down training objective like standard ANNs do, there has been a tradition of explaining early visual representations using more fundamental principles such as sparse coding and predictive coding as well as invoking unsupervised training (Olshausen & Field, 1996; Rao & Ballard, 1999) . For example, unsupervised slow feature analysis extracts the slow-varying features from fast-varying signals in movies based on the intuition that most external salient events (such as objects) are persistent in time, and this idea can be used to explain the emergence of complex cells in V1 (Berkes & Wiskott, 2005) . Recent work in machine learning has attempted to blend more bottom-up principles with top-down training by experimenting with swapping out ANN early layers with V1-like models whose filters are inspired from neuroscience studies (Dapello et al., 2020) . This blended model turns out to have benefits for classification robustness in the outputs. However, it remains unclear whether there is a form of top-down training that can produce V1-like models. Such a mechanism would provide a fundamentally different alternative to prior proposals of creating a V1 through sparse coding or future prediction (Hénaff et al., 2019; 2021) .

2.2. TEMPORAL PREDICTION AND INVERTIBILITY IN NEURAL NETWORKS

Learning to predict over time. Changes in architecture, training diet (movies), and objective (predicting future frames) have all been explored as mechanisms to produce more explicit equivariant representations of natural movies (Lotter et al., 2016; van den Oord et al., 2018) . Directly related to the idea of straightening, penalizing the curvature of representations of frames was used in Learning to linearize (Goroshin et al., 2015) to learn straightened representations from unlabeled videos. This class of models does not need supervision which makes them more bio-plausible in nature; however, as mentioned in the previous section, they lag behind supervised feedforward ANNs both in terms of learning effective representations for object recognition and in producing feature representations that predict neural data. Learning invertible latents. In deep learning applications, invertibility is mostly discussed in generative neural networks as a constraint to learn a prior to address applications in signals and systems such as image de-noising, signal compression and image reconstruction from few and noisy measurements or to be able to reconstruct or modify real images. Usually invertibility is implemented by carefully designing dedicated architectures (Jacobsen et al., 2018b; Chen et al., 2019) . However, recently it has been shown it can be implemented in standard feedforward ANNs when they undergo training for adversarial robustness (Engstrom et al., 2019b; c) . These works showed empirically that adversarially robust training encourages invertibility as linear interpolation between classes (e.g., cat to dog) results in semantically smooth image-to-image translation (Engstrom et al., 2019b) as opposed to blurry image sequences produced by standard ANNs. We reasoned that robust networks which encourage invertibility may also lead to straightening as this a property that would be related to improved invertibility of a network, so we sought to extend prior work and study the behavior of robustly trained networks specifically in the domain of natural movies. We report on how these networks straighten natural movies in their features spaces and can invertibly reproduce movie frames in a natural sequence. (Engstrom et al., 2019a) 3 METHODS

3.1. BASELINE MODELS

We consider the class of feedforward convolutional neural networks, typically restricting to the ResNet-50 (He et al., 2015) architecture trained on ImageNet for the main analyses. Baseline networks (not trained for robustness) include a supervised ResNet-50/ResNet-101/ResNet-152, and self-supervised (Barlowtwins (Zbontar et al., 2021) ). We trained ResNet-50 for imagenet classification without augmentations and with extensive augmentations (Chen et al., 2020) , labeled as Sup-NoAugm and SupMocoAugm, respectively. We also consider Voneresnet (biological V1 front-end (Dapello et al., 2020) ) and ResNet-50 trained as a base network for action recognition (Chen et al., 2021) but include these as separate examples in the Appendix since they use a modified architecture. Table 1 : Clean accuracy and robust (attack: L 2 , ϵ = 0.1) accuracy for the models used. Except for the custom models, all the other models were obtained from the repository of the references. Note that RS here refers to the base classifier in random smoothing without probabilistic inference. 

3.2. MODELS TRAINED FOR ROBUSTNESS

We consider two forms of models trained for minimizing a classification loss L ce in the face of input perturbations δ ∈ R h×w×c subject to constraints on the overall magnitude of perturbations in the input space, where x, y, θ are the network input, output, and classifier parameters, respectively: L ce (θ, x + δ, y) In adversarially trained networks, projected gradient descent from the output space finds maximal directions of perturbation in the input space limited to length ϵ, and training entails minimizing the effect of these perturbation directions on the network's output (Madry et al., 2018) . In random smoothing (Lecuyer et al., 2018; Cohen et al., 2019) , a supervised network is trained but in the face of Gaussian noise added to the input space as the base classifier before performing a probabilistic inference. In this work, we only use the representations as learned in base classifiers without the probabilistic inference. The perturbations in the base classifiers δ thus can follow: δ rand ∼ N (0, σ 2 I), δ adv := arg max |δ|p≤ϵ L ce (θ, x + δ, y) These defenses to input noise have different motivations. Adversarial robustness provides defense against white box attacks whereas random smoothing is protecting against general image corruptions. However, prior work has suggested a connection between corruption robustness and adversarial robustness (Ford et al., 2019) . Theoretically, random smoothing leads to certified robustness (Cohen et al., 2019) and trains a condition of invertible networks (Jacobsen et al., 2018a) , while adversarial robustness has been shown empirically to lead to invertible latent representations in networks (Engstrom et al., 2019b) .

3.3. REPRESENTATIONAL METRICS

Representational straightening estimates the local curvature c in a given representation r of a sequence of images (natural or artificial) of length N , C seq : {x t1 , x t2 , ..., x t N } as the angle between vectors connecting nearby frames, and these local estimates are averaged over the entire movie sequence for the overall straightening in that representational trajectory (same as (Hénaff et al., 2019) ): c t = arccos r t -r t-1 ∥r t -r t-1 ∥ • r t+1 -r t ∥r t+1 -r t ∥ , C seq = 1 N N -1 t=1 c t Lower curvature (angle between neighboring vectors) indicates a straighter trajectory, and in the results we generally reference curvature values to the curvature in the input space (i.e., straightening relative to pixel space). This metric has been utilized in neuroscience showing that humans tend to represent nearby movie frames in a straightened manner relative to pixels (Hénaff et al., 2019) . This curvature metric is also closely related to objectives used in efforts to train models with equivariance by linearizing natural transformations in the world as an alternative to standard networks trained for invariant object classification (Goroshin et al., 2015; Sabour et al., 2017) . Expansion. We define the radius of a sequence of images from a movie clip as the radial size of the minimum covering hyper-sphere circumscribing all points representing the frames in r (Gärtner, 1999) . We use this measure to supplement geometrical characterization of a movie sequence in pixel space and in a model's representational spaces. Like representational straightening values, expansion values for models in the main text are referenced to the radius measured in pixel space or to the radius measure for the same layer in a baseline network by simply dividing by those references. We used mini-ball, a publicly available python package based on (Gärtner, 1999) to measure radius of the covering hyper-sphere.

4.1. ROBUST ANNS EXHIBIT REPRESENTATIONAL STRAIGHTENING

With insights from connections to invertibility (see Figure 1 ), we hypothesized representational straightening of movie trajectories could be present in robustly trained neural networks. We took the same movie stimuli publicly available (Hénaff et al., 2019)(A.4.1, Figure 12 ) and the same metrics, and we tested the same architecture, ResNet50 (He et al., 2015) ) trained under different loss functions Table 1 to perform controlled head-to-head comparisons. Figure 2 shows representational straightening of natural movies measured in layers of ResNet50 trained under AT (Engstrom et al., 2019a) and RS (Cohen et al., 2019) at different adversarial attack or noise levels, respectively. Robust neural networks in contrast to other ANNs decreased the curvature of natural movies. Straightening for artificial sequences as measured in (Hénaff et al., 2019) (A.1, Figure 7 ) and other models (A.2, Figures 9 and 8 ) are provided in Appendix. Importantly, although most models, whether a standard ResNet-50 or one with a V1-like front-end, may display an initial dip in curvature for natural movies in the very earliest layers, this is not sustained in feature representations of later layers except for robustly trained networks (A.2, Figure 9 vs. A.1, Figure 7 ) and those trained on action recognition from temporally instructed training, which we include here as a proxy for a movielike training though its feedforward architecture deviates from a ResNet50 by additional temporal processing components (A.2, Figure 8 ). Perceptual Straightening measured as invertibility of latent representations. Next, we sought to empirically test how well robust networks can invert natural movies given that they contain linearized feature representation of movie frames in their high level feature spaces and given the general conceptual benefit of linearity for invertibility Figure 1 . We measured invertibility of each model on the same movie sequences used for measuring straightening as follows. We linearly interpolated between latent representation of the first and last frame of each movie and used the same procedure as that used previously in (Engstrom et al., 2019b; a) to obtain the pixel-space correspondence of those interpolated representations. Whereas those generated pseudo-frames can be assessed by for their pixel-by-pixel distance to the actual movie frame, we chose a metric, Structural Similarity In- Higher representational straightening (negative curvature change) associates with higher V1 predictivity. Intriguingly, the highest V1 predictivity corresponds to layers that exhibit comparable straightening to that measured from V1 neurons (-10 • on average) (Hénaff et al., 2021) . Explained variance is noise-corrected and computed as in (Schrimpf et al., 2018) . dex Measure (SSIM (Wang et al., 2004) ), that utilizes intermediate-level statistics motivated from biological vision and putatively more related to some aspects of human perception than simple pixel space correspondence. Figure 3 shows an example of such inverted frames for standard ResNet50, RS (L 2 : σ 2 = 0.5) and AT (L 2 : σ 2 = 3), and a summary of average measured invertibility using the SSIM metric on pseudo-frames from each model. As expected, inline with the findings of previous work (Engstrom et al., 2019b) , AT models scored relatively higher on invertibility of frames than a baseline discriminative model. However, what had not been previously shown is that RS models, using merely the benefits of their robustness to noisy augmentation (base classifier on top of learned representation; no probabilistic inference), also exhibit higher invertibility scores compared to standard trained models. Invertibility scores were consistently improved in RS and AT models across a variety of movies tested including those with relatively stationary textures and not just dynamic objects (see A.4.4, Figure 13 for further examples and A.4.3, Table 3 for scores across all 11 movies). Thus, RS models along with AT models exhibit invertibility of representations for movie frames which further demonstrates their ability to support perceptual straightening of natural movies in their highest layers that may be functionally similar to perceptual straightening previously measured from human subjects (Hénaff et al., 2019).

4.2. RANDOM SMOOTHING AND ADVERSARIAL TRAINING IN EXPLAINING NEURAL REPRESENTATIONS IN THE PRIMATE VISUAL SYSTEM

Robustness to noise as a bio-plausible mechanism underlying straightening in primary visual cortex. As shown above, straightening which is a constraint for brain-like representations in visual cortex manifests in robust neural networks. Both classes of RS and AT training for robustness to L 2 norm generate straightened representations of movie sequences. However, to distinguish among models of object recognition, we can measure how well they explain variance in patterns of neural activity elicited in different visual cortical areas. Here, for all neural comparisons in our analyses, we measured the Brain-Score (Schrimpf et al., 2018) using the publicly available online resource to assess the similarity to biological vision of each model, which is a battery of tests comparing models against previously collected data from the primate visual system (see Brain-Score.org). We found that RS and AT models provided a better model of V1 (in terms of explained variance) compared to non-robust models Figure 4 . On other benchmarks, as we go up the ventral stream hierarchy from V1 Figure 5 : Can straightening for a movie sequence be explained by the size of the hyper-sphere bounding the frames (i.e. radius in pixel space)? While RS exhibits a small but positive correlation, the rest of the models, including AT, show negative or no correlations. Positive correlation means the smaller the size of the bounding hyper-sphere in pixel space, the more straightened the representation over the layers of the model. to IT again, keeping the layer assignment fixed across models for proper comparison, we observed a decrease in explainability of robust models (A.3, Figure 11 ), in part presumably because robust models have lower object classification performance which is known to drive fits in higher brain areas like V4 and IT supporting object recognition (Yamins et al., 2014) . Previous work (Dapello et al., 2020; Kong et al., 2022) linked adversarial robustness in models to their higher Brain-Score for V1, but we found that it may not be specifically driven by adversarial robustness per se, rather (L 2 ) noise robustness is also sufficient (as in base classifiers of RS tested here). More broadly, looking at neural fits across all models and their layers, we find that straightening in a particular model-layer correlates with improved explanatory power of variance in cortical area V1 (Figure 4 , middle panel, each dot is a layer from a model), being even more strongly predictive than robustness of the overall model (A3, Figure 10 ). The level of straightening reached by best fitting layers of RS and AT models was comparable to the 10 degree straightening estimated in macaque V1 neural populations (black dashed reference line in Figure 4 ). This complements the fact that robust models peak near the 30 degree straightening measured in perception (Figure 2 ), suggesting that robust models can achieve a brain-like level of straightening to V1 and perception. Does the geometry of movie frame representations in pixel space dictate straightening in downstream representations? The connection between two properties of the same representation manifold, robustness to independently sampled noise and straightened trajectories of smooth input temporal sequences, is not immediately clear. Because robustness is achieved by adding noise bounded by a norm (L 2 , L 2 , or L ∞ ) in pixel space, a natural question is whether the radius of the bounding hyper-sphere of the frames of the tested movies in pixel space (see Expansion in Methods) was correlated with the measured straightening in feature space in each layer of the robustly trained models (Figure 5 ; also see A.5, Figure 14 ). We found, however, that there seemed to be different mechanisms at play for RS versus AT in terms of achieving straightening. RS models showed (small but) positive correlations, which means the smaller the ball containing all the frames of the movie in input space, the larger the straightening effect for the representations of frames of that movie in the model. While in AT models we see the opposite (negative) or no correlation. These divergent patterns underscore differences between these models and suggest that geometric size in pixel space is not strongly constraining the degree to which a movie can be straightened. Geometry of movie frame representations in feature space is relevant for capturing neural representations in V1 Between different RS models tested on different input noise levels, RS L 2 : σ 2 = 0.5 stands out as it gives a better model of V1 than those using smaller or larger magnitude input noise (Figure 4 ). For this model, we found that in addition to its intermediate level of straightening, the expansion score of movie frames, which is the radial size in its representation normalized to size in the same layer of a baseline ResNet50, was highest compared to the other RS models (Figure 6 , middle panel; measures are referenced to layers in a standard ResNet50 to highlight relative effect of robustness training rather than effects driven by hierarchical layer). This demonstrates a potential trade-off between improving straightening in a representation while avoiding too much added contraction of movies by robust training relative to standard training. This balance seems to be best achieved for σ 2 = 0.5, where we also see the significantly higher predictivity of V1 cortical data (Figure 6 , right panel). The best AT model also shows little contraction of movies coupled with high straightening (A.5, 15) .

5. DISCUSSION

We have demonstrated novel properties of robust neural networks in how they represent natural movies. Conceptually, this work establishes a seemingly surprising connection between disparate ideas, robust discriminative networks trained on static images on one hand, to work learning to linearize by training on natural movies, on the other. These modeling paths could both result in linearized, or straightened, natural movie representations (Figure 1 ). From a machine learning perspective, the invertibility and concomitant representational straightening of robust networks suggests that they learn explainable representations of natural movie statistics. Biologically, the emergence of straightening in these networks as well as their ability to better explain V1 data than baselines relatively lacking in straightening Figure 4 provides new insights into potential neural mechanisms for previously difficult to explain brain phenomena. Biological constraints could lend parsimony to selecting among models, each with a different engineering goal. On face, RS by virtue of utilizing Gaussian noise instead of engineered noise gains traction over adversarial training as a more simple, and powerful way of achieving robustness in ANNs, which is inline with a long history of probabilistic inference in visual cortex of humans (Pouget et al., 2013) . Indeed, looking across the range of robust models tested, the best fitting model of V1 was not necessarily the most robust but tended toward more straightened representations that also showed the least contracted representations -consistent with a known dimensionality expansion from the sensory periphery to V1 in the brain (Field, 1994) . Future work exploring a wider variety of robustness training in conjunction with more bioplausible architectures, objectives, and training diets may yet elucidate the balance of factors contributing to biological vision. At the same time, our work does not directly address how straightened representations in the visual system may or may not be utilized to influence downstream visual perception and behavior, and this connection is an important topic for future work. On the one hand, for supporting dynamical scene perception, behaviors that predict (extrapolate) or postdict (interpolate) scene properties over time (e.g., object position) may be supported by straightened natural movie representations. Indeed, both explanations, prediction and postdiction, have been invoked to account for psychophysical phenomena like the flash-lag illusion which present an interesting test case of how the brain processes complex stimuli over time (Eagleman & Sejnowski, 2000) . However, even for relatively stationary scenes such as those containing textures, we observed benefits for straightening and invertibility in robustly trained networks (see A.4, Tables 2 and 3 ). (Chen et al., 2021) They were trained on video clips for action recognition. Although these models were not trained for straightening or predicting the next frame, they exhibit small but measurable straightening for natural movies. However, the curvatures for artificial sequences were not increased as much as curvature increase for artificial sequences in robust neural networks (Figure 7 ). Curvature change (degrees re: pixels)

Vone_Resnet-50

Natural movies

Artificial movies

Figure 9 : Lack of straightening for natural movies in ResNet50 trained with a biologically-inspired model of V1 in the front-end (Dapello et al., 2020) . Vone-ResNet50 exhibits robustness to adversarial attack, but the fact that it does not exhibit straightening (except for the front-end) provides further evidence that adversarial robustness does not always accompany straightening. 



Figure 3: Invertibility as measured by the SSIM(Wang et al., 2004) of the actual in-between frames (labeled as Natural Movie) and the pixel-space projected linear interpolations between the first and the last frame labeled Pixels, standard ResNet50, RS (ResNet50, L 2 : σ 2 = 0.5) and AT (ResNet50, L 2 : σ 2 = 3). Interpolating the representations of the first and last frames in an invertible representation space generates a sequence of frames that are more similar to the ground-truth in-between frames, but for a non-invertible representation the generated frames are blurry and more similar to the interpolation in pixel space (a.k.a. artificial sequence). Interpolating the first and last frames in pixel space, second row, gives exactly what was called an artificial sequence in studies of straightening(Hénaff et al., 2019; 2021), as opposed to natural sequence which were the actual in-between frames.

Figure4: Left: RS and AT are more predictive of V1 neural responses than other non-robust models of the same architecture (ResNet50). Right: each dot represents a layer in ResNet50 trained under different loss function (color codes same as left). Higher representational straightening (negative curvature change) associates with higher V1 predictivity. Intriguingly, the highest V1 predictivity corresponds to layers that exhibit comparable straightening to that measured from V1 neurons (-10 • on average)(Hénaff et al., 2021). Explained variance is noise-corrected and computed as in(Schrimpf et al., 2018) .

Figure 6: Geometric characteristics, straightening and curvature, of RS models related to V1 explainability. ∆ means quantity is referenced to the same measure in a standard ResNet50.

Figure 7: ANNs show straightening of representations when robustness to noise constraints (noise augmentation or robustness to adversarial attack) is added to their training. Counterclockwise from top left, measurements for straightening of movie sequences (from (Hénaff et al., 2019), natural sequence: green, artificial sequence: black) in each layer of ResNet50 architecture under different training regimes: supervised training (standard), supervised training with adversarial training (L 2 , σ 2 = 3) (Engstrom et al., 2019a) and supervised training with noise augmentation (L 2 , σ 2 = 0.5) (Cohen et al., 2019). Top right shows straightening for artificial (open circles) and natural (closed circles) sequences using ResNet architecture with no training (random parameters), self-supervised training(Chen et al., 2020) or additional layers.

Figure 10: Clean accuracy (left) and robust accuracy (right) vs. V1 predictivity (same color convention as used in main text).

Figure14: For each layer in each model, the expansions (re: pixels) and curvature change (re: pixels) were plotted for first row: all movies, second row: average over movies.

TABLE FOR AVERAGE STRAIGHTENING

Average curvature change (re: pixels) for each movie. RN stands for ResNet. The architecture of all robust models used was ResNet50.

Average SSIMs for each movie. RN stands for ResNet. Architecture of all robust models used was ResNet50.water carn. walk. dogv. egomo. chiron. bees leaves smile chirono. prair.

ACKNOWLEDGMENTS

This work was supported by a Klingenstein-Simons fellowship, Sloan Foundation fellowship, and Grossman-Kavli Scholar Award as well as a NVIDIA GPU grant and was performed using the Columbia Zuckerman Axon GPU cluster. We thank all three reviewers for their constructive feedback that led to an improved final version of the paper.

A.4.4 ADDITIONAL MOVIE INTERPOLATION EXAMPLES

Figure 13 : Three more example for interpolations for movies: chironomous, dogville, and egomotion, respectively. The gray dot in the middle of all frames is known as fixation spot where subjects (humans or monkeys) are instructed to keep their gaze toward during the experiment.

A.6 REPRODUCIBILITY INFORMATION

Almost all data (models, movies, and metrics) used in this work are publicly available and we provided references to them in the text (for instance see Table 1 ). We will release the code to reproduce the main results in this work at https://github.com/toosi/BrainLike_Straightening and we provide pointers to the publicly available resources used in this work as listed below. Movies and images. We used the same movies used in the original studies on human perception and monkey primary visual cortex (Hénaff et al., 2019; 2021) which are available from first author Github as referenced in their papers. Images used to measure the clean accuracy and robust accuracy were taken from ImageNet validation set. Models. All the models used in this study were from ResNet family and checkpoints for the main robust models are publicly available as references in the main text (Table 1 ). The checkpoints for the only two custom trained models (supervised with no augmentions and supervised with Moco augmentation) will be made publicly available along with the code. Neural predictivity metric. We used brain-score, which is a publicly available benchmark to evaluate how well a model predicts variance in neural data (Schrimpf et al., 2018) .

