ALIGNING MODEL AND MACAQUE INFERIOR TEMPO-RAL CORTEX REPRESENTATIONS IMPROVES MODEL-TO-HUMAN BEHAVIORAL ALIGNMENT AND ADVER-SARIAL ROBUSTNESS

Abstract

While some state-of-the-art artificial neural network systems in computer vision are strikingly accurate models of the corresponding primate visual processing, there are still many discrepancies between these models and the behavior of primates on object recognition tasks. Many current models suffer from extreme sensitivity to adversarial attacks and often do not align well with the image-by-image behavioral error patterns observed in humans. Previous research has provided strong evidence that primate object recognition behavior can be very accurately predicted by neural population activity in the inferior temporal (IT) cortex, a brain area in the late stages of the visual processing hierarchy. Therefore, here we directly test whether making the late stage representations of models more similar to that of macaque IT produces new models that exhibit more robust, primate-like behavior. We collected a dataset of chronic, large-scale multi-electrode recordings across the IT cortex in six non-human primates (rhesus macaques). We then use these data to fine-tune (end-to-end) the model "IT" representations such that they are more aligned with the biological IT representations, while preserving accuracy on object recognition tasks. We generate a cohort of models with a range of IT similarity scores validated on held-out animals across two image sets with distinct statistics. Across a battery of optimization conditions, we observed a strong correlation between the models' ITlikeness and alignment with human behavior, as well as an increase in its adversarial robustness. We further assessed the limitations of this approach and find that the improvements in behavioral alignment and adversarial robustness generalize across different image statistics, but not to object categories outside of those covered in our IT training set. Taken together, our results demonstrate that building models that are more aligned with the primate brain leads to more robust and human-like behavior, and call for larger neural data-sets to further augment these gains. Code, models, and data are available at https://github.com/dapello/braintree.

1. INTRODUCTION AND RELATED WORK

Object recognition models have made incredible strides in the last ten years, (Krizhevsky et al., 2012; Szegedy et al., 2014; Simonyan and Zisserman, 2014; He et al., 2015b; Dosovitskiy et al., 2020; Liu et al., 2022) even surpassing human performance in some benchmarks (He et al., 2015a) . While some of these models bear remarkable resemblance to the primate visual system (Daniel L. Yamins, 2013; Khaligh-Razavi and Kriegeskorte, 2014; Schrimpf et al., 2018; 2020) , there remain a number of important discrepancies. In particular, the output behavior of current models, while coarsely aligned with primate object confusion patterns, does not fully match primate error patterns on individual images (Rajalingham et al., 2018; Geirhos et al., 2021) . In addition, these same models can be easily fooled by adversarial attacks -targeted pixel-level perturbations intentionally designed to cause the model to produce the wrong output (Szegedy et al., 2013; Carlini and Wagner, 2016; Chen et al., 2017; Rony et al., 2018; Brendel et al., 2019) , whereas primate behavior is thought to be more robust to these kinds of attacks. This is an important unsolved problem in engineering artificial intelligence systems; the deviance between model and human behavior has been studied extensively in the machine learning community, often from the perspective of safety in real-world deployment of computer vision systems (Das et al., 2017; Liu et al., 2017; Xu et al., 2017; Madry et al., 2017; Song et al., 2017; Dhillon et al., 2018; Buckman et al., 2018; Guo et al., 2018; Michaelis et al., 2019) . From a neuroscience perspective, behavioral differences like these point to different underlying mechanisms



Figure 1: Aligning model IT representations with primate IT representations improves behavioral alignment and improves adversarial robustness. A) A set of naturalistic images, each containing one of eight different object classes are shown to a CNN and also to three different primate subjects with implanted multi-electrode arrays recording from the Inferior Temporal (IT) cortex. (1) A Base model (ImageNet pre-trained CORnet-S) is fine-tuned using stochastic gradient descent to (2) minimize the classification loss with respect to the ground truth object in each image while also minimizing a representational similarity loss (CKA) that encourages the model's IT representation to be more like those measured in the (pooled) primate subjects. (3) The resultant IT aligned models are then frozen and each tested in three ways. First, model IT representations are evaluated for similarity to biological IT representation (CKA metric) using neural data obtained from new primate subjects -we refer to the split-trial reliability ceiled average across all held out macaques and both image sets as "Validated IT neural similarity". Second, model output behavioral error patterns are assessed for alignment with human behavioral error patterns at the resolution of individual images (i2n, see Methods). Third, model behavioral output is evaluated for its robustness to white box adversarial attacks using an L ∞ norm projected gradient descent attack. All three tests are carried out with: (i) new images within the IT-alignment training domain (held out HVM images; see Methods) and (ii) new images with novel image statistics (natural COCO images; see Methods), and those empirical results are tracked separately. B) We find that this IT-alignment procedure produced gains in validated IT neural similarity relative to base models on both data sets, and that these gains led to improvement in human behavioral alignment. n=30 models are shown, resulting from training at six different relative weightings of the IT neural similarity loss, each from five base models that derived from five random seeds. C) We also find that these same IT-alignment gains resulted in increased adversarial accuracy (PGD L ∞ , ϵ = 1/1020) on the same model set as in B. Base models trained only for ImageNet and HVM image classification are circled in grey.

