EXPLORING PERCEPTUAL STRAIGHTNESS IN LEARNED VISUAL REPRESENTATIONS

Abstract

Humans have been shown to use a "straightened" encoding to represent the natural visual world as it evolves in time (Hénaff et al. 2019). In the context of discrete video sequences, "straightened" means that changes between frames follow a more linear path in representation space at progressively deeper levels of processing. While deep convolutional networks are often proposed as models of human visual processing, many do not straighten natural videos. In this paper, we explore the relationship between network architecture, differing types of robustness, biologically-inspired filtering mechanisms, and representational straightness in response to time-varying input; we identify strengths and limitations of straightness as a useful way of evaluating neural network representations. We find that (1) adversarial training leads to straighter representations in both CNN and transformer-based architectures but (2) this effect is task-dependent, not generalizing to tasks such as segmentation and frame-prediction, where straight representations are not favorable for predictions; and nor to other types of robustness. In addition, (3) straighter representations impart temporal stability to class predictions, even for out-of-distribution data. Finally, (4) biologically-inspired elements increase straightness in the early stages of a network, but do not guarantee increased straightness in downstream layers of CNNs. We show that straightness is an easily computed measure of representational robustness and stability, as well as a hallmark of human representations with benefits for computer vision models.

1. INTRODUCTION

Visual input from the natural world evolves over time, and this change over time can be thought of as a trajectory in some representation space. For humans, this trajectory has a different representation at varying stages of processing, from input at the retina to brain regions such as V1 and finally to perception (Fig 1 ). We can ask whether there are advantages to representing the natural evolution of the visual input over time with a straighter, less curved, trajectory. If so, one might expect that human vision does this. (Hénaff et al., 2019) demonstrated that trajectories are straighter in human perceptual space than in pixel space, and suggest that a straighter representation may be useful for visual tasks that require extrapolation, such as predicting the future visual state of the world. Learning a useful visual representation is one of the major goals of computer vision. Properties like temporal stability, robustness to transformations, and generalization -all of which characterize human vision -are often desirable in computer vision representations. Yet, many existing computer vision models still fail to capture aspects of human vision, despite achieving high accuracy on visual tasks like recognition (Feather et al., 2019; Hénaff et al., 2019) . In (Hénaff et al., 2019) it was found that, while biologically-inspired V1-like transformations yield straighter representations compared to the input domain, popular computer vision models such as the original ImageNet-trained AlexNet (Krizhevsky et al., 2017) do not. In an effort to achieve favorable human-vision properties for computer vision models, there has been much work dedicated to incorporating various aspects of human vision into deep neural networks. These include modifying network architectures to mimic aspects of the human visual system Huang & Rao (2011), for example incorporating filter banks similar to the receptive field properties of visual 1

