VISUAL EXPERTISE AND THE LOG-POLAR TRANS-FORM EXPLAIN IMAGE INVERSION EFFECTS

Abstract

Visual expertise can be defined as the ability to discriminate among subordinatelevel objects in homogeneous classes, such as identities of faces within the class "face". Despite being able to discriminate many faces, subjects perform poorly at recognizing even familiar faces once inverted. This face-inversion effect is in contrast to subjects' performance identifying inverted objects for which their experience is at a basic level, which results in less impairment. Experimental results have suggested that when identifying mono-oriented objects, such as cars, car novices' performance is between that of faces and other objects. We build an anatomicallyinspired neurocomputational model to explore this effect. Our model includes a foveated retina and the log-polar mapping from the visual field to V1. This transformation causes changes in scale to appear as horizontal translations, leading to scale equivariance. Rotation is similarly equivariant, leading to vertical translations. When fed into a standard convolutional network, this provides rotation and scale invariance. It may be surprising that a rotation-invariant network shows any inversion effect at all. This is because there is a crucial topological difference between scale and rotation: Rotational invariance is discontinuous, with V1 ranging from 90°(vertically up) to 270°(vertically down). Hence when a face is inverted, the configural information in the face is disrupted while feature information is relatively unaffected. We show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic level.

1. INTRODUCTION

Since 1969, researchers have been studying the effects of inverting images (Yin, 1969) . Some researchers have focused on defining the bounds of inversion effects: what the measurable effect is for what types of images (Farah et al., 1995; Yin, 1969; Jacques et al., 2007; Rezlescu et al., 2017) . Others looked to explain how inversion effects arise: what part of the brain was active during inversion tasks or what level of experience a participant had with the stimuli in the experiment (Gauthier et al., 2000; Gauthier & Bukach, 2007; Gauthier et al., 2014; Kanwisher et al., 1997; 1998; Richler et al., 2011; Wang et al., 2014) . In Yin (1969) , participants studied a set of images during the training phase, and then they were shown pairs of images in testing and asked to select the image that was in the study set. Trials with upright images and trials with inverted images were compared to determine the inversion effect. Using images of faces resulted in a strong and significant inversion effect -performance was much worse for inverted faces. Images of houses -a mono-oriented category -had a lesser, but still significant effect. Images of airplanes had an insignificant inversion effect. We draw two conclusions from this work: the effects of inversion on performance are greater when images of faces are used as the stimuli and insignificant when images of certain objects (e.g., planes, that are less mono-oriented than houses) are used as the stimuli (Yin, 1969) . The second conclusion is that not all objects produce the same inversion effects. Mono-oriented objects, which are objects that are typically seen in only one orientation such as the houses in Yin's 1969 work, do show an inversion effect, though it is smaller than that of faces (Yin, 1969) . Since Yin's 1969 paper on inversion, a great deal of research has focused on explaining inversion effects. Why is it that different stimuli -faces, objects, mono-oriented objects -produce different inversion effects? One explanation of inversion effects, supported by brain imaging and behavioral studies, is that visual expertise changes the way we process visual stimuli. Faces are processed holistically, which means that not just the features, but the configuration of the features matters. When such stimuli are inverted, the configuration is disrupted, and we are left with featural processing (Gauthier et al., 2000; 1999; 2003) . Similar inversion effects have been observed in experts of other domains, such as dog show judges or bird watchers (Diamond & Carey, 1986; Gauthier et al., 2000) . Visual expertise is defined with respect to Rosch's basic level categories. In a category hierarchy, the basic level is the level at which objects are most commonly labeled, such as "chair", "tree", or "car". Basic level categories define broad categories of objects that share properties such as general appearance, function, and common parts (Rosch et al., 1976) . For example, cars can look very different from each other, but they all have wheels, an enclosed space for passengers, and are used for ground transportation. Visual expertise is defined as having proficiency in differentiating subordinate-level sub-classes of basic level categories. For example, subordinates of the basic level category "tree" could include "sugar maple", "american elm", or "northern red oak". Most people are face experts in this sense. It has been estimated that we are able to identify on the order of 5,000 different people (Jenkins et al., 2018) . Identity is a subordinate-level judgment because faces share the same features (eyes, nose, mouth, ears, etc.) in the same general configuration. We also process faces holistically (Gauthier & Bukach, 2007) . This means that instead of just using the features of a face to recognize a person, we use the configural information, such as the distance between the eyes, or the distance from the nose to the mouth. Hence, expertise is fine grained discrimination of homogeneous categories. The research into expertise suggests that experts in other domains, such as cars or birds, also use configural information when viewing basic level categories in which the participants are experts (Gauthier et al., 2000) . We conduct experiments in order to ask if there is a way to characterize inversion effects in different stimuli in terms of levels of expertise. In doing so, we explore the changes in visual signal processing that occur between novice level and expert level. To do this, We build an anatomically inspired network that incorporates foveation -high resolution central vision and low resolution peripheral vision -and the log polar mapping between the visual field and the primary visual cortex (Polimeni et al., 2006) . The log polar mapping causes changes in image scale to appear as horizontal translations. When presented as input to a convolutional neural network (CNN), which is translation invariant, the log polar mapping makes the network relatively scale invariant. Image rotation is similarly equivariant in the log-polar representation, because it leads to vertical translations. However, the two differ topologically: pixels that shift vertically can "fall off" the edge of the image and wrap around to the opposite edge. This causes a rearranging of features in the image, hence a disruption of configural information. Using this model, we test the inversion effects of different types of stimuli across increasing expertise in order to gain an insight into how and why visual processing changes based on the visual stimulus. Our model is consistent with the view that expertise plays a significant role in the way we process visual inputs, and leads to the inversion effects seen in previous work.

2.1. MODEL

We use ResNet-50 (He et al., 2016) to perform all experiments, trained from scratch with the foveated, log-polar representation. We call this LPnet. We compare our results to a "vanilla" ResNet-50 with standard images. Unless otherwise noted, all experiments use the Adam optimizer, an initial learning rate of 1e-4, and a minibatch size of 48.

2.2. DATA

To test the effects of expertise in visual processing, we use four different datasets. To model experts, the first three datasets are images of faces, cars, and dogs, generally mono-oriented objects, with targets at the subordinate level. To model novices, who mainly know basic-level labels, the fourth

