CONVOLUTIONAL NEURAL NETWORKS ARE NOT INVARIANT TO TRANSLATION, BUT THEY CAN LEARN TO BE

Abstract

When seeing a new object, humans can immediately recognize it across different retinal locations: we say that the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several works have found that these networks systematically fail to recognise new objects on untrained locations. In this work we show how, even though CNNs are not 'architecturally invariant' to translation, they can indeed 'learn' to be invariant to translation. We verified that this can be achieved by pretraining on ImageNet, and we found that it is also possible with much simpler datasets in which the items are fully translated across the input canvas. Significantly, simply training everywhere on the canvas was not enough. We investigated how this pretraining affected the internal network representations, finding that the invariance was almost always acquired, even though it was some times disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right 'latent' characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.

1. INTRODUCTION

The equivalence of an objects across different viewpoints is considered a fundamental capacity of human vision recognition (Hummel, 2002) . This is mediated by the inferior temporal cortex, which appears to provide the bases for scale, translation, and rotation invariance (Tanaka, 1996; O'Reilly & Munakata, 2019) . Taking inspiration from biological models (LeCun et al., 1998) , Artificial Neural Networks have been endowed with convolution and pooling operations (LeCun et al., 1998; 1990) . It is often claimed that Convolutional Neural Networks (CNNs) are less susceptible to irrelevant sources of variation such as image translation, scaling, and other small deformations (Gens & Domingos, 2014; Xu et al., 2014; LeCun & Bengio, 1995; Fukushima, 1980) . While it is difficult to overstate the importance of convolution and pooling operations in deep learning, their ability to make a network invariant to image transformations has been overestimated: for example, Gong et al. (2014) showed that CNNs achieve neither rotation nor scale invariance. Similarly, multiple studies have reported highly limited translation invariance (Kauderer-Abrams, 2017; Gong et al., 2014; Azulay & Weiss, 2019; Chen et al., 2017; Blything et al., 2020) . It is important to understand the reason for the misconception regarding the ability of CNNs to be invariant to translation. We believe this is due to two misunderstanding. Firstly, it is commonly assumed that CNNs are 'architecturally' invariant to translation (that is, the invariance is built in the architecture through pooling and/or convolution). For example: "[CNNs] have an architecture hard-wired for some translation-invariance while they rely heavily on learning through extensive data or data augmentation for invariance to other transformations" (Han et al., 2020) , and "Most deep learning networks make heavy use of a technique called convolution (LeCun, 1989) , which constrains the neural connections in the network such that they innately capture a property known as translational invariance. This is essentially the idea that an object can slide around an image while maintaining its identity; a circle in the top left can be presumed, even absent direct experience) to be

