CONVOLUTIONAL NEURAL NETWORKS ARE NOT INVARIANT TO TRANSLATION, BUT THEY CAN LEARN TO BE

Abstract

When seeing a new object, humans can immediately recognize it across different retinal locations: we say that the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several works have found that these networks systematically fail to recognise new objects on untrained locations. In this work we show how, even though CNNs are not 'architecturally invariant' to translation, they can indeed 'learn' to be invariant to translation. We verified that this can be achieved by pretraining on ImageNet, and we found that it is also possible with much simpler datasets in which the items are fully translated across the input canvas. Significantly, simply training everywhere on the canvas was not enough. We investigated how this pretraining affected the internal network representations, finding that the invariance was almost always acquired, even though it was some times disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right 'latent' characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.

1. INTRODUCTION

The equivalence of an objects across different viewpoints is considered a fundamental capacity of human vision recognition (Hummel, 2002) . This is mediated by the inferior temporal cortex, which appears to provide the bases for scale, translation, and rotation invariance (Tanaka, 1996; O'Reilly & Munakata, 2019) . Taking inspiration from biological models (LeCun et al., 1998) , Artificial Neural Networks have been endowed with convolution and pooling operations (LeCun et al., 1998; 1990) . It is often claimed that Convolutional Neural Networks (CNNs) are less susceptible to irrelevant sources of variation such as image translation, scaling, and other small deformations (Gens & Domingos, 2014; Xu et al., 2014; LeCun & Bengio, 1995; Fukushima, 1980) . While it is difficult to overstate the importance of convolution and pooling operations in deep learning, their ability to make a network invariant to image transformations has been overestimated: for example, Gong et al. (2014) showed that CNNs achieve neither rotation nor scale invariance. Similarly, multiple studies have reported highly limited translation invariance (Kauderer-Abrams, 2017; Gong et al., 2014; Azulay & Weiss, 2019; Chen et al., 2017; Blything et al., 2020) . It is important to understand the reason for the misconception regarding the ability of CNNs to be invariant to translation. We believe this is due to two misunderstanding. Firstly, it is commonly assumed that CNNs are 'architecturally' invariant to translation (that is, the invariance is built in the architecture through pooling and/or convolution). For example: "[CNNs] have an architecture hard-wired for some translation-invariance while they rely heavily on learning through extensive data or data augmentation for invariance to other transformations" (Han et al., 2020) , and "Most deep learning networks make heavy use of a technique called convolution (LeCun, 1989) , which constrains the neural connections in the network such that they innately capture a property known as translational invariance. This is essentially the idea that an object can slide around an image while maintaining its identity; a circle in the top left can be presumed, even absent direct experience) to be the same as a circle in the bottom right." (Marcus, 2018), see also LeCun & Bengio (1995) In fact, the convolution operation is translationally equivariant, not invariant, meaning that a transformation applied to the input is transferred to the output (Lenc & Vedaldi, 2019) . Even when this point is made, such in LeCun & Bengio (1995) , is it still assumed that the equivariance is enough to support an important degree of translation invariance. For example, LeCun & Bengio (1995) write: "Once a feature has been detected its exact location becomes less important as long as its approximate position relative to other features is preserved". As a matter of fact, equivariance and invariance are mutually exclusive functions (a representation cannot support both), and accordingly, any invariance supported by a network must be coded into the fully connected part rather than in the equivariant convolutional layers. Moreover, perfect equivariance can be lost in the convolutional layers (Azulay & Weiss, 2019; Zhang, 2019) through subsequent sub-sampling (implemented with pooling and striding operations, commonly used in almost any CNN). Therefore, overall, most modern CNNs are neither architecturally invariant nor perfectly equivariant to translation. The other reason for overestimating the extent that CNNs are invariant to translation resides in the failure to distinguish between trained and online translation invariance (see Bowers et al. 2016) . Trained invariance refers to the ability to correctly classify unseen instances of a class in trained location. For instance, a network trained on the whole visual field on identifying instances of dogs, will be able to identify a new image of a dog across multiple locations. This feature is obtained by data-augmentation: jittering the training samples so that the network is trained on items across different locations (Kauderer-Abrams, 2017; Furukawa, 2017) . However, this should not be considered a form of translation invariance, as it is simply a case of identifying a test image (a novel image of a dog) at a trained location. More interesting is the concept of 'online' translation invariance: learning to identify an object at one location immediately affords the capacity to identify that object at multiple otherfoot_0 . Online translation invariance is generally measured by training a network on images placed on a certain location (generally the center of a canvas), and then testing with the same images placed on untrained location. In many reports CNNs performed at chance level on untrained locations (Kauderer-Abrams, 2017; Gong et al., 2014; Azulay & Weiss, 2019; Chen et al., 2017; Blything et al., 2020) . This problem has been tackled with several architectural changes: Sundaramoorthi & Wang ( 2019 Without having to resort to new architectures, two works have recently contrasted the previous findings, obtaining a high degree of online translation invariance: Han et al. (2020) found that a CNN exhibited an almost perfect online translation invariance on a Korean Characters recognition task, but it did not compare these results with the previous literature. Blything et al. (2020) also found perfect online translation invariance when a VGG16 network was pretrained on ImageNet but found almost no invariance when the same (vanilla) network was untrained. This latter finding explains Han et al. results, as they also used a pretrained network when assessing online translation invariance. Together these results hint to the fact that translationally invariant representations do not need to be built inside the network architecture, but can be learned. In the current work, we further explore this idea.

2. CURRENT WORK

In this work we focus on 'online' translation invariance on a classic CNN, using VGG16 (Simonyan & Zisserman, 2014) as a typical convolutional network. We show how, even though classic CNNs are not 'architectural' invariance, they can 'learn' to be invariant to translation by extracting latent features of their visual environmenta (the dataset). Learning is used in the sense that the invariance is coded within the network weights, optimized through backpropagation, and not hard-wired in the network architecture, such as Sundaramoorthi & Wang (2019) or Bruna & Mallat (2012) .



'Trained translation invariance' in which images can be identified across the canvas after training exemplar images across many locations is not to be confused with our approach of 'training' translation invariance, in which a network is trained to exhibit 'online' translation invariance



; Gens & Domingos (2014); Xu et al. (2014); Marcos et al. (2016).

) suggested a solution based on Gaussian-Hermite basis; Bruna & Mallat (2012) used a wavelet scattering network model; Jaderberg et al. (2015) added a new module that can account for any affine transformation; Blything et al. (2020) used Global Average Pooling.

