MIND THE PAD -CNNS CAN DEVELOP BLIND SPOTS

Abstract

We show how feature maps in convolutional networks are susceptible to spatial bias. Due to a combination of architectural choices, the activation at certain locations is systematically elevated or weakened. The major source of this bias is the padding mechanism. Depending on several aspects of convolution arithmetic, this mechanism can apply the padding unevenly, leading to asymmetries in the learned weights. We demonstrate how such bias can be detrimental to certain tasks such as small object detection: the activation is suppressed if the stimulus lies in the impacted area, leading to blind spots and misdetection. We propose solutions to mitigate spatial bias and demonstrate how they can improve model accuracy.

1. MOTIVATION

Convolutional neural networks (CNNs) serve as feature extractors for a wide variety of machinelearning tasks. Little attention has been paid to the spatial distribution of activation in the feature maps a CNN computes. Our interest in analyzing this distribution is triggered by mysterious failure cases of a traffic light detector: The detector successfully detects a small but visible traffic light in a road scene. However, it fails completely in detecting the same traffic light in the next frame captured by the ego-vehicle. The major difference between both frames is a limited shift along the vertical dimension as the vehicle moves forward. Therefore, the drastic difference in object detection is surprising given that CNNs are often assumed to have a high degree of translation invariance [8; 17] . The spatial distribution of activation in feature maps varies with the input. Nevertheless, by closely examining this distribution for a large number of samples, we found consistent patterns among them, often in the form of artifacts that do not resemble any input features. This work aims to analyze the root cause of such artifacts and their impact on CNNs. We show that these artifacts are responsible for the mysterious failure cases mentioned earlier, as they can induce 'blind spots' for the object detection head. Our contributions are: • Demonstrating how the padding mechanism can induce spatial bias in CNNs (Section 2). • Demonstrating how spatial bias can impair downstream tasks (Section 3). • Identifying uneven application of 0-padding as a resolvable source of bias (Section 5). • Relating the padding mechanism with the foveation behavior of CNNs (Section 6). • Providing recommendations to mitigate spatial bias and demonstrating how this can prevent blind spots and boost model accuracy.

2. THE EMERGENCE OF SPATIAL BIAS IN CNNS

Our aim is to determine to which extent activation magnitude in CNN feature maps is influenced by location. We demonstrate our analysis on a publicly-available traffic-light detection model [36] . This model implements the SSD architecture [26] in TensorFlow [1], using as a feature extractor. The model is trained on the BSTLD dataset [4] which annotates traffic lights in road scenes. Figure 1 shows two example scenes from the dataset. For each scene, we show two feature maps computed by two filters in the 11 th convolutional layer. This layer contains 512 filters whose feature maps are used directly by the first box predictor in the SSD to detect small objects. The bottom row in Figure 1 shows the average response of each of the two aforementioned filters, computed over the test set in BSTLD. The first filter seems to respond mainly to features in the top half of the input, while the second filter responds mainly to street areas. There are visible lines in the two average maps that do not seem to resemble any scene features and are consistently present in the individual feature maps. We analyzed the prevalence of these line artifacts in the feature maps of all 512 filters. The right column in Figure 1 shows the average of these maps per scene, as well as over the entire test set (see supplemental for all 512 maps). The artifacts are largely visible in the average maps, with variations per scene depending on which individual maps are dominant. A useful way to make the artifacts stand out is to neutralize scene features by computing the feature maps for a zero-valued input. Figure 2 depicts the resulting average map for each convolutional layer after applying ReLU units. The first average map is constant as we expect with a 0-valued input. The second map is also constant except for a 1-pixel boundary where the value is lower at the left border and higher at the other three borders. We magnify the corners to make these deviations visible. The border deviations increase in thickness and in variance at subsequent layers, creating multiple line artifacts at each border. These artifacts become quite pronounced at ReLU 8 where they start to propagate inwards, resembling the ones in Figure 1 . Figure 2 : Activation maps for a 0 input, averaged over each layer's filters (title format: H⇥W⇥C).



Figure1: Averaging feature maps per input (column marginal) and per filter (row marginal) in the last convolutional layer of a traffic light detector. Color indicates activation strength (the brighter, the higher), revealing line artifacts in the maps. These artifacts are the manifestation of spatial bias.

