

Abstract

In this work we tackle the problem of out-of-distribution generalization through conditional computation. Real-world applications often exhibit a larger distributional shift between training and test data than most datasets used in research. On the other hand, training data in such applications often comes with additional annotation. We propose a method for leveraging this extra information by using an auxiliary network that modulates activations of the main network. We show that this approach improves performance over a strong baseline on the Inria Aerial Image Labeling and the Tumor Infiltrating Lymphocytes (TIL) Datasets, which by design evaluate out-of-distribution generalization in both semantic segmentation and image classification.

1. INTRODUCTION

Deep learning has achieved great success in many core artificial intelligence (AI) tasks (Hinton et al., 2012; Krizhevsky et al., 2012; Brown et al., 2020) over the past decade. This is often attributed to better computational resources (Brock et al., 2018) and large-scale datasets (Deng et al., 2009) . Collecting and annotating datasets which represent a sufficient diversity of real-world test scenarios for every task or domain is extremely expensive and time-consuming. Hence, sufficient training data may not always be available. Due to many factors of variation (e.g., weather, season, daytime, illumination, view angle, sensor, and image quality), there is often a distributional change or domain shift that can degrade performance in real-world applications (Shimodaira, 2000; Wang & Schneider, 2014; Chung et al., 2018) . Applications in remote sensing, medical imaging, and Earth observation commonly suffer from distributional shifts resulting from atmospheric changes, seasonality, weather, use of different scanning sensors, different calibration and other variations which translate to unexpected behavior at test time (Zhu et al., 2017; Robinson et al., 2019; Ortiz et al., 2018) . In this work, we present a novel neural network architecture to increase robustness to distributional changes (See Figure 1 ). Our framework combines conditional computation (Dumoulin et al., 2018; 2016; De Vries et al., 2017; Perez et al., 2018) with a task specific neural architecture for better domain shift generalization. One key feature of this architecture is the ability to exploit extra information, often available but seldom used by current models, through a conditioning network. This results in models with better generalization, better performance in both independent and identically distributed (i.i.d.) and non-i.i.d. settings, and in some cases faster convergence. We demonstrate these methodological innovations on an aerial building segmentation task, where test images are from different geographic areas than the ones seen during training (Maggiori et al., 2017) and on the task of Tumor Infiltrating Lymphocytes (TIL) classification (Saltz et al., 2018) . We summarize our main contributions as follows: • We propose a novel architecture to effectively incorporate conditioning information, such as metadata. • We show empirically that our conditional network improves performance in the task of semantic segmentation and image classification. • We study how conditional networks improve generalization in the presence of distributional shift. 

2. BACKGROUND AND RELATED WORK

Self-supervised learning. Self-supervised learning extracts and uses available relevant context and embedded metadata as supervisory signals. It is a representation learning approach that exploits a variety of labels that come with the data for free. To leverage large amounts of unlabeled data, it is possible to set the learning objectives such that supervision is generated from the data itself. The selfsupervised task, also known as pretext task, guides us to a supervised loss function (Gidaris et al., 2018; Oord et al., 2018; He et al., 2019; Chen et al., 2020) . However, in self-supervised learning we usually do not emphasize performance on this auxiliary task. Rather we focus on the learned intermediate representation with the expectation that this representation can carry good semantic or structural meanings and can be beneficial to a variety of practical downstream tasks. Conditional networks can be seen as a self-supervision approach in which the pretext task is jointly learned with the downstream task. Our proposed modulation of a network architecture based on an auxiliary network's intermediate representation can also be seen as an instance of knowledge transfer (Hinton et al., 2015; Urban et al., 2016; Buciluǎ et al., 2006) . Because the auxiliary network has an additional task signalmetadata prediction -information about this task can be transferred to the main task network. Conditional Computation. Ioffe and Szegedy designed Batch Normalization (BN) as a technique to accelerate the training of deep neural networks (Ioffe & Szegedy, 2015) . BN normalizes a given mini-batch B = {F n,.,.,. } N n=1 of N feature maps F n,.,.,. as described by the following Equation: BN (F n,c,h,w |γ c , β c ) = γ c F n,c,h,w -E B [F .,c,.,. ] Var B [F .,c,.,. ] + + β c , where c, h and w are indexing the channel, height and width axis, respectively, γ c and β c are trainable scale and shift parameters, introduced to keep the representational power of the original network, and is a constant factor for numerical stability. For convolutional layers the mean and variance are computed over both the batch and spatial dimensions, implying that each location in the feature map is normalized in the same way. De Vries et al. (2017); Perez et al. (2018) introduced Conditional Batch Normalization (CBN) as a method for language-vision tasks. Instead of setting γ c and β c in Equation 1 directly, CBN defines them as learned functions β n,c = β c (q n ) and γ n,c = γ c (q n ) of a conditioning input q n . Note that this results in a different scale and shift for each sample in a mini-batch. Scale (γ n,c ) and shift (β n,c ) parameters for each convolutional feature are generated and applied to each feature via an affine transformation. Feature-wise transformations frequently have enough capacity to model complex



Figure 1: Conditional Networks

