CONTEXTUAL DROPOUT: AN EFFICIENT SAMPLE-DEPENDENT DROPOUT MODULE

Abstract

Dropout has been demonstrated as a simple and effective module to not only regularize the training process of deep neural networks, but also provide the uncertainty estimation for prediction. However, the quality of uncertainty estimation is highly dependent on the dropout probabilities. Most current models use the same dropout distributions across all data samples due to its simplicity. Despite the potential gains in the flexibility of modeling uncertainty, sample-dependent dropout, on the other hand, is less explored as it often encounters scalability issues or involves non-trivial model changes. In this paper, we propose contextual dropout with an efficient structural design as a simple and scalable sample-dependent dropout module, which can be applied to a wide range of models at the expense of only slightly increased memory and computational cost. We learn the dropout probabilities with a variational objective, compatible with both Bernoulli dropout and Gaussian dropout. We apply the contextual dropout module to various models with applications to image classification and visual question answering and demonstrate the scalability of the method with large-scale datasets, such as ImageNet and VQA 2.0. Our experimental results show that the proposed method outperforms baseline methods in terms of both accuracy and quality of uncertainty estimation.

1. INTRODUCTION

Deep neural networks (NNs) have become ubiquitous and achieved state-of-the-art results in a wide variety of research problems (LeCun et al., 2015) . To prevent over-parameterized NNs from overfitting, we often need to appropriately regularize their training. One way to do so is to use Bayesian NNs that treat the NN weights as random variables and regularize them with appropriate prior distributions (MacKay, 1992; Neal, 2012) . More importantly, we can obtain the model's confidence on its predictions by evaluating the consistency between the predictions that are conditioned on different posterior samples of the NN weights. However, despite significant recent efforts in developing various types of approximate inference for Bayesian NNs (Graves, 2011; Welling & Teh, 2011; Li et al., 2016; Blundell et al., 2015; Louizos & Welling, 2017; Shi et al., 2018) , the large number of NN weights makes it difficult to scale to real-world applications. Dropout has been demonstrated as another effective regularization strategy, which can be viewed as imposing a distribution over the NN weights (Gal & Ghahramani, 2016) . Relating dropout to Bayesian inference provides a much simpler and more efficient way than using vanilla Bayesian NNs to provide uncertainty estimation (Gal & Ghahramani, 2016) , as there is no more need to explicitly instantiate multiple sets of NN weights. For example, Bernoulli dropout randomly shuts down neurons during training (Hinton et al., 2012; Srivastava et al., 2014) . Gaussian dropout multiplies the neurons with independent, and identically distributed (iid) Gaussian random variables drawn from N (1, α), where the variance α is a tuning parameter (Srivastava et al., 2014) . Variational dropout generalizes Gaussian dropout by reformulating it under a Bayesian setting and allowing α to be learned under a variational objective (Kingma et al., 2015; Molchanov et al., 2017) . However, the quality of uncertainty estimation depends heavily on the dropout probabilities (Gal et al., 2017) . To avoid grid-search over the dropout probabilities, Gal et al. (2017) and Boluki et al. (2020) propose to automatically learn the dropout probabilities, which not only leads to a faster experiment cycle but also enables the model to have different dropout probabilities for each layer, bringing greater flexibility into uncertainty modeling. But, these methods still impose the restrictive assumption that dropout probabilities are global parameters shared across all data samples. By contrast, we consider parameterizing dropout probabilities as a function of input covariates, treating them as data-dependent local variables. Applying covariate-dependent dropouts allows different data to have different distributions over the NN weights. This generalization has the potential to greatly enhance the expressiveness of a Bayesian NN. However, learning covariate-dependent dropout rates is challenging. Ba & Frey (2013) propose standout, where a binary belief network is laid over the original network, and develop a heuristic approximation to optimize free energy. But, as pointed out by Gal et al. (2017) , it is not scalable due to its need to significantly increase the model size. In this paper, we propose a simple and scalable contextual dropout module, whose dropout rates depend on the covariates x, as a new approximate Bayesian inference method for NNs. With a novel design that reuses the main network to define how the covariate-dependent dropout rates are produced, it boosts the performance while only slightly increases the memory and computational cost. Our method greatly enhances the flexibility of modeling, maintains the inherent advantages of dropout over conventional Bayesian NNs, and is generally simple to implement and scalable to the large-scale applications. We plug the contextual dropout module into various types of NN layers, including fully connected, convolutional, and attention layers. On a variety of supervised learning tasks, contextual dropout achieves good performance in terms of accuracy and quality of uncertainty estimation.

2. CONTEXTUAL DROPOUT

We introduce an efficient solution for data-dependent dropout: (1) treat the dropout probabilities as sample-dependent local random variables, (2) propose an efficient parameterization of dropout probabilities by sharing parameters between the encoder and decoder, and (3) learn the dropout distribution with a variational objective.

2.1. BACKGROUND ON DROPOUT MODULES

Consider a supervised learning problem with training data D := {x i , y i } N i=1 , where we model the conditional probability p θ (y i | x i ) using a NN parameterized by θ. Applying dropout to a NN often means element-wisely reweighing each layer with a data-specific Bernoulli/Gaussian distributed random mask z i , which are iid drawn from a prior p η (z) parameterized by η (Hinton et al., 2012; Srivastava et al., 2014) . This implies dropout training can be viewed as approximate Bayesian inference (Gal & Ghahramani, 2016) . More specifically, one may view the learning objective of a supervised learning model with dropout as a log-marginal-likelihood: log N i=1 p(y i | x i , z)p(z)dz. To maximize this often intractable log-marginal, it is common to resort to variational inference (Hoffman et al., 2013; Blei et al., 2017) that introduces a variational distribution q(z) on the random mask z and optimizes an evidence lower bound (ELBO): L(D) = E q(z) log N i=1 pθ (yi | xi,z)pη(z) q(z) = N i=1 E zi∼q(z) [log p θ (y i | x i , z i )] -KL(q(z)||p η (z)), (1) where KL(q(z)||pη(z)) = E q(z) [log q(z) -log p(z) ] is a Kullback-Leibler (KL) divergence based regularization term. Whether the KL term is explicitly imposed is a key distinction between regular dropout (Hinton et al., 2012; Srivastava et al., 2014) and their Bayesian generalizations (Gal & Ghahramani, 2016; Gal et al., 2017; Kingma et al., 2015; Molchanov et al., 2017; Boluki et al., 2020) .

2.2. COVARIATE-DEPENDENT WEIGHT UNCERTAINTY

In regular dropout, as shown in (1), while we make the dropout masks data specific during optimization, we keep their distributions the same. This implies that while the NN weights can vary from data to data, their distribution is kept data invariant. In this paper, we propose contextual dropout, in which the distributions of dropout masks z i depend on covariates x i for each sample (x i , y i ). Specifically, we define the variational distribution as q φ (z i | x i ), where φ denotes its NN parameters. In the framework of amortized variational Bayes (Kingma & Welling, 2013; Rezende et al., 2014) , we can view q φ as an inference network (encoder) trying to approximate the posterior p(z i | y i , x i ) ∝ p(y i | x i , z i )p(z i ). Note as we have no access to y i during testing, we parameterize our encoder in a way that it depends on x i but not y i . From the optimization point of view, what we propose corresponds to the ELBO of log N i=1 p(y i | x i , z i )p(z i )dz i given q φ (z i | x i ) as the encoder, which can be expressed as L(D) = N i=1 L(xi, yi), L(xi, yi) = E z i ∼q φ (• | x i ) [log p θ (yi | xi, zi)] -KL(q φ (zi | xi)||pη(zi)). (2) This ELBO differs from that of regular dropout in (1) in that the dropout distributions for z i are now parameterized by x i and a single KL regularization term is replaced with the aggregation of N data-dependent KL terms. Unlike conventional Bayesian NNs, as z i is now a local random variable, the impact of the KL terms will not diminish as N increases, and from the viewpoint of uncertainty quantification, contextual dropout relies only on aleatoric uncertainty to model its uncertainty on y i given x i . Like conventional BNNs, we may add epistemic uncertainty by imposing a prior distribution on θ and/or φ, and infer their posterior given D. As contextual dropout with a point estimate on both θ and φ is already achieving state-of-the-art performance, we leave that extension for future research. In what follows, we omit the data index i for simplification and formally define its model structure. Cross-layer dependence: For a NN with L layers, we denote z = {z 1 , . . . , z L }, with z l representing the dropout masks at layer l. As we expect z l to be dependent on the dropout masks in previous layers {z j } j<l , we introduce an autoregressive distribution as q φ (z | x) = L l=1 q φ (z l | x l-1 ), where x l-1 , the output of layer l -1, is a function of {z 1 , . . . , z l-1 , x}. Parameter sharing between encoder and decoder: We aim to build an encoder by modeling q φ (z l | x l-1 ), where x may come from complex and highly structured data such as images and natural languages. Thus, extracting useful features from x to learn the encoder distribution q φ itself becomes a problem as challenging as the original one, i.e., extracting discriminative features from x to predict y. As intermediate layers in the decoder network p θ are already learning useful features from the input, we choose to reuse them in the encoder, instead of extracting the features from scratch. If we denote layer l of the decoder network by g l θ , then the output of layer l, given its input x l-1 , would be U l = g l θ (x l-1 ). Considering this as a learned feature for x, as illustrated in Figure 1 , we build the encoder on this output as α l = h l ϕ (U l ), draw z l conditioning on α l , and element-wisely multiply z l with U l (with broadcast if needed) to produce the output of layer l as x l . In this way, we use {θ, ϕ} to parameterize the encoder, which reuses parameters θ of the decoder. To produce the dropout rates of the encoder, we only need extra parameters ϕ, the added memory and computational cost of which are often insignificant in comparison to these of the decoder.

2.3. EFFICIENT PARAMETERIZATION OF CONTEXTUAL DROPOUT MODULE

Denote the output of layer l by a multidimensional array (tensor) U l = g l θ (x l-1 ) ∈ R C l 1 ×...×C l D l , where D l denotes the number of the dimensions of U l and C l d denotes the number of elements along dimension d ∈ {1, . . . , D l }. For efficiency, the output shape of h l ϕ is not matched to the shape of U l . Instead, we make it smaller and broadcast the contextual dropout masks z l across the dimensions of U l (Tompson et al., 2015) . Specifically, we parameterize dropout logits α l of the variational distribution to have C l d elements, where d ∈ {1, ...., D l } is a specified dimension of U l . We sample z l from the encoder and broadcast them across all but dimension d of U l . We sample z l ∼ Ber(σ(α l )) under contextual Bernoulli dropout, and follow Srivastava et al. (2014) to use z l ∼ N (1, σ(α l )/(1 -σ(α l ))) for contextual Gaussian dropout. To obtain α l ∈ R C l d , we first take the average pooling of U l across all but dimension d, with the output denoted as F avepool,d (U l ), and then apply two fully-connected layers Φ l 1 and Φ l 2 connected by F NL , a (Leaky) ReLU based nonlinear activation function, as where α l = h l ϕ (U l ) = Φ l 2 (F NL (Φ l 1 (F avepool,d (U l )))), Φ l 1 is a linear transformation mapping from R C l d to R C l d /γ , while Φ l 2 is from R C l d /γ back to R C l d , with γ being a reduction ratio controlling the complexity of h l ϕ . Below we describe how to apply contextual dropout to three representative types of NN layers. Contextual dropout module for fully-connected layersfoot_0 : If layer l is a fully-connected layer and U l ∈ R C l 1 ×•••×C l D l , we set α l ∈ R C l D l , where D l is the dimension that the linear transformation is applied to. Note, if U l ∈ R C l 1 , then α l ∈ R C l 1 , and F avepool,1 is an identity map, so α l = Φ l 2 (F NL (Φ l 1 (U l ))). Contextual dropout module for convolutional layers: Assume layer l is a convolutional layer with C l 3 as convolutional channels and U l ∈ R C l 1 ×C l 2 ×C l 3 . Similar to Spatial Dropout (Tompson et al., 2015) , we set α l ∈ R C l 3 and broadcast its corresponding z l spatially as illustrated in Figure 2 . Such parameterization is similar to the squeeze-and-excitation unit for convolutional layers, which has been shown to be effective in image classification tasks (Hu et al., 2018) . However, in squeeze-andexcitation, σ(α l ) is used as channel-wise soft attention weights instead of dropout probabilities, therefore it serves as a deterministic mapping in the model instead of a stochastic unit used in the inference network. Contextual dropout module for attention layers: Dropout has been widely used in attention layers (Xu et al., 2015b; Vaswani et al., 2017; Yu et al., 2019) . For example, it can be applied to multi-head attention weights after the softmax operation (see illustrations in Figure 2 ). The weights are of dimension [H, N K , N Q ], where H is the number of heads, N K the number of keys, and N Q the number of queries. In this case, we find that setting α l ∈ R H gives good performance. Intuitively, this coincides with the choice of channel dimension for convolutional layers, as heads in attention could be analogized as channels in convolution.

2.4. VARIATIONAL INFERENCE FOR CONTEXTUAL DROPOUT

In contextual dropout, we choose L(D) = (x,y)∈D L(x, y) shown in (2) as the optimization objective. Note in our design, the encoder q φ reuses the decoder parameters θ to define its own parameters. Therefore, we copy the values of θ into φ and stop the gradient of θ when optimizing q φ . This is theoretically sound (Ba & Frey, 2013) . Intuitively, the gradients to θ from p θ are less noisy than that from q φ as the training of p θ (y | x, z) is supervised while that of q φ (z) is unsupervised. As what we have expected, allowing gradients from q φ to backpropagate to θ is found to adversely affect the training of p θ in our experiments. We use a simple prior p η , making the prior distributions for dropout masks the same within each layer. The gradients with respect to η and θ can be expressed as ∇ η L(x, y) = E z∼q φ (• | x) [∇ η log p η (z)], ∇ θ L(x, y) = E z∼q φ (• | x) [∇ θ log p θ (y | x, z)], (4) which are both estimated via Monte Carlo integration, using a single z ∼ q φ (z | x) for each x. Now, we consider the gradient of L with respect to ϕ, the components of φ = {θ, ϕ} not copied from the decoder. For Gaussian contextual dropout, we estimate the gradients via the reparameterization trick (Kingma & Welling, 2013) . For z l ∼ N (1, σ(α l )/(1 -σ(α l ))), we rewrite it as z l = 1 + σ(α l )/(1 -σ(α l )) l , where l ∼ N (0, I). Similarly, sampling a sequence of z = {z l } L l=1 from q φ (z | x) can be rewritten as f φ ( , x), where f φ is a deterministic differentiable mapping and are iid standard Gaussian. The gradient ∇ ϕ L(x, y) can now be expressed as (see pseudo code of Algorithm 3 in Appendix) ∇ ϕ L(x, y) = E ∼N (0,1) [∇ ϕ (log p θ (y | x, f φ ( , x)) - log q φ (f φ ( ,x) | x) log pη(f φ ( ,x)) )]. (5) For Bernoulli contextual dropout, backpropagating the gradient efficiently is not straightforward, as the Bernoulli distribution is not reparameterizable, restricting the use of the reparameterization trick. In this case, a commonly used gradient estimator is the REINFORCE estimator (Williams, 1992) (see details in Appendix A). This estimator, however, is known to have high Monte Carlo estimation variance. To this end, we estimate ∇ ϕ L with the augment-REINFORCE-merge (ARM) estimator (Yin & Zhou, 2018) , which provides unbiased and low-variance gradients for the parameters of Bernoulli distributions. We defer the details of this estimator to Appendix A. We note there exists an improved ARM estimator (Yin et al., 2020; Dong et al., 2020) , applying which could further improve the performance.

2.5. TESTING AND COMPLEXITY ANALYSIS

Testing stage: To obtain a point estimate, we follow the common practice in dropout (Srivastava et al., 2014) to multiply the neurons by the expected values of random dropout masks, which means that we predict y with p θ (y | x, z), where z = E q φ (z | x) [z] under the proposed contextual dropout. When uncertainty estimation is needed, we draw K random dropout masks to approximate the posterior predictive distribution of y given x using p(y | x) = 1 K K k=1 p θ (y | x, z (k) ), where z (1) , . . . , z (K) iid ∼ q φ (z | x). Complexity analysis: The added computation and memory of contextual dropout are insignificant due to the parameter sharing between the encoder and decoder. Extra memory and computational cost mainly comes from the part of h l ϕ , where both the parameter size and number of operations are of order O((C l d ) 2 /γ), where γ is from 8 to 16. This is insignificant, compared to the memory and computational cost of the main network, which are of order larger than O((C l d ) 2 ). We verify the point by providing memory and runtime comparisons between contextual dropout and other dropouts on ResNet in Table 3 (see more model size comparisons in Table 5 in Appendix).

2.6. RELATED WORK

Data-dependent variational distribution: Deng et al. (2018) model attentions as latent-alignment variables and optimize a tighter lower bound (compared to hard attention) using a learned inference network. To balance exploration and exploitation for contextual bandits problems, Wang & Zhou (2019) introduce local variable uncertainty under the Thompson sampling framework. However, their inference networks of are both independent of the decoder, which may considerably increase memory and computational cost for the considered applications. Fan et al. (2020) propose Bayesian attention modules with efficient parameter sharing between the encoder and decoder networks. Its scope is limited to attention units as Deng et al. (2018) , while we demonstrate the general applicability of contextual dropout to fully connected, convolutional, and attention layers in supervised learning models. Conditional computation (Bengio et al., 2015; 2013; Shazeer et al., 2017; Teja Mullapudi et al., 2018) tries to increase model capacity without a proportional increase in computation, where an independent gating network decides turning which part of a network active and which inactive for each example. In contextual dropout, the encoder works much like a gating network choosing the distribution of sub-networks for each sample. But the potential gain in model capacity is even larger, e.g., there are potentially ∼ O((2 d ) L ) combinations of nodes for L fully-connected layers, where d is the order of the number of nodes for one layer. Generalization of dropout: DropConnect (Wan et al., 2013) randomly drops the weights rather than the activations so as to generalize dropout. The dropout distributions for the weights, however, are still the same across different samples. Contextual dropout utilizes sample-dependent dropout probabilities, allowing different samples to have different dropout probabilities.

3. EXPERIMENTS

Our method can be straightforwardly deployed wherever regular dropout can be utilized. To test its general applicability and scalability, we apply the proposed method to three representative types of NN layers: fully connected, convolutional, and attention layers with applications on MNIST (LeCun et al., 2010) , CIFAR (Krizhevsky et al., 2009) , ImageNet (Deng et al., 2009) , and VQA-v2 (Goyal et al., 2017) . To investigate the model's robustness to noise, we also construct noisy versions of datasets by adding Gaussian noises to image inputs (Larochelle et al., 2007) . For evaluation, we consider both the accuracy and uncertainty on predicting y given x. Many metrics have been proposed to evaluate the quality of uncertainty estimation. On one hand, researchers are generating calibrated probability estimates to measure model confidence (Guo et al., 2017; Naeini et al., 2015; Kuleshov et al., 2018) . While expected calibration error and maximum calibration error have been proposed to quantitatively measure calibration, such metrics do not reflect how robust the probabilities are with noise injected into the network input, and cannot capture epistemic or model uncertainty (Gal & Ghahramani, 2016) . On the other hand, the entropy of the predictive distribution as well as the mutual information, between the predictive distribution and posterior over network weights, are used as metrics to capture both epistemic and aleatoric uncertainty (Mukhoti & Gal, 2018) . However, it is often unclear how large the entropy or mutual information is large enough to be classified as uncertain, so such metric only provides a relative uncertainty measure. Hypothesis testing based uncertainty estimation: Unlike previous information theoretic metrics, we use a statistical test based method to estimate uncertainty, which works for both single-label and multi-label classification models. One advantage of using hypothesis testing over information theoretic metrics is that the p-value of the test can be more interpretable, making it easier to be deployed in practice to obtain a binary uncertainty decision. To quantify how confident our model is about this prediction, we evaluate whether the difference between the empirical distributions of the two most possible classes from multiple posterior samples is statistically significant. Please see Appendix D for a detailed explanation of the test procedure. Uncertainty evaluation via PAvPU: With the p-value of the testing result and a given p-value threshold, we can determine whether the model is certain or uncertain about one prediction. To evaluate the uncertainty estimates, we uses Patch Accuracy vs Patch Uncertainty (PAvPU) (Mukhoti & Gal, 2018) , which is defined as PAvPU = (n ac + n iu )/(n ac + n au + n ic + n iu ), where n ac , n au , n ic , n iu are the numbers of accurate and certain, accurate and uncertain, inaccurate and certain, inaccurate and uncertain samples, respectively. This PAvPU evaluation metric would be higher if the model tends to generate the accurate prediction with high certainty and inaccurate prediction with high uncertainty.

3.1. CONTEXTUAL DROPOUT ON FULLY CONNECTED LAYERS

We consider an MLP with two hidden layers of size 300 and 100, respectively, with ReLU activations. Dropout is applied to the input layer and the outputs of first two full-connected layers. We use MNIST as the benchmark. We compare contextual dropout with MC dropout (Gal & Ghahramani, 2016) , concrete dropout (Gal et al., 2017) , Gaussian dropout (Srivastava et al., 2014) , and Bayes by Backprop (Blundell et al., 2015) . Please see the detailed experimental setting in Appendix C.1. results for p-value threshold 0.05 is in general consistent with the results for other p-value thresholds (see more in Table 6 in Appendix). We observe that contextual dropout outperforms other methods in all metrics. Moreover, compared to Bayes by Backprop, contextual dropout is more memory and computationally efficient. As shown in Table 5 in Appendix, contextual dropout only introduces 16% additional parameters. However, Bayes by Backprop doubles the memory and increases the computations significantly as we need multiple draws of NN weights for uncertainty. Due to this reason, we do not include it for the following large model evaluations. We note that using the output of the gating network to directly scale activations (contextual gating) underperforms contextual dropout, which shows that the sampling process is important for preventing overfitting and improving robustness to noise. Adding a regular dropout layer on the gating activations (contextual gating + dropout) improves a little, but still underperforms contextual dropout, demonstrating that how we use the gating activations matters. In Figure 3 , we observe that Bernoulli contextual dropout learns different dropout probabilities for different samples adapting the sample-level uncertainty which further verifies our motivation and supports the empirical improvements. For sample-dependent dropout, the dropout probabilities would not vanish to zero even though the prior for regularization is also learned, because the optimal dropout probabilities for each sample is not necessarily zero. Enabling different samples to have different network connections could greatly enhance the model's capacity. The prior distribution also plays a different role here. Instead of preventing the dropout probabilities from going to zero, the prior tries to impose some similarities between the dropout probabilities of different samples. 

3.2. CONTEXTUAL DROPOUT ON CONVOLUTIONAL LAYERS

We apply dropout to the convolutional layers in WRN (Zagoruyko & Komodakis, 2016) . In Figure 6 in Appendix, we show the architecture of WRN, where dropout is applied to the first convolutional layer in each network block; in total, dropout is applied to 12 convolutional layers. We evaluate on CIFAR-10 and CIFAR-100 . The detailed setting is provided in Appendix C.1. 4) on an image that all three methods have high uncertainty, contextual dropout places a higher probability on the correct answer than the other two. These observations verify that contextual dropout provides better calibrated uncertainty. Large-scale experiments with ImageNet: Contextual dropout is also applied to the convolutional layers in ResNet-18, where we plug contextual dropout into a pretrained model, and fine-tune the pretrained model on ImageNet. In Table 3 , we show it is even possible to finetune a pretrained model with contextual dropout module, and without much additional memory or run time cost, it achieves better performance than both the original model and the one with regular Gaussian dropout. Training model with contextual dropout from scratch can further improve the performance. See detailed experimental setting in Appendix C.1.

3.3. CONTEXTUAL DROPOUT ON ATTENTION LAYERS

We further apply contextual dropout to the attention layers of VQA models, whose goal is to provide an answer to a question relevant to the content of a given image. We conduct experiments on the commonly used benchmark, VQA-v2 (Goyal et al., 2017) , containing human-annotated questionanswer (QA) pairs. There are three types of questions: Yes/No, Number, and Other. In Figure 5 , we show one example for each question type. There are 10 answers provided by 10 different human annotators for each question (see explanation of evaluation metrics in Appendix C.2). As shown in the examples, VQA is generally so challenging that there are often several different human annotations for a given image. Therefore, good uncertainty estimation becomes even more necessary.

Model and training specifications:

We use MCAN (Yu et al., 2019) , a state-of-the-art Transformerlike model for VQA. Self-attention layers for question features and visual features, as well as the question-guided attention layers of visual features, are stacked one over another to build a deep model. Dropout is applied in every attention layer (after the softmax and before residual layer (Vaswani et al., 2017) ) and fully-connected layer to prevent overfitting (Yu et al., 2019) , resulting in 62 dropout layers Results and analysis: We compare different dropouts on both the original VQA dataset and a noisy version, where Gaussian noise with standard deviation 5 is added to the visual features. In Tables 4, we show the overall accuracy and uncertainty estimation. The results show that on the original data, contextual dropout achieves better accuracy and uncertainty estimation than the others. Moreover, on noisy data, where the prediction becomes more challenging and requires more model flexibility and robustness, contextual dropouts outperform their regular dropout counterparts by a large margin in terms of accuracy with consistent improvement across all three question types. Visualization: In Figures 12-15 in Appendix F.3, we visualize some image-question pairs, along with the human annotations and compare the predictions and uncertainty estimations of different dropouts. We show three of them in Figure 5 . As shown in the plots, overall contextual dropout is more conservative on its wrong predictions and more certain on its correct predictions than other methods (see more detailed explanations in Appendix F.3).

4. CONCLUSION

We introduce contextual dropout as a simple and scalable data-dependent dropout module that achieves strong performance in both accuracy and uncertainty estimation on a variety of tasks including large scale applications. With an efficient parameterization of the coviariate-dependent variational distribution, contextual dropout boosts the flexibility of Bayesian neural networks with only slightly increased memory and computational cost. We demonstrate the general applicability of contextual dropout on fully connected, convolutional, and attention layers, and also show that contextual dropout masks are compatible with both Bernoulli and Gaussian distribution. On both image classification and visual question answering tasks, contextual dropout consistently outperforms corresponding baselines. Notably, on ImageNet, we find it is possible to improve the performance of a pretrained model by adding the contextual dropout module during a finetuning stage. Based on these results, we believe contextual dropout can serve as an efficient alternative to data-independent dropouts in the versatile tool box of dropout modules. Algorithm 2: Bernoulli contextual dropout with independent ARM Input: data D, r, {g l θ } L l=1 , {h l ϕ } L l=1 , step size s Output: updated θ, ϕ, η repeat G ϕ = 0; Sample x, y from data D; x 0 = x for l = 1 to L do U l = g l θ (x l-1 ), α l = h l ϕ (U l ) Sample π l from Uniform(0,1); z l true := 1 [π l <σt(α l )] ; x l = U l z l true end for r true = r(x L true , y) x 0 sudo = x for l = 1 to L do U l sudo = g l θ (x l-1 sudo ), α l sudo = h l ϕ (U l sudo ) z l sudo := 1 [π l sudo >σt(-α l sudo )] ; x l sudo = U l sudo z l sudo end for r sudo = r(x L sudo , y); for l = 1 to L do G ϕ = G ϕ + t(r true -r sudo )(1/2 -π l )∇ ϕ α l ; end for ϕ = ϕ + sG ϕ , with step-size s; θ = θ + s ∂ log p θ (y | x,z 1:L,true ) ∂θ ; η = η + s ∂ log pη(z 1:L,true ) ∂η ; until convergence Algorithm 3: Gaussian contextual dropout with reparamaterization trick Input: data D, r, {g l θ } L l=1 , {h l ϕ } L l=1 , step size s Output: updated θ, ϕ, η repeat Sample x, y from data D; x 0 = x for l = 1 to L do U l = g l θ (x l-1 ), α l = h l ϕ (U l ) Sample l from N (0, 1); τ l = 1-σt(α l ) σt(α l ) ; z l := 1 + τ l l ; x l = U l z l end for ϕ = ϕ + s∇ ϕ (log p θ (y | x, z 1:L ) - log q φ (z 1:L |x) log pη(z 1:L ) ), with step-size s; θ = θ + s ∂ log p θ (y | x,z 1:L ) ∂θ ; η = η + s ∂ log pη(z 1:L ) ∂η ; until convergence C DETAILS OF EXPERIMENTS All experiments are conducted using a single Nvidia Tesla V100 GPU. Choice of hyper-parameters in Contextual Dropout: Contextual dropout introduces two additional hyperparameters compared to regular dropout. One is the channel factor γ for the encoder network. In our experiments, the results are not sensitive to the choice of the value of the channel factor γ. Any number from 8 to 16 would give similar results, which is also observed in (Hu et al., 2018) . The other is the sigmoid scaling factor t that controls the learning rate of the encoder. We find that the performance is not that sensitive to its value and it is often beneficial to make it smaller than the learning rate of the decoder. In all experiments considered in the paper, which cover various noise levels and model sizes, we have simply fixed it at t = 0.01.

MLP:

We consider an MLP with two hidden layers of size 300 and 100, respectively, and use ReLU activations. Dropout is applied to all three full-connected layers. We use MNIST as the benchmark. All models are trained for 200 epochs with batch size 128 and the Adam optimizer (Kingma & Ba, 2014) (β 1 = 0.9, β 2 = 0.999). The learning rate is 0.001. We compare contextual dropout with MC dropout (Gal & Ghahramani, 2016) and concrete dropout (Gal et al., 2017) . For MC dropout, we use the hand-tuned dropout rate at 0.2. For concrete dropout, we initialize the dropout rate at 0.2 for Bernoulli dropout and the standard deviation parameter at 0.5 for Gaussian dropout. and set the Concrete temperature at 0.1 (Gal et al., 2017) . We initialize the weights in contextual dropout with He-initialization preserving the magnitude of the variance of the weights in the forward pass (He et al., 2015) . We initialize the biases in the way that the dropout rate is 0.2 when the weights for contextual dropout are zeros. We also initialize our prior dropout rate at 0.2. For hyperparameter tuning, we hold out 10, 000 samples randomly selected from the training set for validation. We use the chosen hyperparameters to train on the full training set (60, 000 samples) and evaluate on the testing set (10, 000 samples). We use Leaky ReLU (Xu et al., 2015a) with 0.1 as the non-linear operator in contextual dropout. The reduction ratio γ is set as 10, and sigmoid scaling factor t as 0.01. For Bayes by Backprop, we use -log σ 1 = 0, -log σ 2 = 6, π = 0.2 (following the notation in the original paper). For evaluation, we set M = 20.

WRN:

We consider WRN (Zagoruyko & Komodakis, 2016) , including 25 convolutional layers. In Figure 6 , we show the architecture of WRN, where dropout is applied to the first convolutional layer in each network block; in total, dropout is applied to 12 convolutional layers. We use CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) as benchmarks. All experiments are trained for 200 epochs with the Nesterov Momentum optimizer (Nesterov, 1983) , whose base learning rate is set as 0.1, with decay factor 1/5 at epochs 60 and 120. All other hyperparameters are the same as MLP except for Gaussian dropout, where we use standard deviation equal to 0.8 for the CIFAR100 with no noise and 1 for all other cases. ResNet: We used ResNet-18 as the baseline model. We use momentum SGD, with learning rate 0.1, and momentum weight 0.9. Weight decay is utilized with weight 1e -4 . For models trained from scratch, we train the models with 90 epochs. For finetuning models, we start with pretrained baseline ResNet models and finetune for 1 epoch. 

C.2 VQA

Dataset: The dataset is split into the training (80k images and 444k QA pairs), validation (40k images and 214k QA pairs), and testing (80k images and 448k QA pairs) sets. We perform evaluation on the validation set as the true labels for the test set are not publicly available (Deng et al., 2018) . Evaluation metric: the evaluation for VQA is different from image classification. The accuracy for a single answer could be a number between 0 and 1 (Goyal et al., 2017) : Acc(ans) = min{(#human that said ans)/3, 1}. We generalize the uncertainty evaluation accordingly: n ac = i Acc i Cer i , n iu = i (1 -Acc i )(1 -Cer i ) , n au = i Acc i (1 -Cer i ), n ic = i (1 -Acc i )(Cer i ) where for the ith prediction Acc i is the accuracy and Cer i ∈ {0, 1} is the certainty indicator. Experimental setting: We follow the setting by Yu et al. (2019) , where bottom-up features extracted from images by Faster R-CNN (Ren et al., 2015) are used as visual features, pretrained wordembeddings (Pennington et al., 2014) and LSTM (Hochreiter & Schmidhuber, 1997) are used to extract question features. We adopt the encoder-decoder structure in MCAN with six co-attention layers. We use the same model hyperparameters and training settings in Yu et al. (2019) as follows: the dimensionality of input image features, input question features, and fused multi-modal features are set to be 2048, 512, and 1024, respectively. The latent dimensionality in the multi-head attention is 512, the number of heads is set to 8, and the latent dimensionality for each head is 64. The size of the answer vocabulary is set to N = 3129 using the strategy in Teney et al. (2018) . To train the MCAN model, we use the Adam optimizer (Kingma & Ba, 2014) with β 1 = 0.9 and β 2 = 0.98. The base learning rate is set to min(2.5te -5 , 1e -4 ), where t is the current epoch number starting from 1. After 10 epochs, the learning rate is decayed by 1/5 every 2 epochs. All the models are trained up to 13 epochs with the same batch size of 64. We only conduct training on the training set (no data augmentation with visual genome dataset), and evaluation on the validation set. For MC dropout, we use the dropout rate of 0.1 for Bernoulli dropout as in Yu et al. (2019) and the standard deviation parameter of 1/3 for Gaussian dropout. For concrete dropout, we initialize the dropout rate at 0.1 and set the Concrete temperature at 0.1 (Gal et al., 2017) . For hyperparameter tuning, we randomly hold out 20% of the training set for validation. After tuning, we train on the whole training set and evaluate on the validation set. We initialize the weights with He-initialization preserving the magnitude of the variance of the weights in the forward pass (He et al., 2015) . We initialize the biases in the way that the dropout rate is 0.1 when the weights for contextual dropout are zeros. We also initialize our prior dropout rate at 0.1. We use ReLU as the non-linear operator in contextual dropout. We use γ = 8 for layers with C l d > 8, otherwise γ = 1. We set α ∈ R d V for residual layers.

D STATISTICAL TEST FOR UNCERTAINTY ESTIMATION

Consider M posterior samples of predictive probabilities {p m } M m=1 , where p m is a vector with the same dimension as the number of classes. For single-label classification models, p m is produced by a softmax layer and sums to one, while for multi-label classification models, p m is produced by a sigmoid layer and each element is between 0 and 1. The former output is used in most image classification models, while the latter is often used in VQA where multiple answers could be true for a single input. In both cases, to quantify how confident our model is about this prediction, we evaluate whether the difference between the probabilities of the first and second highest classes is statistically significant with a statistical test. We conduct the normality test on the output probabilities for both image classification and VQA models, and find most of the output probabilities are approximately normal (we randomly pick some Q-Q plots (Ghasemi & Zahediasl, 2012) and show them in Figures 7 and 8 ). This motivates us to use two-sample t-test 4 . In the following, we briefly summarize the two-sample t-test we use. Two sample hypothesis testing is an inferential statistical test that determines whether there is a statistically significant difference between the means in two groups. The null hypothesis for the t-test is that the population means from the two groups are equal: µ 1 = µ 2 , and the alternative hypothesis is µ 1 = µ 2 . Depending on whether each sample in one group can be paired with another sample in the other group, we have either paired t-test or independent t-test. In our experiments, we utilize both types of two sample t-test. For a single-label model, the probabilities are dependent between two classes due to the softmax layer, therefore, we use the paired two-sample t-test; for a multi-label model, the probabilities are independent given the logits of the output layer, so we use the independent two-sample t-test. For paired two-sample t-test, we calculate the difference between the paired observations calculate the t-statistic as below: T = Ȳ s/ √ N , where Ȳ is the mean difference between the paired observations, s is the standard deviation of the differences, and N is the number of observations. Under the null hypothesis, this statistic follows a t-distribution with N -1 degrees of freedom if the difference is normally distributed. Then, we use this t-statistic and t-distribution to calculate the corresponding p-value. For independent two-sample t-test, we calculate the t-statistic as below: T = Ȳ1 -Ȳ2 s 2 /N 1 + s 2 /N 2 s 2 = (y 1 -Ȳ1 ) + (y 2 -Ȳ2 ) N 1 + N 2 -2 where N 1 and N 2 are the sample sizes, and Ȳ1 and Ȳ2 are the sample means. Under the null hypothesis, this statistic follows a t-distribution with N 1 + N 2 -2 degrees of freedom if both y 1 and y 2 are normally distributed. We calculate the p-value accordingly. To justify the assumption of the two-sample t-test, we run the normality test on the output probabilities for both image classification and VQA models. We find most of the output probabilities are approximately normal. We randomly pick some Q-Q plots (Ghasemi & Zahediasl, 2012) and show them in Figures 7 and 8 .

F.2 BOXPLOT FOR CIFAR-10

In this section, we visualize 5 most uncertain images for each dropout (only include Bernoulli, Concrete, and Contextual Bernoulli dropout for simplicity) leading to 15 images in total. The true images with the labels are on the left side and boxplots of probability distributions of different dropouts are on the right side. All models are trained on the original CIFAR-10 dataset. Among these 15 images, we observe that contextual dropout predicts the right answer if it is certain, and it is certain and predicts the right answer on many images that MC dropout or concrete dropout is uncertain about (e.g, many images in Figure 9 -10). However, MC dropout or concrete dropout is uncertain about some easy examples (images in Figures 9-10) or certain on some wrong predictions (images in Figure 11 ). Moreover, on an image that all three methods have high uncertainty, concrete dropout often places a higher probability on the correct answer than the other two methods (images in Figure 11 ). . For each sample, we manually rank different methods by the general rule that accurate and certain is the most preferred, followed by accurate and uncertain, inaccurate and uncertain, and then inaccurate and certain. For each image-question pair, we rank three different dropouts based on their answers and p-values, and highlight the best performing one, the second best, and the worst with green, yellow, and red, respectively (tied ranks are allowed). As shown in the plots, overall contextual dropout is more conservative on its wrong predictions and more certain on its correct predictions than other methods for both randomly selected images and challenging images. 



Note that full-connected layers can be applied to multi-dimensional tensor as long as we specify the dimension along which the summation operation is conducted(Abadi et al., 2015). Results on original data is deferred toTable 6 in Appendix . Note that we also tried a nonparametric test, Wilcoxon rank-sum test, and obtain similar results.



Figure 1: A contextual dropout module.

Figure 2: Left: Contextual dropout in convolution layers. Right: Contextual dropout in attention layers.

Figure3: Visualization of dropout probabilities of Bernoulli contextual dropout on the MNIST dataset: the learned dropout probabilities seem to increase as we go to higher-level layers, as also observed inGal et al. (2017). With contextual dropout, different samples own different dropout probabilities. Inaccurate ones often have higher dropout probabilities corresponding to higher uncertainties. On the further right figure, we compare the dropout distributions across 3 representative digits. The dropout probabilities are overall higher for digit 8 compared to digit 1, meaning 1 is easier to classify. The distribution for 5 has longer tails than others showing there are more variations in the uncertainty for digit 5.

Figure 4: The performance of combining different dropouts with deep ensemble on noisy MNIST data.

Figure 5: VQA visualization: for each question type, we present an image-question pair along with human annotations. We manually classify each prediction by different methods based on their answers and p-values. For questions that have a clear answer, we define the good as certain & accurate, the average as uncertain & accurate or uncertain & inaccurate, and the bad as certain & inaccurate. Otherwise, we define the good as uncertain & accurate, the average as certain & accurate or uncertain & inaccurate, and the bad as certain & inaccurate.

Figure 6: Architecture of the Wide Residual Network.

Figure 9: Visualization of probability outputs of different dropouts on CIFAR-10. 5 plots that Concrete Dropout is the most uncertain are presented. Number to class map: {0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, 9: truck.}

Figure 10: Visualization of probability outputs of different dropouts on CIFAR-10. 5 plots that MC Dropout is the most uncertain are presented. Number to class map: {0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, 9: truck.}

Figure 12: VQA visualization: 6 plots that Contextual Dropout is the most uncertain are presented.

Figure 13: VQA visualization: 6 plots that Concrete Dropout is most uncertain are presented.

Figure 14: VQA visualization: 6 plots that MC Dropout is the most uncertain are presented.

Figure 15: VQA visualization: 12 randomly selected plots are presented.

Results on noisy MNIST with MLP.

Results on CIFAR-100 with WRN. We show the results for CIFAR-100 in Table2(see CIFAR-10 results in Tables 8-9 in Appendix). Accuracies, PAvPUs, and test predictive loglikelihoods are incorporated for both the original and noisy data. We consistently observe that contextual dropout outperforms other models in accuracy, uncertainty estimation, and loglikelihood.

Results on ImageNet with ResNet-18.

Accuracy and PAvPU on visual question answering.

Model size comparison among different methods.

ACKNOWLEDGEMENTS

The authors acknowledge the support of Grants IIS-1812699, IIS-1812641, ECCS-1952193, CCF-1553281, and CCF-1934904 from the U.S. National Science Foundation, and the Texas Advanced Computing Center for providing HPC resources that have contributed to the research results reported within this paper. M. Zhou acknowledges the support of a gift fund from ByteDance Inc.

Appendix

A DETAILS OF ARM GRADIENT ESTIMATOR FOR BERNOULLI CONTEXTUAL DROPOUT In this section, we will explain the implementation details of ARM for Bernoulli contextual dropout. To compute the gradients with respect to the parameters of the variational distribution, a commonly used gradient estimator is the REINFORCE estimator (Williams, 1992) as.This gradient estimator is, however, known to have high variance (Yin & Zhou, 2018) . To mitigate this issue, we use ARM to compute the gradient with Bernoulli random variable.ARM gradient estimator: In general, denoting σ(α) = 1/(1 + e -α ) as the sigmoid function, ARM expresses the gradient of) where z true := 1 [π<σ(α)] and z sudo := 1 [π>σ (-α) ] are referred to as the true and pseudo actions, respectively, andSequential ARM: Note that the above equation is not directly applicable to our model due to the cross-layer dependence. However, the dropout masks within each layer are independent of each other conditioned on these of the previous layers, so we can break our expectation into a sequence and apply ARM sequentially. We rewriteWhen computing ∇ ϕ L, we can ignore the ϕ in r as the expectation of ∇ ϕ log q φ (z | x) is zero. Using the chain rule, we havewhere r(x, z 1:l , y) := E z l+1:L ∼q φ (• | x,z 1:l ) [r(x, z, y)], we knowwhere z l true := 1 [π l <σ(α l )] and z l sudo := 1 [π l >σ(-α l )] . We estimate the gradients via Monte Carlo integration. We provide the pseudo code in Algorithm 1.Implementation details: The computational complexity of sequential ARM is O(L) times of that of the decoder computation. Although it is embarrassingly parallelizable, in practice, with limited computational resource available, it maybe be challenging to use sequential ARM when L is fairly large. In such cases, the original non-sequential ARM can be viewed as an approximation to strike a good balance between efficiency and accuracy (see the pseudo code in Algorithm 2 in Appendix). In our cases, for image classification models, L is small enough (3 for MLP, 12 for WRN) for us to use sequential ARM. For VQA, L is as large as 62 and hence we choose the non-sequential ARM.To control the learning rate of the encoder, we use a scaled sigmoid function:where a larger t corresponding to a larger learning rate for the encoder. This function is also used in Li & Ji (2019) to facilitate the transition of probability between 0 and 1 for the purpose of pruning NN weights.

B ALGORITHMS

Below, we present training algorithms for both Bernoulli and Gaussian contextual dropout.Algorithm 1: Bernoulli contextual dropout with sequential ARM In this section, we include the Q-Q plots of the output probabilities as the normality test for the assumptions of two-sample t-test. In Figure 7 , we test the normality of differences between highest probabilities and second highest probabilities on WRN model with contextual dropout trained on the orignal CIFAR-10 dataset. In Figure 8 , we test the normality of highest probabilities and second highest probabilities (separately) on VQA model with contextual dropout trained on the original VQA-v2 dataset. We use 20 data points for the plots. 

