ON DROPOUT, OVERFITTING, AND INTERACTION EF-FECTS IN DEEP NEURAL NETWORKS

Abstract

We examine Dropout through the perspective of interactions. Given N variables, there are O(N 2 ) possible pairwise interactions, O(N 3 ) possible 3-way interactions, i.e. O(N k ) possible interactions of k variables. Conversely, the probability of an interaction of k variables surviving Dropout at rate p is O((1 -p) k ). In this paper, we show that these rates cancel, and as a result, Dropout selectively regularizes against learning higher-order interactions. We prove this new perspective analytically for Input Dropout and empirically for Activation Dropout. This perspective on Dropout has several practical implications: (1) higher Dropout rates should be used when we need stronger regularization against spurious high-order interactions, (2) caution must be used when interpreting Dropout-based feature saliency measures, and (3) networks trained with Input Dropout are biased estimators, even with infinite data. We also compare Dropout to regularization via weight decay and early stopping and find that it is difficult to obtain the same regularization against high-order interactions with these methods.

1. INTRODUCTION

We examine Dropout through the perspective of interactions: learned effects that require multiple input variables. Given N variables, there are O(N 2 ) possible pairwise interactions, O(N 3 ) possible 3-way interactions, etc. We show that Dropout contributes a regularization effect which helps neural networks (NNs) explore simpler functions of lower-order interactions before considering functions of higher-order interactions. Dropout imposes this regularization by reducing the effective learning rate of interaction effects according to the number of variables in the interaction effect. As a result, Dropout encourages models to learn simpler functions of lower-order additive components. This understanding of Dropout has implications for choosing Dropout rates: higher Dropout rates should be used when we need stronger regularization against spurious high-order interactions. This perspective also issues caution against using Dropout to measure term saliency because Dropout regularizes against terms for high-order interactions. Finally, this view of Dropout as a regularizer of interaction effects provides insight into the varying effectiveness of Dropout for different architectures and data sets. We also compare Dropout to regularization via weight decay and early stopping and find that it is difficult to obtain the same regularization effect for high-order interactions with these methods. Why Interaction Effects? When it was introduced, Dropout was motivated to prevent "complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors" (Hinton et al., 2012; Srivastava et al., 2014) . Because most "complex co-adaptations" are interaction effects, we examine Dropout under the lens of interaction. This perspective is valuable because (1) modern NNs have so many weights that understanding networks by looking at their weights is infeasible, but interactions are far more tractable because interaction effects live in function space, not weight space, (2) the decomposition that we use to calculate interaction effects has convenient properties such as identifiability, and (3) this perspective has practical implications on choosing Dropout rates for NN systems. To preview the experimental results, when NNs are trained on data that has no interactions, the optimal Dropout rate is high, but when NNs are trained on datasets which have important 2nd and 3rd order interactions, the optimal Dropout rate is 0.

2. RELATED WORK

Although Hinton et al proposed Dropout to prevent spurious co-adaptation (i.e., spurious interactions), many questions remain. For example: Is the expectation of the output of a NN trained with Dropout the same as for a NN trained without Dropout? Does Dropout change the trajectory of learning during optimization even in the asymptotic limit of infinite training data? Should Dropout be used at run-time when querying a NN to see what it has learned? These questions are important because Dropout has been used as a method for Bayesian uncertainty (Gal & Ghahramani, 2016; Gal et al., 2017; Chang et al., 2017b; a) , which implicitly assume that Dropout does not bias the model's output. The use of Dropout as a tool for uncertainty quantification has been questioned due to its failure to separate aleotoric and epistemic sources of uncertainty (Osband, 2016) (i.e., the uncertainty does not decrease even as more data is gathered). In this paper we ask a separate yet related question: Does Dropout treat all parts of function space equivalently? Significant work has focused on the effect of Dropout as a weight regularizer (Baldi & Sadowski, 2013; Warde-Farley et al., 2013; Cavazza et al., 2018; Mianjy et al., 2018; Zunino et al., 2018) , including its properties of structured shrinkage (Nalisnick et al., 2018) or adaptive regularization (Wager et al., 2013) . However, weight regularization is of limited utility for modern-scale NNs, and can produce counter-intuitive results such as negative regularization (Helmbold & Long, 2017) . Instead of focusing on the influence of Dropout on parameters, we take a nonparametric view of NNs as function approximators. Thus, our work is similar in spirit to Wan et al. (2013) , which showed a linear relationship between keep probability and the Rademacher complexity of the model class. Our investigation finds that Dropout preferentially targets high-order interaction effects, resulting in models that generalize better by down-weighting high-order interaction effects that are typically spurious or difficult to learn correctly from limited training data.

3. PRELIMINARIES

Multiplicative terms like X 1 X 2 are often used to encode "interaction effects". They are, however, only pure interaction effects if X 1 and X 2 are uncorrelated and have mean zero. When the two variables are correlated, some portion of the variance in the outcome X 1 X 2 can be explained by main effects of each individual variable. Note that correlation between two input variables does not imply an interaction effect on the outcome, and an interaction effect of two input variables on the outcome does not imply correlation between the variables. In this paper, we use the concept of pure interaction effects from Lengerich et al. ( 2020): a pure interaction effect is variance explained by a group of variables u that cannot be explained by any subset of u. This definition is equivalent to the fANOVA decomposition of the overall function F : Given a density w(X) and F u ⊂ L 2 (R u ) the family of allowable functions for variable set u, the weighted fANOVA (Hooker, 2004; 2007; Cuevas et al., 2004)  decomposition of F (X) is: {f u (X u )|u ⊆ [d]} = arg min {gu∈F u } u∈[d] u⊆[d] g u (X u ) -F (X) 2 w(X)dX, where [d] indicates the power set of d features, such that ∀ v ⊆ u, f u (X u )g v (X v )w(X)dX = 0 ∀ g v , i.e., each member f u is orthogonal to the members which operate on any subset of u. An interaction effect f u is of order k if |u| = k. Given N variables in X, there are O(N ) possible effects of individual variables, O(N 2 ) possible pairwise interactions, O(N 3 ) possible 3-way interactions, i.e. O(N k ) possible interactions of order k. The fANOVA decomposition provides a unique decomposition for a given data distribution; thus, pure interaction effects can only be defined by simultaneously defining a data distribution. An example of this interplay between the data distribution and the interaction definition is shown in Figure B.2. As Lengerich et al. (2020) describe, the correct distribution to use is the data-generating distribution p(x). In studies on real data, estimating p(x) is one of the central challenges of machine learning; for this paper, we use simulation data for which we know p(x).

