INTERPRETING AND BOOSTING DROPOUT FROM A GAME-THEORETIC VIEW

Abstract

This paper aims to understand and improve the utility of the dropout operation from the perspective of game-theoretic interactions. We prove that dropout can suppress the strength of interactions between input variables of deep neural networks (DNNs). The theoretic proof is also verified by various experiments. Furthermore, we find that such interactions were strongly related to the over-fitting problem in deep learning. Thus, the utility of dropout can be regarded as decreasing interactions to alleviate the significance of over-fitting. Based on this understanding, we propose an interaction loss to further improve the utility of dropout. Experimental results have shown that the interaction loss can effectively improve the utility of dropout and boost the performance of DNNs.

1. INTRODUCTION

Deep neural networks (DNNs) have exhibited significant success in various tasks, but the overfitting problem is still a considerable challenge for deep learning. Dropout is usually considered as an effective operation to alleviate the over-fitting problem of DNNs. Hinton et al. (2012) ; Srivastava et al. (2014) thought that dropout could encourage each unit in an intermediate-layer feature to model useful information without much dependence on other units. Konda et al. (2016) considered dropout as a specific method of data augmentation. Gal & Ghahramani (2016) proved that dropout was equivalent to the Bayesian approximation in a Gaussian process. Our research group led by Dr. Quanshi Zhang has proposed game-theoretic interactions, including interactions of different orders (Zhang et al., 2020) and multivariate interactions (Zhang et al., 2021b) . As a basic metric, the interaction can be used to explain signal-processing behaviors in trained DNNs from different perspectives. For example, we have built up a tree structure to explain hierarchical interactions between words encoded in NLP models (Zhang et al., 2021a) . We also prove a close relationship between the interaction and the adversarial robustness (Ren et al., 2021) and transferability (Wang et al., 2020) . Many previous methods of boosting adversarial transferability can be explained as the reduction of interactions, and the interaction can also explain the utility of the adversarial training (Ren et al., 2021) . As an extension of the system of game-theoretic interactions, in this paper, we aim to explain, model, and improve the utility of dropout from the following perspectives. First, we prove that the dropout operation suppresses interactions between input units encoded by DNNs. This is also verified by various experiments. To this end, the interaction is defined in game theory, as follows. Let x denote the input, and let f (x) denote the output of the DNN. For the i-th input variable, we can compute its importance value φ(i), which measures the numerical contribution of the i-th variable to the output f (x). We notice that the importance value of the i-th variable would be different when we mask the j-th variable w.r.t. the case when we do not mask the j-th variable. Thus, the interaction between input variables i and j is measured as the difference φw/ j (i) -φw/o j (i). Second, we also discover a strong correlation between interactions of input variables and the overfitting problem of the DNN. Specifically, the over-fitted samples usually exhibit much stronger interactions than ordinary samples. Therefore, we consider that the utility of dropout is to alleviate the significance of over-fitting by decreasing the strength of interactions encoded by the DNN. Based on this understanding, we propose an interaction loss to further improve the utility of dropout. The interaction loss directly penalizes the interaction strength, in order to improve the performance of DNNs. The interaction loss exhibits the following two distinct advantages over the dropout operation. (1) The interaction loss explicitly controls the penalty of the interaction strength, which enables people to trade off between over-fitting and under-fitting. (2) Unlike dropout which is incompatible with the batch normalization operation (Li et al., 2019) , the interaction loss can work in harmony with batch normalization. Various experimental results show that the interaction loss can boost the performance of DNNs. Furthermore, we analyze interactions encoded by DNNs from the following three perspectives. (1) First, we discover the consistency between the sampling process in dropout (when the dropout rate p = 0.5) and the sampling in the computation of the Banzhaf value. The Banzhaf value (Banzhaf III, 1964) is another metric to measure the importance of each input variable in game theory. Unlike the Shapley value, the Banzhaf value is computed under the assumption that each input variable independently participates in the game with the probability 0.5. We find that the frequent inference patterns in Banzhaf interactions (Grabisch & Roubens, 1999) are also prone to be frequently sampled by dropout, thereby being stably learned. This ensures the DNN to encode smooth Banzhaf interactions. We also prove that the Banzhaf interaction is close to the aforementioned interaction, which also relates to the dropout operation with the interaction used in this paper. (2) Besides, we find that the interaction loss is better to be applied to low layers than being applied to high layers. (3) Furthermore, we decompose the overall interaction into interaction components of different orders. We visualize the strongly interacted regions within each input sample. We find out that interaction components of low orders take the main part of interactions and are suppressed by the dropout operation and the interaction loss. Contributions of this paper can be summarized as follows. (1) We mathematically represent the dependence of feature variables using as the game-theoretic interactions, and prove that dropout can suppress the strength of interactions encoded by a DNN. In comparison, previous studies (Hinton et al., 2012; Krizhevsky et al., 2012; Srivastava et al., 2014) did not mathematically model the the dependence of feature variables or theoretically proved its relationship with the dropout. (2) We find that the over-fitted samples usually contain stronger interactions than other samples. (3) Based on this, we consider the utility of dropout is to alleviate over-fitting by decreasing the interaction. We design a novel loss function to penalize the strength of interactions, which improves the performance of DNNs. ( 4) We analyze the properties of interactions encoded by DNNs, and conduct comparative studies to obtain new insights into interactions encoded by DNNs.

2. RELATED WORK

The dropout operation. Dropout is an effective operation to alleviate the over-fitting problem and improve the performance of DNNs (Hinton et al., 2012) . Several studies have been proposed to explain the inherent mechanism of dropout. According to (Hinton et al., 2012; Krizhevsky et al., 2012; Srivastava et al., 2014) , dropout could prevent complex co-adaptation between units in intermediate layers, and could encourage each unit to encode useful representations itself. However, these studies only qualitatively analyzed the utility of dropout, instead of providing quantitative results. Wager et al. (2013) showed that dropout performed as an adaptive regularization, and established a connection to the algorithm AdaGrad. Konda et al. (2016) interpreted dropout as a kind of data augmentation in the input space, and Gal & Ghahramani (2016) proved that dropout was equiva-

