REVISITING STRUCTURED DROPOUT

Abstract

Large neural networks are often overparameterised and prone to overfitting, Dropout is a widely used regularization technique to combat overfitting and improve model generalization. However, unstructured Dropout is not always effective for specific network architectures and this has led to the formation of multiple structured Dropout approaches to improve model performance and, sometimes, reduce the computational resources required for inference. In this work, we revisit structured Dropout comparing different Dropout approaches to natural language processing and computer vision tasks for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we call ProbDropBlock which drops contiguous blocks from feature maps with a probability given by the normalized feature salience values. We find that with a simple scheduling strategy the proposed approach to structured Dropout consistently improved model performance compared to baselines and other Dropout approaches on a diverse range of tasks and models. In particular, we show ProbDropBlock improves RoBERTa finetuning on MNLI by 0.22%, and training of ResNet50 on ImageNet by 0.28%.

1. INTRODUCTION

In our modern society, Deep Neural Networks have become increasingly ubiquitous, having achieved significant success in many tasks including visual recognition and natural language processing Heaton (2020); Jumper et al. (2021) ; Schrittwieser et al. (2020) . These networks now play a larger role in our lives and our devices, however, despite their successes they still have notable weaknesses. Deep Neural Networks are often found to be highly overparameterized, and as a result, require excessive memory and significant computational resources. Additionally, due to overparameterization, these networks are prone to overfit their training data. There are several approaches to mitigate overfitting including reducing model size or complexity, early stopping Caruana et al. (2000) , data augmentation (DeVries & Taylor, 2017) and regularisation (Loshchilov & Hutter, 2017) . In this paper, we focus on Dropout which is a widely used form of regularisation proposed by Srivastava et al. (2014b) . Standard Unstructured Dropout involves randomly deactivating a subset of neurons in the network for each training iteration and training this subnetwork, at inference time the full model could then be treated as an approximation of an ensemble of these subnetworks. Unstructured Dropout was efficient and effective and this led to it being widely adopted, however, when applied to Convolutional Neural Networks (CNNs), unstructured Dropout struggled to achieve notable improvements He et al. ( 2016 2019) including DropBlock and DropChannel. DropBlock considers the spatial correlations between nearby entries in a feature map of a CNN and attempts to stop that information flow by deactivating larger contiguous areas/blocks, while DropChannel considers the correlation of information within a particular channel and performs Dropout at the channel level. However, since the development of these structured approaches, there have been further strides in network architecture design, with rising spread and interest in Transformer-based models. Given the success achieved by block-wise structured Dropout on CNNs, it is only natural to ask the question, do these approaches apply to Transformer-based models? Structured Dropout approaches for transformers seem to focus on reducing the model size and inference time, these works place In this paper, we revisit the idea of structured Dropout for current state-of-the-art models on language and vision tasks. Additionally, we devised our own form of adaptive structured Dropout -ProbDropBlock and compare it to preexisting approaches to structured and unstructured Dropout. In Figure 1 we illustrate the effects of select structured and unstructured Dropout approaches on an image of a cat. As can be seen in Figure 1a the original image consists of three channels (RGB) which are aggregated to form the image. Different approaches to Dropout may treat channels differently. In Figure 1b we illustrate the effect of unstructured Dropout on this image, the many small black squares represent deactivated/dropped weights at a pixel level and we also see different pixels have been deactivated in each channel. In Figure 1c we see fewer but larger black squares, and that the locations of dropped pixels are consistent between channels, however, this is not the case in Figure 1e and Figure 1d . In this work, we say that BatchDropBlock is channel consistent i.e. channels do not deactivate blocks independently rather the deactivated blocks are consistent between channels. In Figure 1b , Figure 1c , Figure 1d , for a single channel there is a uniform probability of any pixel or block (depending on the approach) to be dropped and so deactivated pixels may not contain any of the key information required to identify this image as a cat (i.e. the probability of deactivating a pixel/block belonging to the cat is the same as that of one belonging to the background). This is not the case for Figure 1e , in our adaptive DropBlock approach the probability of a block being dropped is dependent on the value of the center pixel in the block. It can be seen that this approach is not channel consistent and deactivated pixels are concentrated on the cat. Figure 1 is illustrative to give one an intuitive understanding of these techniques, as in practice these techniques are applied to feature maps which are the output activations of a preceding layer of the network. The contributions of this paper include: • The testing of preexisting unstructured and structured Dropout approaches on current stateof-the-art models including transformer-based models on natural language inference and vision tasks. We reveal that structured Dropouts are generally better than unstructured ones on both vision and language tasks.



); Huang et al. (2017) and this led to the development of several structured Dropout approaches Ghiasi et al. (2018); Dai et al. (2019); Cai et al. (

Figure 1: An illustration of applying different Dropouts to an image.

