REVISITING STRUCTURED DROPOUT

Abstract

Large neural networks are often overparameterised and prone to overfitting, Dropout is a widely used regularization technique to combat overfitting and improve model generalization. However, unstructured Dropout is not always effective for specific network architectures and this has led to the formation of multiple structured Dropout approaches to improve model performance and, sometimes, reduce the computational resources required for inference. In this work, we revisit structured Dropout comparing different Dropout approaches to natural language processing and computer vision tasks for multiple state-of-the-art networks. Additionally, we devise an approach to structured Dropout we call ProbDropBlock which drops contiguous blocks from feature maps with a probability given by the normalized feature salience values. We find that with a simple scheduling strategy the proposed approach to structured Dropout consistently improved model performance compared to baselines and other Dropout approaches on a diverse range of tasks and models. In particular, we show ProbDropBlock improves RoBERTa finetuning on MNLI by 0.22%, and training of ResNet50 on ImageNet by 0.28%.

1. INTRODUCTION

In our modern society, Deep Neural Networks have become increasingly ubiquitous, having achieved significant success in many tasks including visual recognition and natural language processing Heaton (2020); Jumper et al. (2021); Schrittwieser et al. (2020) . These networks now play a larger role in our lives and our devices, however, despite their successes they still have notable weaknesses. Deep Neural Networks are often found to be highly overparameterized, and as a result, require excessive memory and significant computational resources. Additionally, due to overparameterization, these networks are prone to overfit their training data. There are several approaches to mitigate overfitting including reducing model size or complexity, early stopping Caruana et al. (2000) , data augmentation (DeVries & Taylor, 2017) and regularisation (Loshchilov & Hutter, 2017) . In this paper, we focus on Dropout which is a widely used form of regularisation proposed by Srivastava et al. (2014b) . Standard Unstructured Dropout involves randomly deactivating a subset of neurons in the network for each training iteration and training this subnetwork, at inference time the full model could then be treated as an approximation of an ensemble of these subnetworks. Unstructured Dropout was efficient and effective and this led to it being widely adopted, however, when applied to Convolutional Neural Networks (CNNs), unstructured Dropout struggled to achieve notable improvements He et al. (2016); Huang et al. (2017) and this led to the development of several structured Dropout approaches Ghiasi et al. (2018); Dai et al. (2019); Cai et al. (2019) including DropBlock and DropChannel. DropBlock considers the spatial correlations between nearby entries in a feature map of a CNN and attempts to stop that information flow by deactivating larger contiguous areas/blocks, while DropChannel considers the correlation of information within a particular channel and performs Dropout at the channel level. However, since the development of these structured approaches, there have been further strides in network architecture design, with rising spread and interest in Transformer-based models. Given the success achieved by block-wise structured Dropout on CNNs, it is only natural to ask the question, do these approaches apply to Transformer-based models? Structured Dropout approaches for transformers seem to focus on reducing the model size and inference time, these works place

