COMFORT ZONE: A VICINAL DISTRIBUTION FOR REGRESSION PROBLEMS

Abstract

Domain-dependent data augmentation methods generate artificial samples using transformations suited for the underlying data domain, such as rotations on images and time warping on time series data. However, domain-independent approaches, e.g. mixup, are applicable to various data modalities, and as such they are general and versatile. While mixup-based techniques are used extensively in classification problems, their effect on regression tasks is somewhat less explored. To bridge this gap, we study the problem of domain-independent augmentation for regression, and we introduce comfort-zone: a new data-driven domainindependent data augmentation method. Essentially, our approach samples new examples from the tangent planes of the train distribution. Augmenting data in this way aligns with the network tendency towards capturing the dominant features of its input signals. Evaluating comfort-zone on regression and time series forecasting benchmarks, we show that it improves the generalization of several neural architectures. We also find that mixup and noise injection are less effective in comparison to comfort-zone.

1. INTRODUCTION

Classification and regression problems primarily differ in their output's domain. In classification, we have a finite set of labels, whereas in regression, the range is an infinite set of quantitieseither discrete or continuous (Goodfellow et al., 2016) . In classical work (Devroye et al., 2013) , classification is argued to be "easier" than regression, but more generally, it is agreed by many that classification and regression problems should be treated differently (Muthukumar et al., 2021) . Particularly, the differences between classification and regression are actively explored in the context of regularization. Regularizing neural networks to improve their performance on new samples has received a lot of attention in the past few years. One of the main reasons for this increased interest is that most of the recent successful neural models are overparameterized. Namely, the amount of learnable parameters is significantly larger than the number of available training samples (Allen-Zhu et al., 2019a; b) , and thus regularization is often necessary to alleviate overfitting issues. Recent studies on overparameterized linear models identify conditions under which overfitting is "benign" in regression (Bartlett et al., 2020) , and uncover the relationship between the choice of loss functions in classification and regression tasks (Muthukumar et al., 2021) . Still, the regularization of deep neural regression networks is not well understood. In this work, we focus on a common regularization approach known as Data Augmentation (DA) in which data samples are artificially generated and used during training. In general, DA techniques can be categorized into domain-dependent (DD) methods and domain-independent (DI) approaches. The former techniques are specific for a certain data modality such as images, whereas the latter methods typically do not depend on the data modality. Numerous DD-and DI-DA approaches are available for classification tasks (Shorten & Khoshgoftaar, 2019; Shorten et al., 2021) , and many of them consistently improve over non-augmented models. Unfortunately, DI-DA for regression problems is a significantly less explored topic. Recent works on linear models study the connection between the DA policy and optimization (Hanin & Sun, 2021) , as well as the generalization effects of linear DA transformations (Wu et al., 2020) . We contribute to this line of work by proposing and analyzing a new domain-independent data augmentation method for nonlinear deep regression, and by extensively evaluating our approach in comparison to existing baselines. Many strong data augmentation methods were proposed in the past few years. Particularly relevant to our study is the family of mixup-based techniques that are commonly used in classification applications. The original method, mixup (Zhang et al., 2017), produces convex combinations of training samples, promoting linear behavior for in-between samples. The method is domainindependent and data-agnostic, and it was shown to solve the Vicinal Risk Minimization (VRM) problem instead of the usual Empirical Risk Minimization (ERM) problem. In comparison, our approach is domain-independent and data-driven, and it can also be viewed as solving a VRM problem. Through extensive evaluations, we will show that mixup and noise injection are less effective for regression. Contribution. Challenged by the differences between classification and regression and motivated by the success of domain-independent methods such as mixup, we propose a simple, domainindependent and data-driven DA routine, termed comfort-zone (Sec. 3). Let X, Y be the input and output mini-batch tensors, respectively, and let Z l = g l (X) be the hidden representation at layer l (Verma et al., 2019). Essentially, our method produces new training samples Z l (λ), Y (λ) from the given ones by scaling their small singular values by a random λ ∈ [0, 1]. At its core, comfort-zone incorporates into training the assumption that data with similar dominant components of the train set should be treated as true samples. We offer two simple implementations of comfort-zone; a non-differentiable approach that can be used for input-level application, and a fully differentiable pipeline which is applicable to any layer (App. A). We analyze comfort-zone using perturbation theory, and we introduce its associated vicinal risk minimization (Sec. 4). Our experimental evaluation focuses on benchmark regression tasks (Sec. 5.1), and on time series forecasting tasks with small and large datasets (Sec. 5.2). The results show that comfort-zone improves the test error on several neural architectures and datasets, and in comparison to other DA baselines. We offer a potential explanation to the success of our method (Sec. 3, App. B). Finally, an ablation study is performed, justifying our design choices (App. C).

2. RELATED WORK

Deep neural networks regularization is an established research topic with several existing works (Goodfellow et al., 2016) . Common regularization approaches include weight decay, dropout (Srivastava et al., 2014) , batch normalization (Ioffe & Szegedy, 2015) , and data augmentation (DA) . In what follows, we categorize DA techniques to be either domain-dependent or domain-independent. Domain-dependent DA was shown to be effective for, e.g., image data (LeCun et al., 1998) and audio signals (Park et al., 2019) , among other domains. However, adapting these methods to new data domains is typically challenging and often infeasible. In the past few years, an increased interest has been drawn to domain-independent DA methods, allowing to regularize neural networks when only basic data assumptions are allowed. We focus in what follows on domain-independent techniques that were proposed in the context of classification and regression problems. Guo et al., 2019; Hendrycks et al., 2019; Yun et al., 2019; Berthelot et al., 2019; Greenewald et al., 2021; Lim et al., 2021) . DA for regression. Significantly less attention has been drawn to designing domain-independent data augmentation for regression tasks. A recent survey (Wen et al., 2020) on DA for time series data lists a few basic augmentation methods including noise injection. Incorporating noise in the data can be used for regression tasks, and it can also be incorporated into other DA methods such as ours. Dubost et al. (2019) propose to recombine samples for regression tasks with countable outputs, and thus their method can not be directly extended to the uncountable regime. Recently, mixRL (Hwang & Whang, 2021) developed a meta learning framework based on reinforcement learning for mixing samples in their neighborhood.



DA for classification. Recently, Zhang et al. (2017) proposed to convex mixing of input samples as well as one-hot output labels during training. The new training procedure, named mixup, minimizes the Vicinal Risk Minimization (VRM) problem instead of the typical Empirical Risk Minimization (ERM). Many extensions of mixup were proposed, including mixing latent features (Verma et al., 2019), same-class mixing (DeVries & Taylor, 2017), among others (

