COMFORT ZONE: A VICINAL DISTRIBUTION FOR REGRESSION PROBLEMS

Abstract

Domain-dependent data augmentation methods generate artificial samples using transformations suited for the underlying data domain, such as rotations on images and time warping on time series data. However, domain-independent approaches, e.g. mixup, are applicable to various data modalities, and as such they are general and versatile. While mixup-based techniques are used extensively in classification problems, their effect on regression tasks is somewhat less explored. To bridge this gap, we study the problem of domain-independent augmentation for regression, and we introduce comfort-zone: a new data-driven domainindependent data augmentation method. Essentially, our approach samples new examples from the tangent planes of the train distribution. Augmenting data in this way aligns with the network tendency towards capturing the dominant features of its input signals. Evaluating comfort-zone on regression and time series forecasting benchmarks, we show that it improves the generalization of several neural architectures. We also find that mixup and noise injection are less effective in comparison to comfort-zone.

1. INTRODUCTION

Classification and regression problems primarily differ in their output's domain. In classification, we have a finite set of labels, whereas in regression, the range is an infinite set of quantitieseither discrete or continuous (Goodfellow et al., 2016) . In classical work (Devroye et al., 2013) , classification is argued to be "easier" than regression, but more generally, it is agreed by many that classification and regression problems should be treated differently (Muthukumar et al., 2021) . Particularly, the differences between classification and regression are actively explored in the context of regularization. Regularizing neural networks to improve their performance on new samples has received a lot of attention in the past few years. One of the main reasons for this increased interest is that most of the recent successful neural models are overparameterized. Namely, the amount of learnable parameters is significantly larger than the number of available training samples (Allen-Zhu et al., 2019a; b) , and thus regularization is often necessary to alleviate overfitting issues. Recent studies on overparameterized linear models identify conditions under which overfitting is "benign" in regression (Bartlett et al., 2020) , and uncover the relationship between the choice of loss functions in classification and regression tasks (Muthukumar et al., 2021) . Still, the regularization of deep neural regression networks is not well understood. In this work, we focus on a common regularization approach known as Data Augmentation (DA) in which data samples are artificially generated and used during training. In general, DA techniques can be categorized into domain-dependent (DD) methods and domain-independent (DI) approaches. The former techniques are specific for a certain data modality such as images, whereas the latter methods typically do not depend on the data modality. Numerous DD-and DI-DA approaches are available for classification tasks (Shorten & Khoshgoftaar, 2019; Shorten et al., 2021) , and many of them consistently improve over non-augmented models. Unfortunately, DI-DA for regression problems is a significantly less explored topic. Recent works on linear models study the connection between the DA policy and optimization (Hanin & Sun, 2021) , as well as the generalization effects of linear DA transformations (Wu et al., 2020) . We contribute to this line of work by proposing and analyzing a new domain-independent data augmentation method for nonlinear deep regression, and by extensively evaluating our approach in comparison to existing baselines. Many strong data augmentation methods were proposed in the past few years. Particularly relevant to our study is the family of mixup-based techniques that are commonly used in classification applications. The original method, mixup (Zhang et al., 2017) , produces convex combinations of training samples, promoting linear behavior for in-between samples. The method is domainindependent and data-agnostic, and it was shown to solve the Vicinal Risk Minimization (VRM) problem instead of the usual Empirical Risk Minimization (ERM) problem. In comparison, our approach is domain-independent and data-driven, and it can also be viewed as solving a VRM problem. Through extensive evaluations, we will show that mixup and noise injection are less effective for regression. Contribution. Challenged by the differences between classification and regression and motivated by the success of domain-independent methods such as mixup, we propose a simple, domainindependent and data-driven DA routine, termed comfort-zone (Sec. 3). Let X, Y be the input and output mini-batch tensors, respectively, and let Z l = g l (X) be the hidden representation at layer l (Verma et al., 2019) . Essentially, our method produces new training samples Z l (λ), Y (λ) from the given ones by scaling their small singular values by a random λ ∈ [0, 1]. At its core, comfort-zone incorporates into training the assumption that data with similar dominant components of the train set should be treated as true samples. We offer two simple implementations of comfort-zone; a non-differentiable approach that can be used for input-level application, and a fully differentiable pipeline which is applicable to any layer (App. A). We analyze comfort-zone using perturbation theory, and we introduce its associated vicinal risk minimization (Sec. 4). Our experimental evaluation focuses on benchmark regression tasks (Sec. 5.1), and on time series forecasting tasks with small and large datasets (Sec. 5.2). The results show that comfort-zone improves the test error on several neural architectures and datasets, and in comparison to other DA baselines. We offer a potential explanation to the success of our method (Sec. 3, App. B). Finally, an ablation study is performed, justifying our design choices (App. C).

2. RELATED WORK

Deep neural networks regularization is an established research topic with several existing works (Goodfellow et al., 2016) . Common regularization approaches include weight decay, dropout (Srivastava et al., 2014) , batch normalization (Ioffe & Szegedy, 2015) , and data augmentation (DA) . In what follows, we categorize DA techniques to be either domain-dependent or domain-independent. Domain-dependent DA was shown to be effective for, e.g., image data (LeCun et al., 1998) and audio signals (Park et al., 2019) , among other domains. However, adapting these methods to new data domains is typically challenging and often infeasible. In the past few years, an increased interest has been drawn to domain-independent DA methods, allowing to regularize neural networks when only basic data assumptions are allowed. We focus in what follows on domain-independent techniques that were proposed in the context of classification and regression problems. DA for classification. Recently, Zhang et al. (2017) proposed to convex mixing of input samples as well as one-hot output labels during training. The new training procedure, named mixup, minimizes the Vicinal Risk Minimization (VRM) problem instead of the typical Empirical Risk Minimization (ERM). Many extensions of mixup were proposed, including mixing latent features (Verma et al., 2019) , same-class mixing (DeVries & Taylor, 2017) , among others (Guo et al., 2019; Hendrycks et al., 2019; Yun et al., 2019; Berthelot et al., 2019; Greenewald et al., 2021; Lim et al., 2021) . DA for regression. Significantly less attention has been drawn to designing domain-independent data augmentation for regression tasks. A recent survey (Wen et al., 2020) on DA for time series data lists a few basic augmentation methods including noise injection. Incorporating noise in the data can be used for regression tasks, and it can also be incorporated into other DA methods such as ours. Dubost et al. (2019) propose to recombine samples for regression tasks with countable outputs, and thus their method can not be directly extended to the uncountable regime. Recently, mixRL (Hwang & Whang, 2021) developed a meta learning framework based on reinforcement learning for mixing samples in their neighborhood. To generate new samples, we consider the singular value decomposition (SVD) of a matrix A ∈ R q×r , q ≥ r which is given by A = U SV T . The matrices U, V are orthogonal, and S is a diagonal matrix whose main diagonal consists of the singular values ordered by σ 1 ≥ σ 2 ≥ • • • ≥ σ r ≥ 0. SVD is intimately related to principal component analysis (PCA) which in turn is heavily studied in manifold learning and dimensionality reduction (Ma & Fu, 2012) . It is well-known that the best rank k approximation of A is given by omitting the last (r -k) singular values, i.e., A k = k j=1 σ j u j v T j (Eckart & Young, 1936) . The matrix A k preserves the (k) dominant components in A, and it discards the rest. The key insight in our approach is that scaling down the small singular values should yield training samples that are in close proximity to the true distribution of the data P. Let the input and output mini-batch tensors X ∈ R b×m and Y ∈ R b×m , respectively, where w.l.o.g b ≥ 2m is the batch size. We denote the network by f (X) = f l (g l (X)), Z l := g l (X) where g l maps inputs to latent representations Z l at layer l ∈ [0, L], and f l maps latent vectors to outputs (Verma et al., 2019) . Let λ ∼ Beta(α, α) for α ∈ (0, ∞), and k ∈ [1, 2m] be the index of the singular value after which we scale down. Then, the new artificial samples Z l (λ, k), Y (λ, k) are defined via A := [Z l , Y ] = U SV T ∈ R b×2m , A(λ, k) := [Z l (λ, k), Y (λ, k)] = U S(λ, k)V T , where [•, •] concatenates along columns, and S(λ, k) is the diagonal matrix of scaled down singular values. Namely, we compute S(λ, k) = diag(σ 1 , . . . , σ k | λσ k+1 , . . . , λσ 2m ). The value k depends on the hyper-parameter ρ ∈ [0, 1] that represents the "amount" of signal to keep unchanged, i.e., k = arg max k k j=1 σ j / j σ j ≤ ρ . Similar to mixup (Zhang et al., 2017) , our method recovers the original dataset D as α → 0, ∀ρ. The loss function associated with comfort-zone is where c : R b×m × R b×m → [0, ∞) is a cost function, typically mean squared error (MSE). The transformation χ takes a pair of tensors g l (X), Y , and it scales down the last (2m -k) singular values of their concatenation by λ. A key attribute of comfort-zone is that it is fully differentiable since the singular value decomposition can be backpropagated (Ionescu et al., 2015) . Importantly, when comfort-zone is applied only at the input level (l = 0), a straightforward non-differentiable implementation is sufficient. We provide an example pyTorch pseudocode for input-level comfort-zone in Fig. 1 (left), and we discuss in App. A a potential differentiable implementation. The computational complexity of comfort-zone is governed by SVD calculation which has a complexity of SVD is O(min(qr 2 , rq 2 )) for an q × r matrix. In comparison, mixup samples a scalar from a random distribution, and it linearly blends two samples, whereas additive noise samples points from a random distribution. Thus, mixup and additive noise have a complexity of O(qr). L(f ) = E (X,Y ) E λ E l c ((f l , 1) • χ(g l (X), Y ; λ, k))) , s.t. (X, Y ) ∼ D, λ ∼ Beta(α, α), l ∼ [0, L] , Design choices. For certain λ values, the new sample Z l (λ, k), Y (λ, k) may be too far from P. With this in mind, we explored the option of scaling down the loss function c(•, •) by a parameter µ(λ) in addition to modifying the singular values. However, we tested various profiles µ(λ) and discovered the best models are obtained when no scaling of loss occurs, see App. C. Importantly, this means that our approach adopts a different ansatz in comparison to mixup. While mixup incorporates uncertainty into the model training using "in-between" samples and labels, our method uses the new data as if it was sampled from the true distribution. An alternative option which would be conceptually closer to mixup is to scale down the large singular values as well as the loss term. We show in App. C that this choice is also inferior to comfort-zone. The effect of comfort-zone on data and learning. We generated a 2D point cloud whose intrinsic dimension is one (shown in blue, Fig. 1 ), and we applied different DA methods on this data. The three panels in the figure show in orange the augmented data when using additive noise, mixup, and comfort-zone with α = 1.0 over the original point cloud colored in light blue. Injecting noise alters each point in its neighborhood, whereas mixup draws the points towards the center of their convex hull. In contrast, comfort-zone aligns the new samples along the dominant component of the original data. Notably, our approach may increase the span of training data, and thus it can improve estimation in regression as was recently shown in (Wu et al., 2020) . We argue that training on samples created with our method encourages the inherent tendency of the network to model the dominant parts of the data better. To demonstrate this phenomenon, we trained an N-BEATS (Oreshkin et al., 2019) architecture with and without comfort-zone on the Air Passengers dataset provided in DARTS (Herzen et al., 2022) . The trained models are evaluated on the dataset modified using a 100 varying λ ∈ [0, 1] values, see Fig. 2 (left) . Namely, we modify the singular values of every batch in the dataset using different λ, and feed the resulting data for inference. Surprisingly, the non-augmented model (blue curve) performs better on the unseen modified samples, yielding the minimum at λ ≈ 0.5. In comparison, the regularized network attains a qualitatively similar plot, but the MSE is lower for all λ and the minimum is obtained for a lower value at λ ≈ 0.3. This behavior was found to be consistent across several architectures and datasets, see App. B. Inspecting an example of the data and its modifications, reveals the differences between the samples when select λ values are used. The original sample (black) shown in Fig. 2 (right) exhibits primary extrema S p at times t = 4, 7, 15 and secondary extrema S q at times t = 9, 13 marked with black dots. The blue curve (λ = 0.5) for which the non-augmented model attained the minimum loss, maintains the primary extrema while significantly "flatenning" the secondary extrema. In comparison, the orange curve (λ = 0.3) for which the augmented model achieves the minimum loss, flattens S q completely. Finally, we observe that the S q data points are qualitatively different when λ = 0.0. From the analysis above, we conclude the following. First, the network prefers data with less small scale features; this finding is consistent with similar results on e.g., autoencoder models (Jain et al., 2021) . Second, our regularization encourages this tendency by providing the model with such data, leading to improved MSE profiles. To the best of our knowledge, the above analysis is novel on deep regression models. Notably, while it may argued that the behavior in Fig. 2 (left) is natural and intuitive as the model "simply" performs better on denoised signals, we argue differently. In particular, this plot somewhat contradicts our understanding of overfitting which occurs in high probability for tiny datasets such as Air Passengers (a single time series with 144 entries) using multiple weights network such as N-BEATS. Specifically, since the data is highly likely to be overfit by the network, we expect the MSE value to be lowest for λ = 1, and MSE value equal or higher for any λ < 1. Thus, we advocate that the above analysis may reveal a characteristic feature of regression neural networks. Our analysis is reinforced further as other datasets and architectures follow a similar pattern (App. B). Importantly, we are unaware of a similar experiment in the literature of deep regression neural networks.

4. ANALYSIS

Relation to additive noise. In what follows, we would like to answer the following question: Does applying comfort-zone is merely a variant of injecting additive noise? To this end, we analyze comfort-zone from a perturbation theory viewpoint. Specifically, we would like to understand how a random data perturbation affects the singular values of the data matrix A ∈ R q×r , q ≥ r. We denote by σ 1 ≥ σ 2 ≥ • • • ≥ σ r the singular values of A. The perturbed matrix and its singular values set are denoted by Ã = A + E and {σ j } r j=1 , respectively. We write inf 2 (A) and |A| 2 to denote the smallest and largest singular values of any matrix A. The following classical result provides an estimated bound for the perturbed singular values (Stewart, 1979; 1998) . Theorem 1. Let P be the orthogonal projection onto the column space of A. Let P ⊥ = I -P . Then σ2 j = (σ j + γ j ) 2 + η 2 j , j = 1. . . . , r , where |γ j | ≤ |P E| 2 and inf 2 (P ⊥ E) ≤ η j ≤ |P ⊥ E| 2 . Following Stewart (1979) , we make two observations with respect to Thm. 1. First, if σ j ≫ |E| 2 then it dominates the bound and we have σj ∼ = σ j + γ j . Second and more important to our setting, when σ j is of order |E| 2 , the term η j will tend to dominate. Indeed, in these cases the term η j increases the singular value σ j . We conclude that random perturbations to A tend to increase its small singular values. In contrast, comfort-zone typically decreases the small singular values, while leaving the large σ j unchanged. Thus, comfort-zone is in effect a complementary approach to injecting additive noise, allowing a finer control over the resulting new samples. Finally, we note that for a certain choice of hyper-parameters, our approach can be viewed as injecting noise per the above analysis. For example, taking ρ = 0.0 and λ ∼ Uniform(1.0, α) for α > 1.0 will increase all the singular values of A by a factor of λ ∈ [1.0, α], where Uniform is the random uniform distribution. comfort-zone as a Vicinal Risk Minimization (VRM). Given a cost function c : Y ×Y → R + , the learning problem aims at minimizing the expectation of the loss c(f (x), y) over the distribution P(x, y), x ∈ X , y ∈ Y. A fundamental challenge, shared by most real-world scenarios, is that the true distribution of the data is unfortunately unknown. The alternative is to minimize over the empirical distribution of a train set {(x i , y i )} n i=1 given by d P emp (x, y) = 1 n i δ xi (x)δ yi (y) . The resulting scheme is the common training procedure of modern neural networks, formally known as the Empirical Risk Minimization (ERM) (Vapnik, 1991) . While P emp provides a basic approximation of the true P, it was suggested (Chapelle et al., 2001 ) that other density estimates d P est that take into account the vicinity of (x i , y i ) should be considered. The recent mixup approach (Guo et al., 2019) exploits this idea by proposing a Vicinal Risk Minimization (VRM) procedure that is based on the vicinal distribution estimate 1 n i,j δ xij (λ) (x)δ ỹij (λ) (y), defined using convex combinations zij (λ) = λz i + (1 -λ)z j for z ∈ {x, y} and λ ∼ Beta(α, α). In this context, the main difference between comfort-zone and mixup is in the definition of vicinity as we describe below. We denote by T (x, y) the tangent plane of the data manifold M at the point (x, y) ∈ M ⊂ X × Y. Namely, T (x, y) is the linear approximation of M at (x, y). For every pair (x, y), we define a new density distribution P tan which considers all pairs (a, b) in the tangent plane of (u, v) ∈ M. Formally, d P tan (x, y) = M T (u,v) δ a (x)δ b (y) d ab d uv . Then, comfort-zone approximates the latter expression by generating an estimate of the tangent plane T est via SVD, yielding the following vicinal estimate d P est (x, y) = 1 n i 1 k i j δ xj (x)δ yj (y) , (x j , y j ) ∈ T est (x i , y i ) , k i = |T est (x i , y i )| .

5.1. REGRESSION BENCHMARK DATASETS

While there is extensive work on deep regression in the vision community for e.g., object detection (Szegedy et al., 2013) and human pose estimation (Li & Chan, 2014) , we aimed for an evaluation setting where data modalities different from images and text are being considered. To this end, we evaluate comfort-zone on regression benchmark datasets that frequently appear in the literature, see e.g., Hernández-Lobato & Adams (2015) . The datasets include Diabetes listing 442 patients with 10 feature variables (Efron et al., 2004) ; Concrete describes 1030 instances of the actual concrete strength using 8 features (Yeh, 1998) ; Energy details the energy efficiency of 768 building shapes using 8 variables (Tsanas & Xifara, 2012) ; and Wine which consists of 1599 red wine instances with 11 features (Cortez et al., 2009) . The output of Diabetes, Concrete and Wine has one feature, whereas Energy has two features. We perform min-max normalization to all datasets, and we remove it during model testing. The baseline architecture we consider is a residual network (ResNet) (He et al., 2016) . ResNet models are typically overparameterized, and thus they serve as a good baseline to explore DA effects. We use fully connected layers in the residual block instead of convolutions, employing 18 (ResNet18) and 34 (ResNet34) residual layers followed by a linear layer. During training, data is split to 70%, 10% and 20% for the train, validation and test sets, respectively. Each model is trained 20 times with random splits, and it is trained for 40 epochs using a hidden size of 100 and a batch size of 16. We employ an Adam optimizer, with 0.0001 weight decay and an initial learning rate of 0.001, and we reduce it by half with a patience of 3 based on the validation loss. The loss is MSE, and we infer over the models which yield the best loss on the validation set when averaged over 20 runs. We compare the baseline (ERM) to mixup (Guo et al., 2019) , additive uniform noise (UN), and comfort-zone (CZ). The results are detailed in Tab. 1 using the metrics: root MSE (RMSE), mean absolute percentage error (MAPE), and R2 (Makridakis & Hibon, 2000) . In mixup, we follow the authors guidelines and use α ∈ [0.1, 0.4], and with the additive noise we use a scale of 0.1. We apply comfort-zone at the input level, and we perform a grid search over α ∈ {0.5, 1.0, 2.0}, and ρ ∈ {0.97, 0.98}. Our results show that more depth yields inferior results, which may be related to the network size w.r. 34 layers by a small RMSE margin. Further, our method achieves the best results on all datasets in comparison to all other DA baselines. We also find that comfort-zone reduces the standard deviation for almost all datasets and metrics.

5.2. TIME SERIES FORECASTING (TSF)

Small-scale TSF. Forecasting time series data is one of the fundamental regression tasks in machine learning. We test comfort-zone using the DARTS (Herzen et al., 2022) time series forecasting framework, which supports several TSF methods and datasets. Specifically, we consider the DARTS implementations of RNN, TCN (Bai et al., 2018) , TRANSFORMER (Vaswani et al., 2017) , and N-BEATS (Oreshkin et al., 2019) . The datasets are mostly univariate, i.e., the time series samples are one-dimensional. In our experiments, we perform a min-max normalization to the data, and we convert it to a single-precision floating point representation. Importantly, DARTS datasets are small, ranging from ≈ 100 samples to ≈ 3000 samples in total. This regime of small training sets is expected to benefit the most from DA techniques such as ours. The data is split to approximately 80% for training and 20% for testing. Unless otherwise noted, covariates such as hour-of-day are not used (Salinas et al., 2020) . We train for 300 epochs, using a batch size of 32 and an Adam optimizer with a learning rate 0.001 and no scheduling. In all cases, the training loss is mean squared error (MSE). The specific input and output tensor sizes depend on the dataset, and we provide this information in App. D. For reproducibility and to reduce variability, we train each model on the same hundred seeds {0, . . . , 99}. During inference, we evaluate the trained models using the root mean square error (RMSE), mean absolute percentage error (MAPE), and R2 measures, see e.g., (Makridakis & Hibon, 2000) . We report the average measures and their standard deviation over the seed set. In our experiments, we compared the effect of comfort-zone in relation to the baseline model (ERM), and to the baseline augmented with the DA approaches mixup, additive noise (UN), and comfort-zone (CZ). Following the evaluation protocol proposed in the original mixup paper (Zhang et al., 2017) , we evaluate the dependence of different DA methods on the choice of hyper-parameters. To this end, we fix the hyper-parameters for all DA baselines. We used α = 0.4 for mixup, a scale of 0.1 for UN, and α = 0.2 and ρ = 0.9 for CZ in all cases. For comfort-zone, we take the best result out of the original data and noise-injected data. The hyper-parameters were chosen using a basic grid test, taking the parameters which yield the best average error across DARTS datasets. 3.56 ± 0.9 5.43 ± 1.4 0.79 ± 0.1 2.96 ± 0.6 4.51 ± 0.9 0.92 ± 0.0 US Gasoline Sunspots RMSE ↓×10 -2 MAPE ↓ R2 ↑ RMSE ↓×10 -2 MAPE ↓ R2 ↑ ERM 6 .33 ± 0.5 6.64 ± 0.5 -0.11 ± 0.2 6.17 ± 0.9 27.45 ± 3.2 -0.68 ± 0.5 mixup 6.40 ± 0.5 6.73 ± 0.5 -0.13 ± 0.2 6.05 ± 0.8 26.17 ± 2.7 -0.61 ± 0.4 UN 6.49 ± 0.6 6.68 ± 0.6 -0.16 ± 0.2 6.02 ± 0.7 27.59 ± 3.5 -0.59 ± 0.4 CZ 6.25 ± 0.6 6.59 ± 0.6 -0.08 ± 0.2 6.09 ± 0.9 27.20 ± 3.4 -0.64 ± 0.5 Tab. 2 shows the statistics and results for the univariate datasets Air Passengers, Australia Beer, US Gasoline and Sunspots provided in DARTS. These datasets are trained on the baseline N-BEATS architecture whose time series forecasting capabilities are considered state-of-the-art (Oreshkin et al., 2019) , and then trained again on baseline with DA. We observe a consistent behavior where comfort-zone improves the generalization error compared to ERM and usually reduces the standard deviation for all datasets. Further, comfort-zone beats all other methods, except on Sunspots where the best results for MAPE are attained by mixup, and for RMSE and R2 by additive noise. In addition to Tab. 2, we show in Tab. 9 an extended evaluation, showing the results on DARTS datasets on the baseline architectures RNN, TCN and TRANSFORMER, and on their DA augmented versions. In this extended setting, we observe that TCN and TRANSFORMER benefit from our DA for all datasets, whereas RNN yields mixed results with comfort-zone. Furthermore, the standard deviation typically becomes smaller with our DA in comparison to the baseline. Additive noise and mixup somewhat depend on the architecture and dataset, where in some cases the generalization improves and in others, deteriorates. Notably, the best overall results (marked in blue) for each dataset were almost always obtained with comfort-zone. The only exception was for RMSE and R2 metrics for US Gasoline, where comfort-zone yields the best MAPE results, and second best RMSE and R2 (marked in red). Finally, we note that while our CZ on Sunspot with N-BEATS in Tab. 2 did not yield the best estimates in comparison to other DA techniques, the setting of TCN with CZ attains the best overall results for Sunspots, see Tab. 9. Large-scale TSF. To further evaluate our approach in the context of forecasting, we consider larger-scale benchmark datasets. Electricity contains the hourly electricity consumption of 370 customers for a total of ≈ 9.7M samples, and Traffic includes the hourly occupancy rate of 963 car lanes of San Francisco bay area freeways for a total of ≈ 10.1M samples. Both datasets appear frequently in the forecasting literature, e.g., (Salinas et al., 2020; Oreshkin et al., 2019) . We incorporate our comfort-zone into the N-BEATS framework which includes ensembling during inference. Following Oreshkin et al. ( 2019), we train the generic and interpretable models with and without DA, and we use a total of 180 models for evaluation. These models arise from using different metrics, different horizon lengths, and different initialization. Finally, different data splits are considered. Using the original code repository of the authors, we approximately recover their results. We refer to (Oreshkin et al., 2019) for the full details regarding the evaluation protocol and testing setup. We perform a grid search over α ∈ {0.1, . . . , 1.0} and ρ ∈ {0.8, 0.85, 0.9} yielding a total of 30 models per dataset, architecture and split. Many hyper-parameter combinations lead to an improvement in the normalized deviation (ND) test error. Tab. 3 shows the ensemble median results of the generic (N-BEATS-G) and interpretable (N-BEATS-I) baselines, as well as our results. While comfort-zone improves both datasets, we observe that the generic net benefits relatively more from our DA in comparison to its interpretable version. We further extend our evaluation on large-scale TSF tasks where we consider the datasets ETTm 2 , Exchange, weather and ILI in the challenging setting of long horizon forecasting benchmark including 2021) where the datasets are described as well as the benchmark setting. For a baseline, we consider the generic version of N-BEATS, trained in ERM and augmented with CZ. We performed a grid search for comfort-zone using ρ ∈ {0.85, 0.90, 0.95}, and α ∈ {0.1, . . . , 1.0}. Many hyper-parameter combinations lead to improved results over the baseline, and we report the best results of our approach per horizon. Tab. 4 reports the MSE and mean absolute error (MAE) metrics of this experiment. In all cases except for ETTm 2 with horizon 96, CZ improves generalization and yields better error metrics. 

6. DISCUSSION

We have proposed comfort-zone, a data-driven method for data augmentation of regression tasks. We showed that comfort-zone supports the network tendency of representing dominant components of its input signals by creating virtual examples sampled from the tangent planes of the original train set. Implementing comfort-zone is straightforward, and it admits a fully differentiable as well as a simpler non-differentiable versions. Throughout an extensive evaluation, we have shown that comfort-zone improves the generalization error of neural models on time series forecasting datasets and regression benchmarks. In addition, comfort-zone obtains better results when compared to a few data augmentation baselines, while reducing the standard deviation of the model ensemble. When inspecting the effect of the hyper-parameters α and ρ, we observe that for small datasets the results improve as α increases, and for medium datasets the results are stable or deteriorate for increasing α. Further, larger neural models (ResNet34) were less affected by changes in α in comparison to smaller models (ResNet18). We identify that choosing the value of ρ depends on the intrinsic features of the dataset. In general, higher ρ is preferable when the intrinsic dimension of the data is higher. However, our understanding of the interplay between the hyper-parameters and model behavior is still somewhat limited. The time complexity of comfort-zone is governed by the SVD calculation, which may be restrictive for large train batches. There are several exciting avenues for future exploration. First, is there a fundamental link between the vicinal distribution employed and the learned representation? While several existing works suggest that linearity yields better models, the model dependency on the specific definition of vicinity is still not well understood. Second, can similar methods be useful in classification tasks? The adaptation of comfort-zone to classification is straightforward, however, several design choices which were tuned for regression may require change in a classification setting.

C ABLATION STUDY

To motivate the specific design choices in comfort-zone, we run an ablation study over different design settings. The first hyper-parameter we consider is µ(λ) which is used to scale the cost function during training. Our experiments show that µ(λ) = 1, i.e., no scaling, leads to the best results, and we report for profiles µ(λ) ∈ {1, λ, λ 2 }. The second hyper-parameter marks whether to scale down the small or large singular values. In comfort-zone we always scale the small singular values. The third hyper-parameter deals with modifying the samples at the input level or in the latent space. The results are given in Tab. 5. The ablation study is performed on the Concrete regression dataset using ResNet18 architecture, and Australia Beer dataset using N-BEATS architecture. For both datasets, scaling down the singular values at the input level and with no scaling to the loss function leads to the best test measures. Further, the latent version of comfort-zone yields the second best results. Finally, scaling down the large singular values and the loss function was beneficial for Australia Beer, but resulted in poor measures on Concrete.  E EFFECTS OF ADDITIVE NOISE IN COMBINATION WITH C O M F O R T-Z O N E In our experiments, we tested DARTS time series datasets using comfort-zone (CZ) and additive noise UN, together and separately. We reported the results achieving the best metrics with or without UN. For the sake of completeness, we add in Tab. 7 a comparison of the results for each of our DARTS datasets, when training with CZ with and without UN. Typically, there is some improvement when using UN alongside CZ, but this is not always the case, e.g., for Australia Beer and N-BEATS where CZ alone had better results. For each dataset and architecture we also add the baseline without any DA, which shows that generally CZ improves on the baseline with our without UN. We believe additive noise is helpful in this test scenario since the DARTS datasets are extremely small, and thus our data augmentation does not necessarily span a wide enough regime. 

F TIME CONSUMPTION OF APPLYING C O M F O R T-Z O N E

We add in Tab. 8 the timings of applying CZ on a batch of the different time series datasets. We measured these timings by applying CZ on the batch 1000 times, measuring the entire time length then dividing the total time by 1000 to get the average time of a single application. 

G ADDITIONAL TIME SERIES RESULTS

In addition to the results in Tab. 2, we evaluate our method on three architectures (RNN, TCN and TRANSFORMER), using ERM and mixup, UN and CZ for DA approaches. We report the results in Tab. 9, where we highlight in blue the best method and in red the second-best. Overall, CZ achieves the best results in all cases and RMSE, MAPE and R2 metrics, except of US Gasoline where TRANSFORMER yields better RMSE and R2 estimates. In Tab. 10 we demonstrate the results of applying CZ to image datasets, as well as a comparison to application of mixup and a baseline with no DA. The results were produced using the aforementioned DA methods in their manifold setup on preact-resnet18 model and CIFAR datasets, each run with three different seeds then averaged. For the manifold-mixup and baseline code we used the repository in (Lim et al., 2021) , and we added the manifold CZ version mentioned in Sec. 3 on top of it. When using CZ, we applied it on the data alone, and did not incorporate the labels into the augmentation. That is, after applying CZ to a sample, its target stayed the same. The results are unfavorable towards CZ, but it's worth pointing out that mixup was designed with classification in mind, and it augments the data using the targets as well as the input data. In contrast, CZ was designed originally for regression tasks, and even though it incorporates the targets into the DA in those setups, it is less obvious to realize how to do so with the type of targets used in classification. As mentioned, the naive way we tried did not use the targets as part of the augmentation, and we think this is the main reason for the deficit in the results. We leave further exploration of this research direction to future work. 



Figure 2: Evaluating a non-augmented model and a model trained with comfort-zone on train data whose small singular values are scaled down for different values of λ (left). We show on the right panel an example of a time series sample (black), and its modifications using λ = 0.5 (blue), λ = 0.3 (orange), and λ = 0.0 (gray).

We show the pseudocode for comfort-zone at the input level, l = 0 (left). We demonstrate the effect of a few DA methods on 2D data whose intrinsic dimension is one (right). denoting the input and output domains by X and Y, respectively. A regression problem is such that the output domain is (un)countable, e.g., Y ⊂ N m or Y ⊂ R m . For simplicity, we consider the setting X , Y ⊂ R m , but our method is applicable to other cases. During training, the learning model is provided with a training set D = {(x i , y i )} n i=1 , sampled from (x i , y i ) ∼ P. Our method extends the training distribution by producing a new training set as we describe below.

Test errors on regression benchmarks using ResNet18 and ResNet34 architectures.

Test errors of small-scale time series forecasting datasets from DARTS. Each dataset is trained on generic N-BEATS, and it is augmented using comfort-zone and other DA approaches.

Test errors for N-BEATS architectures generic (G) and interpretable (I) trained with and without comfort-zone on Electricity and Traffic datasets for different train-test splits. We refer the reader toWu et al. (

Long horizon time series forecasting results.

Ablation study of comfort-zone over different loss scaling profiles µ(λ), scaling down the small or large singular values, and modifying data at the input or latent levels. The split column specifies the point in time from which we split the data to train and test sets. Then #in and #out represent the series length for the input and output, respectively. Finally, #pred is the length of series predicted during model evaluation.

Test errors on several sequential neural architectures on the small-scale time series forecasting datasets from DARTS. Each architecture is trained with comfort-zone and either with (CZ+UN) or without additive uniform noise (CZ).

Running times of a single application of our method for several datasets.

Test errors of several sequential neural models on the time series forecasting datasets from DARTS. Each architecture is also trained with: mixup, additive noise, and comfort-zone. ± 3.63 14.225 ± 6.32 -0.469 ± 2.00 TRANSFORMER 5.234 ± 1.02 7.245 ± 1.75 0.568 ± 0.17 TRANSFORMER + mixup 6.321 ± 1.64 8.876 ± 2.70 0.352 ± 0.34 TRANSFORMER + UN 4.507 ± 1.10 6.456 ± 1.72 0.673 ± 0.17 TRANSFORMER + CZ 4.352 ± 0.99 6.197 ± 1.52 0.697 ± 0.15 Australia Beer 176 RNN 4.392 ± 1.46 6.169 ± 2.08 0.822 ± 0.16 RNN + mixup 5.682 ± 2.38 8.096 ± 3.49 0.684 ± 0.28 RNN + UN 6.416 ± 1.77 9.109 ± 2.67 0.631 ± 0.23 RNN + CZ 4.522 ± 1.71 6.385 ± 2.48 0.806 ± 0.20 TCN 4.058 ± 1.85 5.939 ± 2.51 0.834 ± 0.21 TCN + mixup 4.040 ± 2.05 5.775 ± 2.74 0.829 ± 0.25 TCN + UN 4.798 ± 2.01 7.220 ± 3.05 0.775 ± 0.22 TCN + CZ 3.757 ± 1.73 5.522 ± 2.34 0.858 ± 0.19 TRANSFORMER 4.939 ± 0.89 6.507 ± 1.44 0.790 ± 0.08 TRANSFORMER + mixup 4.113 ± 0.99 5.776 ± 1.53 0.851 ± 0.07 TRANSFORMER + UN 5.914 ± 1.47 7.963 ± 2.25 0.691 ± 0.15 TRANSFORMER + CZ 4.431 ± 0.88 5.730 ± 1.26 0.830 ± 0.07 ± 0.34 7.233 ± 0.37 -0.199 ± 0.12 TCN + CZ 6.737 ± 0.37 7.399 ± 0.43 -0.248 ± 0.14 TRANSFORMER 6.066 ± 0.38 6.478 ± 0.36 -0.012 ± 0.13 TRANSFORMER + mixup 6.254 ± 0.46 6.517 ± 0.42 -0.078 ± 0.17 TRANSFORMER + UN 6.187 ± 0.43 6.461 ± 0.40 -0.054 ± 0.15 TRANSFORMER + CZ 6.153 ± 0.44 6.443 ± 0.38 -0.043 ± 0.15 Sunspots 2820 RNN 5.773 ± 0.24 28.259 ± 1.45 -0.443 ± 0.12 RNN + mixup 5.797 ± 0.25 28.842 ± 1.44 -0.455 ± 0.13 RNN + UN 5.953 ± 0.33 30.169 ± 1.94 -0.537 ± 0.17 RNN + CZ 5.759 ± 0.26 28.262 ± 1.58 -0.436 ± 0.13 TCN 5.901 ± 0.66 25.455 ± 2.06 -0.524 ± 0.35 TCN + UN 6.277 ± 0.66 28.336 ± 2.83 -0.722 ± 0.36 TCN + mixup 5.936 ± 0.52 26.062 ± 1.88 -0.535 ± 0.28 TCN + CZ 5.785 ± 0.63 25.287 ± 2.35 -0.464 ± 0.32 TRANSFORMER 6.399 ± 0.94 27.161 ± 2.57 -0.808 ± 0.55 TRANSFORMER + mixup 6.230 ± 0.83 26.653 ± 2.54 -0.707 ± 0.48 TRANSFORMER + UN 5.899 ± 0.61 26.121 ± 2.31 -0.520 ± 0.32 TRANSFORMER + CZ 5.902 ± 0.47 26.890 ± 2.44 -0.515 ± 0.25 H RESULTS OF CZ ON CIFAR DATASETS

Accuracy results on image (CIFAR) datasets.

A A FULLY DIFFERENTIABLE C O M F O R T-Z O N E

In Sec. 3 and Fig. 1 we discuss a potential implementation of our method at the input level, i.e., for l = 0. However, this approach is not suitable for the latent version. Indeed, identifying the indices of singular values which should be scaled by λ as was proposed in Sec. 3 is not a differentiable action. Specifically, the use of numpy.where() does not allow for end-to-end learning, and it should be replaced. Fortunately, PyTorch allows for differentiable index selecting from a tensor, thus by using this feature we can separate the singular values we wish to scale from those we wish to keep as is. We separate the s, the singular values vector, into two vectors, scale the desired singular values and then concatenate the vectors, a differentiable operation in itself, to recreate s. For completeness, we provide the pseudocode for the fully differentiable scale_down function in Fig. 3 . 

BETTER

Following the discussion in Sec. 3, we verify empirically that neural networks model the dominant parts of their data better. We repeat the experiment in Fig. 2 in the main text using several datasets and architectures. Every pair of dataset and architecture are evaluated on the dataset whose singular values are modified using varying values of λ. The results are presented in Fig. 4 where solid lines represent the non-regularized version, and dashed lines are associated with models trained with our DA. In all cases we observe a similar qualitative behavior as we reported in Sec. 3. In particular, the highest MSE values are obtained for both the baseline and regularized models for λ = 1, i.e., when the data is unchanged. Further, the model attain improved error measures as λ decreases, where the error profile is similar for the baseline and regularized models. Based on these results, we deduce that sequential models prefer to represent and compute the dominant components of data.Figure 4 : We reproduce Fig. 2 for several architectures and datasets. In all cases the models achieve better error measures for the modified data, whether it appeared during training or not.

