REGULARIZATION COCKTAILS FOR TABULAR DATASETS

Abstract

The regularization of prediction models is arguably the most crucial ingredient that allows Machine Learning solutions to generalize well on unseen data. Several types of regularization are popular in the Deep Learning community (e.g., weight decay, drop-out, early stopping, etc.), but so far these are selected on an ad-hoc basis, and there is no systematic study as to how different regularizers should be combined into the best "cocktail". In this paper, we fill this gap, by considering the cocktails of 13 different regularization methods and framing the question of how to best combine them as a standard hyperparameter optimization problem. We perform a large-scale empirical study on 40 tabular datasets, concluding that, firstly, regularization cocktails substantially outperform individual regularization methods, even if the hyperparameters of the latter are carefully tuned; secondly, the optimal regularization cocktail depends on the dataset; and thirdly, regularization cocktails yield the state-of-the-art in classifying tabular datasets by outperforming Gradient-Boosted Decision Trees.

1. INTRODUCTION

In most supervised learning application domains, the available data for training predictive models is both limited and noisy with respect to the target variable. Therefore, it is paramount to regularize machine learning models for generalizing the predictive performance on future unseen data. The concept of regularization is well-studied and constitutes one of the pillar components of machine learning. Throughout this work we use the term "regularization" for all methods that explicitly or implicitly take measures that reduce the overfitting phenomenon; we categorize these non-exhaustively into weight decay, data augmentation, model averaging, structure and linearization, and implicit regularization families (detailed in Section 2). In this paper, we propose a new principled strategy that highlights the need for automatically learning the optimal combination of regularizers, denoted as regularization cocktails, via a hyperparameter optimization procedure. Truth be told, combining regularization methods is of course far from being a novel practice per se. As a matter of fact, most modern deep learning models use combinations of a number of regularizers. For instance, EfficientNet (Tan & Q.Le, 2019) mixes components of structural regularization and linearization via ResNet-style skip connections (He et al., 2016) , learning rate scheduling, Drop-Out ensembling (Srivastava et al., 2014) and AutoAugment data augmentation (Cubuk et al., 2019) . However, even though each of those regularizers is motivated in isolation, the reasoning behind a specific combination of regularizers is largely based on accuracy-driven manual trial-and-error iterations, mostly on image classification benchmarks like CIFAR (Krizhevsky et al., 2009) and ImageNet (Deng et al., 2009) . Unfortunately, the manual search for combinations of regularizers is sub-optimal, unsustainable, and in essence consists of an example of manual hyperparameter tuning, which in turn is easily outperformed by automated algorithms (Snoek et al., 2012; Thornton et al., 2013; Feurer et al., 2015; Olson & Moore, 2016; Jin et al., 2019; Erickson et al., 2020; Zimmer et al., 2020) . Following the spirit of AutoML (Hutter et al., 2018) , we, therefore, propose a strategy for learning the optimal dataset-specific regularization cocktail by means of a modern hyperparameter optimization (HPO) method. To the best of our knowledge, there exists no study providing empirical evidence that a mixture of numerous regularizers outperforms individual regularizers; this paper fills this gap. More precisely, the research hypothesis of this paper is that a properly mixed regularization cocktail outperforms every individual regularizer in it, in terms of accuracy under the same run-time budget, and that the best cocktail to use depends on the dataset. To validate this hypothesis, we executed a large-scale experimental study employing 40 diverse tabular datasets and 13 prominent regularizers with a thorough hyperparameter tuning for all regularizers. We focus on tabular datasets because, in contrast to large image datasets, a thorough hyper-parameter search procedure is feasible. Neural networks are high variance models for tabular datasets, therefore improved regularization schemes can provide a relatively higher generalization gain on tabular datasets compared to other data types. Thereby, we make the followings contributions: 1. We demonstrate the empirical accuracy gains of regularization cocktails in a systematic manner via a large-scale experimental study on tabular datasets; 2. We challenge the status-quo practices of designing universal dataset-agnostic regularizers, by showing that an optimal regularization cocktail is highly dataset-dependent; 3. We demonstrate that regularization cocktails achieve state-of-the-art classification accuracy performance on tabular datasets and outperform Gradient-Boosted Decision Trees (GBDT) with a statistically-significant margin; 4. As an overarching contribution, this paper provides previously-lacking in-depth empirical evidence to better understand the importance of combining different mechanisms for regularization, one of the most fundamental concepts in machine learning.

2. RELATED WORK

Weight decay: The classical approaches of regularization focused on minimizing the norms of the parameter values, concretely either the L1 (Tibshirani, 1996) , the L2 (Tikhonov, 1943) , or a combination of L1 and L2 known as the Elastic Net (Zou & Hastie, 2005) . A recent work fixes the malpractice of adding the decay penalty term before momentum-based adaptive learning rate steps (e.g., in common implementations of Adam (Kingma & Ba, 2015) ), by decoupling the regularization from the loss and applying it after the learning rate computation (Loshchilov & Hutter, 2019) . Data Augmentation: A different treatment of the overfitting phenomenon relies on enriching the training dataset via instance augmentation. The literature on data augmentation is vast, especially for image data, ranging from basic image manipulations (e.g., geometric transformations, or mixing images) up to parametric augmentation strategies such as adversarial and controller-based methods (Shorten & Khoshgoftaar, 2019) . For example, Cut-Out (Devries & Taylor, 2017) proposes to mask a subset of input features (e.g., pixel patches for images) for ensuring that the predictions remain invariant to distortions in the input space. Along similar lines, Mix-Up (Zhang et al., 2018) generates new instances as a linear span of pairs of training examples, while Cut-Mix (Yun et al., 2019) suggests super-positions of instance pairs with mutually-exclusive pixel masks. A recent technique, called Aug-Mix (Hendrycks et al., 2020) , generates instances by sampling chains of augmentation operations. On the other hand, the direction of reinforcement learning (RL) for augmentation policies was elaborated by Auto-Augment (Cubuk et al., 2019) , followed by a technique that speeds up the training of the RL policy (S. Lim et al., 2019) . Last but not least, adversarial attack strategies (e.g., FGSM (Goodfellow et al., 2015) ) generate synthetic examples with minimal perturbations, which are employed in training robust models (Madry et al., 2018) . Model Averaging: Ensembled machine learning models have been shown to reduce variance and act as regularizers (Polikar, 2012) . A popular ensemble neural network with shared weights among its base models is Drop-Out (Srivastava et al., 2014) , which was extended to a variational version with a Gaussian posterior of the model parameters (Kingma et al., 2015) . A follow-up work that is known as Mix-Out (Lee et al., 2020) extends Drop-Out by statistically fusing the parameters of two base models. Furthermore, ensembles can be created using models from the local optima discovered along a single convergence procedure (Huang et al., 2016) . Structural and Linearization: One strategy of regularizing deep learning models is to discover dedicated neural structures that generalize on particular tasks, such as image classification or Natural Language Processing (NLP). In that context, ResNet adds skip connections across layers (He et al., 2016) , while the Inception model computes latent representations by aggregating diverse convolutional filter sizes (Szegedy et al., 2017) . The attention mechanism gave rise to the popular Transformer architecture in the realm of NLP (Vaswani et al., 2017) . Recently, EfficientNet is an architecture that easily scales deep convolutional neural networks by controlling only a few hyperparameters (Tan & Q.Le, 2019) . Besides the aforementioned manually-designed architectures, the stream of Neural Architecture Search (Elsken et al., 2019) focuses on exploring neural connectivity graphs for finding the optimal architectures via reinforcement learning (Zoph & Le, 2017) , black-box search (Real et al., 2019) or differentiable solvers (Liu et al., 2019) . A recent trend adds a dosage of linearization to deep models, where skip connections transfer embeddings from previous less non-linear layers (He et al., 2016; Huang et al., 2017) . Along similar lines, the Shake-Shake regularization deploys skip connections in parallel convolutional blocks and aggregates the parallel representations through affine combinations (Gastaldi, 2017) , while Shake-Drop extends this mechanism to a larger number of CNN architectures (Yamada et al., 2018) . Implicit: The last family of regularizers broadly encapsulates methods which do not directly propose novel regularization techniques but have an implicit regularization effect as a virtue of their 'modus operandi' (Arora et al., 2019) . For instance, Batch Normalization improves generalization by reducing the internal covariate shifts (Ioffe & Szegedy, 2015) , while early stopping of the optimization procedure also yields a similar generalization effect (Yao et al., 2007) . On the other hand, stabilizing the convergence of the training routine is another implicit regularization, for instance by introducing learning rate scheduling schemes (Loshchilov & Hutter, 2017) . The recent strategy of stochastic weight averaging relies on averaging parameter values from the local optima encountered along the sequence of optimization steps (Izmailov et al., 2018) , while another approach conducts updates in the direction of a few 'lookahead' steps (Zhang et al., 2019) . Positioning in the realm of AutoML: In contrast to the prior literature, we do not propose a new individual regularization method, but empirically identify the superiority of learning regularization cocktails among a set of existing regularizers from the aforementioned categories. We train datasetspecific cocktails as a hyperparameter optimization (HPO) task (Feurer & Hutter, 2019) . In that regard, our work is positioned in the realm of AutoML and is a special case of a combined algorithm selection and hyperparameter optimization (Thornton et al., 2013) . We learn the regularization cocktails and optimize the joint hyperparameter configuration space by means of BOHB (Falkner et al., 2018b) , which is a variation of Hyperband (Li et al., 2017) with model-based surrogates and is one of the current state of the art approaches for efficient HPO. While the search for optimal hyperparameters λ is an active field of research in the realm of AutoML (Hutter et al., 2018) , still the choice of the regularizer Ω mostly remains an ad-hoc practice, where practitioners select few combinations among popular regularizers (Dropout, L2, Batch Normalization, etc.). In contrast to prior studies, we hypothesize that the optimal regularizer is a cocktail mixture of a large set of regularization methods, all being simultaneously applied with different strengths (i.e., dataset-specific hyperparameters). Given a set of K regularizers (Ω (k) •; λ (k) K k=1 := Ω (1) •; λ (1) , . . . , Ω (K) •; λ (K) , each with its own hyperparameters λ (k) ∈ Λ (k) , ∀k ∈ {1, . . . , K}, the problem of finding the optimal cocktail of regularizers is: λ * ∈ arg min λ (1) ∈Λ (k) ,...,λ (K) ∈Λ (k) L y (Val) , f X (Val) ; Ω (k) θ * , λ (k) K k=1 (4) s.t. θ * ∈ arg min θ L y (Train) , f X (Train) ; Ω (k) θ, λ (k) K k=1 The intuitive interpretation of Equations 4-5 is searching for the optimal hyperparameters λ (i.e., strengths) of the cocktail's regularizers using the validation set (Equation 4), given that the optimal prediction model parameters θ are trained under the regime of all the regularizers being applied jointly (Equation 5). We stress that the hyperparameters λ (k) include a conditional hyperparameter controlling whether the k-th regularizer is applied at all, or skipped. Therefore, the best cocktail might consist of combinations of a subset of regularizers.

3.2. REGULARIZATION INGREDIENTS AND THE SEARCH SPACE

To build the regularization cocktails we combine the 13 methods shown in Table 1 , which are selected among the categories of regularizers covered in Section 2, each having its own hyperparameter search space. We set the other hyperparameters regarding the architecture and the optimizer as detailed in Table 2 in Appendix B.1. The regularization cocktails introduce 9 non-conditional hyperparameters in the search space, which, in turn, can add up to 9 conditional hyperparameters. In total, our regularization cocktails can add up to 18 hyperparameters in the search space. In the defined search space some of the combinations are not technically feasible, therefore, we introduce the following constraints to the proposed search space: (i) Shake-Shake and Shake-Drop are not simultaneously active since the latter builds on the former. (ii) Only one data augmentation technique out of Mix-Up, Cut-Mix, Cut-Out, and FGSM adversarial learning can be active at once due to a technical limitation of the base library (Zimmer et al., 2020) . As an optimizer, we decided to use BOHB (Falkner et al., 2018a) since it achieves a strong anytime performance by combining Hyperband (Li et al., 2017) and Bayesian Optimization (Shahriari et al., 2016) , and still has the convergence guarantees of Hyperband. Furthermore, BOHB can deal with the categorical hyperparameters for enabling or disabling regularization techniques and the corresponding conditional structures. In Appendix A we provide a brief description of how BOHB works.

4. EXPERIMENTAL PROTOCOL 4.1 EXPERIMENTAL SETUP

We use a collection of tabular datasets (listed in (Mendoza et al., 2018; Zimmer et al., 2020) with our implementations for the regularizers, as shown in Table 1 . To optimally utilize resources, we ran BOHB with 10 workers in parallel, where each worker had access to 2 CPU cores and 12GB of memory, executing one configuration at a time. In view of limited computational resources and taking into account the dimensions D of the considered configuration spaces, we ran BOHB for at most 4 days, or at most 40 × D hyperparameter configurations, whichever came first. During the training phase, each configuration was run for 105 epochs. For the sake of studying the effect on more datasets, we only evaluated a single train-val-test split. After the training phase is completed, we report the results on the best hyperparameter configuration found retrained on the joint train and validation set for 105 epochs.

4.2. FIXED ARCHITECTURE AND OPTIMIZATION HYPERPARAMETERS

In order to focus exclusively on investigating the effect of individual regularization methods, we fix the hyperparameters that are related to the model architecture and general training procedure in the search-space, as specified in Table 2 of Appendix B.1. These hyperparameter values are tuned for maximizing the performance of an unregularized neural network on our dataset collection (see Table Moreover , we set a low learning rate of 10 -3 after performing a grid search for finding the value performing best across all datasets. We use the AdamW implementation (Loshchilov & Hutter, 2019) , which implements decoupled weight decay, and cosine annealing with restarts (Loshchilov & Hutter, 2019) as a learning rate scheduler. Using a learning rate scheduler with restarts helps in our case because we keep a fixed initial learning rate. For the restarts, we use an initial budget of 15 epochs, with a budget multiplier of 3, following published practices (Zimmer et al., 2020) . Additionally, since our benchmark includes imbalanced datasets, we use a weighted version of categorical cross-entropy and balanced accuracy (Brodersen et al., 2010) as the evaluation metric.

4.3. HYPOTHESES AND ASSOCIATED EXPERIMENTS

Hypothesis 1: The regularization cocktails achieve better generalization performance compared to the individual regularization methods over all datasets. Experiment 1: We regularize the plain neural network (Section 4.2) with each method from Table 1, one at a time. For every regularizer, we tune its hyperparameters on each dataset, then finally measure the regularized network's performance on the test set after retraining the best hyper-parameter configuration on the joint train and validation set. For each dataset, we compare the results against the results of a cocktail optimized the same way as each of the individual ingredient regularizers. Hypothesis 2: The optimal regularization cocktails are dataset-dependent. Experiment 2: We study the best-found regularization cocktails of every dataset and frequencies of the regularizers that were chosen to be activated by BOHB, to demonstrate that no combination of regularizers is frequent. Furthermore, we regularize the plain network with the most frequent regularizers and compare it against our proposed method of Section 3. Hypothesis 3: The regularization cocktails achieve state-of-the-art classification accuracies in tabular datasets. Experiment 3: We compare against GBDT, the state-of-the-art classifier for tabular data. For a fair comparison, we optimized the hyper-parameters of GBDT on every dataset using the popular AutoSklearnfoot_3 library, by following the exact hyper-parameter search protocol (same train, validation, and test splits) and provided GBDT with the same HPO budget as our proposed method. The search space for the hyperparameters of GBDT is further detailed in Section 5. Regularization cocktail performance (Experiment 1): Figure 1 presents the critical difference diagram of the ranks, which demonstrates that the cocktail outperforms each individual regularizer. The critical difference diagram is generated by performing a posthoc analysis based on the Wilcoxon-Holm method (Wilcoxon, 1992; Holm, 1979) with a p value of 0.05 as the threshold for statistical significance. Observing the results, the regularization cocktail manages to outperform all individual cocktail ingredients with a statistically significant margin. This confirms our hypothesis that well-tuned regularization cocktails outperform well-tuned individual regularization techniques across a diverse suite of tabular datasets. In addition, Figure 2 provides additional information on the rank distributions of the different compared methods, while Figure 6 of Appendix C offers detailed descriptive statistics for each one-on-one comparison against baselines.

6KDNH6KDNH

31 %1 /$ 6( 6:$ 6& $7 66 6' 08 &2 &0 :' '2 &RFNWDLO 5DQN 5DQNVRILQGLYLGXDOUHJXODUL]HUVDQGWKHFRFNWDLO Dataset-dependent optimal cocktails (Experiment 2): Figure 3 shows that the optimal regularization cocktail depends on the dataset at hand since no combination of regularizers was active on the majority of the datasets. The plot depicts all frequent singular regularizers and combinations of pairs of regularizers occurring in at least 30% of the datasets, based on how often they were part of the per-dataset cocktails. The most frequent pair of regularizers (BN and SE) is selected only on 50% of the datasets, which highlights the fact that regularization cocktails are dataset-specific and there is no frequent universal combination. Moreover, the results presented in Figure 3 provide insights into frequent cocktail ingredients with regards to the regularization types. For instance, although Snapshot Ensembling as an individual method ranks 7-th among 13 regularizers in Figure 1 , it is nevertheless present in the regularization cocktails of 72.5% datasets. The finding hints that optimal cocktails are composed of weaker regularizers, whose dataset-dependent combination enhances the regularization effect. For a more in-depth summary of the frequencies that correspond to the individual regularization methods, we refer to Figure 7 Lastly, to further validate that good performing regularization cocktails are dataset-dependent, we conducted another experiment by creating 2 baselines consisting of the following top-5 cocktails: 1. The top-5 most frequent regularizers of Experiment 2 (Snapshot Ensembling, Batch Normalization, Dropout, Weight Decay, and dataset-specific augmentation); 2. The top-5 regularizers with the highest ranks from Experiment 1 (Dropout, Shake-Drop, Batch Normalization, Snapshot Ensembling, and dataset-specific augmentation). In both top-5 cocktails "dataset-specific augmentation" signifies having data augmentation activated, however, the choice between CutMix, CutOut, and Mixup is dataset-specific and is tuned during the HPO process. This design decision was taken to make the baselines even more competitive. The aforementioned regularizers in the top-5 baselines are always applied jointly (i.e. no subset of those methods are selected on a per dataset basis), however, we tune the hyper-parameters of all regularizers in each top-5 baseline jointly for each dataset. We observe that both top-5 baselines underperform against our proposed dataset-specific cocktail as indicated in Figure 4 . Additionally, we measured the statistical significance between the top-5 baselines and our method using the Wilcoxon signedrank test at a 10% significance level. For the top-5 highest ranks variant, the result confirms that the difference is significant with a p-value of 0.00004. Similarly, the results show a significance against the top-5 most frequent variant with a p-value of 0.08143. For a detailed summary of all the results for every method we refer to Appendix D, Table 5 . Regularization cocktails achieve state-of-the-art classification accuracy in tabular datasets (Experiment 3): To investigate whether the regularization cocktails achieve state-of-the-art classification accuracy, we compare our method against the GBDT method, which is the de-facto stateof-the-art in tabular datasets. The results, as presented in Figure 5 show the superiority of the regularization cocktails in terms of the predictive performance under the same time and resource constraints compared to the GBDT algorithm. We used the GBDT implementation of Auto-Sklearn, a popular automated tool in the realm of AutoML. For ensuring a fair comparison we ran GBDT with the same setup (same training, validation, testing splits) and the same hyper-parameter search time. In addition, we ran experiments with the same computational hardware as for the cocktail. More details on the experimental setup are presented in Appendix B.2. The regularization cocktails achieve higher accuracies compared to GBDT on 28 out of 40 datasets (70% win ratio) and the difference is statistically significant. Figure 5 further illustrates that the performance gain of the regularization cocktails is invariant to the dataset size. In the left subplot of Figure 5 we observe that our method outperforms GBDT in both small and large datasets. Furthermore, the right subplot of Figure 5 depicts the fact that our gain is not marginal and in certain datasets, we achieve up to 30% increase in test accuracy. The full per-dataset accuracies of GBDT are found in Appendix D, Table 7 . Lastly, we computed the statistical significance between the cocktail and GBDT using the Wilcoxon signed-rank test, which resulted in a p-value of 0.0003. Based on the empirical results, we conclude that the regularization cocktails yield state-of-the-art prediction models for classifying tabular datasets.

6. CONCLUSIONS AND FUTURE WORK

Even though combining regularizers is a relatively frequent practice by researchers, to date, there exists no prior work that systematically studies the effect of optimally combining regularization methods. This paper presented the first step in empirically studying regularization cocktails, by posing the problem as a standard hyperparameter optimization challenge. We conducted a large-scale experiment involving 13 regularization methods and 40 datasets, with a thorough hyperparameter optimization procedure for each technique. The findings of this study can be summarized as three simple take-home messages for practitioners: 1. Instead of applying a single regularization technique, we recommend exploiting the complementary effects of regularization cocktails. 2. To make neural networks achieve state-of-the-art classification accuracy in classifying tabular datasets the regime of regularization cocktails should be applied. 3. To obtain a well-performing, dataset-specific regularization cocktail, using state-of-the-art hyperparameter optimization techniques is recommended. As future work, we would like to combine regularization cocktails for neural networks with automated data preprocessing pipelines and architecture search, in order to advance the performance gain of deep learning on small tabular data. Furthermore, the estimator for Auto-Sklearn is restricted to only include GBDT, for the sake of fully comparing against the algorithm as a baseline. We do not activate any preprocessing since also our regularization cocktails do not make use of preprocessing algorithms in the pipeline. The time left is always selected based on the time it took BOHB to find the hyperparameter with the best validation accuracy from the start of the hyperparameter optimization phase. The ensemble size is kept to 1 since our method only features one classifier and not multiple ones. The seed is set to 11 as it was set in the experiments with the regularization cocktail, so we can have the same data splits. To keep the comparison fair, there is no warm start for the initial configurations with meta-learning, since, our method also does not make use of meta-learning. Lastly, the number of workers in parallel is set to 10 to match the parallel resources that were given to the experiment with the regularization cocktails. 

D TABLES

In this section, at Table 4 we provide information about the datasets that are considered in our experiments. Concretely, we provide descriptive statistics and the identifiers for every dataset. The identifier (the task id) can be used to download the datasets from OpenML. Moreover, Table 5 shows the results for the comparison between the Regularization Cocktail and the Top-5 cocktail variants as described in Experiment 2. The results are calculated on the test set for all datasets, after retraining on the best dataset-specific hyperparameter configuration. Most Frequent (Top-5 F) and the Top-5 Highest Ranks (Top-5 R) baselines. At Table 6 we provide the results of all our experiments for the baseline, the individual regularization methods, and the regularization cocktail. All the results are calculated on the test set after retraining on the best-found hyperparameter configurations. The evaluation metric used for the performance is the balanced accuracy. 



For simplicity, we only discuss hold-out validation scheme here, but in principle any other validation scheme, such as cross validation and bootstrap sampling, would be possible. in Appendix D). Moreover, we use a 9-layer feed-forward neural network with units for each layer, a choice based on a previous related work(Orhan & Pitkow, 2017). We emphasize that the network has a sufficiently large capacity, to ensure that the effect of regularization methods would be noticeable. https://automl.github.io/auto-sklearn



Figure 1: Critical difference diagram generated with the Wilcoxon-Holm post-hoc analysis on 40 datasets. The diagram shows the ranks and the statistical significance of the results for every individual regularization technique and our regularization cocktails.

Figure 2: Rank distribution for the individual regularization methods and the regularization cocktail. The rank distribution for each method is calculated on the test set over all datasets.

Figure 3: Left: Individual and pairwise cocktail ingredients occurring in at least 30% of the datasets. Right: Clustered histogram of cocktail ingredients. Data Augmentation: {CutMix, Cutout, Mixup, Adversarial Training}, Structural: {Skip connection, Shake-Shake, Shake-Drop}, Weight aggregation: {Lookahead Optimizer, Stochastic Weight Averaging, Snapshot Ensembling}.

Figure4: Comparison of our proposed dataset-specific cocktail against the cocktail of the top-5 most frequent regularizers (top row), and the cocktail of the top-5 regularizers with the highest performance ranks (bottom row). Each point represents a dataset and the gain is defined as the test accuracy of our method divided by the test accuracy of each baseline. We illustrate the gain with three ablations: the number of samples (left), number of features (middle) and test accuracy (right).

Figure 5: Left: Cocktail gain over the GBDT algorithm calculated on the test set for every dataset (The gain is calculated by dividing the cocktail accuracy with the GBDT accuracy). Right: The distribution of the cocktail gain.

Figure6: Pairwise statistical significance and comparison. For every entry, the first row showcases the wins, draws and losses of the horizontal method with the vertical method on all datasets, calculated on the test set; the second row presents the p-value for the statistical significance test.



The configuration space for the regularization cocktail regarding the explicit regularization hyperparameters of the methods and the conditional constraints enabling or disabling them.

B.2 AUTO-SKLEARN: GRADIENT BOOSTED DECISION TREE SEARCH SPACEFor Experiment 3, we set up the search space of Auto-Sklearn as follows:

The search space of the training and model hyperparameters for the gradient boosting estimator of the Auto-Sklearn tool.

Datasets. The collection of datasets used in our experiments, combined with detailed information for each dataset.

Task Id Cockt. Top-5 F Top-5 R Task Id Cockt. Top-5 F Top-5 R Task Id Cockt. Top-5 F Top-5 R

Top-5 baselines. The test set performance for the Regularization Cocktail against the Top-5

annex

BOHB (Falkner et al., 2018a ) is a hyperparameter optimization algorithm that extends Hyperband (Li et al., 2017) by sampling from a model instead of sampling randomly from the hyperparameter search space.Initially, BOHB performs random search and favors exploration. As it iterates and gets more observations, it builds models over different fidelities and trades off exploration with exploitation to avoid converging in bad regions of the search space. BOHB samples from the model of the highest fidelity with a probability p and with 1 -p from random. A model is built for a fidelity only when enough observations exist, by default the criteria is set to equal S + 1, where S is the dimensionality of the search space. Table 2 presents the implicit search space used in all our experiments. The implicit search space is shared between all the individual regularizers and the regularization cocktail.

BOHB achieves strong anytime results by combining

In Figure 6 , we present the results of each pairwise comparison. The results presented are calculated on the test set after the refit phase is completed on the best hyperparameter configuration. The p-value is generated by performing the Wilcoxon signed-rank test. As can be seen from the results, the regularization cocktail is the only method that has statistically significant results compared to all the other methods.C.2 EXPERIMENT 2: DATASET-DEPENDENT OPTIMAL COCKTAILSIn Figure 7 , we present the occurrences of every regularization method over all datasets. The occurrences are calculated by analyzing the best-found hyperparameter configuration for each dataset and observing the number of times the regularization method was chosen to be activated by BOHB. 

