TANGOS: REGULARIZING TABULAR NEURAL NET-WORKS THROUGH GRADIENT ORTHOGONALIZATION AND SPECIALIZATION

Abstract

Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the tabular domain, their efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing regularization techniques include broad modelling decisions such as choice of architecture, loss functions, and optimization methods. In this work, we introduce Tabular Neural Gradient Orthogonalization and Specialization (TANGOS), a novel framework for regularization in the tabular setting built on latent unit attributions. The gradient attribution of an activation with respect to a given input feature suggests how the neuron attends to that feature, and is often employed to interpret the predictions of deep networks. In TANGOS, we take a different approach and incorporate neuron attributions directly into training to encourage orthogonalization and specialization of latent attributions in a fully-connected network. Our regularizer encourages neurons to focus on sparse, non-overlapping input features and results in a set of diverse and specialized latent units. In the tabular domain, we demonstrate that our approach can lead to improved out-of-sample generalization performance, outperforming other popular regularization methods. We provide insight into why our regularizer is effective and demonstrate that TANGOS can be applied jointly with existing methods to achieve even greater generalization performance.

1. INTRODUCTION

Despite its relative under-representation in deep learning research, tabular data is ubiquitous in many salient application areas including medicine, finance, climate science, and economics. Beyond raw performance gains, deep learning provides a number of promising advantages over non-neural methods including multi-modal learning, meta-learning, and certain interpretability methods, which we expand upon in depth in Appendix C. Additionally, it is a domain in which general-purpose regularizers are of particular importance. Unlike areas such as computer vision or natural language processing, architectures for tabular data generally do not exploit the inherent structure in the input features (i.e. locality in images and sequential text, respectively) and lack the resulting inductive biases in their design. Consequentially, improvement over non-neural ensemble methods has been less pervasive. Regularization methods that implicitly or explicitly encode inductive biases thus play a more significant role. Furthermore, adapting successful strategies from the ensemble literature to neural networks may provide a path to success in the tabular domain (e.g. Wen et al., 2020) . Recent work in Kadra et al. (2021) has demonstrated that suitable regularization is essential to outperforming such methods and, furthermore, a balanced cocktail of regularizers results in neural network superiority. Regularization methods employed in practice can be categorized into those that prevent overfitting through data augmentation (Krizhevsky et al., 2012; Zhang et al., 2018) , network architecture choices (Hinton et al., 2012; Ioffe & Szegedy, 2015) , and penalty terms that explicitly influence parameter learning (Hoerl & Kennard, 1970; Tibshirani, 1996; Jin et al., 2020) , to name just a few. While all such methods are unified in attempting to improve out-of-sample generalization, this is often achieved in vastly different ways. For example, L1 and L2 penalties favor sparsity and shrinkage, respectively, on model weights, thus choosing more parsimonious solutions. Data perturbation techniques, on the other hand, encourage smoothness in the system assuming that small perturbations in the input should not result in large changes in the output. Which method works best for a given task is generally not known a priori and considering different classes of regularizer is recommended in practice. Furthermore, combining multiple forms of regularization simultaneously is often effective, especially in lower data regimes (see e.g. Brigato & Iocchi, 2021 and Hu et al., 2017) . Neuroscience research has suggested that neurons are both selective (Johnston & Dark, 1986) and have limited capacity (Cowan et al., 2005) in reacting to specific physiological stimuli. Specifically, neurons selectively choose to focus on a few chunks of information in the input stimulus. In deep learning, a similar concept, commonly described as a receptive field, is employed in convolutional layers (Luo et al., 2016) . Here, each convolutional unit has multiple filters, and each filter is only sensitive to specialized features in a local region. The output of the filter will activate more strongly if the feature is present. This stands in contrast to fully-connected networks, where the all-to-all relationships between neurons mean each unit depends on the entire input to the network. We leverage this insight to propose a regularization method that can encourage artificial neurons to be more specialized and orthogonal to each other. Contributions. (1) Novel regularization method for deep tabular models. In this work, we propose TANGOS, a novel method based on regularizing neuron attributions. A visual depiction is given in Figure 1 . Specifically, each neuron is more specialized, attending to sparse input features while its attributions are more orthogonal to those of other neurons. In effect, different neurons pay attention to non-overlapping subsets of input features resulting in better generalization performance. We demonstrate that this novel regularization method results in excellent generalization performance on tabular data when compared to other popular regularizers. (2) Distinct regularization objective. We explore how TANGOS results in distinct emergent characteristics in the model weights. We further show that its improved performance is linked to increased diversity among weak learners in an ensemble of latent units, which is generally in contrast to existing regularizers. (3) Combination with other regularizers. Based upon these insights, we demonstrate that deploying TANGOS in tandem with other regularizers can further improve generalization of neural networks in the tabular setting beyond that of any individual regularizer.

2. RELATED WORK

Gradient Attribution Regularization. A number of methods exist which incorporate a regularisation term to penalize the network gradients in some way. Penalizing gradient attributions is a natural approach for achieving various desirable properties in a neural network. Such methods have been in use at least since Drucker & Le Cun (1992) , where the authors improve robustness by encouraging invariance to small perturbations in the input space. More recently, gradient attribution regularization has been successfully applied across a broad range of application areas. Some notable examples include encouraging the learning of robust features in auto-encoders (Rifai et al., 2011) , improving stability in the training of generative adversarial networks (Gulrajani et al., 2017) , and providing robustness to adversarial perturbations (Moosavi-Dezfooli et al., 2019) . While many works have applied a shrinkage penalty (L2) to input gradients, Ross et al. (2017a) explore the effects of encouraging sparsity by considering an L1 penalty term. Gradient penalties may also be leveraged to compel a network to attend to particular human-annotated input features (Ross et al., 2017b) . A related line of work considers the use of gradient aggregation methods such as Integrated Gradients (Sundararajan et al., 2017) and, typically, penalizes their deviation from a given target value (see e.g. Liu & Avci (2019) and Chen et al. (2019) ). In contrast to these works, we do not require manually annotated regions upon which we constrain the network to attend. Similarly, Erion et al. (2021) provide methods for encoding domain knowledge such as smoothness between adjacent pixels in an image. We note that while these works have investigated penalizing a predictive model's output attributions, we are the first to regularize attributions on latent neuron activations. We provide an extended discussion of related works on neural network regularization more generally in Appendix A.

3.1. PROBLEM FORMULATION

We operate in the standard supervised learning setting, with d X -dimensional input variables X ∈ X ⊆ R d X and target output variable Y ∈ Y ⊆ R. Let P XY denote the joint distribution between input and target variables. The goal of the supervised learning algorithm is to find a predictive model, f θ : X → Y with learnable parameters θ ∈ Θ. The predictive model belongs to a hypothesis space f θ ∈ H that can map from the input space to the output space. The predictive function is usually learned by optimizing a loss function L : Θ → R using empirical risk minimization (ERM). The empirical risk cannot be directly minimized since the data distribution P XY is not known. Instead, we use a finite number of iid samples (x, y) ∼ P XY , which we refer to as the training data D = {(x i , y i )} N i=1 . Once the predictive model is trained on D, it should ideally predict well on out-of-sample data generated from the same distribution. However, overfitting can occur if the hypothesis space H is too complex and the sampling of training data does not fully represent the underlying distribution P XY . Regularization is an approach that reduces the complexity of the hypothesis space so that more generalized functions are learned to explain the data. This leads to the following ERM: θ * = arg min θ∈Θ 1 |D| (x,y)∈D L(f θ (x), y) + R(θ, x, y), that includes an additional regularization term R which, generally, is a function of input x, the label y, the model parameters θ, and reflects prior assumptions about the model. For example, L1 regularization reflects the belief that sparse solutions in parameter space are more desirable.

3.2. NEURON ATTRIBUTIONS

Formally, attribution methods aim to uncover the importance of each input feature of a given sample to the prediction of the neural network. Recent works have demonstrated that feature attribution methods can be incorporated into the training process (Lundberg & Lee, 2017; Erion et al., 2021) . These attribution priors optimize attributions to have desirable characteristics, including interpretability as well as smoothness and sparsity in predictions. However, these methods have exclusively investigated output attributions, i.e., contributions of input features to the output of a model. To the best of our knowledge, we are the first work to investigate regularization of latent attributions. We rewrite our predictive function f using function composition f = l • g. Here g : X → H maps the input to a representation h = g(x) ∈ H, where H ⊆ R d H is a d Hdimensional latent space. Additionally, l : H → Y maps the latent representation to a label space y = l(h) ∈ Y. We let h i = g i (x), for, i ∈ [d H ] denote the i th neuron in the hidden layer of interest. Additionally, we use a i j (x) ∈ R to denote the attribution of the i th neuron w.r.t. the feature x j . With this notation, upper indices correspond to latent units and lower indices to features. In some cases, it will be convenient to stack all the feature attributions together in the attribution vector = + + | | | | 1 ℒspec | | | | 1 | | | | 1 = cos( , ) + ℒ orth cos( , ) + cos( , ) Input Features h 1 h 2 h 3 ∂h i ∂x j Input Gradients a i (x) = [a i j (x)] d X j=1 ∈ R d X . Attribution methods work by using gradient signals to evaluate the contributions of the input features. In the most simplistic setting: a i j (x) ≡ ∂h i (x) ∂x j . This admits a simple interpretation through a first-order Taylor expansion: if the input feature x j were to increase by some small number ϵ ∈ R + , the neuron activation would change by ϵ • a i j (x) + O(ϵ 2 ). The larger the absolute value of the gradient, the stronger the effect of a change in the input feature. We emphasize that our method is agnostic to the gradient attribution method, as different methods may be more appropriate for different tasks. For a comprehensive review of different methods, assumptions, and trade-offs, see Ancona et al. (2017) . For completeness, we also note another category of attribution methods is built around perturbations: this class of methods evaluates contributions of individual features through repeated perturbations. Generally speaking, they are more computationally inefficient due to the multiple forward passes through the neural network and are difficult to include directly in the training objective.

3.3. REWARDING ORTHOGONALIZATION AND SPECIALIZATION

The main contribution of this work is proposing regularization on neuron attributions. In the most general sense, any function of any neuron attribution method could be used as a regularization term, thus encoding prior knowledge about the properties a model should have. Specifically, the regularization term is a function of the network parameters θ and x, i.e., R(θ, x), and encourages prior assumptions on desired behavior of the learned function. Biological sensory neurons are highly specialized. For example, certain visual neurons respond to a specific set of visual features including edges and orientations within a single receptive field. They are thus highly selective with limited capacity to react to specific physiological stimuli (Johnston & Dark, 1986; Cowan et al., 2005) . Similarly, we hypothesize that neurons that are more specialized and pay attention to sparser signals should exhibit better generalization performance. We propose the following desiderata and corresponding regularization terms: • Specialization. The contribution of input features to the activation of a particular neuron should be sparse, i.e., ||a i (x)|| is small for all i ∈ [d H ] and x ∈ X . Intuitively, in higher-dimensional settings, a few features should account for a large percentage of total attributions while others are near zero, resulting in more specialized neurons. We write this as a regularization term for mini-batch training: L spec (x) = 1 B B b=1 1 d H d H i=1 ∥a i (x b )∥ 1 , where b ∈ [B] is the batch index of x b ∈ X and ∥•∥ 1 denotes the l 1 norm. • Orthogonalization. Different neurons should attend to non-overlapping subsets of input features given a particular input sample. To encourage this, we penalize the correlation between neuron attributions ρ[a i (x), a j (x)] for all i ̸ = j and x ∈ X . In other words, for each particular input, we want to discipline the latent units to attend to different aspects of the input. Then, expressing this as a regularization term for mini-batch training, we obtain: L orth (x) = 1 B B b=1 1 C d H i=2 i-1 j=1 ρ a i (x b ), a j (x b ) . Here, C is the number of pairwise correlations, C = d H •(d H -1) 2 , and ρ[a i (x b ), a j (x b )] ∈ [0, 1] is calculated using the cosine similarity |a i⊺ (x b ) a j (x b )| ||a i (x b )||2||a j (x b )||2 where ∥•∥ 2 denotes the l 2 norm. These terms can be combined into a single regularization term and incorporated into the training objective. The resulting TANGOS regularizer can be expressed as: R TANGOS (x) = λ 1 L spec (x) + λ 2 L orth (x), where λ 1 , λ 2 ∈ R act as weighting terms. As this expression is computed using gradient signals, it can be efficiently implemented and minimized in any auto-grad framework. 4 How AND Why DOES TANGOS WORK? To the best of our knowledge, TANGOS is the only work to explicitly regularize latent neuron attributions. A natural question to ask is (1) How is TANGOS different from other regularization? While intuitively it makes sense to enforce specialization of each unit and orthogonalization between units, we empirically investigate if other regularizers can achieve similar effects, revealing that our method regularizes a unique objective. Having established that the TANGOS objective is unique, the next question is (2) Why does it work? To investigate this question, we frame the set of neurons as an ensemble, and demonstrate that our regularization improves diversity among weak learners, resulting in improved out-of-sample generalization.

4.1. TANGOS REGULARIZES A UNIQUE OBJECTIVE

TANGOS encourages generalization by explicitly decorrelating and sparsifying the attributions of latent units. A reasonable question to ask is if this is unique, or if other regularizers might achieve the same objective implicitly. Two alternative regularizers that one might consider are L2 weight regularization and Dropout. Like TANGOS, weight regularization methods implicitly and partially penalize the gradients by shrinking the weights in the neural network. Additionally, Dropout trains an ensemble of learners by forcing each neuron to be more independent. In Figure 3 , we provide these results on the UCI temperature forecast dataset (Cho et al., 2020) , in which data from 25 weather stations in South Korea is used to predict next-day peak temperature. We train a fully connected neural network for each regularization method. Specifically, we plot L spec and L orth for neurons in the penultimate layers and the corresponding generalization performance. We supply an extended selection of these results on additional datasets and regularizers in Appendix J. First, we observe that TANGOS significantly decreases correlation between different neuron attributions while other regularization terms, in fact, increase them. For L2 weight regularization, this suggests that as the neural network weights are made smaller, the neurons increasingly attend to the same input features. A similar effect is observed for Dropout -which has a logical explanation. Indeed, Dropout creates redundancy by forcing each latent unit to be independent of others. Naturally, this encourages individual neurons to attend to overlapping features. In contrast, TANGOS aims to achieve specialization, such that neurons pay attention to sparse, non-overlapping features. Additionally, we note that no alternative regularizers achieve greater attribution sparsity. This does not come as a surprise for Dropout, where the aim to induce redundancy in each neuron will naturally encourage individual neurons to attend to more features. While L2 does achieve a similar level of sparsity, this is paired with a high L orth term indicating that, although the latent units do attend to sparse features, they appear to collapse to a solution in which they all attend to the same weighted subset of the input features. This, as we will discover in §4.2, is unlikely to be optimal for out-of-sample generalization. Therefore, we conclude that the pairing of the specialization and orthogonality objectives in TANGOS regularizes a unique objective. 

4.2. TANGOS GENERALIZES BY INCREASING DIVERSITY AMONG LATENT UNITS

Having established how TANGOS differs from existing regularization objectives, we now turn to answer why it works. In this section, we provide an alternative perspective on the effect of TANGOS regularization in the context of ensemble learning. A predictive model f (x) may be considered as an ensemble model if it can be written in the form f (x) = T k ∈T α k T k (x) , where T represents a set of basis functions sometimes referred to as weak learners and the α k 's represent their respective scalar weights. It is therefore clear that each output of a typical neural network may be considered an ensemble predictor with every latent unit in its penultimate layer acting as a weak learner in their contribution to the model's output. More formally, in this setting T k (x) is the activation of latent unit k with respect to an input x and α k is the subsequent connection to the output activation. With this in mind, we present the following definition. Definition 4.1. Consider an ensemble regressor f (x) = T k ∈T α k T k (x) trained on D = {(x i , y i )} N i=1 where each (x, y) is drawn randomly from P XY . Additionally, the weights are constrained such that k α k = 1. Then, for a given input-label pair (x, y), we define: (a) The overall ensemble error as: Err = (f (x)y) 2 . (b) The weighted errors of the weak learners as: Err = k α k (T k (x) -y) 2 . (c) The ensemble diversity as: Div = k α k (T k (x) -f (x)) 2 . Intuitively, Err provides a measure of the strength of the ensemble members while Div measures the diversity of their outputs. To understand the relationship between these two terms and the overall ensemble performance, we consider Proposition 1. Proposition 1 (Krogh & Vedelsby 1994) . The overall ensemble error for an input-label pair (x, y) can be decomposed into the weighted errors of the weak learners and the ensemble diversity such that: Err = Err -Div. (3) This decomposition provides a fundamental insight into the success of ensemble methods: an ensemble's overall error is reduced by decreasing the average error of the individual weak learners and increasing the diversity of their outputs. Successful ensemble methods explicitly increase ensemble diversity when training weak learners by, for example, sub-sampling input features (random forest, Breiman, 2001) , sub-sampling from the training data (bagging, Breiman, 1996) or error-weighted input importance (boosting, Bühlmann, 2012) . Returning to the specific case of neural networks, it is clear that TANGOS provides a similar mechanism of increasing diversity among the latent units that act as weak learners in the penultimate Note while all methods achieve low overall error, TANGOS is the only method that does so by increasing the diversity among the latent units. layer. By forcing the latent units to attend to sparse, uncorrelated selections of features, the learned ensemble is encouraged to produce diverse learners whilst maintaining coverage of the entire input space in aggregate. In Figure 4 , we demonstrate this phenomenon in practice by returning to the UCI temperature forecast regression task. We provide extended results in Appendix J. We train a fully connected neural network with two hidden layers with the output layer weights constrained such that they sum to 1. We observe that regularizing with TANGOS increases diversity of the latent activations resulting in improved out-of-sample generalization. This is in contrast to other typical regularization approaches which also improve model performance, but exclusively by attempting to reduce the error of the individual ensemble members. This provides additional motivation for applying TANGOS in the tabular domain, an area where traditional ensemble methods have performed particularly well. 

5. EXPERIMENTS

In this section, we empirically evaluate TANGOS as a regularization method for improving generalization performance. We present our benchmark methods and training architecture, followed by extensive results on real-world datasets. There are a few main aspects that deserve empirical investigation, which we investigate in turn: ▶ Stand-alone performance. §5.1 Comparing the performance of TANGOS, where the focus is on applying it as a stand-alone regularizer, to a variety of benchmarks on a suite of real-world datasets. ▶ In tandem performance. §5.2 Motivated by our unique regularization objective and our analysis in §4, we demonstrate that applying TANGOS in conjunction with other regularizers can lead to even greater gains in generalization performance. ▶ Modern architectures. §5.3 We evaluate performance on a state-of-the-art tabular architecture and compare to boosting. All experiments were run on NVIDIA RTX A4000 GPUs. Code is provided on Github 12 . TANGOS. We train TANGOS regularized models as described in Algorithm 1 in Appendix F. For the specialization parameter we search for λ 1 ∈ {1, 10, 100} and for the orthogonalization parameter we search for λ 2 ∈ {0.1, 1}. For computational efficiency, we apply a sub-sampling scheme where 50 neuron pairs are randomly sampled for each input (for further details see Appendix F). Benchmarks. We evaluate TANGOS against a selection of popular regularizer benchmarks. First, we consider weight decay methods L1 and L2 regularization, which sparsify and shrink the learnable parameters. For the regularizers coefficients, we search for λ ∈ {0.1, 0.01, 0.001} where regularization is applied to all layers. Next, we consider Dropout (DO), with drop rate p ∈ {10%, 25%, 50%}, and apply DO after every dense layer during training. We also consider implicit regularization in batch normalization (BN). Lastly, we evaluate data augmentation techniques Input Noise (IN), where we use additive Gaussian noise with mean 0 and standard deviation σ ∈ {0.1, 0.05, 0.01} and MixUp (MU). Furthermore, each training run applies early stopping with patience of 30 epochs. In all experiments, we use 5-fold cross-validation to train and validate each benchmark. We select the model which achieves the lowest validation error and provide a final evaluation on a held-out test set.

5.1. GENERALIZATION: STAND-ALONE REGULARIZATION

For the first set of experiments, we are interested in investigating the individual regularization effect of TANGOS. To ensure a fair comparison, we evaluate the generalization performance on held-out test sets across a variety of datasets. Datasets. We employ 20 real-world tabular datasets from the UCI machine learning repository. Each dataset is split into 80% for cross-validation and the remaining 20% for testing. The splits are standardized on just the training data, such that features have mean 0 and standard deviation 1 and categorical variables are one-hot encoded. See Appendix L for further details on the 20 datasets used. Training and Evaluation. To ensure a fair comparison, all regularizers are applied to an MLP with two ReLU-activated hidden layers, where each hidden layer has d H + 1 neurons. The models are trained using Adam optimizer with a dataset-dependent learning rate from {0.01, 0.001, 0.0001} and are trained for up to a maximum of 200 epochs. For regression tasks, we report the average Mean Square Error (MSE) and, on classification tasks, we report the average negative log-likelihood (NLL).

Results

. Table 1 provides the benchmarking results for individual regularizers. We observe that TANGOS achieves the best performance on 10/20 of the datasets. We also observe that on 6 of the remaining datasets, TANGOS ranks second. This is also illustrated by the ranking plot in Appendix H. There we also provide a table displaying standard errors. As several results have overlapping error intervals, we also assess the magnitude of improvement by performing a non-parametric Wilcoxon signed-rank sum test (Wilcoxon, 1992) paired at the dataset level. We compare TANGOS to the best-performing baseline method (L2) as a one-tailed test for both the regression and classification results obtaining p-values of 0.006 and 0.026 respectively. This can be interpreted as strong evidence to suggest the difference is statistically significant in both cases. Note that a single regularizer is seldom used by itself. In addition to a stand-alone method, it remains to be shown that TANGOS brings value when used with other regularization methods. This is explored in the next section.

5.2. GENERALISATION: IN TANDEM REGULARIZATION

Motivated by the insights described in §4, a natural next question is whether TANGOS can be applied in conjunction with existing regularization to unlock even greater generalization performance. In this set of experiments, we investigate this question. Setup. The setting for this experiment is identical to §5.1 except now we consider the six baseline regularizers in tandem with TANGOS. We examine if pairing our proposed regularizer with existing methods results in even greater generalization performance. We again run 5-fold cross-validation, searching over the same hyperparameters, with the final models evaluated on a held-out test set. Results. We summarize the aggregated results over the datasets for each of the six baseline regularizers in combination with TANGOS in Figure 5 . Consistently across all regularizers in both the regression and the classification settings, we observe that adding TANGOS regularization improves test performance. We provide the full table of results in the supplementary material. We also note an apparent interaction effect between certain regularizers (i.e. input noise for regression and dropout for classification), where methods that seemed to not be particularly effective as stand-alone regularizers become the best-performing method when evaluated in tandem. The relationship between such regularizers provides an interesting direction for future work.

5.3. CLOSING THE GAP ON BOOSTING

In this experiment, we apply TANGOS regularization to a state-of-the-art deep learning architecture for tabular data (Gorishniy et al., 2021) and evaluate its contribution towards producing competitive performance against leading boosting methods. We provide an extended description of this experiment in Appendix B and results in Table 2 . We find that TANGOS provides moderate gains in this setting, improving performance relative to state-of-the-art boosting methods. Although boosting approaches still match or outperform deep learning in this setting, in Appendix C we argue that deep learning may also be worth pursuing in the tabular modality for its other distinct advantages. 

A EXTENDED RELATED WORKS

Neural Network Regularization. Regularization methods seek to penalize complexity and impose a form of smoothness on a model. This may be cast as expressing a prior belief over the hypothesis space of a neural network which attempts to aid generalization. ▶ Categories. A vast array of regularization methods have been proposed throughout the literature (for a comprehensive taxonomy see e.g. Kukačka et al., 2017) . Modern nomenclature typically includes broad modeling decisions such as choice of architecture, loss function, and optimization method under the umbrella of regularization. Additionally, many regularization techniques augment the training data using methods such as input noise (Krizhevsky et al., 2012) or MixUp (Zhang et al., 2018) . Dropout (Hinton et al., 2012) and related approaches that augment a hidden representation of the input may also be included in this category. Possibly a more conventional category of regularization is that which adds explicit penalty term(s) to the loss function. These terms might penalize the network weights directly to shrink or sparsify their values as in L2 (Hoerl & Kennard, 1970 ) and L1 (Tibshirani, 1996) , respectively. Alternatively, network outputs may be penalized to, for example, reduce overconfidence (Pereyra et al., 2017) . ▶ Weight Orthogonalization. A number of works have studied the orthogonalization of network weights via various weight penalization methods (Bansal et al., 2018) . More recent work in Liu et al. (2021) proposed to learn an orthogonal transformation of the randomly initialized incoming weights to a given neuron. In contrast, this work seeks to ensure that the gradients of different latent neurons with respect to a given input vector are orthogonal. ▶ Combination. Compositions of multiple regularization methods are extensively applied in practice. An early example in the regression setting is the elastic net penalty (Zou & Hastie, 2005) which attempts to combine sparsity with shrinkage in the coefficients. More recent work has demonstrated the effectiveness of combining several regularization terms on tabular data (Kadra et al., 2021) , a domain in which neural networks superiority had previously been less convincing.

B TABULAR ARCHITECTURES AND BOOSTING

While non-neural methods such as XGBoost (Chen & Guestrin, 2016) and CatBoost (Prokhorenkova et al., 2018) are still considered state of the art for tabular data (Grinsztajn et al., 2022) , much progress has been made in recent years to close the gap. Furthermore, differing learning paradigms have various strengths and weaknesses outside of maximum generalization performance, which is often a consideration in practical applications. While boosting methods boast excellent computational efficiency and strong out-of-the-box performance, neural networks have unique utility in, for example, multi-modal learning (Ramachandram & Taylor, 2017) , meta-learning (Hospedales et al., 2021) and certain interpretability methods (Zhang et al., 2021) . In this section, we provide additional experiments applying TANGOS to a state-of-the-art transformer architecture for tabular data proposed in Gorishniy et al. (2021) . Specifically, this architecture combines a Feature Tokenizer which transforms features into embeddings with a multi-layer Transformer (Vaswani et al., 2017) . We compare this FT-Transformer architecture to boosting methods in the default setting where we evaluate out-of-the-box performance and the tuned setting where we jointly optimize the Transformer along with its baseline regularizers. We describe these two settings in more detail next. Default Setting. In this setting, we use a 3-layer Transformer with a 32-dimensional feature embedding size and 4 attention heads. Following the original paper we use Reglu activations, a hidden layer size of 43 corresponding to a ratio of 4 3 with the embedding size, Kaiming initialization (He et al., 2015) , and AdamW optimizer (Loshchilov & Hutter, 2017) . Finally, we apply a learning rate of 0.001. We compare this architecture with and without TANGOS regularization applied which we refer to as "Baseline" and "+ TANGOS" respectively. We set λ 1 = 1 and λ 2 = 0.01 which were found to be reasonable default values for specialization and orthogonalization in our experiments in Section 5. Tuned Setting. Here we apply ten iterations of random search tuning over the same hyperparameters as in the original work with those achieving the best validation performance selected. We then evaluate this combination by training over three seeds and perform their final evaluations on a held-out test set. We search using the same distributions as in the original work and consider the following ranges. L2 regularization ∈ [1e -06, 1e -03], residual dropout ∈ [0.0, 0.2], hidden layer dropout ∈ [0.0, 0.5], attention dropout ∈ [0.0, 0.5], hidden layer to feature embedding dimension ratio ∈ [1.0, 3.0], embedding dimension ∈ [16, 48] , number of layers ∈ [1, 3], learning rate ∈ [1e -04, 1e -03]. In the "+ TANGOS" setting we also include λ 1 ∈ [0.001, 10] and λ 2 ∈ [0.0001, 1] with a log uniform distribution. All remaining architecture choices are consistent with the default setting and the original work. We ran our experiments on the Jannis (Guyon et al., 2019) and Higgs (Baldi et al., 2014) datasets. These are both classification datasets consisting of 83733 and 98050 examples respectively. These datasets were selected as they represent a significant number of input examples along with a middling number of input features relative to the other tabular datasets explored in this work (54 and 28 respectively). We follow the experimental protocol of the boosting comparison in Grinsztajn et al. (2022) using the same training, validation, and test splits and reporting mean test accuracy over three runs. Therefore we obtain the same results for boosting as reported in that work. The results of this experiment are reported in Table 2 where we find that TANGOS does indeed have a positive effect on the FT-Transformer performance although, consistent with the original work, we found that regularization only provides modest gains at best with this architecture. While we do not claim that TANGOS regularization results in neural networks that outperform Boosting methods, these results indicate that TANGOS regularization can contribute to closing the gap and may play a key role when combined with other methods as highlighted in Kadra et al. (2021) . We believe this to be an important area for future research and, in particular, expect that architecture-specific developments of the ideas presented in this work may provide further improvements on the results obtained in this section.

C MOTIVATION FOR DEEP LEARNING ON TABULAR DATA

Several works have argued that boosting methods generally achieve superior performance to even state-of-the-art deep learning architectures for tabular data (Grinsztajn et al., 2022; Shwartz-Ziv & Armon, 2022) . However, this is in contrast to recent findings for transformer style architectures in Gorishniy et al. (2021) , especially with appropriate feature embeddings (Gorishniy et al., 2022) and sufficient pretraining (Rubachev et al., 2022) . We defer from this discussion to highlight a selection of reasons to consider deep learning methods for tabular data beyond straightforward improvements in predictive performance. In particular, we include a number of deep learning paradigms that are difficult to analogize for non-neural models and have been successfully applied to tabular data. Multi-modal learning refers to the task of modeling data inputs that consist of multiple data modalities (e.g. image, text, tabular). As one might intuit, jointly modeling these multiple modalities can result in better performance than independently predicting from each of them (Ramachandram & Taylor, 2017; Guo et al., 2019) . Deep learning provides a uniquely natural method of combining modalities with the advantages of (1) modality-specific encoders, (2) that are fused into a joint downstream representation and trained end-to-end with backpropagation, and (3) superior modeling performance in many modalities such as images and natural language. Healthcare is a domain in which multi-modal learning is particularly salient (Acosta et al., 2022) . Recent work in Wu et al. (2022) showed that jointly modeling tabular clinical records using an MLP together with medical images using a CNN outperforms the non-multi-modal baselines. Elsewhere in Tang et al. (2020) , a multi-modal approach is taken in combining input modalities based on the preprocessing of functional magnetic resonance imaging and region of interest time series data for the diagnosis of autism spectrum disorder. A resnet-18 encodes one modality while an MLP encodes the other, resulting in superior performance when analyzed in an ablation study. In this setting, progress in modeling each of the individual modalities is likely to result in better performance of the system as a whole. Interestingly, Ramachandram & Taylor (2017) identified regularization techniques for improved cross-modality learning as an important research direction. We believe that further development of the ideas presented in this work could provide a powerful tool for balancing how models attend to multiple input modalities. Meta-learning aims to distill the experience of multiple learning episodes across a distribution of related tasks to improve learning performance on future tasks (Hospedales et al., 2021) . Deep learning-based approaches have seen great success as a solution to this problem in a variety of fields. In the tabular domain, with careful consideration of the shared information between tasks, recent works have also shown promising results in this direction by developing methods for transferring deep tabular models across tables (Wang & Sun, 2022; Levin et al., 2022) . In particular, in Levin et al. (2022) it was noted that "representation learning with deep tabular models provides significant gains over strong GBDT baselines", also finding that "the gains are especially pronounced in low data regimes". Interpretability is an important area of deep learning research aiming to provide users with the ability to understand and reason about model outputs. Certain classes of interpretability methods have recently been developed that provide distinct forms of interpretability relying on the hidden representations of neural networks. In such models, probing the representation space of a deep model permits a new type of interpretation. For instance, Kim et al. (2018) studies how human concepts are represented by deep classifiers. This makes it possible to analyze how the classes predicted by the model relate to human understandable concepts. For example, one can verify if the stripe concept is relevant for a CNN classifier to identify a zebra, as demonstrated in the paper. Another example is Crabbé et al. (2021) , which proposes to explain a given example with reference to a freely selected set of other examples (potentially from the same dataset). A user study was carried out in this work which concluded that, among non-technical users, this method of explanation does affect their confidence in the model's prediction. These powerful methods crucially rely on the model's representation space, which effectively assumes that the model is a deep neural network. Representation learning more generally provides access to several other methods from deep learning to the tabular domain. A number of works have used deep learning approaches to map inputs to embeddings which can be useful for downstream applications. SuperTML (Sun et al., 2019) and Zhu et al. (2021) map tabular inputs to image-like embeddings that can therefore be passed to image architectures such as CNNs. Other self-supervised methods include VIME (Yoon et al., 2020) which applies input reconstruction, SubTab (Ucar et al., 2021) which suggests a multi-view reconstruction task and SCARF (Bahri et al., 2021) which takes a contrastive approach. Representation learning approaches such as these have proven successful on downstream tabular data tasks such as uncertainty quantification (Seedat et al., 2023) , federated learning (He et al., 2022) , anomaly detection (Liang et al., 2022) , and feature selection (Lee et al., 2022) .

D TANGOS BEHAVIOR ANALYSIS

In this section, we apply TANGOS to a simple image classification task using a convolutional neural network (CNN) and provide a qualitative analysis of the behavior of the learned network. This analysis is conducted on the MNIST dataset (LeCun et al., 1998) using the recommended split resulting in 60,000 training and 10,000 validation examples. In this experiment, we train a standard CNN architecture (as described in Table 3 ) with a penultimate hidden layer of 10 neurons for 10 epochs with Adam optimizer and a learning rate of 0.001. We also apply L2 regularization with weight 0.001. After each epoch, the model is evaluated on the validation set where the epoch achieving the best validation performance is stored for further analysis. Two models are trained under this protocol. One model which applied TANGOS to the penultimate hidden layer with λ 1 = 100, λ 2 = 0.1 and M = 25 and a baseline model which does not apply TANGOS. In this section, we examine the gradients of each of the 10 neurons in the penultimate hidden layer with respect to each of the input dimensions of a given image. TANGOS is designed to reward orthogonalization and specialization of these gradient attributions which can be evaluated qualitatively by inspection. In all plots that follow we apply a min-max scaling across all hidden units for a fair comparison. Both strong positive and strong negative values for attributions may be interpreted as a latent unit attending to a given input dimension. In Figure 6 we provide results for the baseline model applied to a test image where, in line with similar analyses in previous works such as Crabbé & van der Schaar (2022) , we note that the way in which hidden units attend to the input is highly entangled. In contrast to this, in Figure 7 , we include the same plot for the TANGOS trained model on the same image. In this case, each hidden unit does indeed produce relatively sparse and orthogonal attributions as desired. These results were consistent across the test images. We can glean further insight into the TANGOS trained model by examining the role of individual neurons across multiple test images. In Figure 8 , we provide the gradient attributions for hidden neuron 5 (H5) from our previous discussion across twelve test images. This neuron appears to discriminate between an open or a closed loop at the lower left of the digit. Indeed this is a key aspect of distinction between the set of digits {2, 6, 8, 0} (first row) and {9, 5, 3} (second row). We also include digits where this visual feature is less useful as they contain no lower-left loop either open or closed (third row). This hypothesis can be further examined by analyzing the values of these activations. We note that the first two rows typically have higher magnitude with opposite signs while the third row has lower magnitude activations. In Table 4 we summarize the effect of these activation scores on class probabilities by accounting for the weights connecting to each of the ten classes. As one might expect, the weight connections between the hidden neuron and classes on the first row and the second row have opposite signs indicating that neuron 5 does indeed discriminate between these classes.

E PERFORMANCE WITH INCREASING DATA SIZE

In this section, we evaluate TANGOS performance with an increasing number of input examples. To do this we use the Dionis dataset, which was the largest benchmark dataset proposed in Kadra et al. (2021) with 416,188 examples. As in that work, we set aside 20% for testing with the remaining data further split into 80% training and 20% validation. The data was standardized to have zero mean and unit variance with statistics calculated on the training data. We then consider using various proportions (10%, 50%, 100%) of the training data to train an MLP with and without TANGOS regularization. We also evaluate the best-performing regularization method, L2, from our experiments in Section 5. For both regularization methods, we train three hyperperameter settings at each proportion and evaluate the best performing of the three on the test set. For TANGOS we consider {(λ 1 = 1, λ 2 = 0.01), (λ 1 = 1, λ 2 = 0.1), (λ 1 = 10, λ 2 = 0.1)} and for L2 we consider balance computational burden with more faithful estimation: L ′ orth (x) = 1 B B b=1 1 |M | (i,j)∈M ρ[a i (x b ), a j (x b )] This reduces the complexity of calculating L orth from O(d 2 H ) to O(M ). For our experimental results described in Tables 1, 6 and 7  L(f θ (x), y) = E (x,y)∼Dmini [L(f θ (x), y)]; R(x) = λ 1 E x∼Dmini [L spec (x)] + λ 2 E x∼Dmini [L ′ orth (x)]; θ ← θ + η∇ θ L(f θ (x), y) + R(x) ; end while Additionally, we provide an empirical analysis of TANGOS designed to evaluate the effectiveness of our proposed subsampling approximation with respect to generalization performance and computational efficiency as the number of sampled neuron pairs M grows. Furthermore, we analyze the computational efficiency of our method as the number of latent units grows, evaluating the method's capacity to scale to large models. All experiments are run on the BC dataset which we split into 80% training and 20% validation. We fix λ 1 = 100 and λ 2 = 0.1 throughout our experiments. We run each experiment over 10 random seeds and report the mean and standard deviation. All remaining experimental details are consistent with our experiments in Section 5. We note that our implementation of TANGOS is not optimized to the same extent as the Pytorch (Paszke et al., 2019) implementation of L2 to which we compare, and therefore we may consider our relative computational performance to be a loose upper bound on a truly optimized version. In Figure 10 (left), we report the relative increase in compute time per epoch as we increase the number of sampled pairs. As theory would suggest, this growth is linear. A natural follow-up question while better generalization performance is maintained even for low sampling rates (right). The benefits of TANGOS can be realized using our proposed sampling approximation with comparable runtime to even the most efficient existing regularization approaches. is the extent to which model performance is affected by decreasing the number of sampled pairs. In Figure 10 (right), we observe that even very low sampling rates still result in excellent performance. Based on these results, our recommendation for practitioners is that while increasing the sampling rate can lead to marginal improvements in performance, relatively low sampling rates appear to be generally sufficient and do not require prohibitive computational overhead. Given the results in Figure 10 , we next wish to evaluate if the proposed sampling scheme enables TANGOS to scale to much bigger models. In order to evaluate this we vary the number of neurons in the relevant hidden layer while maintaining a fixed sampling rate of 50 pairs (consistent with our experiments in Section 5). Other experimental parameters are consistent with the previous experiment. The results are provided in Figure 11 where we observe a relatively slow increase in runtime as the model grows. These results demonstrate that TANGOS can efficiently be applied to much larger models by using our proposed sampling scheme. 

G ABLATION STUDY

TANGOS is designed with joint application of both regularization on specialization and orthogonalization in mind. Having empirically demonstrated strong overall results, an immediate question is the dynamics of the two regularizers, and how they interact to affect performance. Specifically, we consider the performance gain due to joint regularization effects over applying each regularizer separately. This includes three separate settings: 1) when the specialization regularizer is applied independently (SpecOnly), here we set λ 2 = 0 and search over λ 1 ∈ {1, 10, 100}; 2) when the orthogonalization is applied separately (OrthOnly), we set λ 1 = 0 and search over λ 2 ∈ {0.1, 1}; and lastly 3) when both are applied jointly (TANGOS), i.e. searching over λ 1 ∈ {1, 10, 100} and λ 2 ∈ {0.1, 1}. We report the result of the ablation study in Table 5 . We empirically observe that the joint effects of both regularizers (i.e. TANGOS) are crucial to achieve consistently good performance. Combining these results with what we observed in Figure 3 , we hypothesize that applying just specialization regularization, with no regard for diversity, can inadvertently force the neurons to attend to overlapping regions in the input space. Correspondingly, simply enforcing orthogonalization, with no regard for sparsity, will likely result in neurons attending to non-overlapping yet spurious regions in the input. Thus, we conclude that the two regularizers have distinct, but complementary, effects that work together to achieve the desired regularization effect. 

H RANKING PLOT

In Table 1 , we reported the generalization performance of TANGOS compared to other regularizers in a stand-alone setting. To gain a better understanding of relative performance, we visually depict the relative ranking of regularizers across all 20 datasets. Figure 12 demonstrates that TANGOS consistently ranks as one of the better-performing regularizers, while performance of benchmark methods tend to fluctuate depending on the dataset. 

Rank

Figure 12 : Ranking of stand-alone regularizers. ranking of regularizer performance across the 20 datasets, as reported in Table 1 . TANGOS consistently ranks among the best-performing regularizers.

I STAND-ALONE UNCERTAINTY

In Table 6 , we report the standard deviation on generalization performance reported in Table 1 . The standard errors are computed using 10 seeded runs. 

K IN TANDEM RESULTS

In Table 7 , we provide a detailed breakdown of Figure 5 , specifically by reporting in tandem performance when benchmarks are paired with TANGOS across all datasets. 

L DATASET AND REGULARIZER DETAILS

We perform our experiments on 20 real-world publicly available datasets obtained from (Dua et al., 2017) . They are summarized in Table 8 . Further information and the source files used for each of the respective datasets can be found at: https://archive.ics.uci.edu/ml/machinelearning-databases/<UCISource>/ where <UCI Source> denotes the datasets unique identifier as listed in Table 8 . Standard preprocessing was applied including standardization of features, one hot encoding of categorical variables, median imputation of missing values and log transformations of highly skewed feature distributions. Furthermore, for computational feasibility, datasets with over 1000 samples were reduced in size. In these cases the first 1000 samples from the original UCI Source file were used. In Table 9 we summarize the regularizers considered in this work.



https://github.com/alanjeffares/TANGOS https://github.com/vanderschaarlab/TANGOS



Figure 1: TANGOS encourages specialization and orthogonalization. TANGOS penalizes neuron attributions during training. Here, indicates strong positive attribution and indicates strong negative attribution, while interpolating colors reflect weaker attributions. Neurons are regularized to be specialized (attend to sparser features) and orthogonal (attend to non-overlapping features).

Figure 2: Method illustration. TANGOS regularizes the gradients with respect to each of the latent units.

Figure 3: Comparison of regularization objectives. (Top) Generalization performance of key regularization techniques, (Bottom) corresponding neuron attributions evaluated on the test set. L2 and DO can reduce overfitting, but neuron attributions are in fact becoming more correlated. TANGOS achieves the best generalization performance by penalizing a different objective.

Figure 4: Neuron Diversity.Overall ensemble error and decomposition in terms of diversity and average error of the weak learners. Note while all methods achieve low overall error, TANGOS is the only method that does so by increasing the diversity among the latent units.

Figure 5: In Tandem Regularization. Aggregated errors across the 10 regression datasets (left) and the 10 classification datasets (right). In all cases, the addition of TANGOS provides superior performance over the standalone regularizer.

Figure 6: Without TANGOS Training. Gradient attributions with respect to each of the 10 hidden neurons. These results suggest significant overlap among the gradient attributions.H1 H2 H3 H4 H5

Figure 9: Performance Gains With Increasing Data Size. Training with various proportions of training data from the 416,188 examples of the Dionis dataset, we find the relative boost in performance from TANGOS to be consistent.

, we use |M | = 50. We empirically demonstrate that this approximation still leads to strong results in real-world experiments. The overall training procedure is described in Algorithm 1. Algorithm 1 TANGOS regularization Result: Learned parameters θ Input: λ 1 , λ 2 , training data D, learning rate η; Initialise θ; while not converged do Sample D mini from D;

Figure10: Sampling efficiency. Runtime increases linearly with the number of sampled pairs (left) while better generalization performance is maintained even for low sampling rates (right). The benefits of TANGOS can be realized using our proposed sampling approximation with comparable runtime to even the most efficient existing regularization approaches.

Figure 11: Scaling to large models. With a subsampling rate fixed at M = 50, TANGOS incurs only a small percentage increase in runtime as the number of neurons in the penultimate hidden layer increases dramatically.



Stand

FTZikeba, Jakub M Tomczak, Marek Lubicz, and Jerzy 'Swikatek. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Applied Soft Computing, 2013.Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.

MNIST Convolutional Neural Network Architecture.

Ablation study. Generalization performance on different ablation settings.

Standard error on generalization performance. Standard errors with respect to the random seed after retraining models from Table 1 experiments 10 times.

In tandem performance. Mean ± standard deviation of generalization performance when each regularizer is employed in tandem with TANGOS.

ACKNOWLEDGMENTS

We thank the anonymous ICLR reviewers as well as members of the van der Schaar lab for many insightful comments and suggestions. Alan Jeffares is funded by the Cystic Fibrosis Trust. Tennison Liu would like to thank AstraZeneca for their sponsorship and support. Fergus Imrie and Mihaela van der Schaar are supported by the National Science Foundation (NSF, grant number 1722516). Mihaela van der Schaar is additionally supported by the Office of Naval Research (ONR).

REPRODUCIBILITY STATEMENT

We have attempted to make our experimental results easily reproducible by both a detailed description of our experimental procedure and providing the code used to produce our results (https:// github.com/alanjeffares/TANGOS). Experiments are described in Section 5 with further details in Appendices F and L. All datasets used in this work can be freely downloaded from the UCI repository (Dua et al., 2017) with specific details provided in Appendix L.

annex

λ ∈ {0.01, 0.001, 0.0001}. We repeat this procedure for 6 runs and report the mean test accuracy.The MLP contained three ReLU-activated hidden layers of 400, 100, and 10 hidden units, respectively.We include the results of this experiment in Figure 9 . Consistent with our experiments in Section 5, we find that TANGOS outperforms both the baseline model and the strongest baseline regularization method across all proportions of the data. These results are indicative that TANGOS remains similarly effective across both small and large datasets in the tabular domain.

F APPROXIMATION AND ALGORITHM

Calculating the attribution of the latent units with respect to the input involves computing the Jacobian matrix, which can be computed in O(1) time and has memory complexity O(d H d X ). The computational complexity of calculating L orth is O(d 2 H ) (i.e. all pairwise computation between latent units). While the calculation can be efficiently parallelized, this still becomes impractically expensive with higher dimensional layers. To address this, we introduce a relaxation by randomly subsampling pairs of neurons to calculate attribution similarity. We denote by I denote the set of all possible pairs of neuron indices, I = {(i, j) ∀ i, j ∈ [d H ] and i ̸ = j}. Further, we let M denote a randomly sampled subset of I, M ⊆ I. We devise an approximation to the regularization term, denoted by L ′ orth , by estimating the penalty on the subset M , where the size of M can be chosen to

J INSIGHTS -EXTENDED RESULTS

In this section, we present extended results of the decomposition of overall model error into diversity and weighted error among an ensemble of latent units from Section 4.2. We include all eight regularizers as described in Table 9 and three datasets (WE, ST, and BC) as described in Table 8 . The results are included in Figure 13 . (1988) [1] This dataset has now been removed due to ethical issues. For more information see the following url https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8 Table 9 : Overview of regularizers. Description of benchmarks considered in this work and their implementations. 

