NET-DNF: EFFECTIVE DEEP MODELING OF TABULAR DATA

Abstract

A challenging open question in deep learning is how to handle tabular data. Unlike domains such as image and natural language processing, where deep architectures prevail, there is still no widely accepted neural architecture that dominates tabular data. As a step toward bridging this gap, we present Net-DNF a novel generic architecture whose inductive bias elicits models whose structure corresponds to logical Boolean formulas in disjunctive normal form (DNF) over affine soft-threshold decision terms. Net-DNFs also promote localized decisions that are taken over small subsets of the features. We present extensive experiments showing that Net-DNFs significantly and consistently outperform fully connected networks over tabular data. With relatively few hyperparameters, Net-DNFs open the door to practical end-to-end handling of tabular data using neural networks. We present ablation studies, which justify the design choices of Net-DNF including the inductive bias elements, namely, Boolean formulation, locality, and feature selection.

1. INTRODUCTION

A key point in successfully applying deep neural models is the construction of architecture families that contain inductive bias relevant to the application domain. Architectures such as CNNs and RNNs have become the preeminent favorites for modeling images and sequential data, respectively. For example, the inductive bias of CNNs favors locality, as well as translation and scale invariances. With these properties, CNNs work extremely well on image data, and are capable of generating problem-dependent representations that almost completely overcome the need for expert knowledge. Similarly, the inductive bias promoted by RNNs and LSTMs (and more recent models such as transformers) favors both locality and temporal stationarity. When considering tabular data, however, neural networks are not the hypothesis class of choice. Most often, the winning class in learning problems involving tabular data is decision forests. In Kaggle competitions, for example, gradient boosting of decision trees (GBDTs) (Chen & Guestrin, 2016; Friedman, 2001; Prokhorenkova et al., 2018; Ke et al., 2017) are generally the superior model. While it is quite practical to use GBDTs for medium size datasets, it is extremely hard to scale these methods to very large datasets. Scaling up the gradient boosting models was addressed by several papers (Ye et al., 2009; Tyree et al., 2011; Fu et al., 2019; Vasiloudis et al., 2019) . The most significant computational disadvantage of GBDTs is the need to store (almost) the entire dataset in memoryfoot_0 . Moreover, handling multi-modal data, which involves both tabular and spatial data (e.g., medical records and images), is problematic. Thus, since GBDTs and neural networks cannot be organically optimized, such multi-modal tasks are left with sub-optimal solutions. The creation of a purely neural model for tabular data, which can be trained with SGD end-to-end, is therefore a prime open objective. A few works have aimed at constructing neural models for tabular data (see Section 5). Currently, however, there is still no widely accepted end-to-end neural architecture that can handle tabular data and consistently replace fully-connected architectures, or better yet, replace GBDTs. Here we present Net-DNFs, a family of neural network architectures whose primary inductive bias is an ensemble comprising a disjunctive normal form (DNF) formulas over linear separators. This family also promotes (input) feature selection and spatial localization of ensemble members. These inductive biases have been included by design to promote conceptually similar elements that are inherent in GBDTs and random forests. Appealingly, the Net-DNF architecture can be trained end-to-end using standard gradient-based optimization. Importantly, it consistently and significantly outperforms FCNs on tabular data, and can sometime even outperform GBDTs. The choice of appropriate inductive bias for specialized hypothesis classes for tabular data is challenging since, clearly, there are many different kinds of such data. Nevertheless, the "universality" of forest methods in handling a wide variety of tabular data suggests that it might be beneficial to emulate, using neural networks, the important elements that are part of the tree ensemble representation and algorithms. Concretely, every decision tree is equivalent to some DNF formula over axis-aligned linear separators (see details in Section 3). This makes DNFs an essential element in any such construction. Secondly, all contemporary forest ensemble methods rely heavily on feature selection. This feature selection is manifested both during the induction of each individual tree, where features are sequentially and greedily selected using information gain or other related heuristics, and by uniform sampling features for each ensemble member. Finally, forest methods include an important localization element -GBDTs with their sequential construction within a boosting approach, where each tree re-weights the instance domain differently -and random forests with their reliance on bootstrap sampling. Net-DNFs are designed to include precisely these three elements. After introducing Net-DNF, we include a Vapnik-Chervonenkins (VC) comparative analysis of DNFs and trees showing that DNFs potentially have advantage over trees when the input dimension is large and vice versa. We then present an extensive empirical study. We begin with an ablation study over three real-life tabular data prediction tasks that convincingly demonstrates the importance of all three elements included in the Net-DNF design. Second, we analyze our novel feature selection component over controlled synthetic experiments, which indicate that this component is of independent interest. Finally, we compare Net-DNFs to FCNs and GBDTs over several large classification tasks, including two past Kaggle competitions. Our results indicate that Net-DNFs consistently outperform FCNs, and can sometime even outperform GBDTs.

2. DISJUNCTIVE NORMAL FORM NETWORKS (NET-DNFS)

In this section we introduce the Net-DNF architecture, which consists of three elements. The main component is a block of layers emulating a DNF formula. This block will be referred to as a Disjunctive Normal Neural Form (DNNF). The second and third components, respectively, are a feature selection module, and a localization one. In the remainder of this section we describe each component in detail. Throughout our description we denote by x ∈ R d a column of input feature vectors, by x i , its ith entry, and by σ(•) the sigmoid function.

2.1. A DISJUNCTIVE NORMAL NEURAL FORM (DNNF) BLOCK

A disjunctive normal neural form (DNNF) block is assembled using a two-hidden-layer network. The first layer creates affine "literals" (features) and is trainable. The second layer implements a number of soft conjunctions over the literals, and the third output layer is a neural OR gate. Importantly, only the first layer is trainable, while the two other are binary and fixed. We begin by describing the neural AND and OR gates. For an input vector x, we define soft, differentiable versions of such gates as OR(x) tanh d i=1 x i + d -1.5 , AND(x) tanh d i=1 x i -d + 1.5 . These definitions are straightforwardly motivated by the precise neural implementation of the corresponding binary gates. Notice that by replacing tanh by a binary activation and changing the bias constant from 1.5 to 1, we obtain an exact implementation of the corresponding logical gates for binary input vectors (Anthony, 2005; Shalev-Shwartz & Ben-David, 2014) ; see a proof of this statement in Appendix A. Notably, each unit does not have any trainable parameters. We now define the AND gate in a vector form to project the logical operation over a subset of variables. The projection is controlled by an indicator column vector (a mask) u ∈ {0, 1} d . With respect to such a projection vector u, we define the corresponding projected gate as AND u (x) tanh u T x -||u|| 1 + 1.5 . Equipped with these definitions, a DNNF(x) : R d → R with k conjunctions over m literals is, L(x) tanh x T W + b ∈ R m (1) DNNF(x) OR([AND c 1 (L(x)), AND c 2 (L(x)), . . . , AND c k (L(x))]) . (2) Equation (1) defines L(x) that generates m "neural literals", each of which is the result of a tanhactivation of a (trainable) affine transformation. The (trainable) matrix W ∈ R d×m , as well as the row vector bias term b ∈ R m , determine the affine transformations for each literal such that each of its columns corresponds to one literal. Equation (2) defines a DNNF. In this equation, the vectors c i ∈ {0, 1} m , 1 ≤ i ≤ k, are binary indicators such that c i j = 1 iff the jth literal belongs to the ith conjunction. In our design, each literal belongs to a single conjunction. These indicator vectors are defined and fixed according to the number and length of the conjunctions (See Appendix D.2).

2.2. NET-DNFS

The embedding layer of a Net-DNF with n DNNF blocks is a simple concatenation E(x) [DNNF 1 (x), DNNF 2 (x), . . . , DNNF n (x)]. (3) Depending on the application, the final Net-DNF is a composition of an output layer over E(x). For example, for binary classification (logistic output layer), Net-DNF(x) : R d → (0, 1) is, Net-DNF(x) σ n i=1 w i DNNF i (x) + b i . To summarize, a Net-DNF is always a four-layer network (including the output layer), and only the first and last layers are learned. Each DNNF block has two parameters: the number of conjunctions k and the length m of these conjunctions, allowing for a variety of Net-DNF architectures. In all our experiments we considered a single Net-DNF architecture that has a fixed diversity of DNNF blocks which includes a number of different DNNF groups with different k, each of which has a number of conjunction sizes m (see details in Appendix D.2). The number n of DNNFs was treated as a hyperparameter, and selected based on a validation set as described on Appendix D.1.

2.3. FEATURE SELECTION

One key strategy in decision tree training is greedy feature selection, which is performed hierarchically at any split, and allows decision trees to exclude irrelevant features. Additionally, decision tree ensemble algorithms apply random sampling to select a subset of the features, which is used to promote diversity, and prevent different trees focusing on the same set of dominant features in their greedy selection. In line with these strategies, we include in our Net-DNFs conceptually similar feature selection elements: (1) a subset of features uniformly and randomly sampled for each DNNF; (2) a trainable mechanism for feature selection, applied on the resulting random subset. These two elements are combined and implemented in the affine literal generation layer described in Equation (1), and applied independently for each DNNF. We now describe these techniques in detail. Recalling that d is the input dimension, the random selection is made by generating a stochastic binary mask, m s ∈ {0, 1} d (each block has its own mask), such that the probability of any entry being 1 is p (see Appendix D.2 for details on setting this parameter). For a given mask m s , this selection can be applied over affine literals using a simple product diag(m s )W , where W is the matrix of Equation (1). We then construct a trainable mask m t ∈ R d , which will be applied on the features that are kept by m s . We introduce a novel trainable feature selection component that combines binary quantization of the mask together with modified elastic-net regularization. To train a binarized vector we resort to the straight-through estimator (Hinton, 2012; Hubara et al., 2017) , which can be used effectively to train non-differentiable step functions such as a threshold or sign. The trick is to compute the step function exactly in the forward pass, and utilize a differentiable proxy in the backward pass. We use a version of the straight-through estimator for the sign function (Bengio et al., 2013) , Φ(x) sign(x), forward pass; tanh(x), backward pass. Using the estimator Φ(x), we define a differentiable binary threshold function T (x) = 1 2 Φ(|x|-)+ 1 2 , where ∈ R defines an epsilon neighborhood around zero for which the output of T (x) is zero, and one outside of this neighborhood (in all our experiments, we set = 1 and initialize the entries of m t above this threshold). We then apply this selection by diag(T (m t ))W . Given a fixed stochastic selection m s , to train the binarized selection m t we employ regularization. Specifically, we consider a modified version of the elastic net regularization, R(m t , m s ), which is tailored to our task. The modifications are reflected in two parts. First, the balancing between the L 1 and L 2 regularization is controlled by a trainable parameter α ∈ R. Second, the expressions of the L 1 and L 2 regularization are replaced by R 1 (m t , m s ), R 2 (m t , m s ), respectively (defined below). Moreover, since we want to take into account only features that were selected by the random component, the regularization is applied on the vector m ts = m t m s , where is element-wise multiplication. The functional form of the modified elastic net regularization is as follows, R 2 (m t , m s ) ||m ts || 2 2 ||m s || 1 -β 2 , R 1 (m t , m s ) ||m ts || 1 ||m s || 1 -β R(m t , m s ) 1 -σ(α) 2 R 2 (m t , m s ) + σ(α)R 1 (m t , m s ). The above formulation of R 2 (•) and R 1 (•) is motivated as follows. First, we normalize both norms by dividing with the effective input dimension, ||m s || 1 , which is done to be invariant to the (effective) input size. Second, we define R 2 and R 1 as absolute errors, which encourages each entry to be, on average, approximately equal to the threshold . The reason is that the vector m t passes through a binary threshold, and though the exact values of its entries are irrelevant. What is relevant is whether these values are within epsilon neighborhood of zero or not. Thus, when the values are roughly equal to the threshold, it is more likely to converge to a balanced point where the regularization term is low and the relevant features were selected. The threshold term is controlled by β (a hyperparameter), which controls the cardinality of m t , where smaller values of β lead to sparser m t . To summarize, feature selection is manifested by both architecture and loss. Architecture relies on the masks m t , m s , while the loss function uses R(m t , m s ). Finally, the functional form of a DNNF block with the feature selection component is obtained by plugging the masks into Equation (2), L(x) tanh x T diag(T (m t )) diag(m s )W + b ∈ R m . Additionally, the mean over R(m t , m s ) in all DNNFs is added to the loss function as a regularizer.

2.4. SPATIAL LOCALIZATION

The last element we incorporate in the Net-DNF construction is spatial localization. This element encourages each DNNF unit in a Net-DNF ensemble to specialize in some focused proximity of the input domain. Localization is a well-known technique in classical machine learning, with various implementations and applications (Jacobs et al., 1991; Meir et al., 2000) . On the one hand, localization allows construction of low-bias experts. On the other hand, it helps promote diversity, and reduction of the correlation between experts, which can improve the performance of an ensemble (Jacobs, 1997; Derbeko et al., 2002) . We incorporate spatial localization by associating a Gaussian kernel loc(x|µ, Σ) i with a trainable mean vector µ i and a trainable diagonal covariance matrix Σ i for the ith DNNF. Given a Net-DNF with n DNNF blocks, the functional form of its embedding layer (Equation 3), with the spatial localization, is loc(x|µ, Σ) [e -||Σ1(x-µ1)||2 , e -||Σ2(x-µ2)||2 , . . . , e -||Σn(x-µn)||2 ] ∈ R n sm-loc(x|µ, Σ) Softmax {loc(x|µ, Σ) • σ(τ )} ∈ (0, 1) n E(x) [sm-loc(x|µ, Σ) 1 • DNNF 1 (x), . . . , sm-loc(x|µ, Σ) n • DNNF n (x)], where τ ∈ R is a trainable parameter such that σ(τ ) serves as the trainable temperature in the softmax. The inclusion of an adaptive temperature in this localization mechanism facilitates a data-dependent degree of exclusivity: at high temperatures, only a few DNNFs will handle an input instance whereas at low temperatures, more DNNFs will effectively participate in the ensemble. Observe that our localization mechanism is fully trainable and does not add any hyperparameters.

3. DNFS AND TREES -A VC ANALYSIS

The basic unit in our construction is a (soft) DNF formula instead of a tree. Here we provide a theoretical perspective on this design choice. Specifically, we analyze the VC-dimension of Boolean DNF formulas and compare it to that of decision trees. With this analysis we gain some insight into the generalization ability of formulas and trees, and argue numerically that the generalization of a DNF can be superior to a tree when the input dimension is not small (and vice versa). Throughout this discussion, we consider binary classification problems whose instances are Boolean vectors in {0, 1} n . The first simple observation is that every decision tree has an equivalent DNF formula. Simply, each tree path from the root to a positively labeled leaf can be expressed by a conjunction of the conditions over the features appearing along the path to the leaf, and the whole tree can be represented by a disjunction of the resulting conjunctions. However, DNFs and decision trees are not equivalent, and we demonstrate that in the lense of VC-dimension. Simon (1990) presented an exact expression for the VC-dimension of decision trees as a function of the tree rank. Definition 1 (Rank). Consider a binary tree T . If T consists of a single node, its rank is defined as 0. If T consists of a root, a left subtree T 0 of rank r 0 , and a right subtree T 1 of rank r 1 , then It is evident that in the case of DNF formulas the upper bound on the VC-dimension grows linearly with the input dimension, whereas in the case of decision trees, if the rank is greater than 1, the VC-dimension grows polynomially (with degree at least 2) with the input dimension. In the worst case, this growth is exponential. A direct comparison of these dimensions is not trivial because there is a complex dependency between the rank r of a decision tree, and the number k of the conjunctions of an equivalent DNF formula. Even if we compare large-k DNF formulas to small-rank trees, it is clear that the VC-dimension of the trees can be significantly larger. For example, in Figure 1 , we plot the upper bounds on the VC-dimension of large formulas (solid curves), and the exact VC-dimensions of small-rank trees (dashed curves). With the exception of rank-2 trees, the VC-dimension of decision trees dominates the dimension of DNFs, when the input dimension exceeds 100. Trees, however, may have an advantage over DNF formulas for low-dimensional inputs. Since the VC-dimension is a qualitative proxy of the sample complexity of a hypothesis class, the above analysis provides theoretical motivation for expressing trees using DNF formulas when the input dimension is not small. Having said that, the disclaimer is that in the present discussion we have only considered binary problems. Moreover, the final hypothesis classes of both Net-DNFs and GBDTs are more complex in structure. rank(T ) = 1 + r 0 if r 0 = r 1 max{r 0 , r 1 } else

4. EMPIRICAL STUDY

In this section, we present an empirical study that substantiates the design of Net-DNFs and convincingly shows its significant advantage over FCN architectures. The datasets used in this study are from Kaggle competitions and OpenML (Vanschoren et al., 2014) . A summary of these datasets appears in Appendix C. All results presented in this work were obtained using a massive grid search for optimizing each model's hyperparameters. A detailed description of the grid search process with additional details can be found in Appendices D.1, D.2. We present the scores for each dataset according to the score function defined in the Kaggle competition we used, log-loss and area under ROC curve (AUC ROC) for multiclass datasets and binary datasets, respectively. All results are the mean of the test scores over five different partitions, and the standard error of the mean is reported.foot_1  In addition, we also conducted a preliminary study of TabNet (Arik & Pfister, 2019) (see Section 5) over our datasets using its PyTorch implementationfoot_2 , but failed to produce competitive results.foot_3  The merit of the different Net-DNF components. We start with two different ablation studies, where we evaluate the contributions of the three Net-DNF components. In the first study, we start with a vanilla three-hidden-layer FCN and gradually add each component separately. In the second study, we start each experiment with the complete Net-DNF and leave one component out each time. In each study, we present the results on three real-world datasets, where all results are test log-loss scores (lower is better), out-of-memory (OOM) entries mean that the network was too large to execute on our machine (see Appendix D.2). More technical details can be found in Appendix D.4. 1 . In Exp 1 we start with a vanilla three-hidden-layer FCN with a tanh activation. To make a fair comparison, we defined the widths of the layers according to the widths in the Net-DNF with the corresponding formulas. In Exp 2, we added the DNF structure to the networks from Exp 1 (see Section 2.1). In Exp 3 we added the feature selection component (Section 2.3). It is evident that performance is monotonically improving, where the best results are clearly obtained on the complete Net-DNF (Exp 4). A subtle but important observation is that in all of the first three experiments, for all datasets, the trend is that the lower the number of formulas, the better the score. This trend is reversed in Exp 4, where the localization component (Section 2.4) is added, highlighting the importance of using all components of the Net-DNF representation in concert. Now consider Table 2 . In Exp 5 we took the complete Net-DNF (Exp 4) and removed the feature selection component. When considering the Gesture Phase dataset, an interesting phenomenon is observed. In Exp 3 (128 formulas), we can see that the contribution of the feature selection component is negligible, but in Exp 5 (2048 formulas) we see the significant contribution of this component. We believe that the reason for this difference lies in the relationship of the feature selection component with the localization component, where this connection intensifies the contribution of the feature selection component. In Exp 6 we took the complete Net-DNF (Exp 4) and removed the localization component (identical to Exp 3). We did the same in Exp 7 where we removed the DNF structure. In general, it can be seen that removing each component results in a decrease in performance. An analysis of the feature selection component. Having studied the contribution of the three components to Net-DNF, we now focus on the learnable part of the feature selection component (Section 2.3) alone, and examine its effectiveness using a series of synthetic tasks with a varying percentage of irrelevant features. Recall that when considering a single DNNF block, the feature We compare the performance of a vanilla FCN on three different cases: (1) oracle (ideal) feature selection (2) our (learned) feature selection mask, and (3) no feature selection. (See details in Appendix D.5). Consider the graphs in Figure 2 , which demonstrate several interesting insights. In all tasks the performance of the vanilla FCN is sensitive to irrelevant features, probably due to the representation power of the FCN, which is prone to overfitting. On the other hand, by adding the feature selection component, we obtain near oracle performance on the first three tasks, and a significant improvement on the three others. Moreover, these results support our observation from the ablation studies: that the application of localization together with feature selection increases the latter's contribution. We can see that in Syn1-3 where there is a single interaction, the results are better than in Syn4-6 where the input space is divided into two 'local' sub-spaces with different interactions. These experiments emphasize the importance of the learnable feature selection in itself. 3 : Mean test results on tabular datasets and standard error of the mean. We present the ROC AUC (higher is better) as a percentage, and the log-loss (lower is better) with an x100 factor.

Dataset

Comparative Evaluation. Finally, we compare the performance of Net-DNF vs. the baselines. Consider Table 3 where we examine the performance of Net-DNFs on six real-life tabular datasets (We add three larger datasets to those we used in the ablation studies). We compare our performance to XGboost Chen & Guestrin (2016) , the widely used implementation of GBDTs, and to FCNs. For each model, we optimized its critical hyperparameters. This optimization process required many computational resources: thousands of configurations have been tested for FCNs, hundreds of configurations for XGBoost, and only a few dozen for Net-DNF. A detailed description of the grid search we used for each model can be found in Appendix D.3. In Table 3 , we see that Net-DNF consistently and significantly outperforms FCN over all the six datasets. While obtaining better than or indistinguishable results from XGBoost over two datasets, on the other datasets, Net-DNF is slightly inferior but in the same ball park as XGBoost. 

5. RELATED WORK

There have been a few attempts to construct neural networks with improved performance on tabular data. A recurring idea in some of these works is the explicit use of conventional decision tree induction algorithms, such as ID3 (Quinlan, 1979) , or conventional forest methods, such as GBDT (Friedman, 2001) that are trained over the data at hand, and then parameters of the resulting decision trees are explicitly or implicitly "imported" into a neural network using teacher-student distillation (Ke et al., 2018) , explicit embedding of tree paths in a specialized network architecture with some kind of DNF structure (Seyedhosseini & Tasdizen, 2015) , and explicit utilization of forests as the main building block of layers (Feng et al., 2018) . This reliance on conventional decision tree or forest methods as an integral part of the proposed solution prevents end-to-end neural optimization, as we propose here. This deficiency is not only a theoretical nuisance but also makes it hard to use such models on very large datasets and in combination with other neural modules. A few other recent techniques aimed to cope with tabular data using pure neural optimization as we propose here. Yang et al. (2018) considered a method to approximate a single node of a decision tree using a soft binning function that transforms continuous features into one-hot features. While this method obtained results comparable to a single decision tree and an FCN (with two hidden layers), it is limited to settings where the number of features is small. Popov et al. (2019) proposed a network that combines elements of oblivious decision forests with dense residual networks. While this method achieved better results than GBDTs on several datasets, also FCNs achieved better than or indistinguishable results from GBDTs on most of these cases as well. Arik & Pfister (2019) presented TabNet, a neural architecture for tabular data that implements feature selection via sequential attention that offers instance-wise feature selection. It is reported that TabNet achieved results that are comparative or superior to GBDTs. Both TabNet and Net-DNF rely on sparsity inducing and feature selection, which are implemented in different ways. While TabNet uses an attention mechanism to achieve feature selection, Net-DNF uses DNF formulas and elastic net regularization. Focusing on microbiome data, a recent study Shavitt & Segal (2018) presented an elegant regularization technique, which produces extremely sparse networks that are suitable for microbiome tabular datasets. Finally, soft masks for feature selection have been considered before and the advantage of using elastic net regularization in a variable selection task was presented by Zou & Hastie (2005) ; Li et al. (2016) .

6. CONCLUSIONS

We introduced Net-DNF, a novel neural architecture whose inductive bias revolves around a disjunctive normal neural form, localization and feature selection. The importance of each of these elements has been demonstrated over real tabular data. The results of the empirical study convincingly indicate that Net-DNFs consistently outperform FCNs over tabular data. While Net-DNFs do not consistently beat XGBoost, our results indicate that their performance score is not far behind GBDTs. Thus, Net-DNF offers a meaningful step toward effective usability of processing tabular data with neural networks We have left a number of potential incremental improvements and bigger challenges to future work. First, in our work we only considered classification problems. We expect Net-DNFs to also be effective in regression problems, and it would also be interesting to consider applications in reinforcement learning over finite discrete spaces. It would be very interesting to consider deeper Net-DNF architectures. For example, instead of a single DNNF block, one can construct a stack of such blocks to allow for more involved feature generation. Another interesting direction would be to consider training Net-DNFs using a gradient boosting procedure similar to that used in XGBoost. Finally, a most interesting challenge that remains open is what would constitute the ultimate inductive bias for tabular prediction tasks, which can elicit the best architectural designs for these data. Our successful application of DNNFs indicates that soft DNF formulas are quite effective, and are strictly significantly superior to fully connected networks, but we anticipate that further effective biases will be identified, at least for some families of tabular tasks.

A OR AND AND GATES

The (soft) neural OR and AND gates were defined as OR(x) tanh d i=1 x i + d -1.5 , AND(x) tanh d i=1 x i -d + 1.5 . By replacing the tanh activation with a sign activation, and setting the bias term to 1 (instead of 1.5), we obtain exact binary gates, OR(x) sign d i=1 x i + d -1 , AND(x) sign d i=1 x i -d + 1 . Consider a binary vector x ∈ {±1} d . We prove that AND(x) ≡ d i=1 x i , where, in the definition of the logical "and", -1 is equivalent to 0. If for any 1 ≤ i ≤ d, x i = 1, then ∧ d i=1 x i = 1. Conversely, we have, AND(x) = d i=1 x i -d + 1 = d -d + 1 = 1, and the application of the sign activation yields 1. In the case of the soft neural AND gate, we get tanh(1) ≈ 0.76; therefore, we set the bias term to 1.5 to get an output closer to 1 (tanh(1.5) ≈ 0.9). Otherwise, there exists at least one index 1 ≤ j ≤ d, such that x j = -1, and ∧ d i=1 x i = -1. In this case, AND(x) = d i=1 x i -d + 1 = x j + i =j x i -d + 1 ≤ -1 + (d -1) -d + 1 = -1, and by applying the sign activation we obtain -1. This proves that the AND(x) neuron is equivalent to a logical "AND" gate in the binary case. A very similar proof shows that OR(x) ≡ d i=1 x i .

B PROOF OF THEOREM 2

We bound the VC-dimension of a DNF formula in two steps. First, we derive an upper bound on the VC-dimension of a single conjunction, and then extend it to a disjunction of k conjunctions. We use the following simple lemma. Lemma 1. For every two hypothesis classes, H ⊆ H, it holds that V CDim(H ) ≤ V CDim(H). Proof. Let d = V CDim(H ). By definition, there exist d points that can be shattered by H . Therefore, there exist 2 d hypotheses {h i } 2 d i=1 in H , which shatter these points. By assumption, {h i } 2 d i=1 ⊆ H, so V CDim(H) ≥ d. For any conjunction on n Boolean variables (regardless of the number of literals), it is possible to construct an equivalent decision tree of rank 1. The construction is straightforward. If i=1 x i is the conjunction, the decision tree consists of a single main branch of internal decision nodes connected sequentially. Each left child in this tree corresponds to decision "1", and each right child corresponds to decision "0". The root is indexed 1 and contains the literal x 1 . For 1 ≤ i < , internal node i contains the decision literal x i and its left child is node i + 1 (whose decision literal is x i+1 ). See the example in Figure 3 . It follows that the hypothesis class of conjunctions is contained in the class of rank-1 decision trees. Therefore, by Lemma 1 and Theorem 1, the VC-dimension of conjunctions is bounded above by n + 1. We 

C TABULAR DATASET DESCRIPTION

We use datasets (See Table 4 ) that differ in several aspects such as in the number of features (from 16 up to 200), the number of classes (from 2 up to 9), and the number of samples (from 10k up to 200k). To keep things simple, we selected datasets with no missing values, and that do not require preprocessing. All models were trained on the raw data without any feature or data engineering and without any kind of data balancing or weighting. Only feature wise standardization was applied. All experiments in our work, using both synthetic and real datasets, were done through a grid search process. Each dataset was first randomly divided into five folds in a way that preserved the original distribution. Then, based on these five folds, we created five partitions of the dataset as follows. Each fold is used as the test set in one of the partitions, while the other folds are used as the training and validation sets. This way, each partition was 20% test, 10% validation, and 70% training. This division was done oncefoot_4 , and the same partitions were used for all models. Based on these partitions, the following grid search process was repeated three times with three different seedsfoot_5 (with the exact same five partitions as described before). Algorithm The final mean and semfoot_6 that we presents in all experiments are the average across the three seeds. Additionally, as can be seen from Algorithm 1, the model that was trained on the training set (70%) is the one that is used to evaluate performance on the test set (20%). This was done to keep things simple. The loading wights command is relevant for the neural network models. While for the XGBoost, the framework handles the optimal number of estimators on prediction time (accordingly to early stopping on training time).

D.2 TRAINING PROTOCOL

The Net-DNF and the FCN were implemented using Tesnorflow. To make a fair comparison, for both models, we used the same batch sizefoot_7 of 2048, and the same learning rate scheduler (reduce on plateau) that monitors the training loss. We set a maximum of 1000 epochs and used the same early stopping protocol (30 epochs) that monitors the validation score. Moreover, for both of them, we used the same loss function (softmax-cross-entropy for multi-class datasets and sigmoid-cross-entropy for binary datasets) and the same optimizer (Adam with default parameters).



This disadvantage is shared among popular GBDT implementations: XGBoost, LightGBM, and CatBoost. Our code is available at https://github.com/amramabutbul/DisjunctiveNormalFormNet. https://github.com/dreamquark-ai/tabnet For example, for the Gas Concentration dataset (see below), TabNet results were slightly inferior to the results we obtained for XGBoost (4.89 log-loss for TabNet vs. 2.22 log-loss for XGBoost. We used seed number 1. We used seed numbers 1, 2, 3. For details, see: docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html For Net-DNF , when using 3072 formulas, we set the batch size to 1024 on the Santander Transaction and Gas datasets and when using 2048 formulas, we set the batch size to 1024 on the Santander Transaction dataset. This was done due to memory issues. We noticed that in this scenario, a large learning rate or large batch size leads to a decline in the performance of the 'FCN with the feature selection'. While the simple FCN and the 'FCN with oracle mask' remains approximately the same.



Figure 1: V CDim(DT r n ) and the upper bound on V CDim(DN F k n ) (log scale) as a function of the input dimension

Figure 2: The results on the six synthetic experiments. For each experiment we present the test accuracy (with an error bar of the standard error of the mean) as a function of the input dimension d.

now derive the upper bound on the VC-dimension of a disjunction of k conjunctions. Let C be the class of conjunctions, and let D k (C) be the class of a disjunction of k conjunctions. Clearly, D k (C) is a k-fold union of the class C, namely, D k (C) = k i=0 c i |c i ∈ C . By Lemma 3.2.3 in (Blummer et al. 1989), if d = V CDim(C), then for all k ≥ 1, V CDim(D k (C)) ≤ 2dk log(3k). Therefore, for the class DN F k n , of DNF formulas with k conjunctions on n Boolean variables, we have V CDim(DN F k n ) ≤ 2(n + 1)k log(3k).

Figure 3: An example of a decision tree with rank 1, which is equivalent to the conjunction x 0 ∧ x 1 ∧ x 2 ∧ x 3 ∧ x 4 .

Gradual study (test log-loss scores)

Leave one out study (test log-loss scores) learnable binary mask that multiplies the input element-wise. Here we examine the effect of this mask on a vanilla FCN network (see technical details in Appendix D.5). The synthetic tasks we use were introduced by Yoon et al. (2019);Chen et al. (2018), where they were used as synthetic experiments to test feature selection. There are six different dataset settings; exact specifications appear in Appendix D.5. For each dataset, we generated seven different instances that differ in their input size. While increasing the input dimension d, the same logit is used for prediction, so the new features are irrelevant, and as d gets larger, the percentage of relevant features becomes smaller.

A description of the tabular datasets D EXPERIMENTAL PROTOCOL D.1 DATA PARTITION AND GRID SEARCH PROCEDURE

ACKNOWLEDGMENTS

This research was partially supported by the Israel Science Foundation, grant No. 710/18.

annex

Published as a conference paper at ICLR 2021 For Net-DNF we used an initial learning rate of 0.05. For FCN, we added the initial learning rate to the grid search with values of {0.05, 0.005, 0.0005}.For XGBoost we set the maximal number of estimators to be 2500, and used an early stopping of 50 estimators that monitors the validation score.All models were trained on GPUs -Titan Xp 12GB RAM.Additionally, in the case of Net-DNF, we took a symmetry-breaking approach between the different DNNFs. This is reflected by the DNNF group being divided equally into four subgroups where, for each subgroup, the number of conjunctions is equal to one of the following values [6, 9, 12, 15] , and the group of conjunctions of each DNNF was divided equally into three subgroups where, for each subgroup, the conjunction length is equal to one of the following values [2, 4, 6] . The same approach was used for the parameter p of the random mask. The DNNF group was divided equally into five subgroups where, for each subgroup, p is equal to one of the following values [0.1, 0.3, 0.5, 0.7, 0.9].In all experiments we used the same values. To summarize, we performed a crude but broad selection (among 42 hyper-parameter configurations) for our Net-DNF. Results were quite strong, so we avoided further fine tuning. To ensure extra fairness w.r.t. the baselines, we provided them with significantly more hyper-parameter tuning resources (864 configurations for XGBoost, and 3300 configurations for FCNs).

D.3.3 FULLY CONNECTED NETWORKS

The FCN networks are constructed using Dense-RELU-Dropout blocks with L 2 regularization. The network's blocks are defined in the following way. Given depth and width parameters, we examine two different configurations: (1) the same width is used for the entire network (e.g., if the width is 512 and the depth is four, then the network blocks are [512, 512, 512, 512] ), and (2) the width parameter defines the width of the first block, and the subsequent blocks are reduced by a factor of 2 (e.g., if the width is 512 and the depth is four, then the network blocks are [512, 256, 128, 64] ). On top of the last block we add a simple linear layer that reduce the dimension into the output dimension.The dropout and L 2 values are the same for all blocks. 

D.4 ABLATION STUDY

All ablation studies experiments were conducted using the grid search process as described in D.1. In all experiments, we used the same training details as described on D.2 for Net-DNF. Where the only difference between the different experiments is the addition or removal of the components.The single hyperparameter that was fine-tuned using the grid search is the 'feature selection beta' on the range {1.6, 1.3, 1., 0.7, 0.4, 0.1}, in experiments in which the feature selection component is involved. In the other cases, only one configuration was tested in the grid search process for a specific number of formulas.

D.5 FEATURE SELECTION ANALYSIS

The input features x ∈ R d of all six datasets were generated from a d-dimensional Gaussian distribution with no correlation across the features, x ∼ N(0, I). The label y is sampled as a Bernoulli random variable with P(y = 1|x) = 1 1+logit(x) , where logit(x) is varied to create the different synthetic datasets (x i refers to the ith entry):4. Syn4: if x 11 < 0, logit follows Syn1, else, logit follows Syn2 5. Syn5: if x 11 < 0, logit follows Syn1, else, logit follows Syn3 6. Syn6: if x 11 < 0, logit follows Syn2, else, logit follows Syn3We compare the performance of a basic FCN on three different cases: (1) oracle (ideal) feature selection -where the input feature vector is multiplied element-wise with an input oracle mask, whose ith entry equals 1 iff the ith feature is relevant (e.g., on Syn1, features 1 and 2 are relevant, and on Syn4, features 1-6, and 11 are relevant), (2) our (learned) feature selection mask -where the input feature vector is multiplied element-wise with the mask m t , i.e., the entries of the mask m s (see Section 2.3) are all fixed to 1, and (3) no feature selection.From each dataset, we generated seven different instances that differ in their input size, d ∈ [11, 50, 100, 150, 200, 250, 300] . Where when the input dimension d increases, the same logit function is used. Each instance contains 10k samples that were partitioned as described in Section D.1. We treated each instance as an independent dataset, and the grid search process that is described in Section D.1 was done for each one.The FCN that we used has two dense hidden layers [64, 32] with a RELU activation. To keep things simple, we have not used drouput or any kind of regularization. The same training protocol was used for all three models. We used the same learning rate scheduler, early stopping protocol, loss function and optimizer as appear in Section D.2 9 . We use a batch size of 256, and an initial learning rate of 0.001. The only hyperparameter that was fine-tuned is the 'feature selection beta' in the case of 'FCN with feature selection' on the range {1.3, 1., 0.7, 0.4}. For the two other models, only a single configuration was tested in the grid search process.

