ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS

Abstract

Learning functions over Boolean variables is a fundamental problem in machine learning. But not much is known about learning such functions using neural networks. Because learning these functions in the distribution free setting is NP-Hard, they are unlikely to be efficiently learnable by networks in this case. However, assuming the inputs are sampled from the uniform distribution, an important subset of functions that are known to be efficiently learnable is read-once DNFs. Here we focus on this setting where the functions are learned by a convex neural network and gradient descent. We first observe empirically that the learned neurons are aligned with the terms of the DNF, despite the fact that there are many zero-error networks that do not have this property. Thus, the learning process has a clear inductive bias towards such logical formulas. To gain a better theoretical understanding of this phenomenon we focus on minimizing the population risk. We show that this risk can be minimized by multiple networks: from ones that memorize data to ones that compactly represent the DNF. We then set out to understand why gradient descent "chooses" the compact representation. We use a computer assisted proof to prove the inductive bias for relatively small DNFs, and use it to design a process for reconstructing the DNF from the learned network. We proceed to provide theoretical insights on the learning process and the optimization to better understand the resulting inductive bias. For example, we show that the network that minimizes the l 2 norm of the weights subject to margin constraints is also aligned with the DNF terms. Finally, we empirically show that our results are validated in the empirical case for high dimensional DNFs, more general network architectures and tabular datasets.

1. INTRODUCTION

The training objective of overparameterized neural networks is non-convex and contains multiple global minima with different generalization properties. Therefore, just minimizing the training objective does not guarantee good generalization performance. Nonetheless, neural networks trained in practice with gradient-based methods show good test performance across numerous tasks (Krizhevsky et al., 2012; Silver et al., 2016) , suggesting an inductive bias towards desirable solutions. Understanding this inductive bias and how it depends on the algorithm, architecture and data is one of the major open problems in machine learning (Zhang et al., 2017; Neyshabur et al., 2018) . In recent years, there have been major efforts to tackle this challenge. One line of works considers the Neural Tangent Kernel (NTK) approximation of neural networks which reduces to a convex optimization problem (Jacot et al., 2018) . However, it has been shown that the NTK approximation is limited and does not accurately model neural networks as they are used in practice (Yehudai & Shamir, 2019; Daniely & Malach, 2020) . Other works tackle the non-convexity directly for specific cases. However, current results are either for very simplified settings (e.g., diagonal linear networks Woodworth et al., 2019) or for specific cases such as regression with 2-layer models and Gaussian distributions (Li et al., 2020) , or for impractical settings with infinitely wide two-layer networks (Chizat & Bach, 2020) . Presumably, the reason for this relatively limited progress is the lack of general mathematical tools to analyze the non-convexity directly, except for a few simplified cases. One approach to make progress on this front is to use empirical tools in addition to theory, when the theoretical analysis is not tractable. In this work, we use this approach and study the inductive bias in a challenging and novel setting which is not addressed in previous theoretical works. Concretely, we consider learning read-once DNFs under the uniform distribution with a one-hidden layer, nonhomogeneous convex network with ReLU activations and gradient descent (GD).foot_0  In computational learning theory, the problem of learning DNFs has a long history. Learning DNFs is hard (Pitt & Valiant, 1988) and the best known algorithms for learning DNFs under the uniform distribution run in quasi-polynomial time (Verbeurgt, 1990) . On the other hand, for learning read-once DNFs under the uniform distribution there exist efficient learning algorithms (Mansour & Schain, 2001) .foot_1 Therefore, it is interesting to understand whether neural networks can learn readonce DNFs under the uniform distribution and this motivates the study of the inductive bias in this case. To better understand the inductive bias, we focus on the population setting. We show that even in this setting, where the training set consists of all possible binary vectors, there exist global minima of the training objective with significantly different properties. For example, a global minimum which memorizes the training points in its neurons, and another minimum whose neurons align exactly with the terms of the DNF, which we call a DNF-recovery solution. Figure 1a-b shows an example of these global minima. Therefore, the key question is what is the inductive bias of GD in this case. Namely, to which global minimum does it converge? To address this question, we provide a computer-assisted proof for the convergence of GD in low dimensional DNFs. We circumvent the difficulty of floating point errors in the computer-assisted proof by utilizing a unique feature of our setting that allows to perform calculations in integers. We prove that under a symmetric initialization, the global minimum that GD converges to is similar to a DNF-recovery solution. Figure 1c shows an example of the global minimum GD converges to, which indeed looks similar to the DNF-recovery solution in Figure 1b . We prove, using the computer assisted proof, that after a simple procedure of pruning and rounding of the weights, we can obtain the exact DNF-recovery solution from the model that GD converges to. Consequently, the terms of the DNF can be reconstructed from the network weights. We provide additional theoretical results for the population setting. We show that for a symmetric initialization, gradient descent has the following unique stability property: if at some iteration a neuron is aligned with a term of a DNF, it will continue to be aligned with the term for all subsequent iterations. This gives further evidence that GD is biased towards neurons that are aligned with terms. We also study minimum l 2 norm solutions of our problem, inspired by recent works that show connections between norm minimization and GD for homogeneous models (Lyu & Li, 2020; Chizat & Bach, 2020) . We prove that the l 2 minimum norm solutions are all DNF-recovery solutions. We corroborate our findings with empirical results which show that our conclusions hold more broadly. Specifically, we perform experiments on DNFs of higher dimension, standard one-hidden layer neural networks and Gaussian initialization. Taken together, our results demonstrate that gradi-



Non-homogeneity here is a result of a bias in the second layer. In a read-once DNF each literal appears at most once. See Section for a formal definition.



Figure 1: Examples of global minima for learning the read-once DNF: (x 1 ∧ x 2 ∧ x 3 ) ∨ (x 4 ∧ x 5 ∧ x 6 ) ∨ (x 7 ∧ x 8 ∧ x 9 ) with a convex network. (a) Global minimum that memorizes the training points. (b) Global minimum that recovers the DNF. (c) Global minimum that GD converges to.

