ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS

Abstract

Learning functions over Boolean variables is a fundamental problem in machine learning. But not much is known about learning such functions using neural networks. Because learning these functions in the distribution free setting is NP-Hard, they are unlikely to be efficiently learnable by networks in this case. However, assuming the inputs are sampled from the uniform distribution, an important subset of functions that are known to be efficiently learnable is read-once DNFs. Here we focus on this setting where the functions are learned by a convex neural network and gradient descent. We first observe empirically that the learned neurons are aligned with the terms of the DNF, despite the fact that there are many zero-error networks that do not have this property. Thus, the learning process has a clear inductive bias towards such logical formulas. To gain a better theoretical understanding of this phenomenon we focus on minimizing the population risk. We show that this risk can be minimized by multiple networks: from ones that memorize data to ones that compactly represent the DNF. We then set out to understand why gradient descent "chooses" the compact representation. We use a computer assisted proof to prove the inductive bias for relatively small DNFs, and use it to design a process for reconstructing the DNF from the learned network. We proceed to provide theoretical insights on the learning process and the optimization to better understand the resulting inductive bias. For example, we show that the network that minimizes the l 2 norm of the weights subject to margin constraints is also aligned with the DNF terms. Finally, we empirically show that our results are validated in the empirical case for high dimensional DNFs, more general network architectures and tabular datasets.

1. INTRODUCTION

The training objective of overparameterized neural networks is non-convex and contains multiple global minima with different generalization properties. Therefore, just minimizing the training objective does not guarantee good generalization performance. Nonetheless, neural networks trained in practice with gradient-based methods show good test performance across numerous tasks (Krizhevsky et al., 2012; Silver et al., 2016) , suggesting an inductive bias towards desirable solutions. Understanding this inductive bias and how it depends on the algorithm, architecture and data is one of the major open problems in machine learning (Zhang et al., 2017; Neyshabur et al., 2018) . In recent years, there have been major efforts to tackle this challenge. One line of works considers the Neural Tangent Kernel (NTK) approximation of neural networks which reduces to a convex optimization problem (Jacot et al., 2018) . However, it has been shown that the NTK approximation is limited and does not accurately model neural networks as they are used in practice (Yehudai & Shamir, 2019; Daniely & Malach, 2020) . Other works tackle the non-convexity directly for specific cases. However, current results are either for very simplified settings (e.g., diagonal linear networks Woodworth et al., 2019) or for specific cases such as regression with 2-layer models and Gaussian distributions (Li et al., 2020) , or for impractical settings with infinitely wide two-layer networks (Chizat & Bach, 2020) . Presumably, the reason for this relatively limited progress is the lack of general mathematical tools to analyze the non-convexity directly, except for a few simplified cases. One approach to make progress on this front is to use empirical tools in addition to theory, when the theoretical analysis is not tractable. In this work, we use this approach and study the inductive bias 1

