ON LEARNING READ-ONCE DNFS WITH NEURAL NETWORKS

Abstract

Learning functions over Boolean variables is a fundamental problem in machine learning. But not much is known about learning such functions using neural networks. Because learning these functions in the distribution free setting is NP-Hard, they are unlikely to be efficiently learnable by networks in this case. However, assuming the inputs are sampled from the uniform distribution, an important subset of functions that are known to be efficiently learnable is read-once DNFs. Here we focus on this setting where the functions are learned by a convex neural network and gradient descent. We first observe empirically that the learned neurons are aligned with the terms of the DNF, despite the fact that there are many zero-error networks that do not have this property. Thus, the learning process has a clear inductive bias towards such logical formulas. To gain a better theoretical understanding of this phenomenon we focus on minimizing the population risk. We show that this risk can be minimized by multiple networks: from ones that memorize data to ones that compactly represent the DNF. We then set out to understand why gradient descent "chooses" the compact representation. We use a computer assisted proof to prove the inductive bias for relatively small DNFs, and use it to design a process for reconstructing the DNF from the learned network. We proceed to provide theoretical insights on the learning process and the optimization to better understand the resulting inductive bias. For example, we show that the network that minimizes the l 2 norm of the weights subject to margin constraints is also aligned with the DNF terms. Finally, we empirically show that our results are validated in the empirical case for high dimensional DNFs, more general network architectures and tabular datasets. 4 This follows since for the uniform distribution we have E x∼D,y=f * (x) [max{0, 1 -yN (x; W , b)] = 1 X ∑ x∈X ,y=f * (x)

1. INTRODUCTION

The training objective of overparameterized neural networks is non-convex and contains multiple global minima with different generalization properties. Therefore, just minimizing the training objective does not guarantee good generalization performance. Nonetheless, neural networks trained in practice with gradient-based methods show good test performance across numerous tasks (Krizhevsky et al., 2012; Silver et al., 2016) , suggesting an inductive bias towards desirable solutions. Understanding this inductive bias and how it depends on the algorithm, architecture and data is one of the major open problems in machine learning (Zhang et al., 2017; Neyshabur et al., 2018) . In recent years, there have been major efforts to tackle this challenge. One line of works considers the Neural Tangent Kernel (NTK) approximation of neural networks which reduces to a convex optimization problem (Jacot et al., 2018) . However, it has been shown that the NTK approximation is limited and does not accurately model neural networks as they are used in practice (Yehudai & Shamir, 2019; Daniely & Malach, 2020) . Other works tackle the non-convexity directly for specific cases. However, current results are either for very simplified settings (e.g., diagonal linear networks Woodworth et al., 2019) or for specific cases such as regression with 2-layer models and Gaussian distributions (Li et al., 2020) , or for impractical settings with infinitely wide two-layer networks (Chizat & Bach, 2020) . Presumably, the reason for this relatively limited progress is the lack of general mathematical tools to analyze the non-convexity directly, except for a few simplified cases. One approach to make progress on this front is to use empirical tools in addition to theory, when the theoretical analysis is not tractable. In this work, we use this approach and study the inductive bias in a challenging and novel setting which is not addressed in previous theoretical works. Concretely, we consider learning read-once DNFs under the uniform distribution with a one-hidden layer, nonhomogeneous convex network with ReLU activations and gradient descent (GD).foot_0  In computational learning theory, the problem of learning DNFs has a long history. Learning DNFs is hard (Pitt & Valiant, 1988) and the best known algorithms for learning DNFs under the uniform distribution run in quasi-polynomial time (Verbeurgt, 1990) . On the other hand, for learning read-once DNFs under the uniform distribution there exist efficient learning algorithms (Mansour & Schain, 2001) . 2 Therefore, it is interesting to understand whether neural networks can learn readonce DNFs under the uniform distribution and this motivates the study of the inductive bias in this case. To better understand the inductive bias, we focus on the population setting. We show that even in this setting, where the training set consists of all possible binary vectors, there exist global minima of the training objective with significantly different properties. For example, a global minimum which memorizes the training points in its neurons, and another minimum whose neurons align exactly with the terms of the DNF, which we call a DNF-recovery solution. Figure 1a-b shows an example of these global minima. Therefore, the key question is what is the inductive bias of GD in this case. Namely, to which global minimum does it converge? To address this question, we provide a computer-assisted proof for the convergence of GD in low dimensional DNFs. We circumvent the difficulty of floating point errors in the computer-assisted proof by utilizing a unique feature of our setting that allows to perform calculations in integers. We prove that under a symmetric initialization, the global minimum that GD converges to is similar to a DNF-recovery solution. Figure 1c shows an example of the global minimum GD converges to, which indeed looks similar to the DNF-recovery solution in Figure 1b . We prove, using the computer assisted proof, that after a simple procedure of pruning and rounding of the weights, we can obtain the exact DNF-recovery solution from the model that GD converges to. Consequently, the terms of the DNF can be reconstructed from the network weights. We provide additional theoretical results for the population setting. We show that for a symmetric initialization, gradient descent has the following unique stability property: if at some iteration a neuron is aligned with a term of a DNF, it will continue to be aligned with the term for all subsequent iterations. This gives further evidence that GD is biased towards neurons that are aligned with terms. We also study minimum l 2 norm solutions of our problem, inspired by recent works that show connections between norm minimization and GD for homogeneous models (Lyu & Li, 2020; Chizat & Bach, 2020) . We prove that the l 2 minimum norm solutions are all DNF-recovery solutions. We corroborate our findings with empirical results which show that our conclusions hold more broadly. Specifically, we perform experiments on DNFs of higher dimension, standard one-hidden layer neural networks and Gaussian initialization. Taken together, our results demonstrate that gradi-ent descent can recover simple descriptions of Boolean functions from data, a fact that has important implications on the question of interpretability.

2. RELATED WORK

Recently, several works studied the inductive bias of two-layer homogeneous networks and show connections between gradient methods and margin maximization (Chizat & Bach, 2020; Lyu & Li, 2020; Ji & Telgarsky, 2020) . Preliminary results for non-homogeneous networks were provided in Nacson et al. (2019) . We note that these results do not provide convergence guarantees and hold under assumptions on the dynamics of gradient methods (e.g., that they reach a certain loss value). Other works study fully connected neural networks under certain assumptions on the data such as linearly separable data (Brutzkus et al., 2018) or Gaussian data (Safran & Shamir, 2018) . Malach & Shalev-Shwartz (2019) show that certain structured Boolean circuits can be learned with a network architecture that is specialized for their data structure. Fully connected networks were also analyzed via the NTK approximation (Du et al., 2019; 2018; Arora et al., 2019; Fiat et al., 2019) . Another line of works (Saad & Solla, 1996; Goldt et al., 2019; Tian, 2019) studies neural networks in a student-teacher setting and shows a "specialization" phenomenon, where a subset of student neurons aligns with teacher neurons. The main difference from our setting is that we consider classification on binary data, and they consider regression tasks on non-discrete data (e.g., Gaussian). Furthermore, we perform exact DNF recovery which is unique in our classification setting. Bengio et al. (2006) ; Bach (2017) study convex networks with infinitely many hidden units and devise convex relaxation algorithms. In this work, we consider convex networks with finitely many hidden neurons and study gradient descent on a non-convex objective. Rudin (2019) argues that methods for explaining large neural networks should be avoided because networks are too complex for humans to understand. However, we show, albeit in a restricted setting, that learned networks can be rather simple, and are easily mapped to the underlying DNF. 

3. PROBLEM FORMULATION

f * (x) = 1 if ∃n ∈ [K] s.t. x ⋅ t * n = A n , and otherwise f * (x) = -1. Notice that f * is monotone. We refer to t * n as term n of f * and we say that a sample x ∈ X satisfies the term t * n if x ⋅ t * n = A n . We refer to A n as the size of the term t * n . To compare our notation with the standard one, for example, the DNF (x 1 ∧ x 2 ) ∨ (x 3 ∧ x 4 ) with 4 inputs has terms t * 1 = (1, 1, 0, 0) and t * 2 = (0, 0, 1, 1). We will use the standard notation when convenient, e.g., as in Figure 1 . In this work we will focus on read-once DNFs where for all i ≠ j ∈ [K], A i ∩ A j = ∅ Learning Setup: Let D be a the uniform distribution over X and f * be a monotone read-once DNF......foot_3 . We consider learning f * given a training set S ⊆ X × Y, where for each (x, y) ∈ S, x is sampled IID from D and y = f * (x). Denote S x = {x (x, y) ∈ S} and the positive samples by S p = {x (x, 1) ∈ S}. Neural Architecture: We consider a convex one-hidden layer neural network (NN) with r hidden units and parameters (W , b) ∈ R rD × R r which is defined by: N (x; W , b) = i∈[r] σ(w i ⋅ x + b i ) -1 where σ(x) = max{0, x} is the ReLU function, w i is the i th row of W and b i is the i th entry of b. Note that the network is not homogeneous and therefore recent results on homogeneous networks do not apply (see Section 2). The network is convex because it is a sum of convex ReLU functions. Training Loss: To learn f * we aim to minimize the following hinge loss: L(W , b) = 1 S (x,y)∈S max{0, 1 -yN (x; W , b)} (2) We note that the above loss function is generally non-convex (even though the network N is convex). For optimization we use Gradient Descent (GD) with a fixed learning rate η. We denote the initialization of GD by W (0) , b (0) and the weights at iteration t by W (t) , b (t) . If the iteration index is clear from context we omit it and use (W , b). See Section A for the gradient update. In this work we will mainly focus on the population case. This corresponds to optimizing the loss in Eq. ( 2) with S x = X . 4 Many works have studied it as a proxy to understand the performance in the empirical case (e.g., Brutzkus & Globerson, 2017; Daniely & Malach, 2020) . Remark 3.1. The convex network considered here is a good test-bed for understanding the inductive bias of learning read-once DNFs with one-hidden layer NNs because: (1) It has the same expressive power as standard NNs for implementing Boolean functions (Section 4) (2) It outperforms standard NNs on learning read-once DNFs in our setting (Section 8) (3) Its analysis can be used to better understand the inductive bias of standard NNs (Section 8).

4. EXPRESSIVE POWER

In this section we show that the network in Eq. ( 1) has the expressive power to implement any Boolean function over X . Therefore, in terms of expressive power, the network is suitable for learning Boolean functions and has the same expressive power for implementing Boolean functions as a standard one-hidden layer NN. neurons such that for all x ∈ X , sign (N (x; W , b)) = sign ∑ i∈[r] σ(w i ⋅ x + b i ) -1 = f (x). Proof. Let B + = {x ∈ X f (x) = 1} and denote B + = x 1 , ..., x B+ . Define r = B + and for each i ∈ [r] define w i = x i and b i = -D + 2. Then ∀x i ∈ B + it holds that σ(w i ⋅ x i + b i ) = 2 and ∀x ≠ x i it holds that σ(w i ⋅ x + b i ) = 0. Therefore ∀x ∈ B + we have N (x; W , b) = 1 and for x ∉ B + it holds that N (x; W , b) = -1, from which the claim follows.

5. MULTIPLE GLOBAL MINIMA IN THE POPULATION SETTING

Assume that S x = X . Then any global minimum of the loss in Eq. ( 2) implements the ground-truth function f * . However, as we will show next, there are global minima that implement f * with drastically different properties. Understanding which global minimum GD converges to is important, because the population case is an approximation of the empirical case with sufficiently many training points. Thus a good understanding of the population case can have direct implications on our understanding of the empirical setting. Indeed, we show in Section 8 that using our understanding of the population case leads to a procedure that accurately reconstructs the ground-truth DNF from a network trained on practical-size training sets. One network that globally minimizes the loss is the one whose neurons simply "memorize" all positive points, as the following Proposition states (proof follows from Theorem 4.1). We call the above minimum the memorizing solution. Intuitively, converging to this solution in the population case is undesirable, since this may imply memorization in the empirical setting, which can lead to wrong predictions on unobserved samples. Next, we show a different global minimum which recovers the DNF formula explicitly in its neurons. We first need the following definitions. Definition 5.1. A neuron i ∈ [r] is a covering neuron with respect to a DNF f * , if there exists n ∈ [K] and λ i > 0 such that w i = λ i t * n . We refer to λ i as the covering coefficient of neuron i. We also refer to this as neuron i covering the term n. Definition 5.2. W covers f * , if ∀n ∈ [K] there exists i ∈ [r] such that neuron i covers term n. We can now define a DNF-recovery solution: Definition 5.3. W is a DNF-recovery solution if it covers f * and for neurons i ∈ [r] that are not covering, it holds that w i = 0. For convenience, we will group all neurons that cover a specific term: Definition 5.4. Assume that W covers the DNF f * . For each n ∈ [K], we define the set C n to be the set of all i ∈ [r] such that neuron i covers term n. Next, we show that any DNF-recovery solution is a global minimum under certain conditions. Proposition 5.2. Assume that S x = X and W is a DNF-recovery solution. Then (W , b) is a global minimum of the loss in Eq. ( 2) if the following holds. (1) For every covering neuron i with covering coefficient λ i and which covers the term n, it holds that b i = λ i (2 -t * n 1 ). (2) For a neuron i that is not covering, b i = 0. (3) For all n ∈ [K], ∑ i∈Cn λ i ≥ 1. We note that the third condition ensures that for all x ∈ S p it holds that N (c; W , b) ≥ 1. We need this condition to ensure global optimality. The proof is provided in Section B. Intuitively, converging to this global minimum is desirable, because it learns good representations of the data: the terms of the ground-truth read-once DNF f * . Thus, given this global minimum, the read-once DNF can be easily reconstructed from the network weights, which can be useful for interpretability. Intuitively, since a DNF-recovery solution is equivalent to a network with a small number of neurons, converging to it in the population case may suggest good sample complexity in the empirical case. Indeed, we show empirically in Section 8 that the convex network has good sample complexity for learning read-once DNFs in our setting. The remaining question is which global minimum GD converges to.

6. CONVERGENCE ANALYSIS WITH A COMPUTER ASSISTED PROOF

In this section we characterize the global minimum that GD converges to in the population setting. Answering this question via theoretical analysis is extremely challenging due to the non-convexity and complex dynamics of GD. Therefore, to tackle this problem, we opt for a computer assisted proof. Using a computer assisted proof we show that for low dimensional DNFs, GD converges to a solution which is similar to the DNF-recovery solution in the following sense: after a simple pruning and reconstruction procedure, the DNF-recovery solution can be obtained from the global minimum of GD (see Theorem 6.1). We use a unique property of our setting (see Lemma 6.1) that allows us to perform calculations in integers and avoid floating point errors. In Section 7, we provide further theoretical results on the dynamics of GD that corroborate our findings.

6.1. SETUP

We next provide details on the setup of the computer assisted proof. See Section I for more details. Network and Algorithm: We consider the network in Eq. ( 1) with r = 2 D and GD with initialization W (0) corresponding to all possible vectors in {± } D and b (0) = 0.foot_4 We execute the proof for D ≤ 15. We use small values of to converge to global minima that are similar to Figure 1c . For large initializations, GD has a different inductive bias (see Section 8). Details on values of and learning rate are provided in Section I. DNFs: We define a balanced DNF to be a read-once DNF such that for all i ≠ j ∈ [K], A i = A j . For the proof, we consider all balanced read-once DNFs f * with input dimension 4 ≤ D ≤ 12 where each term is of size at least two. 6 We denote the latter set of DNFs by F. Pruning and Reconstruction: Figure 1c shows an example of the global minimum that GD converges to. Most of the neurons cover a term, but not all. Except for the non-covering neurons, the global minimum looks very similar to a DNF-recovery solution. Based on this observation, we will devise a pruning and rounding procedure that when applied to the global minimum of GD, will provably return a DNF-recovery solution for the cases we consider in the computer assisted proof. Section 8 shows that this procedure also works empirically for various other settings. Next we define the pruning procedure we will use. Definition 6.1. For 0 ≤ γ ≤ 1 and a network N with parameters (W , b), define Q γ (W ) = {i ∈ [r] w i ∞ > γM ∞ (W )} where M ∞ (W ) = max i∈[r] { w i ∞ }. The pruning procedure removes all neurons that have l ∞ norm which is less than γM ∞ (W ). Next we define the reconstruction procedure. Definition 6.2. For 0 ≤ β ≤ 1, a β-reconstruction of weight matrix W is R β (W ) ∈ {0, 1} rD such that for all i ∈ [r] and j ∈ [D], R β (W ) ij = 1 if w ij > β w i ∞ , or R β (W ) ij = 0 otherwise. Figure 2 shows an example of the pruning and reconstruction procedures. We use the values of 0.4 ≤ β ≤ 0.9 and 0.4 ≤ γ ≤ 0.9 in our proof. Further details are given in Section I.

6.2. MAIN RESULT

We can now state our main result, which states that for the set of DNFs F it holds that GD in the setup of Section 6.1 followed by rounding will converge to a DNF-recovery solution. See Figure 2 for an example of this result. Theorem 6.1. For all f * ∈ F, parameters η, β, γ, as described in Section I, and S x = X , GD converges to a global minimum (W , b) such that R β (Q γ (W )) is a DNF-recovery solution. We next provide a sketch of the proof. The main challenge is of course simulating the GD updates, while avoiding floating point errors. We use a key observation that in our setting, we can perform equivalent calculations of the dynamics in integers. This follows since in our binary input setting, the network dynamics can be calculated with rational numbers. Then, by scaling several parameters, we can perform equivalent calculations in integers. The scaling procedure is defined below. Definition 6.3. Given α > 0, weight W (0) and learning rate η, the α-GD algorithm is defined as follows. Initialize U (t) = αW (0) , c (0) = 0. Then run GD with constant learning rate η α = ηα with loss L α (U , c) = 1 S ∑ (x,y)∈S max{0, α -yN α (x; U , c)}, where N α (x; U , c) = ∑ i∈[r] σ(u i ⋅ x + c i ) -α. Since the inputs are integers, by taking the learning rate and to be rational numbers and α to be a sufficiently large integer, α-GD can perform all calculations in integers. Next, we show that α-GD performs the same calculations as GD up to a scaling factor. Therefore, we can run α-GD to simulate GD without floating point errors. Lemma 6.1. Assume we run GD with initialization (W (0) , b (0) ) where b (0) = 0 and constant learning rate η to optimize the loss in Eq. ( 2) and assume we run α-GD with paramaters W (0) and learning rate η. Then it holds that αW t = U t , αb t = c t and αL(W t , b t ) = L α (U t , c t ). We prove this lemma by induction on t (see Section E). For the computer assisted proof, we run α-GD for each f * ∈ F and data S x = X labeled by f * , with suitable α, rational and η such that all calculations are performed in integers. We note that the pruning and reconstruction procedures can be performed with integers as well. Further details are given in Section I.  (x 1 ∧ x 2 ∧ x 3 ) ∨ (x 4 ∧ x 5 ∧ x 6 ) ∨ (x 7 ∧ x 8 ∧ x 9 ). (a) The global minimum that GD converges to. (b) Global minimum after pruning (Definition 6.1). (c) Global minimum after pruning followed by β-reconstruction (Definition 6.2).

7. THEORETICAL INSIGHTS

In this section we give two theoretical results shedding further light on the inductive bias of GD. Preservation of Term Alignment: Assume the symmetric initialization in the computer assisted proof. In the next theorem we show that if at some iteration T of GD, all weight entries in a term are equal, then this will continue to hold for all t > T . This provides further evidence that GD is biased towards term alignment. Theorem 7.1. Assume GD is initialized with (W 0 , b 0 ) as described in Section 6 and S x = X . Assume that there exists T ≥ 0, i ∈ [r] and n ∈ [K] such that for all j 1 , j 2 ∈ A n , w (T ) ij1 = w (T ) ij2 . Then for all t > T , j 1 , j 2 ∈ A n it holds that w (t) ij1 = w (t) ij2 . The proof idea is to exploit the symmetry of the initialization and population setting to show that the GD update is constrained to preserve term alignment. The proof is given in Section F. Norm Minimization implies DNF-Recovery: Recent works have highlighted interesting connections between gradient methods and norm minimization (Lyu & Li, 2020; Nacson et al., 2019) . The optimization problem is to minimize the norm of the model weights subject to the constraint yN (x; W , b) ≥ 1 for all (x, y) in the training data. In Lyu & Li (2020) it was shown that under some conditions GD will converge to a KKT point of this problem. We next show a surprising result in our context: the global optimum of this min-norm problem is a DNF-recovery solution. This means that if GD converges to the optimal KKT point, it will find a DNF-recovery solution.foot_7 Theorem 7.2. Consider the margin-maximization optimization problem: min ∑ i∈[r] (w i , b i ) 2 2 s.t. yN (x; W , b) ≥ 1 , ∀(x, y) ∈ S Then, for any solution (W * , b * ) of Eq. ( 3), W * is a DNF-recovery solution. The proof is technical and given in Section G. The key idea is to show an upper bound on the bias of any global minimum and that the bound is tight for a minimum l 2 norm solution. Then, using this fact together with the optimality of (W * , b * ) the theorem can be proved.

8. EMPIRICAL RESULTS

In this section we perform numerous experiments that support our analysis and show that our conclusions hold in different settings. For each experiment we show a sample of the empirical results due to space constraints. Further details and results are provided in the supplementary. Comparing Convex and Standard Networks: Our analysis focused on a convex network. Here we compare it to a standard 2-layer network with trainable output layer. Figure 3a reports results, showing that the convex network outperforms the standard one for a DNF with D = 9. This shows that the convex network is a good model for studying inductive bias in our setting. Reconstruction of DNFs in Other Settings: Theorem 6.1 holds under the assumption of low dimensional DNFs, a convex network, population case, symmetric initialization and balanced DNFs. Here we show that the reconstruction procedure in the theorem can be applied in broader settings. Specifically, we consider a setting with Gaussian initialization, finite data and unbalanced high dimensional DNFs for D = 27. Figure 3b shows the reconstruction success rate for both networks. For each training set size, the recovery is performed for different initializations and training sets. Note that for standard NNs we slightly modified the recovery procedure to accommodate for these networks. In both cases, we see that for at least a moderate training set size, the recovery procedure has a high success rate (recall this is for perfect recovery). We note that for D > 27 the recovery for standard neural networks did not work well. However, for the convex network even for D = 100, the reconstruction worked well (see Section H). Figure 3c shows an example of a global minimum of GD for D = 100, where the DNF recovery was successful. Large Initialization: Our results in Section 6 required a small initialization scale. However, several works have shown that the scale of the initialization has a significant effect on the inductive bias (Woodworth et al., 2019; Chizat et al., 2019) . What is the inductive bias in the case of large initialization in our setting? Figure 3d shows the neurons of the global minimum that GD converges to for large initialization and the setting of Figure 3c . We see that neurons are not aligned with terms and have a very different inductive bias. Indeed, this model also overfits with 67% test accuracy, compared to 100% test accuracy for the small-scale initialization model in Figure 3c . Experiments on Tabular Datasets: The fact that SGD recovers simple Boolean formulas is very attractive in the context of interpretability. We showed that we can reconstruct DNFs under certain idealized assumptions (e.g., uniform distribution, read-once). However, our reconstruction method might produce meaningless reconstructions on datasets which are not uniform nor labeled with a read-once DNF. We tested our reconstruction method on three tabular UCI datasets kr-vs-kp, diabetes and Splice (Dua & Graff, 2017) . Learning with our convex network resulted in test accuracies of 93%, 96% and 96% on these datasets, respectively. Our reconstruction method obtained a small DNF (3 terms of size less than 3) on kr-vs-kp with test accuracy 83%. For diabetes, the reconstruction method returned a large DNF (more than 10 terms) with test accuracy 93%. On Splice we got a 2-term DNF of sizes 2 and 3 with 95% test accuracy. The latter is a very compact DNF with very small loss in accuracy, illustrating the potential of recovery on interpretability.

9. CONCLUSIONS

Understanding the inductive bias of neural networks for learning DNFs is an important challenge. In this work we mainly focused on learning read-once DNFs under the uniform distribution with a convex network. We provided theoretical results, computer assisted proofs and experiments, all of which suggest that GD is biased towards unique global minima that recover the terms of the DNF. Using our analysis we derived a DNF reconstruction method and showed that it works in broader settings that include standard two-layer networks and tabular datasets. Our work opens up many interesting directions for future work. For example, it would be interesting to understand if DNF recovery is possible for other distributions and DNFs that are not read-once. Another interesting direction is to understand the sample complexity of neural networks for learning DNFs and how it relates to DNF-recovery. Finally, it will be interesting to understand how learning dynamics in neural nets are related to other algorithms for learning DNFs.

A GRADIENT UPDATE

The following definition will be used in order to simplify the gradient updates of GD: Definition A.1. Given (W (t) , b (t) ), for each neuron i ∈ [r] define: G (t) i = {(x, y) ∈ S 1 -yN (x; W (t) , b (t) ) > 0 ∧ w (t) i ⋅ x + b (t) i > 0} The set G (t) i consists of all samples that are included in the gradient update for neuron i at time t. Using this definition, the update rule of GD for neuron i at time t is given by: w (t) i = w (t-1) i + η S (x,y)∈G (t-1) i yx ; b (t) i = b (t-1) i + η S (x,y)∈G (t-1) i y B PROOF OF PROPOSITION 5.2 To prove the result we show necessary properties of global minima when S x = X that will also be useful in the proof of Theorem 7.2. The proof of Proposition 5.2 follows directly from Lemma B.2 below.

B.1 A SIMPLE PROPERTY OF GLOBAL MINIMA

Recall the definition S p = {x (x, 1) ∈ S}. We also define S n = {x (x, -1) ∈ S}. We first need the following definitions. Definition B.1. We say that (W , b) satisfies the MIN + property if ∀x ∈ S p there exists I ⊆ [r] such that ∑ i∈I w i ⋅ x + b i ≥ 2. Definition B.2. We say that (W , b) satisfies the MIN -property if ∀x ∈ S n , ∀i ∈ [r] w i ⋅ x + b i ≤ 0. The following property of a global minimum follows directly by the definition of the network in Eq. (1) and the loss function in Eq. ( 2). The proof is given in Section C. We note that one direct consequence of Lemma B.1 is that if a negative x ∈ S n is activated by a neuron i ∈ [r], i.e., w i ⋅ x + b i > 0, then (x, y) ∈ G i . We will use this fact in later sections.

B.2 THE BIAS THRESHOLD

In this section we show that when S x = X , the bias of any neuron in a global minimum is upper bounded by a certain value which we call the bias threshold. This upper bound will be useful in the proof of Theorem 7.2. Definition B.3. For each term n ∈ [K] of f * and i ∈ [r] define V n (w i ) = max{min j∈An {w ij }, 0}. Definition B.4. The bias threshold for a weight w is BT (w) = -w 1 + 2 ∑ n∈[K] V n (w). Lemma B.2. Assume that S x = X . If (W , b) satisfies that ∀i ∈ [r] b i ≤ BT (w i ) and satisfies the MIN + property. Then it is a global minimum of the loss in Eq. ( 2). The proof is given in Section D. The idea is to find for each i ∈ [r] a negative point xi such that w i ⋅ xi = BT (w i ) and that for any other negative point x i , w i ⋅ x i ≤ w i ⋅ xi . From this, we show that the MIN -property is satisfied if and only if ∀i ∈ [r], b i ≤ BT (w i ). Then, using Lemma B.1 we complete the proof. C PROOF OF LEMMA B.1  Proof. If (W , b) is a global minimum, then for all (x, y) ∈ S, yN (x; W , b) ≥ 1. Therefore, if y = 1, ∑ i∈[r] σ(w i ⋅ x + b i ) ≥ 2. Thus, there exists I ⊆ [r] such that ∑ i∈I w i ⋅ x + b i ≥ 2. If y = -1, then ∑ i∈[r] σ(w i ⋅ x + b i ) ≤ 0 xj = -sign(w j ) ∃n ∈ [K] ∶ V n (w) > 0 ∧ j = J n sign(w j ) otherwise For a term n ∈ [K], if V n (w) > 0 then xJn =sign(w Jn ) = -1, otherwise V n (w) = 0 and there exists j ∈ A n such that w j < 0, i.e., x j = sign(w j ) = -1. In any case, x ⋅ t * n < A n . Therefore, the label of this sample is negative and denote it by ŷ = -1. Now, we have the following: w ⋅ x = j∈[D] w j ⋅ xj = n∈[K+1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈An w j ⋅ xj ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = n∈[K] and Vn(w)>0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈An {Jn} w j ⋅ sign(w j ) -w Jn ⋅ sign(w Jn ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + n∈[K] and Vn(w)=0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈An w j ⋅ sign(w j ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + j∈A K+1 w j ⋅ xj (7) = n∈[K] and Vn(w)>0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈An {Jn} w j -w Jn ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + n∈[K] and Vn(w)=0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈AIn w j ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + j∈A K+1 w j = n∈[K] and Vn(w)>0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈An w j -2V n (w) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + n∈[K] and Vn(w)=0 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈AIn w j -2V n (w) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + j∈A K+1 w j = n∈[K] ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ j∈AIn w j -2V n (w) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ + j∈A K+1 w j = j∈[D] w j -2 n∈[K] V n (w) = w 1 -2 n∈[K] V n (w) = -BT (w) For the first direction, if b > BT (w) then w ⋅ x + b = BT (w) + b > 0, as desired. In the second direction, assume that there is a negative sample x ∈ S n such that w ⋅ x + b > 0. We will show that x ⋅ w ≤ x ⋅ w. Every term n ∈ [K] satisfies: ∀j ∈ A n {J n } x j w j ≤ w j = sign(w j )w j = xj w j (8) If V n (w) = 0, then the index J n also satisfies Eq. ( 8), by the definition of x. Otherwise V n (w) > 0 and we know that there exists j ′ ∈ A n such that x j ′ = -1 (otherwise the sample's label is not negative), and w Jn , w j ′ > 0. If J n = j ′ , then x j w j = xj w j . Otherwise, the following holds: x j ′ w j ′ + x Jn w Jn ≤ -w j ′ + w Jn ≤ 0 ≤ w j ′ -w Jn ≤ xj ′ w j ′ + xJn w Jn (9) Note that every every index in A K+1 satisfies Eq. ( 8) as well. Therefore, x ⋅ w ≤ x ⋅ w. We can conclude that: 0 < x ⋅ w + b ≤ x ⋅ w + b = -BT (w) + b (10) which implies that b > BT (w) as desired. E PROOF OF LEMMA 6.1 Proof. We wish to prove that the α-GD optimization procedure is equivalent to the original GD, up to scaling. We begin with some definitions. Definition E.1. Given (U (t) , c (t) ), for a neuron i ∈ [r] is defined: H (t) i = {(x, y) ∈ S 1 -yN α (x; U (t) , c (t) ) > 0 ∧ u (t) i ⋅ x + c (t) i > 0} (11) The update rule of α-GD for (U (t) , c (t) ) can be simplified as follows: u (t) i = u (t-1) i + η S (x,y)∈H (t-1) i yx ; c (t) i = c (t-1) i + η S (x,y)∈H (t-1) i y (12) We prove the claim by induction on t. For t = 0 by the definition of the initialization we have: U (0) = [±α ] D = α[± ] D = αW (0) ; c (0) = [0] D = αb (0) (13) The network satisfies: N α (x; U (0) , c (0) ) = i∈[r] σ(u (0) i ⋅ x + c (0) i ) -α = i∈[r] σ(αw (0) i ⋅ x + αb (0) i ) -α = i∈[r] ασ(w (0) i ⋅ x + b (0) i ) -α = α ⎛ ⎝ i∈[r] σ(w (0) i ⋅ x + b (0) i ) ⎞ ⎠ (14) = αN (x; W (0) , b (0) ) and we can conclude that: L α (U (0) , c (0) ) = 1 S (x,y)∈S max{0, α -yN α (x; U (0) , c (0) )} = 1 S (x,y)∈S max{0, α -yαN (x; W (0) , b (0) )} (15) = α 1 S (x,y)∈S max{0, 1 -yN (x; W (0) , b (0) )} = αL(W (0) , b (0) ) For the induction step, we assume correctness of the claim for t -1, and prove it for t. First, we can see that by the induction assumption for x ∈ S x and i ∈ [r] we have that u (t-1) i ⋅ x + c (t-1) i > 0 if and only if w (t-1) i ⋅ x + b (t-1) i > 0. In addition, we have that α -yN α (x; U (t-1) , c (t-1) ) > 0 if and only if 1 -yN (x; W (t-1) , b (t-1) )} > 0. By Definition A.1 and Definition E.1 and the induction hypothesis, for every i ∈ [r] it holds that G (t-1) i = H (t-1) i . Then, by the induction hypothesis, the following holds: u (t) i = u (t-1) i + αη S (x,y)∈H (t-1) i yx = αw (t-1) i + αη S (x,y)∈G (t-1) i yx = αw (t) i (16) c (t) i = c (t-1) i + αη S (x,y)∈H (t-1) i y = αb (t-1) i + αη S (x,y)∈G (t-1) i y = αb (t) i as required for the first condition. To prove that N α (x; t) ) and αL(W t , b t ) = L α (U t , c t ), we can repeat the process of Eq. ( 14) and Eq. ( 15) with iteration t instead of 0. U (t) , c (t) ) = αN (x; W (t) , b

F PROOFS OF THEOREM 7.1 (PRESERVATION OF TERM ALIGNMENT)

We use Definition D.1 where A K+1 = [D] ∪ n∈[K] A n . We first prove a symmetry lemma in Section F.1 and then prove the theorem in Section F.2. F.1 SYMMETRY LEMMA Definition F.1. We define a reordering of A 1 , . . . , A K+1 as the set R = {π 1 , . . . , π K+1 }, where each π i is a permutation of elements in A i for all 1 ≤ i ≤ K + 1. Definition F.2. Given a reordering R, we define for every sample (x, y) ∈ S a pair (x π , y π ) such that: ∀n ∈ [K + 1] ∀j ∈ A n x π πn[j] = x j (17) and y π = y. Since the label of a sample is invariant to a reordering, it holds that (x π , y π ) ∈ S. Lemma F.1. Given (W (0) , b (0) ), a reordering R and step t ≥ 0, every x ∈ S x satisfies: N (x; W (t) , b (t) ) = N (x π ; W (t) , b (t) ) (18) Proof. Given (W (0) , b (0) ) and a reordering R, we define the function P ∶ [r] → [r] as follows: P (i 1 ) = i 2 ⇐⇒ ∀n ∈ [K + 1] ∀j ∈ A n w (0) i1j = w (0) i2πn[j] and b (0) i1 = b (0) i2 By our assumption on the initialization, for every i 1 ∈ [r] there exists a unique i 2 ∈ [r] such that P (i 1 ) = i 2 . Therefore, P is a well defined inverse function. We will prove the following claim by induction on t ≥ 0: 1. ∀i ∈ [r] ∀n ∈ [K + 1] ∀j ∈ A n w (t) ij = w (t) P (i)πn[j] and b (t) i = b (t) P (i) 2. ∀x ∈ S x N (x; W (t) , b (t) ) = N (x π ; W (t) , b (t) ) For t = 0, the first property is correct by the definition of P . By our initialization assumption, every i ∈ [r] satisfies w (0) i ⋅ x = w (0) P (i) ⋅ x π and b (0) i = b (0) P (i) = 0. Then the second property can be proven by: N (x; W (0) , b (0) ) = i∈[r] σ(w (0) i ⋅x+b (0) i )-1 = i∈[r] σ(w (0) P (i) ⋅x π +b (0) P (i) )-1 = N (x π ; W (0) , b (0) ) (20) Assuming the correctness of the claim for t -1, we prove the claim for t. According to Eq. ( 5), for proving the first property, it is enough to show for every i ∈ [r] that the following holds: ∀n ∈ [K + 1] ∀j ∈ A n ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w (t-1) i + η S (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ w (t-1) P (i) + η S (x,y)∈G (t-1) P (i) yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ πn[j] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ b (t-1) i + η S (x,y)∈G (t-1) i y ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ b (t-1) P (i) + η S (x,y)∈G (t-1) P (i) y ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ πn[j] By the induction assumption, w (t-1) ij = w (t-1) P (i)πn[j] and b (t-1) P (i)j = b (t-1) P (i)πn[j] . Therefore, we can see that every x ∈ S x satisfies: w (t-1) i ⋅ x + b (t-1) i = j∈[D] w (t-1) ij x j + b (t-1) i = n∈[K+1] j∈An w (t-1) ij x j + b (t-1) i (22) = n∈[K+1] j∈An w (t-1) P (i)πn[j] x π πn[j] + b (t-1) P (i) = j∈[D] w (t-1) P (i)j x π j + b (t-1) P (i) = w (t-1) P (i) ⋅ x π + b (t-1) P (i) Recall, y = y π , then by the induction assumption and Definition A.1 the following holds: x ∈ G (t-1) i ⇐⇒1 -yN (x; W (t-1) , b (t-1) ) > 0 ∧ w (t-1) i ⋅ x + b (t-1) i > 0 (23) ⇐⇒1 -y π N (x π ; W (t-1) , b (t-1) ) > 0 ∧ w (t-1) P (i) ⋅ x π + b (t-1) P (i) > 0 ⇐⇒ x π ∈ G (t-1) P (i) Therefore, ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) P (i) yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ πn[j] and ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G t-1 i y ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ j = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G t-1 P (i) y ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦π n [j] We conclude that Eq. ( 21) is correct, and as a result the first claim in our proof by induction is correct. Therefore, every pair x, x π ∈ S x satisfies w (t) i ⋅ x = w (t) P (i) ⋅ x π , by the first claim and Definition F.2. Similarly, it holds that b (t) i = b (t) P (i) . Therefore: N (x; W (t) , b (t) ) = i∈[r] σ(w (t) i ⋅ x + b (t) i ) -1 = i∈[r] σ(w (t) P (i) ⋅ x π + b (t) P (i) ) -1 = N (x π ; W (t) , b (t) ) (25) which completes the proof. F.2 PROOF OF THEOREM 7.1 We will prove the claim by induction on t. For t = T the claim is correct by the claim's assumption. Assuming the correctness of the claim for t -1, we will prove it for t. By Eq. ( 5): w (t) i = w (t-1) i + η S x (x,y)∈G (t-1) i 1 yx Given j 1 , j 2 ∈ A n , by the induction assumption w (t-1) ij1 = w (t-1) ij2 . Then, w (t) ij1 = w (t) ij2 if and only if ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j 1 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j 2 (27) We define the following ordering R: 1. ∀n ′ ∈ [K + 1] {n} ∀j ∈ A n π n ′ [j] = j 2. ∀j ∈ A n {j 1 , j 2 } π n [j] = j and π n [j 1 ] = j 2 π n [j 2 ] = j 1 In other words, π is the permutation that switches j 1 and j 2 . Since w (t-1) ij1 = w (t-1) ij2 , we can de- termine that w i ⋅ x = w i ⋅ x π . Recall that N (x; W (t) , b (t) ) = N (x π ; W (t) , b (t) ) by Lemma F.1. Thus, (x, y) ∈ G (t-1) i ⇐⇒ (x π , y π ) ∈ G (t-1) i by Definition A.1. Then, using the fact that ∀(x, y) ∈ G (t-1) i yx j1 = y π x π πn[j1] = yx π j2 we can conclude that: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j 1 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (x,y)∈G (t-1) i yx ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦j 2 as required. G PROOF OF THEOREM 7.2 (MINIMUM l 2 -NORM IS A DNF-RECOVERY SOLUTION) We set out to characterize the structure of the min-norm solution. In the proof we use Definition D.1 where  A K+1 = [D] ∪ n∈[K] A n . w i ⋅ x + b i ≥ 2. We define U to be the set of global minima of the loss in Eq. ( 2). We first prove several lemmas.

G.1 AUXILIARY LEMMAS

Definition G.1. For a term n ∈ [K], we define the special sample x (n) ∈ S x of this term as: ∀j ∈ A n x (n) j = 1 and ∀j ∈ [D] A n x (n) j = -1 We denote the set of all the special samples by O = {x ∈ S p ∃n ∈ [K] x = x (n) } Lemma G.1. Given (W , b) , assume the following conditions are satisfied: 1. ∀i ∈ [r], ∀j ∈ [D] w ij ≥ 0. 2. Every x ∈ O satisfies the MIN + property. Then (W , b) satisfies the MIN + property for all positive samples. Proof. Let x ∈ S p . Then ∃n ∈ [K] such that ∀j ∈ A n x j = 1. According to the second assumption, ∃I ⊆ [r] such that ∑ i∈I w i ⋅ x (n) + b i ≥ 2. For every i ∈ [r] the following holds: w i ⋅ x = j∈[D] w ij x j = j∈Aj w ij + j∈[D] An x j w ij From the first condition of the claim we can deduce that j∈Aj w ij + j∈[D] An x j w ij ≥ j∈An w ij - j∈[D] Aj w ij = j∈[D] w ij x (n) j = w i ⋅ x (n) Then: i∈I σ(w i ⋅ x + b i ) ≥ i∈I σ(w i ⋅ x (n) + b i ) ≥ 2 and (W , b) satisfies the MIN + property for x as required. The following definition will be very useful in our analysis. Definition G.2. Given a min-norm solution (W * , b * ), then we say that the solution ( Ŵ , b) is an i-modified solution if the following holds: ∀i ′ ∈ [r] {i} ŵi ′ = w * i ′ and bi ′ = b * i ′ (32) Thus, given a min-norm solution, to define an i-modified solution, we only need to define the neuron (w i , b i ). Lemma G.2. Given (W * , b * ), then every i ∈ [r] satisfies: 1. b i = BT (w i ). 2. ∀j ∈ [D] w ij ≥ 0. 3. ∃n ∈ [K] such that ∀j ∈ A n w ij ≥ 0 and ∀j ∈ [D] A n w ij = 0. Proof. Given (W * , b * ) and i ∈ [r], we will prove the properties one by one.  ŵi = w * i , bi = BT ( ŵ) By the assumption bi > b * i . Then, every x ∈ S x satisfies the following: x ⋅ w * i + b i < x ⋅ ŵi + bi Because (W * , b * ) satisfies the MIN + property, then ( Ŵ , b) satisfies it as well. Recall that (W * , b * ) also satisfies the MIN -property, thus from Definition G.2 every x ∈ S n satisfies:  ∀i ′ ∈ [r] {i} 0 > x ⋅ w * i ′ + b i ′ = x ⋅ ŵi ′ + bi ′ * i < bi ≤ 0 → bi < b * i . We know that ŵi = w * i the ( ŵi , bi ) 2 2 < (w * i , b * i ) 2 2 which is in contradiction to the optimality of (W * , b * ).  ∀j ∈ [D] {j ′ } ŵij = w * ij ∧ ŵij ′ = 0 ∧ bi = BT ( ŵi ) We want to show that: n∈[K] V n (w * i ) = n∈[K] V n ( ŵi ) (37) If ∃n ′ ∈ [K] such that j ′ ∈ A n ′ , then it follows that V n ′ (w * i ) = V n ′ ŵi ) = 0 and Eq. ( 37) is satisfied. Otherwise, j ′ ∈ A K+1 , by Definition D.1, the indices of A K+1 don't affect the value of the sums in Eq. ( 37). Therefore, this equation is satisfied in this case as well. We know that b * i = BT (w * i ) from the first property, thus every x ∈ S x satisfies the following: x ⋅ w * i + b * i = j∈[D] x j w * ij + BT (w * i ) (38) = j∈[D] {j ′ } x j w * ij + x j ′ w * ij ′ -w * ij ′ - j∈[D] {j ′ } w * ij + n∈[K] 2V n (w * i ) We can see that x j ′ w * ij ′w * ij ′ ≤ 0 for every x j ′ . Then, j∈[D] {j ′ } x j w * ij + x j ′ w * ij ′ -w * ij ′ - j∈[D] {j ′ } w * ij + n∈[K] 2V n (w * i ) ≤ j∈[D] {j ′ } x j w * ij - j∈[D] {j ′ } w * ij + n∈[K] 2V n (w * i ) (39) = j∈[D] x j ŵij -ŵi 1 + n∈[K] 2V n ( ŵi ) = x ⋅ ŵi + BT ( ŵi ) = x ⋅ ŵi + bi Using the fact that x ⋅ w * i + b * i ≤ x ⋅ ŵi + bi with the fact that (W * , b * ) satisfies the MIN + property, we can conclude that ( Ŵ , b) satisfies this property too. According to the "Property 1" above and Definition G.2 we have: ∀i ′ ∈ [r] {i} BT ( ŵi ′ ) = BT (w * i ′ ) = b * i ′ = bi ′ In addition, we know that bi = BT ( ŵi ) by Eq. ( 36). According to Lemma D.1, the solution satisfies the MIN -property and from Lemma B.1, ( Ŵ , b) ∈ U From Eq. ( 36), we know that w * ij ′ > ŵij ′ → w * i 1 > ŵi 1 . Combining this with Eq. ( 37) we can conclude that BT (w * i ) > BT ( ŵi ) → b * i > bi . Therefore, ( ŵi , bi ) 2 2 < (w * i , b * i ) 2 2 in contradiction to the optimality of (W * , b * ). Property 3 -Assume by contradiction that there exists i ∈ [r] such that: ∃n 1 , n 2 ∈ [K + 1] such that ∃j ∈ A n1 w ij > 0 and ∃j ∈ A n2 w ij > 0 (41) Without loss of generality we assume: j∈An 1 w ij ≥ j∈An 2 w ij Let's look on the following i-modified ( Ŵ , b) which is defined by: ∀j ∈ A n2 ŵij = 0 and ∀j ∈ [D] A n2 ŵij = w ij and bi = BT ( ŵi ) First, we will show that bi ≥ b * i . If n 2 ≠ K +1, from Definition B.4 and the assumption that A n2 > 1, the following holds: bi = BT ( ŵi ) = BT (w * i ) + j∈An 2 w ij -2V n2 (w * i ) = b i + j∈An 2 w * ij -2V n2 (w i ) ≥ b * i ( ) Otherwise n 2 = K + 1 and from Definition B.4 the following holds: bi = BT ( ŵi ) = BT (w * i ) + j∈An 2 w ij = b i + j∈An 2 w * ij ≥ b * i (45) In both cases bi ≥ b * i as required. Given ñ ∈ [K], we know that (W * , b * ) ∈ U and thus it satisfies the MIN + property for x (ñ) . Then, ∃I ⊆ [r] such that: i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 We will show that ( Ŵ , b) satisfies this property for ñ as well. If ñ ≠ n 2 , due to the first properly of this lemma and the fact that bi ≥ b * i the following holds: w * i ⋅ x (ñ) + b * i = j∈[D] An 2 w * ij x (ñ) j - j∈An 2 w * i + b * i < j∈[D] An 2 w * ij x (ñ) j + b * i ≤ j∈[D] ŵij x (ñ) j + bi = ŵi ⋅ x (ñ) + bi Then we can conclude that ( Ŵ , b) satisfies the MIN + property for x (ñ) from the following inequalities: i ′ ∈I ŵi ′ ⋅ x + bi ′ ≥ i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 Otherwise, ñ = n 2 . By the fact that b * i = BT (w * i ) ≤ 0, the first property of this lemma and Eq. ( 42) the following holds: w * i ⋅x+b * i = j∈An 2 w * ij - j∈[D] An 2 w * ij +b * i ≤ j∈An 2 w * ij - j∈An 1 w * ij +b * i ≤ j∈An 2 w * ij - j∈An 1 w * ij < 0 (49) Therefore, using Definition G.2, ( Ŵ , b) satisfies the following: i ′ ∈I {i} ŵi ′ ⋅ x + bi ′ = i ′ ∈I {i} w * i ′ ⋅ x + b * i ′ ≥ i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 We can conclude that ( Ŵ , b) satisfies the MIN + property for x (ñ) . Combining this with the first property we can see that ( Ŵ , b) meets the condition of Lemma G.1 and then it satisfies the MIN + property for all positive samples. According to the first property and Definition G.2: ∀i ′ ∈ [r] {i} BT ( ŵi ′ ) = BT (w * i ′ ) = b * i ′ = bi ′ In addition, we know that bi = BT ( ŵi ) by Eq. ( 43). According to Lemma D.1, the solution satisfies the MIN -property and from Lemma B.1, ( Ŵ , b) ∈ U. As we saw 0 ≥ b * i > bi → bi > b * i and ∀j ∈ A n2 w * ij ≥ 0 = ŵij then (w * i , b * i ) 2 2 > ( ŵi , bi ) 2 2 in contradiction to the optimality of (W * , b * ). We next prove the following covering lemma. Lemma G.3. Given (W * , b * ), every i ∈ [r] covers some term n ∈ [K] or w i = 0 Proof. Given i ∈ [r], if w i = 0 the claim is true. Otherwise, by the third property of Lemma G.2, ∃n ∈ [r] such that: ∀j ∈ A n w ij > 0 and ∀j ∈ [D] A n w ij = 0 (52) We will prove the claim for n by assuming by contradiction that: ∃j 1 , j 2 ∈ A n such that w * ij1 ≠ w * ij2 Without loss of generality, we assume w * ij1 > w * ij2 and w * ij2 = min j∈An {w * ij }. Define the following i-modified solution ( Ŵ , b): ∀j ∈ A n ŵij = w * ij2 and ∀j ∈ [D] A n ŵij = w * ij and bi = BT ( ŵi ) (54) Note that V n (w * i ) = V n ( ŵi ) = w * ij2 . Given x ∈ S p , we know that (W * , b * ) satisfies the MIN + property for x. Then, ∃I ⊆ [r] such that: i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 (55) If x ⋅ t * n = t * n 1 , due to the third property of Lemma G.2 the following holds: w * i ⋅ x + b * i = j∈An w * ij + BT (w * i ) = j∈An w * ij - j∈An w * ij + 2V n (w * i ) = 2w * ij2 (56) = j∈An w * ij j∈An w * ij + 2w * ij2 = j∈An ŵij -ŵ 1 + 2V n ( ŵi ) = ŵi ⋅ x + bi Then we can conclude that ( Ŵ , b) satisfies this property for x from the following: i ′ ∈I ŵi ′ ⋅ x + bi ′ = i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 Otherwise, x ⋅ t * n < t * n 1 and by Definition B.4: w * i ⋅ x + b i = j∈An w * ij x j + BT (w * i ) ≤ 0 Therefore, using Definition G.2, we have that ( Ŵ , b) satisfies: i ′ ∈I {i} ŵi ′ ⋅ x + bi ′ = i ′ ∈I {i} w * i ′ ⋅ x + b * i ′ ≥ i ′ ∈I w * i ′ ⋅ x + b * i ′ ≥ 2 We can conclude that ( Ŵ , b) satisfies the MIN + property for x. According to the first property of Lemma G.2 and Definition G.2: ∀i ′ ∈ [r] {i} BT ( ŵi ′ ) = BT (w * i ′ ) = b * i ′ = bi ′ In addition, we know that bi = BT ( ŵi ) by Eq. ( 54). According to Lemma D.1, the solution satisfies the MIN -property and from Lemma B.1, ( Ŵ , b) ∈ U Finally, we can see that, w * i 1 > ŵi 1 implies that BT (w i ) > BT ( ŵi ) and ∀j ∈ [D]w * ij ≥ ŵij . Thus, we have (w * i , b * i ) 2 2 > ( ŵi , bi ) 2 2 . This is contradiction to the optimality of (W * , b * ), as desired. We can now define λ i = V n (w * i ) and we know that the neuron i satisfies: ∀j ∈ A n w * ij = λ i and ∀j ∈ [D] A n w * ij = 0 Therefore, w i = λ i t * n and we can say that the neuron i covers the term n. Proof. We know from Lemma G.3 that each neuron covers some term. We denote the term that is covered by neuron i as n i . Given n ∈ [K], we assume by contradiction that it is not covered, namely ∀i ∈ [r] n i ≠ n. Consider the special positive sample x (n) for which it holds that:  ∀i ∈ [r] x (n) ⋅ w * i = λ n x (n) ⋅ t * ni = -λ n t * ni 1 < 0

H EXPERIMENT DETAILS AND ADDITIONAL RESULTS

In the experiments for Figure 3a and Figure 3b , we use a small Gaussian initialization W (0) ∼ N (0, 10 -5 ) and b (0) = [0] D . The learning rate for GD is η = 10 -2 , and the number of hidden units is r = 700. We create the train set by sampling uniformly from [±1] D . The test set is size 10 4 , sampled uniformly from [±1] D . In the experiments of Figure 3a , for every train set size we run the experiment five times and report average accuracy. If GD converges to a local minimum during learning we re-initialize and re-train. Therefore, all our results are taken when the train error is zero. For the tabular datasets, we consider the three UCI datasets: kr-vs-kp, Splice, and diabetes. For all these, we convert the input into binary by changing categorical variables to one-hot. We also consider binary classification so in kr-vs-kp the class 'won' is positive considered and 'notwon' is negative, in Splice the classes 'EI' and 'IE' are considered positive and 'N' negative, and diabetes is binary by design. We train on 90% of the data and test on 10%.

H.1 ADDITIONAL EXPERIMENTS

Here we report additional experiments on various D values and DNFs. Figure 4 reports results on test accuracy for convex and standard networks, for D = 10, 27, 35. We note that for D > 40 values the standard network converges very slowly and often does not converge to zero error, and thus we do not include comparisons in these cases. 

I DETAILS FOR COMPUTER ASSISTED PROOF

The set F for which we proved the result consists of all balanced DNFs with input dimension 4 ≤ D ≤ 15 with terms of sizes at least two, including ones where variables are not part of any term (e.g., when D = 6, we consider also the DNF: (x 1 ∧ x 2 ) ∨ (x 3 ∧ x 4 )). We consider the following parameter ranges in the proof: γ ∈ {0.4, 0.5, 0.9} and β ∈ {0.4, 0.5, 0.9}. For 4 ≤ D ≤ 12 we used η = 10 -5 and ∈ {10 -6 , 2 ⋅ 10 -6 , 3 ⋅ 10 -6 , 4 ⋅ 10 -6 , 5 ⋅ 10 -6 , 6 ⋅ 10 -6 , 7 ⋅ 10 -6 , 8 ⋅ 10 -6 }. For 13 ≤ D ≤ 15 we used η = 10 -4 and = 2 ⋅ 10 -6 to avoid long run time.foot_8  Given D, η, and target function f * we use lemma α-GD to learn f * . This allows us to perform the simulation in integers without floating point inaccuracies. According to lemma 6.1 this is equivalent to learning with standard GD in term of convergence and recovery. We choose α = 2 D η ⋅ 10 6 such that the initialization is integer and by Eq. ( 5) the update step is in integer steps. We use int64 to avoid integer overflow problem. The procedure begins by initializing W = [±α ] D and b = [0] D . Then we apply Eq. ( 5) using matrix multiplication in tensorflow on GPU, until we achieve 0 loss. After the network converges to a global minimum, we execute γ-pruning. To keep the simulation in integers we use the following trick. Instead of calculating if w i ∞ is larger than γM ∞ (W ), we calculate 10 ⋅ w i ∞ and 10 ⋅ γM ∞ (W ), because 10γ is an integer. Finally, we apply β-reconstruction to the pruned network to get the reconstructed network, using the same trick to do the simulation in integers. For validating if our reconstructed network is equivalent to f * , we go over all the terms of f * and check that there is a neuron that is equal to it. In addition, we go over all the neurons and check that they have a corresponding term.



Non-homogeneity here is a result of a bias in the second layer. In a read-once DNF each literal appears at most once. See Section for a formal definition. In the case of the uniform distribution, we can assume monotone DNFs WLOG. This follows since for the uniform distribution (that has IID Bernoulli(0.5) entries), by symmetry, any negated literal can be replaced with the original literal (without negation) and all our results still hold. We chose a single symmetric initialization of all binary vectors with entries in {± } to serve as a representative of initializations used in practice. Indeed, we show in Section 8 that the conclusions of Theorem 6.1, empirically hold for Gaussian initializations. We focus on balanced DNFs because while the symmetric initialization W (0) described above is a good test-bed for understanding the inductive bias for balanced DNFs, it is not so for unbalanced DNFs. Indeed, we observe empirically that GD converges in some instances to spurious local minima in this case. However, we will empirically show in Section 8 that this is not the case for Gaussian initialization and that for this initialization we can successfully reconstruct DNFs using the method described in this section. For the hinge loss, we do not expect GD to converge exactly to the min norm solution. Indeed, Figure2afrom the computer assisted proof shows that GD converges to a solution which is not a DNF recovery solution. In the case of D = 13 and the target DNF (x1∧x2)∨(x3∧x4)∨(x5∧x6)∨(x7∧x8)∨(x9∧x10)∨(x11∧x12) we used η = 10 -3 for reducing the run time even more.



Figure 1: Examples of global minima for learning the read-once DNF: (x 1 ∧ x 2 ∧ x 3 ) ∨ (x 4 ∧ x 5 ∧ x 6 ) ∨ (x 7 ∧ x 8 ∧ x 9 ) with a convex network. (a) Global minimum that memorizes the training points. (b) Global minimum that recovers the DNF. (c) Global minimum that GD converges to.

Theorem 4.1. Let f ∶ X → Y. Then, there exists (W , b) and a network N in Eq. (1) with r ≤ 2 D

Proposition 5.1. Assume S x = X . Consider (W , b) with r = S p s.t. for any x ∈ S p there exists i ∈ [r] where w i = x and b i = -D + 2. Then (W , b) is a global minimum of the loss in Eq. (2).

Figure 2: Illustration of pruning and reconstruction for the read-once DNF:(x 1 ∧ x 2 ∧ x 3 ) ∨ (x 4 ∧ x 5 ∧ x 6 ) ∨ (x 7 ∧ x 8 ∧ x 9 ). (a)The global minimum that GD converges to. (b) Global minimum after pruning (Definition 6.1). (c) Global minimum after pruning followed by β-reconstruction (Definition 6.2).

Figure 3: Experiments for learning read-once DNFs. (a) Test performance of convex and standard two-layer networks for D = 9. (b) Reconstruction success rate for convex and standard two-layer networks, for D = 27. (c) Model learned by a convex network for DNF with D = 100 and small initialization (d) Same setting as (c) with large initialization.

Lemma B.1. Assume that S x = X . Then (W , b) is a global minimum of Eq. (2) if and only if (W , b) satisfies MIN + and MIN -.

and therefore for all i ∈ [r], w i ⋅ x + b i ≤ 0. The other direction follows similarly. D PROOF OF LEMMA B.2 Definition D.1. We define the set of indices which are not active in any term as the noisy indices and denote them by A K+1 = [D] ∪ n∈[K] A n . The lemma follows directly from Lemma B.1 and the following lemma. Lemma D.1. Given a neuron (w, b), there is negative sample x ∈ S n for which w ⋅ x + b > 0 if and only if b > BT (w). Proof. Given neuron (w, b), we define the minimum index of a term n ∈ [K] as J n = arg min j∈An {w j }. Consider a sample x ∈ S x where for each j ∈ [D]:

Assume by contradiction that b * i ≠ BT (w * i ). By Lemma B.2, b i has to be smaller than BT (w * i ), because otherwise (W * , b * ) is not a global minimum of the loss in Eq. (2). Now consider the i-modified solution, ( Ŵ , b), which is defined by:

35) In addition, from Lemma D.1 we know that 0 > x ⋅ ŵi + bi . Therefore, ( Ŵ , b) satisfies the MIN - property. By Lemma B.1, ( Ŵ , b) ∈ U. From Definition B.4 the bias threshold is a negative, then b

Assume by contradiction that ∃j ′ ∈ [D] such that w * ij ′ < 0. Consider the following i-modified solution ( Ŵ , b):

Figure 4: Test accuracy for the convex and standard networks. (a) D = 10 and the target DNF has 3 terms of size: (2,3,3) (b) D = 27 and the target DNF has 3 terms of size 3, 2 terms of size 2 and one term of size 6. (c) D = 35 and the target DNF has 3 terms of size 2, 4 terms of size 4 and 2 terms of size 2.

62)By the first property of Lemma G.3, ∀i ∈ [r], b * i = BT (w * i ) ≤ 0. Therefore, ∀i ∈ [r]x (n) ⋅w * i +b * i < 0 in contradiction to the fact that (W * , b * ) ∈ UBy Definition 5.2, W covers the DNF f * . Together with Lemma G.3, this implies that W * is a DNF-recovery solution.

Figure 5 reports results on DNF recovery accuracy for D = 10, 30, 100. For D ≥ 30, the standard network typically fails to recover the DNF.

Figure6shows learning with small and large initial scale for an unbalanced DNF. To create the figures of the global minima, we use hierarchical clustering to cluster the neurons.

Figure 5: Reconstruction success rate for the convex and standard networks. (a) D = 10 and the target DNF has 3 terms of size: (2,3,3) (b) D=30 and the target DNF has 5 terms of size 6. Standard network fails to reconstruct (success rate below 5%). (c) D = 100 and the target DNF has 15 terms of size 5. Standard network fails to reconstruct (successs rate below 5%).

Figure 6: Model learned by a convex network for DNF with D = 27 for different initailization scales. The ground-trtuh DNF has 3 terms of size 3, 2 terms of size 2 and one term of size 1. (a) Small initialization -Can be seen to lead to good recovery. (b) Large initialization -Does not lead to good recovery.

In this work, we consider DNFs on inputs with entries in {±1} and output in {±1}. Therefore, we will use the following notation for DNFs. Let t

Following the notation from Definition B.1, we say that a solution (W , b) satisfies the M IN + property for a positive point x ∈ S p if there exists I ⊆ [r] such that ∑

