GLOBALLY OPTIMAL TRAINING OF NEURAL NET-WORKS WITH THRESHOLD ACTIVATION FUNCTIONS

Abstract

Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation function has zero gradient except at a single non-differentiable point. To this end, we study weight decay regularized training problems of deep neural networks with threshold activations. We first show that regularized deep threshold network training problems can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method, provided that the last hidden layer width exceeds a certain threshold. We also derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network. We corroborate our theoretical results with various numerical experiments.

1. INTRODUCTION

In the past decade, deep neural networks have proven remarkably useful in solving challenging problems and become popular in many applications. The choice of activation plays a crucial role in their performance and practical implementation. In particular, even though neural networks with popular activation functions such as ReLU are successfully employed, they require advanced computational resources in training and evaluation, e.g., Graphical Processing Units (GPUs) (Coates et al., 2013) . Consequently, training such deep networks is challenging especially without sophisticated hardware. On the other hand, the threshold activation offers a multitude of advantages: (1) computational efficiency, (2) compression/quantization to binary latent dimension, (3) interpretability. Unfortunately, gradient based optimization methods fail in optimizing threshold activation networks due to the fact that the gradient is zero almost everywhere. To close this gap, we analyze the training problem of deep neural networks with the threshold activation function defined as σ s (x) := s1{x ≥ 0} = s if x ≥ 0 0 otherwise , where s ∈ R is a trainable amplitude parameter for the neuron. Our main result is that globally optimal deep threshold networks can be trained by solving a convex optimization problem.

1.1. WHY SHOULD WE CARE ABOUT THRESHOLD NETWORKS?

Neural networks with threshold activations are highly desirable due to the following reasons: • Since the threshold activation (1) is restricted to take values in {0, s}, threshold neural network models are far more suitable for hardware implementations (Bartlett & Downs, 1992; Corwin et al., 1994) . Specifically, these networks have significantly lower memory footprint, less computational complexity, and consume less energy (Helwegen et al., 2019 ). • Modern neural networks have extremely large number of full precision trainable parameters so that several computational barriers emerge during hardware implementations. One approach to mitigate these issues is reducing the network size by grouping the parameters via a hash function (Hubara et al., 2017; Chen et al., 2015) . However, this still requires full precision training before the application of the hash function and thus fails to remedy the computational issues. On the other hand, neural networks with threshold activations need a minimal amount of bits. • Another approach to reduce the complexity is to quantize the weights and activations of the network (Hubara et al., 2017) and the threshold activation is inherently in a two level quantized form. • The threshold activation is a valid model to simulate the behaviour of a biological neuron as detailed in Jain et al. (1996) . Therefore, progress in this research field could shed light into the connection between biological and artificial neural networks.

1.2. RELATED WORK

Although threshold networks are essential for several practical applications as detailed in the previous section, training their parameters is a difficult non-differentiable optimization problem due to the discrete nature in (1). For training of deep neural networks with popular activations, the common practice is to use first order gradient based algorithms such as Gradient Descent (GD) since the well known backpropagation algorithm efficiently calculates the gradient with respect to parameters. However, the threshold activation in (1) has zero gradient except at a single non-differentiable point zero, and therefore, one cannot directly use gradient based algorithms to train the parameters of the network. In order to remedy this issue numerous heuristic algorithms have been proposed in the literature as detailed below but they still fail to globally optimize the training objective (see Figure 1 ). The Straight-Through Estimator (STE) is a widely used heuristics to train threshold networks (Bengio et al., 2013; Hinton, 2012) . Since the gradient is zero almost everywhere, Bengio et al. (2013) ; Hinton (2012) proposed replacing the threshold activation with the identity function during only the backward pass. Later on, this approach is extended to employ various forms of the ReLU activation function, e.g., clipped ReLU, vanilla ReLU, Leaky ReLU (Yin et al., 2019b; Cai et al., 2017; Xiao et al.) , during the backward pass. Additionally, clipped versions of the identity function were also used as an alternative to STE (Hubara et al., 2017; Courbariaux et al., 2016; Rastegari et al., 2016) . ). • In Section 3.1, we characterize the evolution of the set of hyperplane arrangements and consequently hidden layer representation space as a recursive process (see Figure 3 ) as the network gets deeper. • We prove that when a certain layer width exceeds O( √ n/L), the regularized L-layer threshold network training further simplifies to a problem that can be solved in O(n) time. 

1.3. CONTRIBUTIONS

) m ≥ m * (convex opt) Theorem 2.3 2 O(n) m ≥ n + 2 (convex opt) Theorem 3.2 L O(n 3r L-2 l=1 ml ) m L-1 ≥ m * (convex opt) Corollary 3.4 L O(n) ∃l : m l ≥ O( √ n/L) (convex opt) Notation and preliminaries: We use lowercase and uppercase bold letters to denote vectors and matrices, respectively. We also use [n] to denote the set {1, 2, . . . , n}. We denote the training data matrix consisting of d dimensional n samples as X ∈ R n×d , label vector as y ∈ R n , and the l th layer weights of a neural network as W (l) ∈ R m l-1 ×m l , where m 0 = d, m L = 1.

2. TWO-LAYER THRESHOLD NETWORKS

We first consider the following two-layer threshold network f θ,2 (X)=σ s (XW (1) )w (2) = m j=1 s j 1{Xw (1) j ≥0}w (2) j , where the set of the trainable parameters are W (1) ∈ R d×m , s ∈ R m , w (2) ∈ R m and θ is a compact representation for the parameters, i.e., θ := {W (1) , s, w (2) }. Note that we include bias terms by concatenating a vector of ones to X. Next, consider the weight decay regularized training objective P noncvx 2 := min W (1) ,s,w (2) 1 2 f θ,2 (X) -y 2 2 + β 2 m j=1 w (1) j 2 2 + |s j | 2 + |w (2) j | 2 . Now, we apply a scaling between variables s j and w (2) j to reach an equivalent optimization problem. Lemma 2.1 (Optimal scaling). The training problem in (3) can be equivalently stated as P noncvx 2 = min θ∈Θs 1 2 f θ,2 (X) -y 2 2 + β w (2) 1 , where Θ s := {θ : |s j | = 1, ∀j ∈ [m]}. We next define the set of hyperplane arrangement patterns of the data matrix X as H(X) := {1{Xw ≥ 0} : w ∈ R d } ⊂ {0, 1} n . We denote the distinct elements of the set H(X) by d 1 , . . . , d P ∈ {0, 1} n , where P := |H(X)| is the number of hyperplane arrangements. Using this fixed set of hyperplane arrangements {d i } P i=1 , we next prove that (4) is equivalent to the standard Lasso method (Tibshirani, 1996) . Theorem 2.2. Let m ≥ m * , then the non-convex regularized training problem (4) is equivalent to P cvx 2 = min w∈R P 1 2 Dw -y 2 2 + β w 1 , where D = [d 1 d 2 . . . d P ] is a fixed n × P matrix. Here, m * is the cardinality of the optimal solution, which satisfies m * ≤ n + 1. Also, it holds that P noncvx 2 = P cvx 2 . Theorem 2.2 proves that the original non-convex training problem in (3) can be equivalently solved as a standard convex Lasso problem using the hyperplane arrangement patterns as features. Surprisingly, the non-zero support of the convex optimizer in (6) matches to that of the optimal weight-decay regularized threshold activation neurons in the non-convex problem (3). This brings us two major advantages over the standard non-convex training: • Since ( 6) is a standard convex problem, it can be globally optimized without resorting to non-convex optimization heuristics, e.g., initialization schemes, learning rate schedules etc. • Since ( 6) is a convex Lasso problem, there exists many efficient solvers (Efron et al., 2004) .

2.1. SIMPLIFIED CONVEX FORMULATION FOR COMPLETE ARRANGEMENTS

We now show that if the set of hyperplane arrangements of X is complete, i.e., H = {0, 1} n contains all boolean sequences of length n, then the non-convex optimization problem in (3) can be simplified. We call these instances complete arrangements. In the case of two-layer threshold networks, complete arrangements emerge when the width of the network exceeds a threshold, specifically m ≥ n + 2. We note that the m ≥ n regime, also known as memorization, has been extensively studied in the recent literature (Bubeck et al., 2020; de Dios & Bruna, 2020; Pilanci & Ergen, 2020; Rosset et al., 2007) . Particularly, these studies showed that as long as the width exceeds the number of samples, there exists a neural network model that can exactly fit an arbitrary dataset. Vershynin (2020) ; Bartlett et al. (2019) further improved the condition on the width by utilizing the expressive power of deeper networks and developed more sophisticated weight construction algorithms to fit the data. Theorem 2.3. We assume that the set of hyperplane arrangements of X is complete, i.e., equal to the set of all length-n Boolean sequences" H = {0, 1} n . Suppose m ≥ n + 2, then (4) is equivalent to P cvx v2 := min δ∈R n 1 2 δ -y 2 2 + β( (δ) + ∞ + (-δ) + ∞ ) . ( ) and it holds that P noncvx 2 = P cvx v2 . Also, one can construct an optimal network with n + 2 neurons in time O(n) based on the optimal solution to the convex problem (7). Based on Theorem 2.3, when the data matrix can be shattered, i.e., all 2 n possible y ∈ {0, 1} n labelings of the data points can be separated via a linear classifier, it follows that the set of hyperplane arrangements is complete. Consequently, the non-convex problem in (3) further simplifies to (7).

2.2. TRAINING COMPLEXITY

We first briefly summarize our complexity results for solving (6) and ( 7), and then provide the details of the derivations below. Our analysis reveals two interesting regimes: • (incomplete arrangements) When n + 1 ≥ m ≥ m * , we can solve (6) in O(n 3r ) complexity, where r := rank(X). Notice this is polynomial-time whenever the rank r is fixed. • (complete arrangements) When m ≥ n + 2, we can solve (7) in closed-form, and the reconstruction of the non-convex parameters W (1) and w (2) only O(n) time independent of d. Computational complexity of (6): To solve the optimization problem in (6), we first enumerate all possible hyperplane arrangements {d i } P i=1 . It is well known that given a rank-r data matrix the number of hyperplane arrangements P is upper bounded by (Stanley et al., 2004; Cover, 1965 ) P ≤ 2 r-1 k=0 n -1 k ≤ 2r e(n -1) r r , where r = rank(X) ≤ min(n, d). Furthermore, these can be enumerated in O(n r ) (Edelsbrunner et al., 1986) . Then, the complexity for solving (6) is O(P 3 ) ≈ O(n 3r ) (Efron et al., 2004) . Computational complexity of (7): The problem in ( 7) is the proximal operator of a polyhedral norm. Since the problem is separable over the positive and negative parts of the parameter vector δ, the optimal solution can be obtained by applying two proximal steps (Parikh & Boyd, 2014) . As noted in Theorem 2.3, the reconstruction of the non-convex parameters W (1) and w (2) requires O(n) time.

2.3. A GEOMETRIC INTERPRETATION

To provide a geometric interpretation, we consider the weakly regularized case where β → 0. Proposition 2.4 implies that the non-convex threshold network training problem in (3) implicitly represents the label vector y as the convex combination of the hyperplane arrangements determined by the data matrix X. Therefore, we explicitly characterize the representation space of threshold networks. We also remark that this interpretation extends to arbitrary depth as shown in Theorem 3.2.

2.4. HOW TO OPTIMIZE THE HIDDEN LAYER WEIGHTS?

After training via the proposed convex programs in ( 6) and ( 7), we need to reconstruct the layer weights of the non-convex model in (2). We first construct the optimal hyperplane arrangements d i for (7) as detailed in Appendix A.4. Then, we have the following prediction model (2). Notice changing w (1) j to any other vector w j with same norm and such that 1{Xw (1) j ≥ 0} = 1{Xw j ≥ 0} does not change the optimal training objective. Therefore, there are multiple global optima and the one we choose might impact the generalization performance as discussed in Section 5.

3. DEEP THRESHOLD NETWORKS

We now analyze L-layer parallel deep threshold networks model with m L-1 subnetworks defined as f θ,L (X) = m L-1 k=1 σ s (L-1) (X (L-2) k w (L-1) k )w (L) k , where θ := {{W (l) k , s (l) k } L l=1 } m L-1 k=1 , θ ∈ Θ := {θ : W (l) k ∈ R m l-1 ×m l , s (l) k ∈ R m l , ∀l, k}, X k := X, X k := σ s (l) k (X (l-1) k W (l) k ), ∀l ∈ [L -1], and the subscript k is the index for the subnetworks (see Ergen & Pilanci (2021a; b) ; Wang et al. (2023) for more details about parallel networks). We next show that the standard weight decay regularized training problem can be cast as an 1 -norm minimization problem as in Lemma 2.1. Lemma 3.1. The following L-layer regularized threshold network training problem P noncvx L =min θ∈Θ 1 2 f θ,L (X)-y 2 2 + β 2 m L-1 k=1 L l=1 ( W (l) k 2 F + s (l) k 2 2 ) can be reformulated as P noncvx L = min θ∈Θs 1 2 f θ,L (X) -y 2 2 + β w (L) 1 , where Θ s := {θ ∈ Θ : |s (L-1) k | = 1}.

3.1. CHARACTERIZING THE SET OF HYPERPLANE ARRANGEMENTS FOR DEEP NETWORKS

We first define hyperplane arrangements for L-layer networks with a single subnetwork (i.e., m L-1 = 1, thus, we drop the index k). We denote the set of hyperplane arrangements as H L (X) := {1{X (L-2) w (L-1) ≥ 0} : θ ∈ Θ}. We also denote the elements of the set H L (X) by d 1 , . . . , d P L-1 ∈ {0, 1} n and P L-1 = |H L (X)| is the number of hyperplane arrangements in the layer L -1. To construct the hyperplane arrangement matrix D (l) , we define a matrix valued operator as follows D (1) := A (X) = [d 1 d 2 . . . d P1 ] , D (l+1) := |S|=m l A D (l) S , ∀l ∈ [L -2]. Here, the operator A (•) outputs a matrix whose columns contain all possible hyperplane arrangements corresponding to its input matrix as in (5). In particular, D (1) denotes the arrangements for the first layer given the input matrix X. The notation D (l) S ∈ {0, 1} n×m l denotes the submatrix of D (l) ∈ {0, 1} n×P l indexed by the subset S of its columns, where the index S runs over all subsets of size m l . Finally, is an operator that takes a union of these column vectors and outputs a matrix of size n × P l+1 containing these as columns. Note that we may omit repeated columns and denote the total number of unique columns as P l+1 , since this does not change the value of our convex program. We next provide an analytical example describing the construction of the matrix D (l) . Example 3.1. We illustrate an example with the training data X = [- 1 1; 0 1; 1 1  ] ∈ R 3×2 . Inspecting the data samples (rows of X), we observe that all possible arrangement patterns are D (1) = A (X) = 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 =⇒ P 1 = 6. ( ) For the second layer, we first specify the number of neurons in the first layer as m 1 = 2. Thus, we need to consider all possible column pairs in (14). We have D (1) {1,2} = 0 0 0 0 0 1 =⇒ A 0 0 0 0 0 1 = 0 0 1 1 0 0 1 1 0 1 1 0 D (1) {1,3} = 0 0 0 1 0 1 =⇒ A 0 0 0 1 0 1 = 0 0 1 1 0 1 1 0 0 1 1 0 . . . We then construct the hyperplane arrangement matrix as 0 0 0 0 1 1 1 1  0 0 1 1 0 0 1 1  0 1 0 1 0 1 0 1 , which shows that P 2 = 8. Consequently, we obtain the maximum possible arrangement patterns, i.e.,{0, 1} 3 , in the second layer even though we are not able to obtain some of these patterns in the first layer in (14). We also provide a three dimensional visualization of this example in Figure 3 . D (2) = |S|=2 A D (1) S =

3.2. POLYNOMIAL-TIME TRAINABLE CONVEX FORMULATION

Based on the procedure described in Section 3.1 to compute the arrangement matrix D (l) , we now derive an exact formulation for the non-convex training problem in (12). Theorem 3.2. Suppose that m L-1 ≥ m * , then the non-convex training problem (12) is equivalent to P cvx L = min w∈R P L-1 1 2 D (L-1) w -y 2 2 + β w 1 , where D (L-1) ∈ {0, 1} n×P L-1 is a fixed matrix constructed via (13) and m * denotes the cardinality of the optimal solution, which satisfies where m * ≤ n + 1. Also, it holds that P noncvx L = P cvx L . Theorem 3.2 shows that two-layer and deep networks simplify to very similar convex Lasso problems (i.e. ( 6) and ( 15)). However, the set of hyperplane arrangements is larger for deep networks as analyzed in Section 3.1. Thus, the structure of the diagonal matrix D and the problem dimensionality are significantly different for these problems.

3.3. SIMPLIFIED CONVEX FORMULATION

Here, we show that the data X can be shattered at a certain layer, i.e., H l (X) = {0, 1} n for a certain l ∈ [L], if the number of hidden neurons in a certain layer m l satisfies m l ≥ C √ n/L. Then we can alternatively formulate (11) as a simpler convex problem. Therefore, compared to the two-layer networks in Section 2.1, we substantially improve the condition on layer width by benefiting from the depth L, which also confirms the benign impact of the depth on the optimization. Lemma 3.3. If ∃l, C such that m l ≥ C √ n/L, then the set of hyperplane arrangements is complete, i.e., H L (X) = {0, 1} n . We next use Lemma 3.3 to derive a simpler form of (11). Corollary 3.4. As a direct consequence of Theorem 2.3 and Lemma 3.3, the non-convex deep threshold network training problem in (12) can be cast as the following convex program P noncvx L = P cvx v2 = min δ∈R n 1 2 δ -y 2 2 + β( (δ) + ∞ + (-δ) + ∞ ) . Surprisingly, both two-layer and deep networks share the same convex formulation in this case. However, notice that two-layer networks require a condition on the data matrix in Theorem 2.3 whereas the result in Corollary 3.4 requires a milder condition on the layer widths.

3.4. TRAINING COMPLEXITY

Here, we first briefly summarize our complexity results for the convex training of deep networks. Based on the convex problems in (15) and Corollary 3.4, we have two regimes: • When n+1 ≥ m L-1 ≥ m * , we solve (15) with O(n 3r L-2 k=1 m k ) complexity, where r := rank(X). Note that this is polynomial-time when r and the number of neurons in each layer {m l } L-2 l=1 are constants. • When ∃l, C : m l ≥ C √ n/L, we solve (7) in closed-form, and the reconstruction of the non-convex parameters requires O(n) time as proven in Appendix A.9. Computational complexity for (15): We first need to obtain an upperbound on the problem dimensionality P L-1 , which is stated in the next result. Lemma 3.5. The cardinality of the hyperplane arrangement set for an L-layer network H L (X) can be bounded as |H L (X)| = P L-1 O(n r L-2 k=1 m k ) , where r = rank(X) and m l denotes the number of hidden neurons in the l th layer. Lemma 3.5 shows that the set of hyperplane arrangements gets significantly larger as the depth of the network increases. However, the cardinality of this set is still a polynomial term since that r < min{n, d} and {m l } L-2 l=1 are fixed constants. To solve (15), we first enumerate all possible arrangements {d i } P L-1 i=1 to construct the matrix D (L-1) . Then, we solve a standard convex Lasso problem, which requires O(P 3 L-1 ) complexity (Efron et al., 2004) . Thus, based on Lemma 3.5, the overall complexity is O(P 3 L-2 ) ≈ O(n 3r L-2 k=1 m k ). Computational complexity for (7): Since Corollary 3.4 yields (7), the complexity is O(n) time.

4. EXTENSIONS TO ARBITRARY LOSS FUNCTIONS

In the previous sections, we considered squared error as the loss function to give a clear description of our approach. However, all the derivations extend to arbitrary convex loss. Now, we consider the regularized training problem with a convex loss function L(•, y), e.g., hinge loss, cross entropy, min θ∈Θ L(f θ,L (X), y) + β 2 m L-1 k=1 L l=1 ( W (l) k 2 F + s (l) k 2 2 ). Then, we have the following generic loss results. Corollary 4.1. Theorem 3.2 implies that when m L-1 ≥ m * , (16) can be equivalently stated as P cvx L = min w∈R P L-1 L D (L-1) w, y + β w 1 . ( ) Alternatively when H L (X) = {0, 1} n , based on Corollary 3.4, the equivalent convex problem is min δ∈R n L(δ, y) + β( (δ) + ∞ + (-δ) + ∞ ). Corollary 4.1 shows that ( 17) and ( 18) are equivalent to the non-convex training problem in ( 16). More importantly, they can be globally optimized via efficient convex optimization solvers. 

5. EXPERIMENTS

In this sectionfoot_0 , we present numerical experiments verifying our theoretical results in the previous sections. As discussed in Section 2.4, after solving the proposed convex problems in ( 15) and ( 7), there exist multiple set of weight matrices yielding the same optimal objective value. Therefore, to have a good generalization performance on test data, we use some heuristic methods for the construction of the non-convex parameters {W (l) } L l=1 . Below, we provide details regarding the weight construction and review some baseline non-convex training methods. Convex-Lasso: To solve the problem (15), we first approximate arrangement patterns of the data matrix X by generating i.i.d. Gaussian weights G ∈ R d× P and subsample the arrangement patterns via 1[XG ≥ 0]. Then, we use G as the hidden layer weights to construct the network. We repeat this process for every layer. Notice that here we sample a fixed subset of arrangements instead of enumerating all possible P arrangements. Thus, this approximately solves (15) by subsampling its decision variables, however, it still performs significantly better than standard non-convex training. Convex-PI: After solving (7), to recover the hidden layer weights of (2), we solve 1{Xw (1) j ≥ 0} = d j as w (1) j = X † d j , where X † denotes the pseudo-inverse of X. The resulting weights w (1) j enforce the preactivations Xw (1) j to be zero or one. Thus, if an entry is slightly higher or less than zero due to precision issues during the pseudo-inverse, it might give wrong output after the threshold activation. To avoid such cases, we use 0.5 threshold in the test phase, i.e., 1{X test w (1) j ≥ 0.5}. Convex-SVM: Another approach to solve 1{Xw (1) j ≥ 0} = d j is to use Support Vector Machines (Dua & Graff, 2017) datasets. We repeat simulations over multiple seeds with two-layer networks and compare the mean/std of the test accuracies of non-convex heuristics trained with SGD with our convex program in (6). Our convex approach achieves highest test accuracy for 9 of 13 datasets whereas the best non-convex heuristic achieves the highest test accuracy only for 4 datasets. (SVMs), which find the maximum margin vector. Particularly, we set the zero entries of d i as -1 and then directly run the SVM to get the maximum margin hidden neurons corresponding to this arrangement. Since the labels are in the form {+1, -1} in this case, we do not need additional thresholding as in the previous approach. Nonconvex-STE (Bengio et al., 2013) : This is the standard non-convex training algorithm, where the threshold activations is replaced with the identity function during the backward pass. STE Variants: We also benchmark against variants of STE. Specifically, we replace the threshold activation with ReLU (Nonconvex-ReLU (Yin et al., 2019a) ), Leaky ReLU (Nonconvex-LReLU (Xiao et al.) ), and clipped ReLU (Nonconvex-CReLU (Cai et al., 2017) ) during the backward pass. Synthetic Datasets: We compare the performances of Convex-PI and Convex-SVM trained via ( 7) with the non-convex heuristic methods mentioned above. We first run each non-convex heuristic for five different initializations and then plot the best performing one in Figure 1 . This experiment clearly shows that the non-convex heuristics fail to achieve the globally optimal training performance provided by our convex approaches. For the same setup, we also compare the training and test accuracies for three different regimes, i.e., n > d, n = d, and n < d. As seen in Figure 2 , our convex approaches not only globally optimize the training objective but also generalize well on the test data. Real Datasets: In Table 2 , we compare the test accuracies of two-layer threshold network trained via our convex formulation in (15), i.e., Convex-Lasso and the non-convex heuristics mentioned above. For this experiment, we use CIFAR-10 ( Krizhevsky et al., 2014) , MNIST (LeCun), and the datasets in the UCI repository (Dua & Graff, 2017) which are preprocessed as in Fernández-Delgado et al. (2014) . Here, our convex training approach achieves the highest test accuracy for most of the datasets while the non-convex heuristics perform well only for a few datasets. Therefore, we also validates the good generalization capabilities of the proposed convex training methods on real datasets.

6. CONCLUSION

We proved that the training problem of regularized deep threshold networks can be equivalently formulated as a standard convex optimization problem with a fixed data matrix consisting of hyperplane arrangements determined by the data matrix and layer weights. Since the proposed formulation parallels the well studied Lasso model, we have two major advantages over the standard non-convex training methods: 1) We globally optimize the network without resorting to any optimization heuristic or extensive hyperparameter search (e.g., learning rate schedule and initialization scheme); 2) We efficiently solve the training problem using specialized solvers for Lasso. We also provided a computational complexity analysis and showed that the proposed convex program can be solved in polynomial-time. Moreover, when a layer width exceeds a certain threshold, a simpler alternative convex formulation can be solved in O(n). Lastly, as a by product of our analysis, we characterize the recursive process behind the set of hyperplane arrangements for deep networks. Even though this set rapidly grows as the network gets deeper, globally optimizing the resulting Lasso problem still requires polynomial-time complexity for fixed data rank. We also note that the convex analysis proposed in this work is generic in the sense that it can be applied to various architectures including batch normalization (Ergen et al., 2022b) , vector output networks (Sahiner et al., 2020; 2021) , polynomial activations (Bartan & Pilanci, 2021) , GANs (Sahiner et al., 2022a) , autoregressive models (Gupta et al., 2021) , and Transformers (Ergen et al., 2022a; Sahiner et al., 2022b) . on w (1) j to simplify (19) as P noncvx 2 = min W (1) ,s,w (2) 1 2 f θ,2 (X) -y 2 2 + β 2 m j=1 (s 2 j + |w (2) j | 2 ) . We now note that one can rescale the parameters as sj = α j s j , w (2) j α j without the changing the network's output, i.e., fθ ,2 (X) = (2) j = f θ,2 (X) where α j > 0. We next note that β 2 m j=1 (s 2 j + w(2) 2 j ) = β 2 m j=1 α 2 j s 2 j + |w (2) j | 2 α 2 j ≥ β m j=1 |s j ||w (2) j | = β m j=1 |s j || w(2) j |, where the equality holds when α j = |w (2) j | |sj | . Thus, we obtain the following reformulation of the objective function in (19) where the regularization term takes a multiplicative form P noncvx 2 = min W (1) ,s,w (2) 1 2 f θ,2 (X) -y 2 2 + β m j=1 |s j ||w (2) j |. Next, we apply a variable change for the new formulation in (20) as follows s j := s j |s j | , w j := w (2) j |s j |. With the variable change above (20) can be equivalently written as P noncvx 2 = min W (1) ,w (2) s :|s j |=1 1 2 f θ,2 (X) -y 2 2 + β m j=1 |w (2) j |. This concludes the proof and yields the following equivalent formulation of ( 21) P noncvx 2 = min W (1) ,w (2) s :|s j |=1 1 2 f θ,2 (X) -y 2 2 + β w (2) 1 , A.2 PROPOSITION 2.4 Proof Proposition 2.4. Using the reparameterization for Lasso problems, we rewrite (9) as min w + j ,w - j ≥0 P j=1 (w + j + w - j ) s.t. P j=1 d j (w + j -w - j ) = y. We next introduce a slack variable t ≥ 0 such that ( 22) can be rewritten as min t≥0 min w + j ,w - j ≥0 t s.t. P j=1 d j (w + j -w - j ) = y, P j=1 (w + j + w - j ) = t. We now rescale w + j , w - j with 1/t to obtain the following equivalent of ( 23) min t≥0 min w + j ,w - j ≥0 t s.t. P j=1 d j (w + j -w - j ) = y/t, P j=1 (w + j + w - j ) = 1. ( ) The problem in (24) implies that ∃w + j , w - j ≥ 0 s.t. P j=1 (w + j + w - j ) = 1 P j=1 d j (w + j -w - j ) = y/t ⇐⇒ y ∈ tConv{±d j , ∀j ∈ [P ]}, where Conv denotes the convex hull operation. Therefore, ( 9) can be equivalenly formulated as min t≥0 t s.t. y ∈ tConv{±d j , ∀j ∈ [P ]}.

A.3 THEOREM 2.2

Proof of Theorem 2.2. We first remark that the activations can only be either zero or one, i.e., σ sj (Xw (1) j ) ∈ {0, 1} n , since |s j | = 1, ∀j ∈ [m] . Therefore, based on Lemma 2.1, we reformulated (4) as P noncvx 2 = min di j ∈{d1...d P } w 1 2 [d i1 , ..., d im ] w -y 2 2 + β w 1 , ( ) where we denote the elements of the set H(X) by d 1 , . . . , d P ∈ {0, 1} n and P is the number of hyperplane arrangements. The above problem is similar to Lasso (Tibshirani, 1996) , although it is non-convex due to the discrete variables d i1 , ..., d im . These m hyperplane arrangement patterns are discrete optimization variables along with the coefficient vector w. In fact, the above problem is equivalent to a cardinality-constrained Lasso problem. We have P noncvx 2 = min r∈R n , d1,...,dm∈H(X), w∈R m r=[d1,...,dm]w-y 1 2 r 2 2 + β w 1 = min r∈R n ,d1,...,dm∈H(X),w∈R m max z∈R n 1 2 r 2 2 + β w 1 + z T (r + y) -z T m i=1 d i w i (i) ≥ min d1,...,dm∈H(X) max z∈R n min r∈R n ,w∈R m z T y + 1 2 r 2 2 + z T r + m i=1 β|w i | -w i z T d i = min d1,...,dm∈H(X) max z∈R n |z T di|≤β, ∀i∈[m] 1 2 y 2 2 - 1 2 z -y 2 2 (ii) ≥ max z∈R n max d∈H(X) |z T d|≤β 1 2 y 2 2 - 1 2 z -y 2 2 (iii) ≥ max z∈R n |z T di|≤β,∀i∈[P ] 1 2 y 2 2 - 1 2 z -y 2 2 = D cvx 2 , where inequality holds by weak duality, inequality (ii) follows from augmenting the set of constraints |z T d i | ≤ β to hold for any sign pattern d ∈ H(X), and inequality (iii) follows from enumerating all possible sign patterns, i.e., H(X) = {d 1 , d 2 , . . . , d P }. We now prove that strong duality in fact holds, i.e., P noncvx 2 = D cvx 2 . We first form the Lagrangian of the dual problem D cvx 2 D cvx 2 = max z∈R n min wi,w i ≥0 1 2 y 2 2 - 1 2 z -y 2 2 + P i=1 w i (β -z T d i ) + w i (β + z T d i ) (i) = min wi,w i ≥0 max z∈R n 1 2 y 2 2 - 1 2 z -y 2 2 + P i=1 w i (β -z T d i ) + w i (β + z T d i ) = min wi,w i ≥0 1 2 P i=1 d i (w i -w i ) -y 2 2 + P i=1 β (w i + w i ) = min w∈R P 1 2 Dw -y 2 2 + β w 1 = P cvx 2 (26) where, in equality (i) follows from the fact that D cvx 2 is a convex optimization problem satisfying Slater's conditions so that strong duality holds (Boyd & Vandenberghe, 2004) , i.e., D cvx 2 = P cvx 2 , and the matrix D ∈ {0, 1} n×P is defined as  D := [d 1 , d, . . . , d P ].

A.4 THEOREM 2.3

Proof of Theorem 2.3. Following the proof of Theorem 2.2, we first note that strong duality holds for the problem in (25), i.e., P noncvx 2 = D cvx 2 . Under the assumption that H(X) = {0, 1} n , the dual constraint max d∈H(X) |z T d| ≤ β is equivalent to {max d∈[0,1] n z T d ≤ β} ∪ {max d ∈[0,1] n -z T d ≤ β}. Forming the Lagrangian based on this reformulation of the dual constraint, we have D cvx 2 = max z∈R n min t,t ≥0 1 2 y 2 2 - 1 2 z -y 2 2 + t(β -max d∈[0,1] n z T d) + t (β + max d ∈[0,1] n z T d ) = max z∈R n min t,t ≥0 d,d ∈[0,1] n 1 2 y 2 2 - 1 2 y -z 2 2 + z T (t d -td) + β(t + t) (i) = max z∈R n min t,t ≥0 d∈[0,t] n ,d ∈[0,t ] n 1 2 y 2 2 - 1 2 y -z 2 2 + z T (d -d) + β(t + t) (ii) = min t,t ≥0 d∈[0,t] n ,d ∈[0,t ] n max z∈R n 1 2 y 2 2 - 1 2 y -z 2 2 + z T (d -d) + β(t + t) = min t,t ≥0 d∈[0,t] n ,d ∈[0,t ] n 1 2 d -d -y 2 2 + β(t + t ) . where, in equality (i), we used the change of variables d ≡ td and d ≡ t d , and, in equality (ii), we used the fact that the objective function 1 2 y 2 2 -1 2 y -z 2 2 + z T (d -d) + β(t + t ) is strongly concave in z and convex in (d, d , t, t ) and the constraints are linear, so that strong duality holds and we can switch the order of minimization and maximization. Given a point (d, d , t, t )  , i.e., d ∈ [0, t] n and d ∈ [0, t ] n , we set δ = d -d. Note that (δ) + ∞ ≤ (d ) + ∞ = d ∞ ≤ t . Similarly, (-δ) + ∞ ≤ t. This implies that min t,t ≥0 d∈[0,t] n ,d ∈[0,t ] n 1 2 d -d -y 2 2 + β(t + t ) ≥ min δ∈R n 1 2 δ -y 2 2 + β( (δ) + ∞ + (-δ) + ∞ ) = P cvx v2 . Conversely, given δ ∈ R n , we set d = (δ) + , t = d ∞ , d = (-δ) + and t = d ∞ . It holds that (d, d , t, t ) is feasible with same objective value, and consequently, the above inequality is an equality, i.e., D cvx 2 = P cvx v2 . Optimal threshold network construction: We now show how to construct an optimal threshold network given an optimal solution to the convex problem (7). Let δ ∈ R n be an optimal solution. Set d = (δ) + and d = (-δ) + . We have d ∈ [0, d ∞ ] n and d ∈ [0, d ∞ ] n . It is easy to show that we can transform δ such for each index i ∈ [n], either the i-th coordinate of d is active or the i-th coordinate of d is active. Therefore, by Caratheodory's theorem, there exist n + , n -≥ 1 such that n -+ n + ≤ n, and d 1 , . . . , d n++1 ∈ {0, 1} n and γ 1 , . . . , γ n++1 ≥ 0 such that n++1 i=1 γ i = 1 and d = d ∞ n++1 i=1 γ i d i , and, d 1 , . . . , d n-+1 ∈ {0, 1} n and γ 1 , . . . , γ n-+1 ≥ 0 such that n-+1 i=1 γ i = 1 and d = d ∞ n-+1 i=1 γ i d i , with n -+ n + ≤ n. Then, we can pick w (1) 1 , . . . , w (1) n++1 , w (2) 1 , . . . , w (2) n++1 , s 1 , . . . , s n++1 such that 1{Xw (1) i ≥ 0} = d i , w (1) i = d ∞ γ i and s i = -1, and, w (1) n++2 , . . . , w (1) n++n-+2 , w (2) n++2 , . . . , w (2) n++n-+2 , s n++2 , . . . , s n++n-+2 such that 1{Xw (1) i ≥ 0} = d i-(n++1) , α i = d ∞ γ i-(n++1) and s i = 1. Note that, given δ ∈ R n , finding the corresponding d 1 , . . . , d n++1 , γ 1 , . . . , γ n++1 and d 1 , . . . , d n++1 , γ 1 , . . . , γ n++1 takes time O(n + + n + ) = O(n). A.5 LEMMA 3.1 Proof of Lemma 3.1. We first restate the standard weight decay regularized training problem below min θ∈Θ 1 2 f θ,L (X) -y 2 2 + β 2 m L-1 k=1 L l=1 ( W (l) k 2 F + s (l) k 2 2 ). Then, we remark that the loss function in (27), i.e., 1 2 f θ,L (X) -y 2 2 is invariant to the norms of hidden layer weights {{W (l) k } L-1 l=1 } m L-1 k=1 and amplitude parameters {{s (l) k } L-2 l=1 } m L-1 k=1 . Therefore, (27) can be rewritten as min θ∈Θ 1 2 f θ,L (X) -y 2 2 + β 2 m L-1 k=1 (w (L) k ) 2 + (s (L-1) k ) 2 . Then, we can directly apply the scaling in the proof of Lemma 2.1 to obtain the claimed 1 -norm regularized optimization problem. A.6 THEOREM 3.2 Proof of Theorem 3.2. We first remark that the last hidden layer activations can only be either zero or one, i.e., X (L-1) k ∈ {0, 1} n×m L-1 , since |s (L-1) k | = 1, ∀k ∈ [m L-1 ]. Therefore, based on Lemma 3.1, we reformulated (12) as P noncvx L = min di j ∈H L (X) w 1 2 d i1 , ..., d im L-1 w -y 2 2 + β w 1 , As in (25), the above problem is similar to Lasso, although it is non-convex due to the discrete variables d i1 , ..., d im L-1 . These m L-1 hyperplane arrangement patterns are discrete optimization variables along with the coefficient vector w. We now note that (29) has following dual problem D cvx L = D cvx 2 = max max d∈H L (X) |z T d|≤β - 1 2 z -y 2 2 + 1 2 y 2 2 . ( ) Since the dual form is the same with the dual problem of two-layer networks, i.e., D cvx L = D cvx 2 , we directly follow the proof of Theorem 2.2 to obtain the following bidual formulation. D cvx L = P cvx L = min w∈R P L-1 1 2 D (L-1) w -y 2 2 + β w 1 , where D (L-1) = [d 1 , d, . . . , d P L-1 ]. Hence, based on the strong duality results in Ergen & Pilanci (2020) ; Pilanci & Ergen (2020) ; Ergen & Pilanci (2021c) , there exist a threshold for the number of neurons, i.e., denoted as m * , such that if m L-1 ≥ m * then strong duality holds for the original non-convex training problem (25), i.e., P noncvx L = D cvx L = P cvx L . A.7 LEMMA 3.3 Proof of Lemma 3.3. Based on the construction in (Bartlett et al., 2019) , a deep network with L layers and W parameters, where W = L i=1 m i-1 m i in our notation, can shatter a dataset of the size C 1 W L log(W ), where C 1 is a constant. Based on this complexity results, if we choose m l = m = C √ n/L, ∀l ∈ [L], the number of samples that can be shattered by an L-layer threshold network is # of data samples that can be shattered = C 1 W L log(W ) = C 1 m 2 L 2 log(m 2 L) = C 1 C 2 n log C 2 n L = n log n LC 1 n provided that n L, which is usually holds in practice and we choose the constant C such that C = 1/C 2 1 . Therefore, as we make the architecture deeper, we can in fact improve the assumption m ≥ n + 2 assumption in the two-layer networks case by benefiting from the depth. A.8 LEMMA 3.5 Proof of Proposition 3.5. Let us start with the L = 2 case for which hyperplane arrangement set can described as detailed in (5) H 2 (X) = {1{Xw (1) ≥ 0} : θ ∈ Θ} = H(X) = {d 1 , . . . , d P1 } ⊂ {0, 1} n . Now, in order to construct H 3 (X) (with m L-1 = 1 so we drop the index k), we need to choose m 1 arrangements from H 2 (X) and then consider all hyperplane arrangements for each of these choices. Particularly, since |H 2 (X)| = P 1 , where P 1 ≤ 2 r-1 k=0 n -1 k ≤ 2r e(n -1) r r ≈ O(n r ) from ( 8), we have P1 m1 choices and each of them yields a different activation matrix denoted as X (1) i ∈ R n×m1 . Based on the upperbound (8), each X (1) i can generate O(n m1 ) patterns. Therefore, overall, the set of possible arrangement in the second layer is as follows 2014) such that 750 ≤ n ≤ 5000 (see Table 3 for the exact dataset sizes). We particularly consider a conventional binary classification framework with two-layer networks and performed simulations over multiple seeds and compare the mean/std of the test accuracies of non-convex heuristics trained with SGDnamely Nonconvex-STE (Bengio et al., 2013) , Nonconvex-ReLU (Yin et al., 2019a) , Nonconvex-LReLU (Xiao et al.) and Nonconvex-CReLU (Cai et al., 2017) , with our convex program in (6), i.e., Convex-Lasso. H 3 (X) = ( P 1 m 1 ) i=1 {1{X (1) w (2) ≥ 0} : θ ∈ Θ} X (1) =X (1) i For the non-convex training, we use the SGD optimizer. We also use the 80% -20% splitting ratio for the training and test sets of the UCI datasets. We tune the regularization coefficient β and the learning rate µ for the non-convex program by performing a grid search on the sets β list = {1e -6, 1e -3, 1e -2, 1e -1, 0.5, 1, 5} and µ list = {1e -3, 5e -3, 1e -2, 1e -1}, respectively. In all experiments, we also decayed the selected learning rate systematically using PyTorch's (Paszke et al., 2019) scheduler ReduceLROnPlateau. Moreover, we choose the number of neurons, number of epochs (ne), batch size (bs) as m = 1000, bs = 5000, bs = n, respectively. Our convex approach achieves highest test accuracy for precisely 9 of 13 datasets whereas the best non-convex heuristic achieves the highest test accuracy only for 4 datasets. This experiment verifies that our convex training approach not only globally optimize the training objective but also usually generalizes well on the test data. In addition, our convex training approach is shown to be significantly more time efficient than standard non-convex training. Here, we compare two-layer threshold network training performance of our convex program (6), which we call Convex-Lasso, with standard non-convex training heuristics, namely Nonconvex-STE (Bengio et al., 2013) , Nonconvex-ReLU (Yin et al., 2019a) , Nonconvex-LReLU (Xiao et al.) and Nonconvex-CReLU (Cai et al., 2017) . For the non-convex heuristics, we train a standard twolayer threshold network with the SGD optimizer. In all experiments, learning rates are initialized to be 0.01 and they are decayed systematically using PyTorch's (Paszke et al., 2019) scheduler ReduceLROnPlateau. To generate a dataset, we first sample an i.i.d. Gaussian data matrix X and then obtain the corresponding labels as y = sgn(tanh(XW (1) )w (2) ) where sgn and tanh are the sign and hyperbolic tangent functions, respectively. Here, we denote the the ground truth parameters as W (1) ∈ R d×m * and w (2) ∈ R m * , where m * = 20 is the number of neurons in the ground truth model. Notice that we use tanh and sign in the ground truth model to have balanced label distribution y ∈ {+1, -1} n . We also note that for all the experiments, we choose the regularization coefficient as β = 1e -3. We now emphasize that to have a fair comparison with the non-convex heuristics, we first randomly sample a small subset of hyperplane arrangements and then solve (6) with this fixed small subset. Specifically, instead of enumerating all possible arrangements {d i } P i=1 , we randomly sample a subset {d ij } m j=1 to have a fair comparison with the non-convex neural network training with m hidden neurons. So, Convex-Lasso is an approximate way to solve the convex program (6) yet it still performs extremely well in our experiments. We also note that the other convex approaches Convex-PI and Convex-SVM exactly solve the proposed convex programs. In Figure 6 , we compare training performances. We particularly solve the convex optimization problem (6) once. However, for the non-convex training heuristics, we try five different initializations. We then select the trial with the lowest objective value to plot. In the figure, we plot the objective value defined in (3). As it can be seen from the figures, Convex-Lasso achieves much lower objective value than the non-convex heuristics in three different regimes, i. 7) with three-layer threshold networks trained with the non-convex heuristics. Here, we directly follow the setup in Figure 6 except the following differences. This time we randomly initialize a three-layer network and then sample i.i.d. Gaussian data matrix. We then set the labels as y = sgn(tanh(tanh(XW (1) )W (2) )w (3) ). To have our convex formulation, we require H(X) to be complete. Thus, we first use a random representation matrix H ∈ R d×M , where M = 1000, and then apply the transformation X = σ(XH). We also apply this transformation to the non-convex methods as if we train a two-layer networks on the modified data matrix X. We also indicate the time taken to solve the convex programs with markers. We again observe that our convex training approach achieves lower objective value than all the non-convex heuristic training methods in all initialization trials. FIGURE 10, 11, 12, AND 13 We also compare our alternative convex formulation in (7) with non-convex approaches for three-layer network training. To do so we first generate a dataset via the following ground truth model y = sgn(tanh(tanh(XW (1) )W (2) )w (3) ), where W (1) ∈ R d×m * 1 , W (2) ∈ R m * 1 ×m * 2 and w (3) ∈ R m * 2 and we choose m * 1 = m * 2 = 20 for all the experiments. As it is described in Theorem 2.3, we require H(X) to be complete in this case. To ensure that, we transform the data matrix X using a random representation matrix. In particular, we first generate a random representation matrix H ∈ R d×M and then multiply it with the data matrix followed by a threshold function. Therefore, effectively, we obtain a new data matrix X = σ(XH). By choosing M large enough, which is M = 1000 in our experiments, we are able to enforce X to be full rank and thus the set of hyperplane arrangements H( X) is complete as assumed in Theorem 2.3. We also apply the same steps for the non-convex training, i.e., we train the networks as if we perform a two-layer network training on the data matrix X. Notice that we also provide standard three-layer network training comparison, where all of three layers are trainable, in Section B.5. More importantly, after solving the convex problem (7), we need to construct a neural network that gives the same objective value. As discussed in Section 2.4, there are numerous ways to reconstruct the non-convex network weights and we particularly use Convex-PI and Convex-SVM as detailed in Section 5, which seem to have good generalization performance. We also note that these approaches are exact in the sense that they globally optimize the objective in B.3. Additionally, we have n neurons in our reconstructed neural network to have fair comparison with the non-convex training with n hidden neurons.

B.4 THREE-LAYER EXPERIMENTS WITH A REPRESENTATION MATRIX IN

In Figure 10 , we compare objective values and observe that Convex-PI and Convex-SVM, which solve the convex problem 7, obtain the globally optimal objective value whereas all trials of the non-convex training heuristics are stuck at a local minimum. Figure 11 , 12, and 13 show that our convex training approaches also yield better test performance in all cases unlike the two-layer training in Section B.3. Again, for the testing phase we generate 3000 samples via the ground truth model defined above. In Figure 1 , we compare the objective values and observe that our convex training approaches achieve a global optimum in all cases unlike the non-convex training heuristics. We also provide the test and training accuracies for three-layer networks trained with different (n, d) pairs in Figure 2 . In all cases our convex approaches outperform the non-convex heuristics in terms of test accuracy. In order to verify the effectiveness of the proposed convex formulations in training deeper networks, we consider a threshold network training problem for a 10-layer network, i.e., (10) with L = 10. We directly follow the same setup with Section B.5 for (n, d, β) = (100, 20, 1e -3). Since in this case the network is significantly deeper than the previous cases, all of the non-convex training heuristics failed to fit the training data, i.e., they couldn't achieve 100% training accuracy. Therefore, for the non-convex training, we also include Batch Normalization in between hidden layers to stabilize and improve the training performance. In Figure 14 , we present the results for this experiment. Here, we observe that our convex training approaches provide a globally optimal training performance and



We provide additional experiments and details in Appendix B.



Figure 1: Training comparison of our convex program in (7) with the non-convex training heuristic STE. We also indicate the time taken to solve the convex programs with markers. For non-convex STE, we repeat the training with 5 different initializations. In each case, our convex training algorithms achieve lower objective than all the non-convex heuristics (see Appendix B.5 for details).

Figure 2: In this figure, we compare the classification performance of three-layer threshold networks trained with the setup described in Figure 1 for a single initialization trial. This experiment shows that our convex training approach not only provides the globally optimal training performance but also generalize remarkably well on the test data (see Appendix B.5 for details).

Figure 4: Two and three dimensional visualizations of hidden layer representation space of the weight decay regularized threshold networks. Here, we visualize the convex hull in Proposition 2.4, which is the implicit representation space revealed by our analysis.

Now, based on the strong duality results inErgen & Pilanci (2020);Pilanci & Ergen (2020), there exist a threshold for the number of neurons, i.e., denoted as m * , such that if m ≥ m * then strong duality holds for the original non-convex training problem (25), i.e., is exactly equivalent to the original non-convex training problem (3).

Figure 6: Training performance comparison of our convex program for two-layer networks in (6) with four standard non-convex training heuristics (STE and its variants in Section 5). To generate the dataset, we randomly initialize a two-layer network and then sample a random i.i.d. Gaussian data matrix X ∈ R n×d . Then, we set the labels as y = sgn(tanh(XW (1) )w (2) ) where sgn and tanh are the sign and hyperbolic tangent functions. We specifically choose the regularization coefficient β = 1e -3 and run the training algorithms with m = 1000 neurons for different (n, d) combinations. For each non-convex method, we repeat the training with 5 different initialization and then plot the best performing non-convex method for each initialization trial. In each case, our convex training algorithm achieves significantly lower objective than all the non-convex heuristics. We also indicate the time taken to solve the convex program with a marker.

Figure 7: In this figure, we compare the classification of the simulation in Figure 6c for a single initialization trial, where n = 100 and d = 20. This experiment shows that our convex training approach not only provides the optimal training performance but also generalizes on the test data as demonstrated in (c).

Figure 8: The setup for these figures is completely the same with Figure 7 except that we consider the (n, d) = (50, 50) case in Figure 6b. Unlike Figure7, here even though our convex training approach provides the optimal training, the non-convex heuristic methods that are stuck at a local minimum yield a better test accuracy. This is due to the fact that we have less data samples with higher dimensionality compared to Figure7.

Figure 9: The setup for these figures completely the same with Figure 7 except that we consider the (n, d) = (20, 100) case in Figure 6a. As in Figure 8, even though the non-convex heuristic training methods fail to globally optimize the objective, they yield higher test accuracies than the convex program due to the low data regime (n ≤ d).

Figure 10: Training comparison of our convex program in (7) with three-layer threshold networks trained with the non-convex heuristics. Here, we directly follow the setup in Figure6except the following differences. This time we randomly initialize a three-layer network and then sample i.i.d. Gaussian data matrix. We then set the labels as y = sgn(tanh(tanh(XW (1) )W (2) )w (3) ). To have our convex formulation, we require H(X) to be complete. Thus, we first use a random representation matrix H ∈ R d×M , where M = 1000, and then apply the transformation X = σ(XH). We also apply this transformation to the non-convex methods as if we train a two-layer networks on the modified data matrix X. We also indicate the time taken to solve the convex programs with markers. We again observe that our convex training approach achieves lower objective value than all the non-convex heuristic training methods in all initialization trials.

Figure 11: In this figure, we compare the classification of the simulation in Figure 10c for a single initialization trial, where (n, d) = (100, 20). This experiment shows that our convex training approach not only provides the optimal training performance but also generalizes on the test data as demonstrated in (c).

Figure 12: The setup for these figures is completely the same with Figure 11 except that we consider the (n, d) = (50, 50) case in Figure 10b. This experiment shows that when we have all possible hyperplane arrangements, our convex training approach generalizes better even in the low data regime (n ≤ d).

Figure 13: The for figures is completely the same with Figure 11 except that we consider the (n, d) = (20, 100) case in Figure 10a. This experiment also confirms better generalization of our convex training approach as in Figure 11 and 12.

Figure 14: Performance comparison of our convex training approaches trained via (7) and non-convex training heuristics for a 10-layer threshold network training problem. Here, we use the same setup in Figure 1. As in the previous experiment, our convex training approaches yields outperforms the non-convex heuristics in both training and test metrics.

We introduce polynomial-time trainable convex formulations of regularized deep threshold network training problems provided that a layer width exceeds a threshold detailed in Table 1. • In Theorem 2.2, we prove that the original non-convex training problem for two-layer networks is equivalent to standard convex optimization problems. • We show that deep threshold network training problems are equivalent to standard convex optimiza-

Summary of our results for the optimization of weight decay regularized threshold network training problems (n: # of data samples, d: feature dimension, m l : # of hidden neurons in layer l, r: rank of the training data matrix, m * : critical width, i.e., # of neurons that obeys 0 ≤ m * ≤ n + 1)

where Conv(A) denotes the convex hull of a set A. This corresponds to the gauge function (seeRockafellar (2015)) of the hyperplane arrangement patterns and their negatives. We provide visualizations of the convex set Conv{±d j , ∀j ∈ [P ]} in Figures3 and 4(see Appendix) for Example 3.1.

Test performance comparison on CIFAR-10 (Krizhevsky et al., 2014), MNIST (LeCun), and UCI

Dataset sizes for the experiments in Table

7. ACKNOWLEDGEMENTS

This work was partially supported by the National Science Foundation (NSF) under grants ECCS-2037304, DMS-2134248, NSF CAREER award CCF-2236829, the U.S. Army Research Office Early Career Award W911NF-21-1-0242, Stanford Precourt Institute, and the ACCESS -AI Chip Center for Emerging Smart Systems, sponsored by InnoHK funding, Hong Kong SAR.

Appendix Table of Contents

Here, we visualize the convex constraint in Proposition 2.4, which is the implicit representation revealed by convex analysis. We observe that the hidden representation space becomes richer with an additional nonlinear layer (from 2-layer to 3-layer), and the resulting symmetry in the convex hull enables the simpler convex formulation (7).

A PROOFS OF THE RESULTS IN THE MAIN PAPER

A.1 LEMMA 2.1 Proof of Lemma 2.1. We first restate the original training problem below.As already noted in the main paper the loss function is invariant to the norm of w(1) j and we may have w(1) j → 0 to reduce the regularization cost. Therefore, we can omit the regularization penalty which implies thatThis analysis can be recursively extended to the l th layer where the hyperplane arrangement set and the corresponding cardinality values can be computed as followsA.9 COROLLARY 3.4Proof of Corollary 3.4. We first remark that the last hidden layer activations can only be either zero or one, i.e., XTherefore, based on Lemma 3.1, we reformulated (12) asAs in (25), the above problem is similar to Lasso, although it is non-convex due to the discrete variables d i1 , ..., d im L-1 . These m L-1 hyperplane arrangement patterns are discrete optimization variables along with the coefficient vector w.We now note that (31) has following dual problem maxSince H L (X) = {0, 1} n by Lemma 3.3, all the steps in the proof of Theorem 2.3 directly follow.Optimal deep threshold network construction: For the last two layers' weights, we exactly follow the weight construction procedure in the Proof of Theorem 2.3 as detailed below.Let δ ∈ R n be an optimal solution. Set d = (δ) + and d = (-δ) n++1) and s i = 1. Note that, given δ ∈ R n , finding the corresponding d 1 , . . . , d n++1 , γ 1 , . . . , γ n++1 and d 1 , . . . , d n++1 , γ 1 , . . . , γ n++1 takes time O(nThen, the rest of the layer weights can be reconstructed using the construction procedure detailed in (Bartlett et al., 2019) .

A.10 COROLLARY 4.1

Proof of Corollary 4.1. We first apply the scaling in 3.1 and then follow the same steps in the proof of Theorem 2.3 to get the following dual problemwhere L * is the Fenchel conjugate of L and defined asTherefore, we obtain a generic version of the dual problem in ( 30), where we can arbitrarily choose the network's depth and the convex loss function. Then the rest of the proof directly follows from Theorem 3.2 and Corollary 3.4.

B ADDITIONAL EXPERIMENTS AND DETAILS

In this section, we present additional numerical experiments and further experimental details that are not presented in the main paper due to the page limit.We first note that all of the experiments in the paper are run on a single laptop with Intel(R) Core(TM) i7-7700HQ CPU and 16GB of RAM.

B.1 EXPERIMENT IN FIGURE 5

Figure 5 : Comparison of our convex training method in (6) with standard non-convex training heuristic for threshold networks, known as Straight-Through Estimator (STE) (Bengio et al., 2013) .For the non-convex heuristic, we repeat the training process using 5 independent initializations, however, all trials fail to converge to the global minimum obtained by our convex optimal method, and lack stability. We provide experimental details in Appendix B.For the experiment in Figure 5 , we consider a simple one dimensional experiment, where the data matrix is X = [-2, -1, 0, 1, 2] T . Using this data matrix, we generate the corresponding labels y by simply forward propagating the data through randomly initialized two-layer threshold networks with m 1 = 2 neurons as described in (2). We then run our convex training method in (6) and the non-convex training heuristic STE (Bengio et al., 2013) on the objective with the regularization coefficient β = 1e -2. For a fair comparison, we used P 1 = 24 for the convex method and m 1 = 24 for STE. We also tune the learning rate of STE by performing a grid search on the set {5e -1, 1e -1, 5e -2, 1e -2, 5e -3, 1e -3}. As illustrated in Figure 5 , the non-convex training heuristic STE fails to achieve the global minimum obtained by our convex training algorithm for 5 different initialization trials.

B.2 EXPERIMENT IN TABLE 2

Here, we provide a test performance comparison on on CIFAR-10 ( Krizhevsky et al., 2014) , MNIST (LeCun), and the datasets taken from UCI Machine Learning Repository (Dua & Graff, 2017) , where higher test accuracies than all of the non-convex heuristics that are further supported Batch Normalization.

