GLOBALLY OPTIMAL TRAINING OF NEURAL NET-WORKS WITH THRESHOLD ACTIVATION FUNCTIONS

Abstract

Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation function has zero gradient except at a single non-differentiable point. To this end, we study weight decay regularized training problems of deep neural networks with threshold activations. We first show that regularized deep threshold network training problems can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method, provided that the last hidden layer width exceeds a certain threshold. We also derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network. We corroborate our theoretical results with various numerical experiments.

1. INTRODUCTION

In the past decade, deep neural networks have proven remarkably useful in solving challenging problems and become popular in many applications. The choice of activation plays a crucial role in their performance and practical implementation. In particular, even though neural networks with popular activation functions such as ReLU are successfully employed, they require advanced computational resources in training and evaluation, e.g., Graphical Processing Units (GPUs) (Coates et al., 2013) . Consequently, training such deep networks is challenging especially without sophisticated hardware. On the other hand, the threshold activation offers a multitude of advantages: (1) computational efficiency, (2) compression/quantization to binary latent dimension, (3) interpretability. Unfortunately, gradient based optimization methods fail in optimizing threshold activation networks due to the fact that the gradient is zero almost everywhere. To close this gap, we analyze the training problem of deep neural networks with the threshold activation function defined as σ s (x) := s1{x ≥ 0} = s if x ≥ 0 0 otherwise , where s ∈ R is a trainable amplitude parameter for the neuron. Our main result is that globally optimal deep threshold networks can be trained by solving a convex optimization problem.

1.1. WHY SHOULD WE CARE ABOUT THRESHOLD NETWORKS?

Neural networks with threshold activations are highly desirable due to the following reasons: • Since the threshold activation ( 1) is restricted to take values in {0, s}, threshold neural network models are far more suitable for hardware implementations (Bartlett & Downs, 1992; Corwin et al., 1994) . Specifically, these networks have significantly lower memory footprint, less computational complexity, and consume less energy (Helwegen et al., 2019 ). • Modern neural networks have extremely large number of full precision trainable parameters so that several computational barriers emerge during hardware implementations. One approach to mitigate these issues is reducing the network size by grouping the parameters via a hash function (Hubara et al., 2017; Chen et al., 2015) . However, this still requires full precision training before the application of the hash function and thus fails to remedy the computational issues. On the other hand, neural networks with threshold activations need a minimal amount of bits. • Another approach to reduce the complexity is to quantize the weights and activations of the network (Hubara et al., 2017) and the threshold activation is inherently in a two level quantized form. • The threshold activation is a valid model to simulate the behaviour of a biological neuron as detailed in Jain et al. (1996) . Therefore, progress in this research field could shed light into the connection between biological and artificial neural networks.

1.2. RELATED WORK

Although threshold networks are essential for several practical applications as detailed in the previous section, training their parameters is a difficult non-differentiable optimization problem due to the discrete nature in (1). For training of deep neural networks with popular activations, the common practice is to use first order gradient based algorithms such as Gradient Descent (GD) since the well known backpropagation algorithm efficiently calculates the gradient with respect to parameters. However, the threshold activation in (1) has zero gradient except at a single non-differentiable point zero, and therefore, one cannot directly use gradient based algorithms to train the parameters of the network. In order to remedy this issue numerous heuristic algorithms have been proposed in the literature as detailed below but they still fail to globally optimize the training objective (see Figure 1 ). The Straight-Through Estimator (STE) is a widely used heuristics to train threshold networks (Bengio et al., 2013; Hinton, 2012) . Since the gradient is zero almost everywhere, Bengio et al. (2013) ; Hinton (2012) proposed replacing the threshold activation with the identity function during only the backward pass. Later on, this approach is extended to employ various forms of the ReLU activation function, e.g., clipped ReLU, vanilla ReLU, Leaky ReLU (Yin et al., 2019b; Cai et al., 2017; Xiao et al.) , during the backward pass. Additionally, clipped versions of the identity function were also used as an alternative to STE (Hubara et al., 2017; Courbariaux et al., 2016; Rastegari et al., 2016) . 1.3 CONTRIBUTIONS ). • In Section 3.1, we characterize the evolution of the set of hyperplane arrangements and consequently hidden layer representation space as a recursive process (see Figure 3 ) as the network gets deeper. • We prove that when a certain layer width exceeds O( √ n/L), the regularized L-layer threshold network training further simplifies to a problem that can be solved in O(n) time.



Figure 1: Training comparison of our convex program in (7) with the non-convex training heuristic STE. We also indicate the time taken to solve the convex programs with markers. For non-convex STE, we repeat the training with 5 different initializations. In each case, our convex training algorithms achieve lower objective than all the non-convex heuristics (see Appendix B.5 for details).

We introduce polynomial-time trainable convex formulations of regularized deep threshold network training problems provided that a layer width exceeds a threshold detailed in Table 1. • In Theorem 2.2, we prove that the original non-convex training problem for two-layer networks is equivalent to standard convex optimization problems. • We show that deep threshold network training problems are equivalent to standard convex optimization problems in Theorem 3.2. In stark contrast to two-layer networks, deep threshold networks can have a richer set of hyperplane arrangements due to multiple nonlinear layers (see Lemma 3.5

