GLOBALLY OPTIMAL TRAINING OF NEURAL NET-WORKS WITH THRESHOLD ACTIVATION FUNCTIONS

Abstract

Threshold activation functions are highly preferable in neural networks due to their efficiency in hardware implementations. Moreover, their mode of operation is more interpretable and resembles that of biological neurons. However, traditional gradient based algorithms such as Gradient Descent cannot be used to train the parameters of neural networks with threshold activations since the activation function has zero gradient except at a single non-differentiable point. To this end, we study weight decay regularized training problems of deep neural networks with threshold activations. We first show that regularized deep threshold network training problems can be equivalently formulated as a standard convex optimization problem, which parallels the LASSO method, provided that the last hidden layer width exceeds a certain threshold. We also derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network. We corroborate our theoretical results with various numerical experiments.

1. INTRODUCTION

In the past decade, deep neural networks have proven remarkably useful in solving challenging problems and become popular in many applications. The choice of activation plays a crucial role in their performance and practical implementation. In particular, even though neural networks with popular activation functions such as ReLU are successfully employed, they require advanced computational resources in training and evaluation, e.g., Graphical Processing Units (GPUs) (Coates et al., 2013) . Consequently, training such deep networks is challenging especially without sophisticated hardware. On the other hand, the threshold activation offers a multitude of advantages: (1) computational efficiency, (2) compression/quantization to binary latent dimension, (3) interpretability. Unfortunately, gradient based optimization methods fail in optimizing threshold activation networks due to the fact that the gradient is zero almost everywhere. To close this gap, we analyze the training problem of deep neural networks with the threshold activation function defined as σ s (x) := s1{x ≥ 0} = s if x ≥ 0 0 otherwise , where s ∈ R is a trainable amplitude parameter for the neuron. Our main result is that globally optimal deep threshold networks can be trained by solving a convex optimization problem.

1.1. WHY SHOULD WE CARE ABOUT THRESHOLD NETWORKS?

Neural networks with threshold activations are highly desirable due to the following reasons: • Since the threshold activation (1) is restricted to take values in {0, s}, threshold neural network models are far more suitable for hardware implementations (Bartlett & Downs, 1992; Corwin et al., 1994) . Specifically, these networks have significantly lower memory footprint, less computational complexity, and consume less energy (Helwegen et al., 2019 ). • Modern neural networks have extremely large number of full precision trainable parameters so that several computational barriers emerge during hardware implementations. One approach to

