POLARITY IS ALL YOU NEED TO LEARN AND TRANSFER FASTER

Abstract

Natural intelligences (NIs) thrive in a dynamic world -they learn quickly, sometimes with only a few samples. In contrast, Artificial intelligence (AI) has achieved supra (-human) level performance in certain AI settings, typically dependent on a prohibitive amount of training samples and computational power. What design principle difference between NI and AI could contribute to such a discrepancy? Here, we propose an angle based on a simple observation from NIs: post-development, neuronal connections in the brain rarely see polarity switch. We demonstrate with simulations that if weight polarities are adequately set a priori, then networks learn with less time and data. We extend such findings onto image classification tasks and demonstrate that fixed polarity, not weight, is a more effective medium for knowledge transfer between networks. We also explicitly illustrate situations in which a priori setting the weight polarities is disadvantageous for networks. Our work illustrates the value of weight polarities from the perspective of statistical and computational efficiency during learning.

1. INTRODUCTION

Natural intelligences (NIs), including animals and humans, thrive in a dynamic world. Often, NIs learn quickly with just a few samples. Artificial intelligences (AIs), specifically deep neural networks (DNNs), can now compete with or even surpass humans in certain tasks, e.g., GO playing (Silver et al., 2017) , object recognition (Russakovsky et al., 2015) , protein folding analysis (Jumper et al., 2021) , etc. However, DNN is only capable of achieving such when a prohibitive amount of data and training resources are available. Such a gap on learning speed and data efficiency between NI and AI has baffled and motivated many AI researchers. A subfield of AI is dedicated to achieving few-shot learning using DNNs (Hoffer & Ailon, 2015; van der Spoel et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017) . Many research teams have achieved amazing performances on benchmark datasets (Lazarou et al., 2022; Bendou et al., 2022) . However, the products of such engineering efforts greatly deviate from the brain. What are the design principle differences between NIs and AIs that contribute to such a learning efficiency gap? In this paper, we propose one possiblility of such a design difference -we could move AI one step closer to NI-level learning efficiency by applying just one simple design principle from NI. NIs are blessed with hundreds of millions of years of optimization through evolution. Through trial and error, the most survival-advantageous circuit configurations emerge, refine, and slowly come into the form that can thrive in an ever-changing world. Such circuit configurations get embedded into our genetic code, establishing a blueprint to be carried out by development. Among the many configurations, circuit rules, and principles that formed through evolution, one theme stands out, one that neuroscientists celebrate and yet is overlooked by the machine learning community: postdevelopment, neuronal connections in the brain rarely see polarity switch (Spitzer, 2017) . After development, NIs learn and adapt through synaptic plasticity -a connection between a pair of neurons can change its strength but rarely its excitatory or inhibitory nature; on the contrary, a connection (weight) between a pair of units in a DNN can freely change its sign (polarity). In fact, polarity change in the adult brain is hypothesized to be associated with depression, schizophrenia, and other illnesses (Spitzer, 2017) . For the rare times such phenomenon have been observed, they never appeared in sensory and motor cortices (Spitzer, 2017) where visual, auditory and motor processing take place. It seems a rather rigid design choice to fix a network's connection polarity. We wonder why the biological networks settled into such a learning strategy: Is it a mere outcome of an implementation-level constraint? It could be just hard for synapses to switch polarities. Or could it be that NIs found out polarity pattern is an effective medium for transferring knowledge across generations? This paper provides some thoughts and evidence relevant to these questions. We first investigate why brains pre-set neuronal connection polarities by assessing what we gain through setting weight polarity a priori (Sec 2-4). We discuss in theory the trade-off between representation capacity and learning speed when weight polarity is fixed for networks. We then propose an SGD-based polarity-fixed learning algorithm: Freeze-SGD. We experimentally show that if the weight polarities are adequately set a priori, then networks can learn with less time and data (simulated task (Sec 2) + two image classification tasks (Sec 3)). We call a network with fixed polarity Frozen-Net, and we discuss how the quality of the polarity configuration affects a DNN's learning efficiency (Sufficient-Polarity vs. RAND-Polarity). We further find transferring and fixing polarities is even superior to transferring weights (Sec 4). Our results point to an unexplored direction in the machine learning community: polarity, not weight, may be the more effective and compressed medium for transferring knowledge between networks. To complete our discussion, we further discuss what we lose when weight polarities are set a priori (Sec 5). Frozen-Nets have reduced representation capacity; therefore, a randomly configured network may not even have the capacity to represent a simple problem such as XOR (Def 2.2). We theoretically prove and experimentally show that if the polarities are set randomly, the probability of a single-hidden-layer network learning XOR increases exponentially as a function of its size (i.e., number of hidden units). We also experimentally show that a sufficiently sized network, even when its polarities are randomly picked, can learn with less time and data than an equally sized network without fixed polarities. By discussing both the advantages and disadvantages of fixing weight polarity, we provide some insights on how to make AI more statistically and computationally efficient; we also show polarity pattern is an effective medium for transferring knowledge.

2. WHAT DO WE GAIN BY SETTING WEIGHT POLARITY a priori?

Networks need both positive and negative weights to funcion (Wang et al., 2022) -a DNN with all non-negative weights is not a universal approximator. Constraining a network's weight polarity pattern limits its representation capacity: when only half of the range is available to each connection, the reduction in total possible network patterns grows exponentially with more edges in the network. It seems counter-intuitive for any network to have willingly chosen to give up on a vast portion of its representation capacity. Are they gaining elsewhere? Our thought is: maybe they learn faster. Below we provide theoretical discussions and experimental evidence. Definition 2.1 (DNN). Let W (l) ∈ R n l ×n l-1 be the input weights of layer l, n l be the number of units in layer l, b (l) ∈ R n l be the bias terms of layer l, for l ∈ {1, . . . , L}. l = 0 is the input layer, n 0 = dim(x). Let layer l be σ (l) (x) = σ(W (l) x + b (l) ). P (x) = σ (L) • σ (L-1) • ... • σ (1) (x), P (x) ∈ R n L ; F (x) = softmax(P (x)) DNN is F (x). For the non-linearity σ, we define it to be ReLU throughout the paper. σ(x j ) = 0 x j < 0 x j x j ⩾ 0 σ(x) = (σ(x 1 ), . . . , σ(x n l )) Lemma 2.1 (capacity-speed trade-off). If the weight polarities are set a priori, such that the function is still representable, then the network can learn faster. We prove Lemma 2.1 for single-hidden-layer networks, with the following assumptions: Assumption 1: The weights take on discrete values. This is essentially true for all DNNs implemented on silicon chips where all continuous variables are discretized; Assumption 2: Exhaustive search as the learning algorithm. See proof on page 20. The theory, albeit proved under constrained settings, argues for a trade-off between network representation capacity and learning speed. Next, we test with simulation to show that networks

