

Abstract

Neural networks have historically been built layerwise from the set of functions in f : R n → R m , i.e. with activations and weights/parameters represented by real numbers, R. Our work considers a richer set of objects for activations and weights, and undertakes a comprehensive study of alternative algebras as number representations by studying their performance on two challenging problems: largescale image classification using the ImageNet dataset and language modeling using the enwiki8 and WikiText-103 datasets. We denote this broader class of models as AlgebraNets. Our findings indicate that the conclusions of prior work, which explored neural networks constructed from C (complex numbers) and H (quaternions) on smaller datasets, do not always transfer to these challenging settings. However, our results demonstrate that there are alternative algebras which deliver better parameter and computational efficiency compared with R. We consider C, H, M 2 (R) (the set of 2 × 2 real-valued matrices), M 2 (C), M 3 (R), M 4 (R), dual numbers and the R 3 cross product. Additionally, we note that multiplication in these algebras has higher compute density than real multiplication, a useful property in situations with inherently limited parameter reuse such as auto-regressive inference and sparse neural networks. We therefore investigate how to induce sparsity within AlgebraNets. We hope that our strong results on large-scale, practical benchmarks will spur further exploration of these unconventional architectures which challenge the default choice of using real numbers for neural network weights and activations.

1. I

Nearly universally, the atomic building blocks of artificial neural networks are scalar real-valued weights and scalar real-valued neuron activations that interact using standard rules of multiplication and addition. We propose AlgebraNets, where we replace the commonly used real-valued algebra with other associative algebras. Briefly, this amounts to replacing scalars by tuples and real multiplication by a tuple multiplication rule. For example, by replacing each scalar weight and activation with 2 × 2 matrices, and standard real addition / multiplication with matrix addition / multiplication. These alternative algebras provide three clear benefits for deep learning at scale: Parameter efficiency. One sweeping benefit of AlgebraNets is they are able to match baseline performance on a variety of tasks, spread over multiple domains, with fewer parameters than the competitive real-valued baselines. This means that equivalently capable models can be trained on smaller hardware, and for a given amount of memory, a model with greater effective capacity can be trained. We find some variants of AlgebraNets that are more parameter efficient than the previously considered C and H algebras. Throughout the text, we count parameters as the total number of real values e.g. a complex number counts as two parameters. Computational efficiency. For scaling large models, parameter efficiency is not the only bottleneck: FLOP efficiency -reducing the relative number of floating-point operations to achieve an equivalent accuracy -is also important. We find instantiations of AlgebraNets that are more FLOP efficient than the previously considered C and H algebras and as FLOP efficient as R. Additionally, all of the proposed algebras offer parameter reuse greater than 1 (see Table 1 ). That is, the ratio of multiplications performed to values consumed is greater than or equal to 1:1. By contrast, for multiplication in R it is only 1:2. Modern hardware requires a high ratio of floating point operations to bytes loaded (bandwidth) to become compute bound and saturate the arithmetic units. This is particularly problematic for auto-regressive inference (dominated by matrix-vector multiplies), sparse models, depthwise convolutions and other operations with low arithmetic density. Architectural exploration. The choice of real numbers for weights and activations is usually taken for granted (with some exceptions, e.g. those discussed in Sec. 3). With AlgebraNets, we challenge this established design choice and open up a vast new space for neural network architecture exploration by showing that real numbers can be easily replaced with a variety of algebraic structures. Leveraging these new building blocks, one can consider different algebraic interactions, different choices of non-linearities, and different network architecture choices. Importantly, as we demonstrate in this work, AlgebraNets are not only scalable to large models and complex tasks, but they in fact offer improvements in model efficiency, which makes them a viable practical choice. We believe we have only begun to scratch the surface of what these alternative building blocks can enable, and we hope that their broader adoption will usher in further progress across the field. In summary, our main contributions are as follows: • We propose AlgebraNets -a novel class of neural networks, which replaces the nearly ubiquitously used real algebra with alternatives. We show that in contrast to previous work, algebra specific initializations and replacement of batch normalization by an expensive whitening procedure (Trabelsi et al., 2018; Gaudet and Maida, 2018; Wu et al., 2020; Pan et al., 2019) is not necessary, making them a near drop-in replacement to real-valued networks. • We evaluate AlgebraNets based on a wide range of algebras on three challenging large scale benchmarks: ImageNet image classification (Russakovsky et al., 2015 ), Enwik8 (LLC, 2009) , and WikiText language modelling (Merity et al., 2016). • We explore sparse AlgebraNets to take advantage of their higher compute density. • We find that AlgebraNets offer improved parameter efficiency and FLOP parity compared to the real-valued baselines, which establishes them as a viable choice for efficient deep learning at scale.

2. A N 2.1 W A ?

We consider algebras because they have the right properties to make them a drop-in replacement for real numbers in typical neural networks. This is not surprising as the real numbers are an algebra over themselves. An algebra A over a field K (which we take to always be the field of real or complex numbers) satisfies the following properties1 (Wikipedia contributors, 2020b;a): 1. It is a vector space over K. • It has an associative and commutative addition operator with an identity element (x + 0 = x) and inverse element (x + (-x) = 0). • It is possible to multiply elements of field K with vectors.2 2. There is a right and left distributive multiplication operator • over vectors closed in A. 3. Scalar multiplication combines with • in a compatible way: (ax) • (by) = (ab)(x • y). We do not claim that these properties are all required as neural network building-blocks, merely that they are convenient. For example, one could imagine not having associative addition -this would require a careful implementation to get right but is possible. One could eliminate the requirement that scalars from K multiply with vectors from A -this would make various normalizations (e.g. batch normalization) impossible, but they are not required. Most importantly, removing some of these requirements does not lead to an obviously useful class of mathematical objects to consider. In addition to the previously considered C and H algebras, we also consider the algebras of n × n matrices over R and C (i.e. M n (R) or M n (C)) as they have higher compute density than R and map well to the matrix multiplication units that are becoming common in processors (Oh, 2019) . We note 1We use the terminology 'vector' in the definition as that is the generally accepted mathematical term, however throughout the rest of the paper we use the term 'tuple' instead. This is to avoid the confusion of calling a matrix a vector, which is technically correct in this context, but rife with potential for confusion. 2a(bx) = (ab)x; 1x = x for 1, the multiplicative identity in K; a(x + y) = ax + ay; (a + b)x = ax + bx

