

Abstract

Neural networks have historically been built layerwise from the set of functions in f : R n → R m , i.e. with activations and weights/parameters represented by real numbers, R. Our work considers a richer set of objects for activations and weights, and undertakes a comprehensive study of alternative algebras as number representations by studying their performance on two challenging problems: largescale image classification using the ImageNet dataset and language modeling using the enwiki8 and WikiText-103 datasets. We denote this broader class of models as AlgebraNets. Our findings indicate that the conclusions of prior work, which explored neural networks constructed from C (complex numbers) and H (quaternions) on smaller datasets, do not always transfer to these challenging settings. However, our results demonstrate that there are alternative algebras which deliver better parameter and computational efficiency compared with R. We consider C, H, M 2 (R) (the set of 2 × 2 real-valued matrices), M 2 (C), M 3 (R), M 4 (R), dual numbers and the R 3 cross product. Additionally, we note that multiplication in these algebras has higher compute density than real multiplication, a useful property in situations with inherently limited parameter reuse such as auto-regressive inference and sparse neural networks. We therefore investigate how to induce sparsity within AlgebraNets. We hope that our strong results on large-scale, practical benchmarks will spur further exploration of these unconventional architectures which challenge the default choice of using real numbers for neural network weights and activations.

1. I

Nearly universally, the atomic building blocks of artificial neural networks are scalar real-valued weights and scalar real-valued neuron activations that interact using standard rules of multiplication and addition. We propose AlgebraNets, where we replace the commonly used real-valued algebra with other associative algebras. Briefly, this amounts to replacing scalars by tuples and real multiplication by a tuple multiplication rule. For example, by replacing each scalar weight and activation with 2 × 2 matrices, and standard real addition / multiplication with matrix addition / multiplication. These alternative algebras provide three clear benefits for deep learning at scale: Parameter efficiency. One sweeping benefit of AlgebraNets is they are able to match baseline performance on a variety of tasks, spread over multiple domains, with fewer parameters than the competitive real-valued baselines. This means that equivalently capable models can be trained on smaller hardware, and for a given amount of memory, a model with greater effective capacity can be trained. We find some variants of AlgebraNets that are more parameter efficient than the previously considered C and H algebras. Throughout the text, we count parameters as the total number of real values e.g. a complex number counts as two parameters. Computational efficiency. For scaling large models, parameter efficiency is not the only bottleneck: FLOP efficiency -reducing the relative number of floating-point operations to achieve an equivalent accuracy -is also important. We find instantiations of AlgebraNets that are more FLOP efficient than the previously considered C and H algebras and as FLOP efficient as R. Additionally, all of the proposed algebras offer parameter reuse greater than 1 (see Table 1 ). That is, the ratio of multiplications performed to values consumed is greater than or equal to 1:1. By contrast, for multiplication in R it is only 1:2. Modern hardware requires a high ratio of floating point operations to bytes loaded (bandwidth) to become compute bound and saturate the arithmetic units. This is

