

Abstract

Neural networks have historically been built layerwise from the set of functions in f : R n → R m , i.e. with activations and weights/parameters represented by real numbers, R. Our work considers a richer set of objects for activations and weights, and undertakes a comprehensive study of alternative algebras as number representations by studying their performance on two challenging problems: largescale image classification using the ImageNet dataset and language modeling using the enwiki8 and WikiText-103 datasets. We denote this broader class of models as AlgebraNets. Our findings indicate that the conclusions of prior work, which explored neural networks constructed from C (complex numbers) and H (quaternions) on smaller datasets, do not always transfer to these challenging settings. However, our results demonstrate that there are alternative algebras which deliver better parameter and computational efficiency compared with R. We consider C, H, M 2 (R) (the set of 2 × 2 real-valued matrices), M 2 (C), M 3 (R), M 4 (R), dual numbers and the R 3 cross product. Additionally, we note that multiplication in these algebras has higher compute density than real multiplication, a useful property in situations with inherently limited parameter reuse such as auto-regressive inference and sparse neural networks. We therefore investigate how to induce sparsity within AlgebraNets. We hope that our strong results on large-scale, practical benchmarks will spur further exploration of these unconventional architectures which challenge the default choice of using real numbers for neural network weights and activations. Nearly universally, the atomic building blocks of artificial neural networks are scalar real-valued weights and scalar real-valued neuron activations that interact using standard rules of multiplication and addition. We propose AlgebraNets, where we replace the commonly used real-valued algebra with other associative algebras. Briefly, this amounts to replacing scalars by tuples and real multiplication by a tuple multiplication rule. For example, by replacing each scalar weight and activation with 2 × 2 matrices, and standard real addition / multiplication with matrix addition / multiplication. These alternative algebras provide three clear benefits for deep learning at scale: Parameter efficiency. One sweeping benefit of AlgebraNets is they are able to match baseline performance on a variety of tasks, spread over multiple domains, with fewer parameters than the competitive real-valued baselines. This means that equivalently capable models can be trained on smaller hardware, and for a given amount of memory, a model with greater effective capacity can be trained. We find some variants of AlgebraNets that are more parameter efficient than the previously considered C and H algebras. Throughout the text, we count parameters as the total number of real values e.g. a complex number counts as two parameters. Computational efficiency. For scaling large models, parameter efficiency is not the only bottleneck: FLOP efficiency -reducing the relative number of floating-point operations to achieve an equivalent accuracy -is also important. We find instantiations of AlgebraNets that are more FLOP efficient than the previously considered C and H algebras and as FLOP efficient as R. Additionally, all of the proposed algebras offer parameter reuse greater than 1 (see Table 1 ). That is, the ratio of multiplications performed to values consumed is greater than or equal to 1:1. By contrast, for multiplication in R it is only 1:2. Modern hardware requires a high ratio of floating point operations to bytes loaded (bandwidth) to become compute bound and saturate the arithmetic units. This is



particularly problematic for auto-regressive inference (dominated by matrix-vector multiplies), sparse models, depthwise convolutions and other operations with low arithmetic density. Architectural exploration. The choice of real numbers for weights and activations is usually taken for granted (with some exceptions, e.g. those discussed in Sec. 3). With AlgebraNets, we challenge this established design choice and open up a vast new space for neural network architecture exploration by showing that real numbers can be easily replaced with a variety of algebraic structures. Leveraging these new building blocks, one can consider different algebraic interactions, different choices of non-linearities, and different network architecture choices. Importantly, as we demonstrate in this work, AlgebraNets are not only scalable to large models and complex tasks, but they in fact offer improvements in model efficiency, which makes them a viable practical choice. We believe we have only begun to scratch the surface of what these alternative building blocks can enable, and we hope that their broader adoption will usher in further progress across the field. In summary, our main contributions are as follows: • We propose AlgebraNets -a novel class of neural networks, which replaces the nearly ubiquitously used real algebra with alternatives. We show that in contrast to previous work, algebra specific initializations and replacement of batch normalization by an expensive whitening procedure (Trabelsi et al., 2018; Gaudet and Maida, 2018; Wu et al., 2020; Pan et al., 2019) is not necessary, making them a near drop-in replacement to real-valued networks. • We evaluate AlgebraNets based on a wide range of algebras on three challenging large scale benchmarks: ImageNet image classification (Russakovsky et al., 2015) , Enwik8 (LLC, 2009) , and WikiText language modelling (Merity et al., 2016) . • We explore sparse AlgebraNets to take advantage of their higher compute density. • We find that AlgebraNets offer improved parameter efficiency and FLOP parity compared to the real-valued baselines, which establishes them as a viable choice for efficient deep learning at scale.

2. A N 2.1 W A ?

We consider algebras because they have the right properties to make them a drop-in replacement for real numbers in typical neural networks. This is not surprising as the real numbers are an algebra over themselves. An algebra A over a field K (which we take to always be the field of real or complex numbers) satisfies the following properties1 (Wikipedia contributors, 2020b;a): 1. It is a vector space over K. • It has an associative and commutative addition operator with an identity element (x + 0 = x) and inverse element (x + (-x) = 0). • It is possible to multiply elements of field K with vectors.2 2. There is a right and left distributive multiplication operator • over vectors closed in A. 3. Scalar multiplication combines with • in a compatible way: (ax) • (by) = (ab)(x • y). We do not claim that these properties are all required as neural network building-blocks, merely that they are convenient. For example, one could imagine not having associative addition -this would require a careful implementation to get right but is possible. One could eliminate the requirement that scalars from K multiply with vectors from A -this would make various normalizations (e.g. batch normalization) impossible, but they are not required. Most importantly, removing some of these requirements does not lead to an obviously useful class of mathematical objects to consider. In addition to the previously considered C and H algebras, we also consider the algebras of n × n matrices over R and C (i.e. M n (R) or M n (C)) as they have higher compute density than R and map well to the matrix multiplication units that are becoming common in processors (Oh, 2019) . We note 1We use the terminology 'vector' in the definition as that is the generally accepted mathematical term, however throughout the rest of the paper we use the term 'tuple' instead. This is to avoid the confusion of calling a matrix a vector, which is technically correct in this context, but rife with potential for confusion. 2a(bx) = (ab)x; 1x = x for 1, the multiplicative identity in K; a(x + y) = ax + ay; (a + b)x = ax + bx that M 2 (R) is isomorphic to the split-quaternion (Cockle, 1849) algebra and M 2 (C) is isomorphic to the biquaternion (Hamilton, 1844) algebra, but the matrix algebras are more familiar so we retain that terminology. Lastly, we consider the dual numbers and the cross product of length-3 tuples. We provide tables describing the multiplicative interaction between tuples. The interaction between two tuples (o a , o b , ...) = (t a , t b , ...) • (v a , v b , ...) is described by a matrix where the indices of t are on the left, v are on the top and entries correspond to which component of o the interaction contributes to. A 0 means there is no interaction and a negative index means the result is subtracted. C; Complex Numbers Each weight, w, is a length 2 tuple (t a , t b ) representing the complex number t a + t b i. For two weight values we load 4 scalars and perform 4 multiplies. M n (R); n×n Matrices Each weight is a length n 2 tuple, representing an n×n matrix. Multiplication and addition proceed with standard rules for matrices. We consider up to M 4 (R) matrices. For two weight values we load 2n 2 scalars and perform n 3 multiplies. H a b c d a a b c d b b -a d -c c c -d -a b d d c -b -a M n (C); n×n Complex Matrices Weights are length 2n 2 tuples representing n × n complex-valued matrices. We consider only n = 2. For two weight values we load 4n 2 scalars and perform 4n 3 multiplies. The multiplication table is in Appendix A. H; Quaternions Each weight, w i is replaced by a length 4 tuple, (t a , t b , t c , t d ). Multiplication is not commutative, with the product of two quaternions given by the Hamilton product (Hamilton, 1843) . For two weight values, we load 8 elements and perform 16 multiplies. 

Diagonal Algebra

The high FLOP cost of the whitening operation required by (Trabelsi et al., 2018; Gaudet and Maida, 2018; Wu et al., 2020; Pan et al., 2019) makes networks using it inefficient at training and inference in terms of FLOPs. We attempt to design an algebra where using whitening would in fact be competitive by eliminating the interaction of terms through the algebra. Only when combining the 'diagonal' D algebra with whitening are there interactions between the different tuple components. a b a a b b b 0 Dual Numbers Each weight is represented by a length 2 tuple representing the dual number (t 0 + t 1 ). For a multiplication, we load 4 values and perform 3 multiplies. R 3 a b c a 0 c -b b -c 0 a c b -a 0 R 3 Cross Product Each weight is represented by a length 3 tuple. We use the cross product between two tuples for the multiplication rule, resulting in 6 different multiplies for 6 values loaded.

2.3. I

, N , N -L , P Prior work (Trabelsi et al., 2018; Gaudet and Maida, 2018) has advocated algebra-specific initializations and expensive whitening procedures to replace batch normalization. We find that this is not necessary to achieve good performance, and we are able to use the same initialization, normalization, and non-linearities across all algebras which facilitates exploring a wide variety of options. To initialize all the components of the algebra tuple at the beginning of a network we set the first tuple component to the typical input. For ResNet, MobileNet, and the RNN we initialize the other components of the tuple with a small one or two-layer MLP, i.e. t b,c,... = M LP (t a ). For the transformer, we take advantage of the fact that the embedding is already a learned representation and simply reshape the output embedding appropriately. We find that the specifics of the input initialization do not have a large effect on performance, though allowing a learnable transformation outperformed initializing additional components to 0 or replicating the input. We use standard Glorot (Glorot and Bengio, 2010) weight initialization of each component independently. Comparisons with the algebra specific initializations (Trabelsi et al., 2018; Gaudet and Maida, 2018) can be found in Appendix B. Existing activation functions can be applied component-wise (t = (f (t a ), • • • , f (t d )) ) and we found that ReLU and swish (Ramachandran et al., 2017) work well; tanh and sigmoid can also be applied component-wise as part of GRUs and LSTMs. Applying the activation function to the entire tuple has possible computational advantages if it is ReLU-like as it would allow an entire tuple multiplication to be skipped. For example, consider t = f (g(t))t. If g( • ) returns the mean of the tuple, and if f was H the Heaviside step function, then one can remove entire components. Appendix B examines different choices for doing this, but we do not consider it further in the main text. The final logits of an AlgebraNet must be real-valued. We use an Algebra-specific final linear layer and convert the final algebra tuple to a scalar with the tuple-wise L 2 norm before applying softmax. More details are in Appendix B. To apply magnitude pruning (Zhu and Gupta, 2017; Gale et al., 2019) to prune tuples we used the tuple L 2 norm as the criterion for pruning for all AlgebraNet variants. For the M n (R) algebras we also experimented with criteria based on the eigenvalues, λ i , and singular values, σ i , of each n × n matrix. The Frobenius norm corresponds to ( i σ 2 i ) 1/2 and the determinant corresponds to ( i λ i ). We found pruning based on the Frobenius norm to be the most effective, followed by pruning based on the largest eigenvalue. See Appendix C for a comparison between different methods. (Trabelsi et al., 2018) , (Gaudet and Maida, 2018) , and (Wu et al., 2020) use whitening in place of batch normalization. Whitening normalizes and de-correlates the different tuple elements from one another. However, this extension results in a substantial increase in both training and test time computational costs, as described in (Pan et al., 2019) . The inclusion of the whitening cost to the FLOP count in Fig. 1 highlights the substantial cost inference cost. Cholesky decomposition (Press et al., 2007) of the inverted covariance matrix is required during training and at inference it is not possible to fold the whitening transformation into adjacent convolutions. A contribution from each of the algebra elements contributes to each element in the whitened output. We find that batch normalization does not substantially decrease performance, trains 1.9× faster and has no inference cost, so we use it for all experiments, unless explicitly stated. 3 R W (Trabelsi et al., 2018) applied complex-valued networks to convolutional neural networks trained on CIFAR-10, as well as to music transcription and Speech Spectrum Prediction. They find that complex-valued networks with the same number of parameters and more FLOPs perform slightly better than real-valued networks. (Gaudet and Maida, 2018) extend the procedure from (Trabelsi et al., 2018) to quaternion valued weights, showing that they are able to reduce the parameter count by a factor of two over complex-valued networks and a factor of four over real-valued networks, while again slightly increasing the top-1 accuracy on CIFAR-10. (Wu et al., 2020) further extend this approach to octonions (which are a non-associative algebra), demonstrating that they are able to further reduce the parameter count while increasing the accuracy of their models on CIFAR-10. These papers establish the efficacy of some alternative algebras, though they focus purely on parameter efficiency, rather than FLOP efficiency which is equally important for image classification tasks. Additionally, the tested datasets are relatively small, and it is unclear how the results scale to larger datasets. Both the quaternion and octonion network papers do not test their models on language modeling tasks where parameter efficiency is often of greater importance. (Parcollet et al., 2018) propose a quaternion recurrent neural network (QRNN) and quaternion LSTM (QLSTM). They show that quaternion based methods are able to reduce the parameter count while offering better performance on the TIMIT and WSJ phoneme recognition tasks. Associative Long Short-Term Memory leverage complex-vectors to increase the memory capacity of LSTMs without a parameter increase (Danihelka et al., 2016) . Recently, many methods to induce sparsity in neural networks have shown that it is possible to train models with an overwhelming fraction of the weights being 0 (Molchanov et al., 2017; Gale et al., 2019; Frankle and Carbin, 2019; Louizos et al., 2018; Evci et al., 2019; Zhu and Gupta, 2017) . Many of these methods gradually decrease the number of weights in the network through training by using some combination of each weight's gradient and magnitude. Fine grained sparsity is hard to accelerate on modern hardware, although there have been some recent results demonstrating that speedups are possible (Elsen et al., 2020) . (Vecchi et al., 2019) considered inducing sparsity in quaternion networks. Primitives that increase computational density of fundamental interactions would increase the performance of sparse methods as demonstrated on the GPU by (Mueller-Roemer et al., 2019) in scientific computing. (Jayakumar et al., 2020) emphasize the importance of multiplicative interaction layers providing a particularly useful inductive bias during the fusion of multiple information streams. Specific AlgebraNets may provide strong, useful domain-specific inductive biases, for example as done by (Worrall et al., 2017) , leveraging the rotational invariance of complex numbers in convolutional networks and by (Hinton et al., 2018) where they use a 4 × 4 pose matrix to represent orientations.

4. E R 4.1 I N

We examine the performance of AlgebraNet versions of ResNet-50 (He et al., 2016) and MobileNet-v1 (Howard et al., 2017) on the ImageNet (Russakovsky et al., 2015) dataset. We use a width multiplier on the channels to adjust model capacity. For all experiments we use SGD with momentum of 0.9. We increase the number of training epochs by a factor of two to 180, which we also use for the pruning experiments. This did not affect the baseline, but it improves the pruning results. It also resulted in improved performance for H, so we used it throughout. For a batch size of 256, the initial learning rate for the ResNet experiments was set to 2.5 and multiplied by 0.1 at epochs 60, 100, 140, and 160. We find it is useful to reduce the amount of L 2 regularization that is used for AlgebraNets. The baseline value 10 -4 was reduced by a factor of 0.725 for ResNet-50 and 0.625 for MobileNet-v1. We use the swish activation function for all experiments shown in Figures 1 and 2 including the baselines. We found it improved performance across the board. Figures 1 and 2 compare the trade-offs between accuracy, parameters, and FLOPs for different flavours of AlgebraNet. Notably, we do not find that the parameter reduction without accuracy loss from (Trabelsi et al., 2018) and (Gaudet and Maida, 2018) on CIFAR translates to ImageNet; we are unable to divide the number of parameters by a factor of two/four for C/H and match baseline performance. We hypothesize that this is in part due to over-paramaterization of many networks trained on CIFAR and that AlgebraNets act as an additional regularizer, in part due to the greater weight reuse: each tuple component is now involved in multiple equations. We feel that this highlights the need for testing methods on large-scale datasets. We find M 2 (R) AlgebraNets provide the best parameter efficiency of all considered algebras while requiring no more FLOPs than the real baseline on both ResNet-50 and MobileNet-v1. We also find, for both ResNet-50 and MobileNet-v1, M 2 (C) AlgebraNets provide better FLOP efficiency than the previously studied H while having the same ratio of multiplies to values. The diagonal We separate based on algebras with up to 1:1 multiplies to values loaded compute density (top), and greater than a 1:1 density (bottom). In the left columns, we show parameters and ImageNet top-1 accruacy. We count parameters as the total number of real values e.g. a complex number counts as two parameters, M 2 (R) and H both count as 4, etc. All runs use batch norm (Ioffe and Szegedy, 2015) , unless explicitly stated. Right: For the same algebras, we compare the number of floating-point operations (multiply-adds) required at inference. Unlike all previously considered algebras, for M 2 (R), we find equivalent computational costs compared to real-valued networks at baseline performance. M 2 (C) improves performance compared to the previously considered H with the same compute density. algebra is extremely parameter efficient and we do find the interaction between different components through whitening to be important as hypothesized -a network trained with whitening achieves 6% higher top-1 accuracy than the same network trained with batch normalization. Unfortunately, adding whitening increases the total number of inference FLOPs by a factor of 3×. We are left to conclude that whitening is not currently competitive and recommend using batch normalization for all algebras. Future work exploring the role of the interaction from whitening, and alternatives that are more computationally efficient is an interesting direction. An additional benefit is that AlgebraNets applied to MobileNet like architectures increase the computational density of the often bandwidth bound depthwise convolutions, while reducing the number of FLOPs in the more costly 1 × 1 convolution. In green, we show baseline magnitude-pruning results from (Gale et al., 2019) . Right: For the same pruned networks, we show the FLOP efficiency.

4.2. P R N -50

We use magnitude based pruning, with the schedule proposed in (Zhu and Gupta, 2017). We always begin pruning at 20% and end pruning at 80% of the total training iterations. We prune every 100 steps. At each pruning step, we set tuples with the lowest magnitude, given by t 2 i , to 0. We do not prune terms in the final linear layer, the tuple-initialization convolutions, or in the initial convolution of the ResNet. To allow for comparisons with (Gale et al., 2019) , we use ReLU activations in pruning experiments, as opposed to swish as used in Fig. 1 . Final top-1 accuracies of pruned networks are shown in Fig. 3 . Despite pruning entire tuples, which allows skipping an entire tuple multiplication, we are still able to find sparse networks that are similarly FLOP efficient to those from (Gale et al., 2019) , while having higher compute density due to the algebra structure. Pruning individual components, rather than setting entire tuples to 0 does improve performance, though does not provide the same computational benefits offered by AlgebraNets. We provide further results in Appendix C.

4.3. T -XL E 8 RNN W T -103

We perform character level language modeling on the Enwik8 dataset from the Hutter Prize (LLC, 2009) with the Transformer-XL (Dai et al., 2019) architecture. We tuned the baseline model by halving the embedding size, number of heads and feedforward size, resulting in a more challenging 'efficient' baseline with only 25% as many parameters as the 24 layer network from (Dai et al., 2019) . We train with Adam (Kingma and Ba, 2014) (learning rate 2 × 10 -4 ), dropout 0. parameter count of 69.4 million. It achieves 0.99 bits per character (BPC), matching the results of (Dai et al., 2019) while requiring 75% less parameters. Using the M 2 (R)-algebra in all linear layers with fixed activation size results in a further 75% reduction in parameter count. We use these parameter savings to increase the depth of the model from 24 to 42 layers, resulting in a model with 45% as many parameters as the 'efficient' baseline. The resulting M 2 (R) AlgebraNet also achieves 0.99 BPC, but with only 31.2 million parameters. The character-embedding layers are computationally unchanged; they associate each character with a d 4 -sized M 2 (R) embedding which can be thought of as a reshaped R d embedding. Special consideration has to be paid to the R l×l attention matrix, which is often regarded as a practical memory and compute bottleneck (Child et al., 2019; Roy et al., 2020; Kitaev et al., 2020) . Using an l × l algebra valued attention matrix would increase the memory and compute requirements (e.g. by a factor 2 in the Complex-Transformer (Yang et al., 2020) or 4 for M 2 (R)). Thus, we desire a real valued attention matrix from the sets of R k -valued key and query vectors (k, q). We do this by reshaping keys and queries from M 2 (R) to R. Formally, we redefine the attentions real-valued scalar product as k, q M2(R) := F (k), F (q) R where F flattens the input into a real vector. Finally, we also consider a dataset approximately one order of magnitude larger by tokenizing WikiText-103 (Merity et al., 2016) into characters (instead of the more common words). On this dataset we consider a single layer GRU architecture followed by 5 linear readout layers with the ReLU non-linearity, skip connections and layer normalization after each layer. We train using a batch size of 16, Adam (learning rate of 10 -4 ), and L 2 regularization of 10 -7 for 200,000 steps. Training takes two days for the largest baseline variant on a single V100 GPU. We train using length 512 sequences and back propagation through time. We initialize the different components of each algebra with single linear layers from the input. We report results on the typical validation set. We replace a gated recurrent unit (GRU) (Cho et al., 2014) with the AlgebraNet equivalent, as well as replacing the readout layers with AlgebraNet variants. We consider M 2 (R), C, H, and M 3 (R). Results are shown in Table 3 . A C AlgebraNet with a hidden size of 1024 and 24.1 million parameters achieves a validation BPC of 1.26, comparable to a real-valued network with 1.45 times the parameter count. We find that M 2 (R) with a hidden size of 512 results in a validation BPC of 1.30, comparable to a model with twice as many parameters. Again, demonstrating the parameter efficiency of M 2 (R) and the usefulness of AlgebraNets for problems such as language modeling where parameter efficiency is crucial.

5. C

Conventional neural networks are composed of real-valued weights and activations along with real valued operators. In this work, we proposed AlgebraNets, a general paradigm of replacing real valued weights and operators with weights and operators from other associative algebras in a general fashion. We show these methods to be more parameter efficient than their real-valued counterparts while having higher compute density. We also find that the M 2 (R) algebra is more FLOP efficient than previously considered algebras -in fact it is as FLOP efficient as the reals. The increased compute density of the proposed algebras will prove particularly useful for sparse neural networks and auto-regressive inference, due to modern hardware favoring a relatively high compute density. We hope that our work enables further development of these methods and promotes broader research into the fundamental design choices upon which modern neural networks are based. Each tuple (ta, t b , tc, t d , te, t f , tg, t h ) represents the 2 × 2 complex matrix: We consider equations of the form: ta + t b i tc + t d i te + t f i tg + t h i M 2 (C) t ← f (g(t)) * t We found that if g is the tuple mean, and f is H the Heaviside function, top-1 performance dropped on an M2(R) ResNet-50 AlgebraNet by 2.97%. While this drop is significant, the resulting activation sparsity might make it a desirable tradeoff in some circumstances. Other methods, such as setting g to be the determinant resulted in greater than a 10% drop in performance.

B.2 I

For a ResNet-50 H-AlgebraNet with the standard number of channels divided by 4, we find a top-1 performance of 74.0 ± 0.14 using standard initialization and 74.1 ± 0.15 using initialization from Gaudet and Maida (2018) . These experiments are done using standard batch normalization instead of the more expensive whitening procedure.

B.3 C R

For all considered algebras, the norm of the tuple is mathematically given by i t 2 i . It is possible that the optimal choice for converting to the reals would be different in models with very large final layers, such as word based language modeling -which we do not consider.

C A

N P C.1 A M 2 (R) For M2(R), we consider a variety of alternative pruning methods to remove entire tuples, based on the two eigenvalues, λ1 and λ2 and singular values, σ1, σ2. Specifically, because our matrices are square but not symmetric, the Forbenius norm is defined based on the singular values which correspond to the squared eigenvalues of AA T , if A is the matrix in question. • Frobenius Norm: σ 2 1 + σ 2 2 1/2 • Determinant: λ1λ2 In Table 4 , we show the resulting drop in top-1 accuracy relative to the Frobenius norm at three different sparsities for three alternative pruning methods. In addition to always achieving the best performance, the Frobenius norm has the additional advantage that it is defined for all Algebra variants that was consider, rather than an Mn(R) specific variant, for example.

C.2 P M 2 (R) H

For M2(R) and H, we also prune individual tuple elements based on element norms. This equally reduces the number of non-zero weights in the network, though it does not result in entire matrix multiplies that can be skipped. Depending on the size of the network, the difference between the methods varies slightly. The main point is that pruning elements rather than tuples increases performance, more-so for higher sparsities. In Table 5 , we show the resulting increase in top-1 accuracy that results from pruning individual tuple components, rather than entire tuples. However, due to the structure Mn(R) and H multiplication, setting individual values to 0 does not result in 0 in the output. Therefore, pruning entire tuples provides more useful computational advantages.

D A N T CIFAR

We use a network structure based on that described in Gaudet and Maida (2018) . We begin with the same ResNet structure, with 128, 256, and then 512 channels in each real block. For the C networks, all channel counts are divided by two. For the M2(R) and H networks, we assign the initial convolution, before the residual blocks, to have half the original number of channels, all other channel counts are divided by four. Thus, for H and M2(R) we have slightly more than 1/4 the parameters. We train with 24 × 24 random crops and evaluate on 32 × 32 images. Algebra Parameters (×10 6 ) FLOPs (×10 We find we are able to divide the channels in the filter by two and maintain the same performance using complex valued networks. When reducing the parameter count by a factor of ∼four, we find we are able to again match baseline performance with quaternions and 2 × 2 matrices. Regularization has non-trivial effect on performance, and by more finely adjusting the L2 loss for the different algebras may result in higher top-1 accuracy. We note that the relative reduction in parameters on CIFAR-10 is not something we are able to replicate on ImageNet. The results from the main text also hold here -M2(R) is the only algebra that is able to maintain accuracy while having fewer FLOPs than the baseline real network. For these experiments, we used algebra specific weight initializations, though we again verified that this does not seem to have a substantial effect.

E E M 2 (R) C

We write the update rule explicitly for readability. Note that it is possible to concatenate the relevant terms on the channel axis to reduce the number of convolutions needed. 

F I A M

Due to the activations, there will be a slight increase in memory footprint from AlgebraNets in some cases. For example, in a M2(R) AlgebraNet for ResNet-50 with channels/4, there will be C/4 convolutions performed. This would, in a naive implementation, result in twice the activation memory. However, with a properly written kernel, this would not be the case. There is, however, an additional factor: to reach comparable performance a slightly larger network than C/4 is needed. In practice about a 1.3× increase in activation memory would be incurred.



Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.



Figure1: ResNet-50 We vary the width of a ResNet-50 trained on ImageNet for various flavors of AlgebraNet and the real-valued baseline. We separate based on algebras with up to 1:1 multiplies to values loaded compute density (top), and greater than a 1:1 density (bottom). In the left columns, we show parameters and ImageNet top-1 accruacy. We count parameters as the total number of real values e.g. a complex number counts as two parameters, M 2 (R) and H both count as 4, etc. All runs use batch norm(Ioffe and Szegedy, 2015), unless explicitly stated. Right: For the same algebras, we compare the number of floating-point operations (multiply-adds) required at inference. Unlike all previously considered algebras, for M 2 (R), we find equivalent computational costs compared to real-valued networks at baseline performance. M 2 (C) improves performance compared to the previously considered H with the same compute density.

Figure 2: MobileNet-v1, Left: We vary the width of a MobileNet-v1 trained on ImageNet for a subset of the considered AlgebraNets and the real-valued baseline. Right: For the same algebras, we compare the number of FLOPs (multiply-adds) required at inference.

Figure3: Pruning ResNet-50, Left: Using magnitude pruning, we prune entire tuples of M 2 (R)-ResNet50 to between 50 and 90% sparsity. Different widths correspond to different curves, points along each curve are different sparsity levels. In green, we show baseline magnitude-pruning results from(Gale et al., 2019). Right: For the same pruned networks, we show the FLOP efficiency.

tuple (ta, t b , tc, t d ) represents a 2 × 2 real matrix.

tuple (ta, t b ) represents the dual number (ta + t b ). cross product between length-3 tuples (ta, t b , tc). concrete example of replacing a real linear layer with M2(R)-linear layer such that the activation memory is kept identical. Intuitively, this can be thought of as reshaping the R d input activations to have shape M2(R) d/4 that is processed by a fM : M2(R) d/4 → M2(R) d/4 linear layer resulting in output activationswhen flattened -with shape R d . Each such linear layer fM requires 1 4 of the parameters and 1 2 of the FLOPS compared to a real R d → R d linear layer counterpart.

'''Simplified example code for M_2(R).x: Input with an additional algebra axis. In the case of a convolution, either (B, H, W, C, A) or (B, C, H, W, A) w: Corresponding weight matrix, with an additional algebra axis. of the four algebra components. x_new = [0, 0, 0, 0] for i in range(4):for j in range(2): # w: weight with an extra algebra dimension. # x: Input with shape [B, ... , A] where A is the additional algebra dimension. x_new[i] += Conv2D(x[..., mat_22_rule[i][j][1], w[..., mat_22_rule[i][j][0], ...) # Add bias if wanted. Add (4,) to shape. L L # Update each of the four algebra components. x_new = [0, 0, 0, 0] for i in range(4): for j in range(2): # w: weight with an extra algebra dimension. # x: Input with shape [B, L, A] where A is the algebra dimension. x_new[i] += dot(x[..., mat_22_rule[i][j][1], w[..., mat_22_rule[i][j][0]) # Add bias if wanted. Add (4,) to shape.



25 (component-wise, not tuple-wise), and windows of 1536 at train and 4608 at test time. The M 2 (R)-Transformer uses a learning rate of .0005 and dropout 0.15. Our 'efficient' baseline 24 layer Transformer-XL model has an embedding size of 512, 4 heads of size 128 each, and a feed-forward hidden size of 1536 for a total Enwik

WikiText-103:We replace a real-valued GRU (and the corresponding linear layers) with AlgebraNet counterparts. We report the minimum validation loss over the last 5% of training. The hidden size is reported as number of tuples, e.g. M 2 (R) with 512 tuples has 2048 scalars in total.

In all cases, we remove tuples with the minimum magnitude of one of those options. For 50%, 70%, and 90% sparsity, we show the performance relative to the Frobenius norm for different magnitude-based tuple pruning criterion.

Performance different from pruning components and entire tuples for M2(R) and H-AlgebraNets.

A comparison of different AlgebraNets on CIFAR-10. BN denotes Batch Normalization, W denotes the use of whitening.

