INTELLIGENT MATRIX EXPONENTIATION

Abstract

We present a novel machine learning architecture that uses a single highdimensional nonlinearity consisting of the exponential of a single input-dependent matrix. The mathematical simplicity of this architecture allows a detailed analysis of its behaviour, providing robustness guarantees via Lipschitz bounds. Despite its simplicity, a single matrix exponential layer already provides universal approximation properties and can learn and extrapolate fundamental functions of the input, such as periodic structure or geometric invariants. This architecture outperforms other general-purpose architectures on benchmark problems, including CIFAR-10, using fewer parameters.

1. INTRODUCTION

Deep neural networks (DNNs) synthesize highly complex functions by composing a large number of neuronal units, each featuring a basic and usually 1-dimensional nonlinear activation function f : R 1 → R 1 . While highly successful in practice, this approach also has disadvantages. In a conventional DNN, any two activations only ever get combined through summation. This means that such a network requires an increasing number of parameters to approximate more complex functions even as simple as multiplication. Parameter-wise, this approach of composing simple functions does not scale efficiently. An alternative to the composition of many 1-dimensional functions is using a simple higherdimensional nonlinear function f : R m → R n . A single multidimensional nonlinearity may be desirable because it could express more complex relationships between input features with potentially fewer parameters and fewer mathematical operations. The matrix exponential stands out as a promising but overlooked candidate for a higher-dimensional nonlinearity for machine learning models. The matrix exponential is a smooth function that appears in the solution to one of the simplest differential equations that can yield desirable mathematical properties: d/dty(t) = M y(t), with the solution y(t) = exp(M t)y(0), where M is a constant matrix. The matrix exponential plays a prominent role in the theory of Lie groups, an algebraic structure widely used throughout many branches of mathematics and science. A unique advantage of the matrix exponential is its natural ability to represent oscillations and exponential decay, which becomes apparent if we decompose the matrix to be exponentiated into a symmetric and an antisymmetric component. The exponential of an antisymmetric matrix, whose nonzero eigenvalues are always imaginary, generates a superposition of periodic oscillations, whereas the exponential of a symmetric matrix, which has real eigenvalues, expresses an exponential growth or decay. This fact, especially the ability to represent periodic functions, gives the M-layer the possibility to extrapolate beyond its training domain. Many real-world phenomena contain some degree of periodicity and can therefore benefit from this feature. In contrast, the functions typically used in conventional DNNs approximate the target function locally and are therefore unable to scale well outside the boundaries of the training data for such problems. Based on these insights, we propose a novel architecture for supervised learning whose core element is a single layer (henceforth referred to as "M-layer"), that computes a single matrix exponential, where the matrix to be exponentiated is an affine function of the input features. We show that the M-layer has universal approximator properties and allows closed-form per-example bounds for robustness. We demonstrate the ability of this architecture to learn multivariate polynomials, such as matrix determinants, and to generalize periodic functions beyond the domain of the input without any feature engineering. Furthermore, the M-layer achieves results comparable to recently-proposed non-specialized architectures on image recognition datasets. We provide TensorFlow code that implements the M-layer in the Supplementary Material.

2. RELATED WORK

Neuronal units with more complex activation functions have been proposed. One such example are sigma-pi units (Rumelhart et al., 1986) , whose activation function is the weighted sum of products of its inputs. More recently, neural arithmetic logic units have been introduced (Trask et al., 2018) , which can combine inputs using multiple arithmetic operators and generalize outside the domain of the training data. In contrast with these architectures, the M-layer is not based on neuronal units with multiple inputs, but uses a single matrix exponential as its nonlinear mapping function. Through the matrix exponential, the M-layer can easily learn mathematical operations more complex than addition, but with simpler architecture. In fact, as shown in Section 3.3, the M-layer can be regarded as a generalized sigma-pi network with built-in architecture search, in the sense that it learns by itself which arithmetic graph should be used for the computation. Architectures with higher-dimensional nonlinearities are also already used. The softmax function is an example of a widely-used such nonlinear activation function that, like the M-layer, has extra mathematical structure. For example, a permutation of the softmax inputs produces a corresponding permutation of the outputs. Maxout networks also act on multiple units and have been successful in combination with dropout (Goodfellow et al., 2013) . In radial basis networks (Park & Sandberg, 1991) , each hidden unit computes a nonlinear function of the distance between its own learned centroid and a single point represented by a vector of input coordinates. Capsule networks (Sabour et al., 2017) are another recent example of multidimensional nonlinearities. Similarly, the M-layer uses the matrix exponential as a single high-dimensional nonlinearity. This mapping satisfies nontrivial mathematical identities, for some of which we here elucidate their application potential in a machine learning context. Feature crosses of second and higher order have also been explored on top of convolutional layers ((Lin et al., 2015; Lin & Maji, 2017; Li et al., 2017; Engin et al., 2018; Koniusz et al., 2018) . In this line of research, (Li et al., 2017) introduces the use of a matrix operation (either matrix logarithm or matrix square root) on top of a convolutional layer with higher order feature crosses. The M-Layer differs from this kind of approach in two ways: first, it uses matrix exponentiation to produce feature crosses, and not to improve trainability on top of other ones; secondly, it computes a matrix operation of an arbitrary matrix (not necessarily symmetric positive semidefinite), which allows more expressibility, as seen for example in Section 3.3. The main differences between the M-layer and other architectures that can utilize feature-crosses are the M-layer's ability to also model non-polynomial dependencies (such as a cosine) and the builtin competition-for-total-complexity regularization, which we will explain in Section 3.3 through a "circuit breadboard" analogy. Matrix exponentiation has a natural alternative interpretation in terms of an ordinary differential equation (ODE). As such, the M-layer can be compared to other novel ODE-related architectures, such as neural ordinary differential equations (NODE) (Chen et al., 2018) and their augmented extensions (ANODE) (Dupont et al., 2019) . We discuss this in Section 3.6. Existing approaches to certifying the robustness of neural networks can be split into two different categories. Some approaches (Peck et al., 2017) mathematically analyze a network layer by layer, providing bounds on the robustness of each layer, that then get multiplied together. This kind of approach tends to give fairly loose bounds, due to the inherent tightness loss from composing upper bounds. Other approaches (Singh et al., 2018; 2019) use abstract interpretation on the evaluation of the network to provide empirical robustness bounds. In contrast, using the fact that the M-layer architecture has a single layer, in Section 3.7 we obtain a direct bound on the robustness on the whole network by analyzing the explicit formulation of the computation.

3. ARCHITECTURE

We start this section by refreshing the definition of the matrix exponential. We then define the proposed M-layer architecture and explain its ability to learn particular functions such as polynomials and periodic functions. Finally, we provide closed-form per-example robustness guarantees.

