BEYOND FULLY-CONNECTED LAYERS WITH QUATERNIONS: PARAMETERIZATION OF HYPERCOM-PLEX MULTIPLICATIONS WITH 1/n PARAMETERS

Abstract

Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, "fully-connected layers with quaternions" (quaternions are 4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily 1/n learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.

1. INTRODUCTION

A quaternion is a 4D hypercomplex number with one real component and three imaginary components. The Hamilton product is the hypercomplex multiplication of two quaternions. Recent works in quaternion space and Hamilton products have demonstrated reasonable success (Parcollet et al., 2018b; 2019; Tay et al., 2019) . Notably, the Hamilton product enjoys a parameter saving with 1/4 learnable parameters as compared with the real-valued matrix multiplication. It also enables effective representation learning by modeling interactions between real and imaginary components. One of the attractive properties of quaternion models is its high applicability and universal usefulness to one of the most ubiquitous layers in deep learning, i.e., the fully-connected (or feedforward) layer. Specifically, "fully-connected layers with quaternions" replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, enjoying parameter savings with only 1/4 learnable parameters and achieving comparable performance with their fully-connected layer counterparts (Parcollet et al., 2018b; 2019; Tay et al., 2019) . The fully-connected layer is one of the most dominant components in existing deep learning literature (Goodfellow et al., 2016; Zhang et al., 2020) . Its pervasiveness cannot be understated, given its centrality to many core building blocks in neural network research. Given widespread adoptions of fully-connected layers, e.g., within LSTM networks (Hochreiter & Schmidhuber, 1997) and transformer models (Vaswani et al., 2017) , having flexibility to balance between parameter savings and effectiveness could be extremely useful to many real-world applications. Unfortunately, hypercomplex space only exists at 4D (quaternions), 8D (octonions), and 16D (sedenions), which generalizes the 2D complex space (Rishiyur, 2006) . Moreover, custom operators are required at each hypercomplex dimensionality. For instance, the Hamilton product is the hypercomplex multiplication in 4D hypercomplex space. Thus, no operator in such predefined hypercomplex space is suitable for applications that prefer reducing parameters to 1/n, where n = 4, 8, 16. In view of the architectural limitation due to the very few choices of those existing hypercomplex space, we propose parameterization of hypercomplex multiplications, i.e., learning the real and imaginary component interactions from data in a differentiable fashion. Essentially, our method can operate on an arbitrary nD hypercomplex space, aside from subsuming those predefined hypercomplex multiplication rules, facilitating using up to arbitrarily 1/n learnable parameters while maintaining expressiveness. In practice, the hyperparameter n can be flexibly specified or tuned by users based on applications. Concretely, our prime contribution is a new module that parameterizes and generalizes the hypercomplex multiplication by learning the real and imaginary component interactions, i.e., multiplication rules, from data. Our method, which we call the parameterized hypercomplex multiplication layer, is characterized by a sum of Kronecker products that generalize the vector outer products to higher dimensions in real space. To demonstrate applicability, we equip two well-established models (the LSTM and transformer) with our proposed method. We conduct extensive experiments on different tasks, i.e., natural language inference for LSTM networks and machine translation for transformer models. Additionally, we perform further experiments on text style transfer and subject verb agreement tasks. All in all, our method has demonstrated architectural flexibility through different experimental settings, where it generally can use a fraction of the learnable parameters with minimal degradation or slight improvement in performance. The overall contributions of this work are summarized as follows: • We propose a new parameterization of hypercomplex multiplications: the parameterized hypercomplex multiplication (PHM) layer. This layer has 1/n learnable parameters compared with the fully-connected layer counterpart, where n can be flexibly specified by users. The key idea behind PHM layers is to learn the interactions between real and imaginary components, i.e., multiplication rules, from data using a sum of Kronecker products. • We demonstrate the applicability of the PHM layers by leveraging them in two dominant neural architectures: the LSTM and transformer models. • We empirically show architectural flexibility and effectiveness of PHM layers by conducting extensive experiments on five natural language inference tasks, seven machine translation datasets, together with text style transfer and subject verb agreement tasks.

2. BACKGROUND ON QUATERNIONS AND HAMILTON PRODUCTS

We begin by introducing the background for the rest of the paper. Concretely, we describe quaternion algebra along with Hamilton products, which is at the heart of our proposed approach. Quaternion A quaternion Q ∈ H is a hypercomplex number with one real component and three imaginary components as follows: Q = Q r + Q x i + Q y j + Q z k, ( .1) whereby ijk = i 2 = j 2 = k 2 = -1. In (2.1), noncommutative multiplication rules hold: ij = k, jk = i, ki = j, ji = -k, kj = -i, ik = -j. Here, Q r is the real component, Q x , Q y , Q z are real numbers that represent the imaginary components of the quaternion Q. Addition The addition of two quaternions is defined as Q + P = Q r + P r + (Q x + P x )i + (Q y + P y )j + (Q z + P z )k, where Q and P with subscripts denote the real and imaginary components of quaternions Q and P .

