BEYOND FULLY-CONNECTED LAYERS WITH QUATERNIONS: PARAMETERIZATION OF HYPERCOM-PLEX MULTIPLICATIONS WITH 1/n PARAMETERS

Abstract

Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, "fully-connected layers with quaternions" (quaternions are 4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily 1/n learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.

1. INTRODUCTION

A quaternion is a 4D hypercomplex number with one real component and three imaginary components. The Hamilton product is the hypercomplex multiplication of two quaternions. Recent works in quaternion space and Hamilton products have demonstrated reasonable success (Parcollet et al., 2018b; 2019; Tay et al., 2019) . Notably, the Hamilton product enjoys a parameter saving with 1/4 learnable parameters as compared with the real-valued matrix multiplication. It also enables effective representation learning by modeling interactions between real and imaginary components. One of the attractive properties of quaternion models is its high applicability and universal usefulness to one of the most ubiquitous layers in deep learning, i.e., the fully-connected (or feedforward) layer. Specifically, "fully-connected layers with quaternions" replace real-valued matrix multiplications in fully-connected layers with Hamilton products of quaternions, enjoying parameter savings with only 1/4 learnable parameters and achieving comparable performance with their fully-connected layer counterparts (Parcollet et al., 2018b; 2019; Tay et al., 2019) . The fully-connected layer is one of the most dominant components in existing deep learning literature (Goodfellow et al., 2016; Zhang et al., 2020) . Its pervasiveness cannot be understated, given its

