TRAINING INVERTIBLE LINEAR LAYERS THROUGH RANK-ONE PERTURBATIONS

Abstract

Many types of neural network layers rely on matrix properties such as invertibility or orthogonality. Retaining such properties during optimization with gradientbased stochastic optimizers is a challenging task, which is usually addressed by either reparameterization of the affected parameters or by directly optimizing on the manifold. This work presents a novel approach for training invertible linear layers. In lieu of directly optimizing the network parameters, we train rank-one perturbations and add them to the actual weight matrices infrequently. This P 4 Inv update allows keeping track of inverses and determinants without ever explicitly computing them. We show how such invertible blocks improve the mixing and thus the mode separation of the resulting normalizing flows. Furthermore, we outline how the P 4 concept can be utilized to retain properties other than invertibility.

1. INTRODUCTION

Figure 1 : Training of deep neural networks (DNN). Standard DNN transform inputs x into outputs y through activation functions and linear layers, which are tuned by an optimizer. In contrast, P 4 training operates on perturbations to the parameters. Those are defined to retain certain network properties (here: invertibility as well as tractable inversion and determinant computation). The perturbed parameters are merged in regular intervals. Many deep learning applications depend critically on the neural network parameters having a certain mathematical structure. As an important example, reversible generative models rely on invertibility and, in the case of normalizing flows, efficient inversion and computation of the Jacobian determinant (Papamakarios et al., 2019) . Preserving parameter properties during training can be challenging and many approaches are currently in use. The most basic way of incorporating constraints is by network design. Many examples could be listed, like defining convolutional layers to obtain equivariances, constraining network outputs to certain intervals through bounded activation functions, Householder flows (Tomczak & Welling, 2016) to enforce layer-wise orthogonality, or coupling layers (Dinh et al., 2014; 2016) that enforce tractable inversion through their twochannel structure. A second approach concerns the optimizers used for training. Optimization routines have been tailored for example to maintain Lipschitz bounds (Yoshida & Miyato, 2017) or efficiently optimize orthogonal linear layers (Choromanski et al., 2020) . The present work introduces a novel algorithmic concept for training invertible linear layers and facilitate tractable inversion and determinant computation, see Figure 1 . In lieu of directly changing the network parameters, the optimizer operates on perturbations to these parameters. The actual network parameters are frozen, while a parameterized perturbation (a rank-one update to the frozen parameters) serves as a proxy for optimization. Inputs are passed through the perturbed network during training. In regular intervals, the perturbed parameters are merged into the actual network and the perturbation is reset to the identity. This stepwise optimization approach will be referred to as property-preserving parameter perturbation, or P 4 update. A similar concept was introduced recently by Lezcano-Casado (2019) , who used dynamic trivializations for optimization on manifolds. In this work, we use P 4 training to optimize invertible linear layers while keeping track of their inverses and determinants using rank-one updates. Previous work (see Section 2) has mostly focused on optimizing orthogonal matrices, which can be trivially inverted and have unity determinant. Only most recently, Gresele et al. ( 2020) presented a first method to optimize general invertible matrices implicitly using relative gradients, thereby providing greater flexibility and expressivity. While their scheme implicitly tracks the weight matrices' determinants, it does not facilitate cheap inversion. In contrast, the present P 4 Inv layers are inverted at the cost of roughly three matrix-vector multiplications. P 4 Inv layers can approximate arbitrary invertible matrices A ∈ GL(n). Interestingly, our stepwise perturbation even allows sign changes in the determinants and recovers the correct inverse after emerging from the ill-conditioned regime. Furthermore, it avoids any explicit computations of inverses or determinants. All operations occurring in optimization steps have complexity of at most O(n 2 ). To our knowledge, the present method is the first to feature these desirable properties. We show how P 4 Inv blocks can be utilized in normalizing flows by combining them with nonlinear, bijective activation functions and with coupling layers. The resulting neural networks are validated for density estimation and as deep generative models. Finally, we outline potential applications of P 4 training to network properties other than invertibility.

2.1. RANK-ONE PERTURBATION

The P 4 Inv layers are based on rank-one updates, which are defined as transformations A → A + uv T with u, v ∈ R n . If A ∈ GL(n) and 1 + v T A -1 u = 0, the updated matrix is also invertible and its inverse can be computed by the Sherman-Morrison formula (A + uv T ) -1 = A -1 - 1 1 + v T A -1 u A -1 uv T A -1 . Furthermore, the determinant is given by the matrix determinant lemma det(A + uv T ) = (1 + v T A -1 u) det(A). Both these equations are widely used in numerical mathematics, since they sidestep the O(n 3 ) cost and poor parallelization of both matrix inversion and determinant computation. The present work leverages these perturbation formulas to keep track of the inverses and determinants of weight matrices during training of invertible neural networks.

2.2. EXISTING APPROACHES FOR TRAINING INVERTIBLE LINEAR LAYERS

Maintaining invertibility of linear layers has been studied in the context of convolution operators (Kingma & Dhariwal, 2018; Karami et al., 2019; Hoogeboom et al., 2019; 2020) and using Sylvester's theorem (Van Den Berg et al., 2018) . Those approaches often involve decompositions that include triangular matrices (Papamakarios et al. ( 2019)). While inverting triangular matrices has quadratic computational complexity, it is inherently sequential and thus fairly inefficient on parallel computers (see Section 4.1). More closely related to our work, Gresele et al. (2020) introduced a relative gradient optimization scheme for invertible matrices. In contrast to this related work, our method facilitates a cheap inverse pass and allows sign changes in the determinant. On the contrary, their method operates in a higher-dimensional search space, which might speed up the optimization in tasks that do not involve inversion during training. 



NORMALIZING FLOWS Cheap inversion and determinant computation are specifically important in the context of normalizing flows, see Appendix C. They were introduced in Tabak et al. (2010); Tabak & Turner (2013)

