MONOTONIC KRONECKER-FACTORED LATTICE

Abstract

It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice (KFL), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The KFL function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of M base KFL models strictly increases as M increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that KFL trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model.

1. INTRODUCTION

Many machine learning problems have other requirements in addition to accuracy, such as efficiency, storage, and interpretability. For example, the ability to learn flexible monotonic functions at scale is useful in practice because machine learning practitioners often know apriori which input features positively or negatively relate to the output and can incorporate these hints as inductive bias in the training to further regularize the model (Abu-Mostafa, 1993) and guarantee its expected behavior on unseen examples. It is, however, computationally challenging to learn such functions efficiently. In this paper, we extend the work of interpretable monotonic lattice regression (Gupta et al., 2016) to significantly reduce computational and storage costs without compromising accuracy. While a linear model with nonnegative coefficients is able to learn simple monotonic functions, its function class is restricted. Prior works proposed non-linear methods (Sill, 1997; Dugas et al., 2009; Daniels & Velikova, 2010; Qu & Hu, 2011) to learn more flexible monotonic functions, but they are shown to work only in limited settings of small datasets and low dimensional feature spaces. Monotonic lattice regression (Gupta et al., 2016) , an extension of lattice regression (Garcia et al., 2012) , learns an interpolated look-up table with linear inequality constraints that impose monotonicity. This has been demonstrated to work with millions of training examples and achieve competitive accuracy against, for example, random forests; however, because the number of model parameters scales exponentially in the number of input features, it is difficult to apply such models in high dimensional feature space settings. To overcome this limitation, (Canini et al., 2016) incorporated ensemble learning to combine many tiny lattice models, each capturing non-linear interactions among a small random subset of features. This paper proposes Kronecker-Factored Lattice (KFL), a novel reparameterization of monotonic lattice regression via Kronecker product to achieve significant parameter efficiency and simultaneously provide guaranteed monotonicity of the learned model with respect to a user-prescribed set of input features for user trust. Both inference and storage costs scale linearly in the number of features. Hence, it potentially allows for more features to non-linearly interact. Kronecker factorization has been applied to a wide variety of problems including optimization (Martens & Grosse, 2015; George et al., 2018; Osawa et al., 2019) , convolutional neural networks (Zhang et al., 2015) , and recurrent neural networks (Jose et al., 2018) , but, to the best of our knowledge, has not yet been explored to learn flexible monotonic functions. The main contributions of this paper are: (1) reparameterization of monotonic lattice regression to achieve linear evaluation time and storage of the resulting subclass of models, (2) proving sufficient and necessary conditions for a KFL model to be monotonic as well as convex and nonnegative, (3) showing how the conditions for monotonicity can be efficiently imposed during training, (4) characterizing the values of M for which the capacity of the function class of an ensemble of M base KFL models strictly increases and showing that with a large enough M an ensemble can express every multilinear interpolated lattice function, and (5) providing experimental results on both public and proprietary datasets that demonstrate that KFL has accuracy and evaluation speeds comparable to or better than other lattice methods while requiring less training time and fewer parameters.

2. NOTATION AND OVERVIEW OF MONOTONIC LATTICE REGRESSION

For n ∈ N we denote by [n] the set {1, 2, . . . , n}. For x ∈ R D and d ∈ [D], let x[d] denote the dth entry of x. We use e d,n to denote the one-hot vector in {0, 1} n where e d,n [j] = 1 if j = d and e d,n [j] = 0, otherwise. When n is clear from the context we write e d . For two real vectors v, w of the same length we use v • w to denote their dot product. We write v ≤ w and v ≥ w if the respective inequality holds entrywise. We denote by 0 the zero vector in a real vector space whose dimension is clear from the context. For two sets S and T we use S × T to denote their Cartesian product and S \ T to denote the set of elements in S but not in T . We denote by |S| the cardinality of S. For w ∈ R V , we use f pwl (x; w) to denote the 1-dimensional continuous piecewise linear function f pwl (•; w) : [0, V -1] → R whose graph has V -1 linear segments, where for i = 1, . . . , V -1, the ith segment connects the points (i -1, w[i]) with (i, w[i + 1]). See Figure 1 for an example. Observe that for any α, β ∈ R and vectors w, v ∈ R V , we have f pwl (•; αw + βv) = αf pwl (•; w) + βf pwl (•; v).

Number of features e d,n

"1-hot" n-dimensional vector with 1 in the dth entry f pwl 1-dimensional piecewise linear function f mll Multilinear lattice function f sl Simplex lattice function f kfl KFL function V D-dimensional size of lattice M V Set of vertices of lattice of size V D V Domain of a lattice model Φ Interpolation kernel. θ Vector of lattice parameters ⊗ i v i Outer / Kronecker product of vectors v i KFL(V,M ) Class of ensembles of M KFL functions of size V r(T ) Rank of tensor T Gupta et al., 2016; 2018; Cotter et al., 2019) , without severely restricting the model class. See Figure 2 for an example.



Table of major notation Lattice models, proposed in Garcia et al. (2012), are interpolated look-up tables such that the function parameters are the function values sampled on a regular grid. They have been shown to exhibit efficient training procedures such that the resulting model is guaranteed to satisfy various types of shape constraints, including monotonicity and convexity (

