MONOTONIC KRONECKER-FACTORED LATTICE

Abstract

It is computationally challenging to learn flexible monotonic functions that guarantee model behavior and provide interpretability beyond a few input features, and in a time where minimizing resource use is increasingly important, we must be able to learn such models that are still efficient. In this paper we show how to effectively and efficiently learn such functions using Kronecker-Factored Lattice (KFL), an efficient reparameterization of flexible monotonic lattice regression via Kronecker product. Both computational and storage costs scale linearly in the number of input features, which is a significant improvement over existing methods that grow exponentially. We also show that we can still properly enforce monotonicity and other shape constraints. The KFL function class consists of products of piecewise-linear functions, and the size of the function class can be further increased through ensembling. We prove that the function class of an ensemble of M base KFL models strictly increases as M increases up to a certain threshold. Beyond this threshold, every multilinear interpolated lattice function can be expressed. Our experimental results demonstrate that KFL trains faster with fewer parameters while still achieving accuracy and evaluation speeds comparable to or better than the baseline methods and preserving monotonicity guarantees on the learned model.

1. INTRODUCTION

Many machine learning problems have other requirements in addition to accuracy, such as efficiency, storage, and interpretability. For example, the ability to learn flexible monotonic functions at scale is useful in practice because machine learning practitioners often know apriori which input features positively or negatively relate to the output and can incorporate these hints as inductive bias in the training to further regularize the model (Abu-Mostafa, 1993) and guarantee its expected behavior on unseen examples. It is, however, computationally challenging to learn such functions efficiently. In this paper, we extend the work of interpretable monotonic lattice regression (Gupta et al., 2016) to significantly reduce computational and storage costs without compromising accuracy. While a linear model with nonnegative coefficients is able to learn simple monotonic functions, its function class is restricted. Prior works proposed non-linear methods (Sill, 1997; Dugas et al., 2009; Daniels & Velikova, 2010; Qu & Hu, 2011) to learn more flexible monotonic functions, but they are shown to work only in limited settings of small datasets and low dimensional feature spaces. Monotonic lattice regression (Gupta et al., 2016) , an extension of lattice regression (Garcia et al., 2012) , learns an interpolated look-up table with linear inequality constraints that impose monotonicity. This has been demonstrated to work with millions of training examples and achieve competitive accuracy against, for example, random forests; however, because the number of model parameters scales exponentially in the number of input features, it is difficult to apply such models in high dimensional feature space settings. To overcome this limitation, (Canini et al., 2016) incorporated ensemble learning to combine many tiny lattice models, each capturing non-linear interactions among a small random subset of features. This paper proposes Kronecker-Factored Lattice (KFL), a novel reparameterization of monotonic lattice regression via Kronecker product to achieve significant parameter efficiency and simultaneously provide guaranteed monotonicity of the learned model with respect to a user-prescribed set of input features for user trust. Both inference and storage costs scale linearly in the number of features. Hence, it potentially allows for more features to non-linearly interact. Kronecker factorization has been applied to a wide variety of problems including optimization (Martens & Grosse, 2015; George et al., 2018; Osawa et al., 2019) , convolutional neural networks (Zhang et al., 2015) , and recurrent neural networks (Jose et al., 2018) , but, to the best of our knowledge, has not yet been explored to learn flexible monotonic functions. The main contributions of this paper are: (1) reparameterization of monotonic lattice regression to achieve linear evaluation time and storage of the resulting subclass of models, (2) proving sufficient and necessary conditions for a KFL model to be monotonic as well as convex and nonnegative, (3) showing how the conditions for monotonicity can be efficiently imposed during training, (4) characterizing the values of M for which the capacity of the function class of an ensemble of M base KFL models strictly increases and showing that with a large enough M an ensemble can express every multilinear interpolated lattice function, and (5) providing experimental results on both public and proprietary datasets that demonstrate that KFL has accuracy and evaluation speeds comparable to or better than other lattice methods while requiring less training time and fewer parameters.

2. NOTATION AND OVERVIEW OF MONOTONIC LATTICE REGRESSION

For n ∈ N we denote by [n] the set {1, 2, . . . , n}. For x ∈ R D and d ∈ [D], let x[d] denote the dth entry of x. We use e d,n to denote the one-hot vector in {0, 1} n where e d,n [j] = 1 if j = d and e d,n [j] = 0, otherwise. When n is clear from the context we write e d . For two real vectors v, w of the same length we use v • w to denote their dot product. We write v ≤ w and v ≥ w if the respective inequality holds entrywise. We denote by 0 the zero vector in a real vector space whose dimension is clear from the context. For two sets S and T we use S × T to denote their Cartesian product and S \ T to denote the set of elements in S but not in T . We denote by |S| the cardinality of S. For w ∈ R V , we use f pwl (x; w) to denote the 1-dimensional continuous piecewise linear function f pwl (•; w) : [0, V -1] → R whose graph has V -1 linear segments, where for i = 1, . . . , V -1, the ith segment connects the points (i -1, w[i]) with (i, w[i + 1]). See Figure 1 for an example. Observe that for any α, β ∈ R and vectors w, v ∈ R V , we have f pwl (•; αw + βv) = αf pwl (•; w) + βf pwl (•; v).

Number of features e d,n

"1-hot" n-dimensional vector with 1 in the dth entry f pwl 1-dimensional piecewise linear function f mll Multilinear lattice function f sl Simplex lattice function f kfl KFL function V D-dimensional size of lattice M V Set of vertices of lattice of size V D V Domain of a lattice model Φ Interpolation kernel. θ Vector of lattice parameters ⊗ i v i Outer / Kronecker product of vectors v i KFL(V,M ) Class of ensembles of M KFL functions of size V r(T ) Rank of tensor T

Table 1: Table of major notation

Lattice models, proposed in Garcia et al. (2012) , are interpolated look-up tables such that the function parameters are the function values sampled on a regular grid. They have been shown to exhibit efficient training procedures such that the resulting model is guaranteed to satisfy various types of shape constraints, including monotonicity and convexity (Gupta et al., 2016; 2018; Cotter et al., 2019) , without severely restricting the model class. See Figure 2 for an example.  M V = 0, 1, . . . , V[1]-1 × . . . × 0, 1, . . . , V[D]-1 . Thus, V[d] is the number of vertices in the lattice in the dth dimension, and the grid has D d=1 V[d] vertices. We denote by D V ∈ R D the domain of the lattice model given by D V = [0, V [1] -1] × . . . × [0, V [D] -1]. Typically, each feature d is "calibrated" so that its value lies in [0, V[d] -1], using the piecewise-linear calibration technique of (Gupta et al., 2016) (see also Section 3.5). The lattice parameters are a vector θ ∈ R M V , where R M V denotes the set of vectors of length |M V | with entries indexed by M V : for each vertex v ∈ M V , there is a corresponding parameter we denote by θ v . The lattice model or function is obtained by interpolating the parameters using one of many possible interpolation schemes specified by an interpolation kernel Φ : D V → R M V . Given Φ, the resulting lattice model f : D v → R is defined by f (x; θ) = θ • Φ(x) = v∈M V θ v Φ(x) v , where (Φ(x)) v denotes the entry of Φ(x) with index v. Typically, one requires that for all u, v ∈ M V , (Φ(v)) u = 1 if u = v, and 0 otherwise, so that f (v) = θ v . In Gupta et al. (2016) the authors present two interpolation schemes termed 'multilinear' and 'simplex'. We denote the corresponding kernels by Φ multilinear V and Φ simplex V and the resulting lattice models by f mll (•; θ) and f sl (•; θ), respectively. We give the definition of Φ multilinear V here since we make use of it later, but refer the reader to (Gupta et al., 2016) for the definition of simplex interpolation. Φ multilinear V (x) v = j∈[D] f pwl (x[j]; e v[j]+1,V[j] ), v ∈ M V (2) For both multilinear and simplex interpolated lattice functions, the resulting function is increasing (resp. decreasing) in a given direction if and only if the function is increasing (resp. decreasing) on the grid's vertices in that direction (Gupta et al., 2016, Lemmas 1, 3) . As a result, training a lattice model with these interpolation schemes to respect the monotonicity constraint can be done by solving a constrained optimization problem with linear inequality constraints (Gupta et al., 2016) . Evaluating the lattice function at a single point can be done in O(2 D ) time for multilinear interpolation and in O(D log D) time for simplex interpolation (Gupta et al., 2016) . Both interpolation schemes require O( d V[d]) space for storing the parameters (assuming a constant number of bits per parameter). In the next section we present a sub-class of the set of multlinear-interpolation lattice functions, which we term KFL(V). Each function in this class can be evaluated in O(D) time and requires only O( d V[d]) space for its parameters. Moreover, like the multilinear and simplex interpolated lattice function classes, there is a necessary and sufficient condition for a function in KFL(V) to be monotonic in a subset of its inputs, and this condition can be efficiently checked. As a result, one can train models in KFL(V) that are guaranteed to be monotonic in a prescribed set of features.

3. KRONECKER-FACTORED LATTICE

The Kronecker product of two matrices A = (a i,j ) ∈ R m×n and B = (b i,j ) ∈ R p×q is defined as the mp × nq real block matrix given by: A ⊗ B =    a 11 B • • • a 1n B . . . . . . . . . a m1 B • • • a mn B    . The Kronecker product is associative and generally non-commutative; however, A ⊗ B and B ⊗ A are permutation equivalent. We extend the definition of Kronecker product to vectors by regarding them as matrices with a single row or column. For s vectors v i ∈ R ni , we use the notation ⊗ s i=1 v i = v 1 ⊗ . . . ⊗ v s . Let I = {0, . . . , n 1 -1} × . . . × {0, . . . , n s -1}. We index the entries of ⊗ s i=1 v i by I, so that the entry with index (i 1 , . . . , i s ) ∈ I is s j=1 v j [i j + 1]. The Kronecker product satisfies the mixed-product property: for 4 matrices A ∈ R m×n , B ∈ R p×q , C ∈ R n×r , D ∈ R q×s it holds that (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD) , where for two matrices M, N , we denote their conventional product by M N . Consequently for vectors a, c ∈ R n and b, d ∈ R q it holds that (a ⊗ b) • (c ⊗ d) = (a • c) ⊗ (b • d) = (a • c)(b • d), where the rightmost equality follows since Kronecker product of scalars reduces to a simple product. See (Loan, 2000) for more details on Kronecker products. We are now ready to define the class KFL(V). Let V ∈ N D and fix a lattice of size V. The class KFL(V) is the subclass of all multilinear interpolation lattice functions whose parameters are Kronecker products of D factors. Precisely, KFL(V) = {f mll (•; ⊗ D d=1 w d ) : ∀d w d ∈ R V[d] }. We reparameterize the functions in KFL and use f kfl (•; w 1 , . . . , w d ) = f mll (•; ⊗ D d=1 w d ) to denote the function with parameters {w d } d . The following proposition shows that functions in KFL(V) are products of 1-dimensional piecewise linear functions. It essentially follows from the fact that Φ multilinear V (x) can be expressed as a Kronecker product of D vectors. Proposition 1. Let V ∈ N D . For all w 1 ∈ R V[1] , . . . , w D ∈ R V[D] f kfl (x; w 1 , . . . , w D ) = d f pwl (x[d]; w d ), for all x ∈ D V . (5) Proof of Proposition 1. For v ∈ N, let Ψ v : R → R v be given by Ψ v (x) = [f pwl (x; e 1,v ) f pwl (x; e 2,v ) . . . f pwl (x; e v,v )]. Let w 1 , . . . , w D be vectors satisfying the conditions in the proposition. By the definition of KFL(V), we have f kfl (•; w 1 , . . . , w D ) = f mll (•; ⊗ D d=1 w d ). Fix Φ = Φ multilinear V . It follows from (2) that for all x ∈ D V we have Φ(x) = ⊗ D d=1 Ψ V[d] (x[d]). Therefore using (4) we get, for all x ∈ D v , f kfl (x; w 1 , . . . , w D ) = f mll (x; ⊗ D d=1 w d ) = (⊗ D d=1 w d ) • (⊗ D d=1 Ψ V [d](x[d])) = D d=1 (w d • Ψ V[d] (x[d])). Now, for each d ∈ [D] we have w d • Ψ V[d] (x[d]) = i∈[V[d]] w d [i]f pwl (x[d]; e i,V[d] ) = f pwl x[d]; i w d [i]e i,V[d] = f pwl (x[d]; w d ). Substituting ( 7) into (6), we obtain (5). It follows from Proposition 1 that evaluating a function in KFL(V) requires evaluating D piecewise linear functions with uniformly spaced knots and computing the product. This can be done in time O(D). The storage complexity is O(# parameters) = O( i V[i]).

3.1. MONOTONICITY CRITERIA FOR KFL

Let i ∈ {1, . . . , D}. We say that a function f : D → R, where D ⊆ R D , is increasing (resp. decreasing) in direction i in D, if for every x ∈ D and real δ ≥ 0, such that x + δe i ∈ D, it holds that f (x) ≤ f (x + δe i ) (resp. f (x) ≥ f (x + δe i )) . Note that a function that does not depend on x[i] is thus "trivially" increasing in direction i. In this section we disregard such edge cases and require a function increasing in the ith direction to strictly increase on at least one pair of inputs. To simplify the exposition, we only deal with increasing functions in this section-all the results transfer naturally to the decreasing case, as well. The next proposition shows a sufficient and necessary condition for a function in KFL(V) to be increasing in a subset of its features in D V . In Section 3.5 we explain how to use this result to train models that are guaranteed to be monotonic. Proposition 2. Fix V ∈ N D , let f ∈ KFL(V) and let i 1 , . . . , i p ∈ [D] be distinct direction indices for p ≤ D. Then f is increasing in directions i 1 , . . . , i p iff f can be written as: f (x) = σ d f pwl (x[d]; w d ), where σ ∈ {+1, -1}, and w d ∈ R V[d] , d ∈ [D] are vectors satisfying the following 3 conditions: 1. If p ≥ 2 then w d ≥ 0 for all d ∈ [D]. 2. If p = 1 then w d ≥ 0 for all d ∈ [D] \ {i 1 }. 3. For all j ∈ [p], σw ij [1] ≤ σw ij [2] ≤ . . . ≤ σw ij [V[i j ]]. See Appendix A for the proof.

3.2. CRITERIA FOR OTHER SHAPE CONSTRAINTS

While the focus of this paper is on learning flexible monotonic functions, other shape constraints, namely convexity and nonnegativity, can easily be imposed on KFL. This section presents the propositions showing sufficient and necessary conditions for these constraints. More shape constraints beyond these are left for future work. We say that a function f : D → R is convex (resp. concave) in direction i in D if for every x ∈ D and real δ ≥ 0 such that x, x + δe i ∈ D, we have f (tx + (1 -t)(x + δe i )) ≤ tf (x) + (1 -t)f (x + δe i ) (resp. f (tx + (1 -t)(x + δe i )) ≥ tf (x) + (1 -t)f (x + δe i )) for all t ∈ [0, 1]. As with monotonicity, we require a function convex (resp. concave) in direction i to strictly satisfy the inequality for some x,δ and t (i.e. we disregard functions that are affine in direction i in D). The following proposition shows sufficient and necessary criteria for a function to be convex in a subset of directions; an analogous criteria can be derived for concavity. Proposition 3. Fix V ∈ N D , let f ∈ KFL(V) and let i 1 , . . . , i p ∈ [D] be distinct direction indices for p ≤ D. Then f is convex in directions i 1 , . . . , i p iff f can be written as: f (x) = σ d f pwl (x[d]; w d ), where σ ∈ {+1, -1}, and w d ∈ R V[d] , d ∈ [D] are vectors satisfying the following 3 conditions: 1. If p ≥ 2 then w d ≥ 0 for all d ∈ [D]. 2. If p = 1 then w d ≥ 0 for all d ∈ [D] \ {i 1 }. 3. For all j ∈ [p], V[i j ] = 2 or V[i j ] > 2 and σ(w ij [k + 2] -w ij [k + 1]) ≥ σ(w ij [k + 1] - w ij [k]) for all k ∈ [V[i j ] -2]. See Appendix A for the proof. Proposition 4. Fix V ∈ N D , let f ∈ KFL(V). Then f is nonnegative (resp. strictly positive) iff f can be written as f (x) = d f pwl (x[d]; w d ), where w d ≥ 0 (resp. w d > 0) for all d ∈ [D]. See Appendix A for the proof.

3.3. ENSEMBLES OF KFL MODELS

It follows from Propositon 2 that the set of monotonic functions in KFL(V) is fairly restricted: for example, a function in KFL(V) that is increasing in two or more directions cannot change sign in D V . To make the model class larger, we use ensembles of sub-models in KFL(V) and take the average or the sum of their predictions as the composite model's prediction. We are thus motivated to define the class of functions that can be expressed as sums of M KFL models. KFL(V, M ) = M i=1 g i : g i ∈ KFL(V) . ( ) Since the zero function with domain D V is in KFL(V), we have KFL(V, M ) ⊆ KFL(V, M + 1). Let L(V) denote the set of all multilinear inteprolation lattice functions on a grid of size V. Then, by construction, L is closed under addition. So we have KFL(V) = KFL(V, 1) ⊆ KFL(V, 2) ⊆ . . . ⊆ L(V). Moreover, the following proposition shows that there exists a positive integer M , such that for all m < M , there are functions that can be expressed as a sum of m + 1 KFL(V) models which cannot be expressed as a sum of m KFL(V) models, and every function in L(V) can be expressed as a sum of M functions in KFL(V). To state the proposition, we require the following definitions on tensors, which are multidimensional generalizations of matrices. There are multiple ways to define a real tensor. Here we view a real tensor of size V as a D-dimensional D] . We index the entries of tensors of size V by M V . Addition between two tensors of the same size and multiplication of a tensor by a real scalar are defined entrywise. Observe that each element in R M V can naturally be regarded as a tensor of size V and vice versa. The outer product of D vectors V[1] × . . . × V[D] array of real numbers. The set of such tensors is denoted by R V[1] ⊗ . . . ⊗ R V[ w i ∈ R V[i] , i ∈ [D] is their Kronecker product, ⊗ D i=1 w i , regarded as a tensor of size V, and we use the same notation for the outer product as for the Kronecker product. A real tensor is called simple if it is the outer product of some D real vectors. Every tensor T can be expressed as a sum of simple tensors and the rank of T , denoted r(T ), is the minimum number of simple tensors that sum to T . The rank of a tensor of size V is at most |M V |/ max i {V[i]}. See (Rabanser et al., 2017) for an introduction to tensors. Proposition 5. For two sets A, B we use A B to denote that A is a subset of B but A = B. Let M = max{r(T ) : T ∈ R V[1] ⊗ . . . ⊗ R V[D] }. Then KFL(V, 1) KFL(V, 2) . . . KFL(V, M ) = L(V). See Appendix A for the proof.

3.4. DEPENDENCE OF THE CAPACITY OF THE MODEL CLASS ON LATTICE SIZE

In the previous section, we showed that the capacity of KFL(V, M ) increases as M increases up to a certain threshold. One may also ask how the capacity of KFL(V, M ) depends on V. Unfortunately, the domain D V of a member in KFL(V, M ) is different for different vectors V, which introduces a technical difficulty when trying to compare the capacity across different lattice sizes. In practice, we follow the technique of Gupta et al. (2016)  (V, M ) = {f (c(•)) : f ∈ KFL(V, M ), c ∈ C(V)}. The following proposition shows that the capacity of CKFL(V, M ) increases as V increases. Proposition 6. Let V 1 , V 2 ∈ N D be two lattice sizes and M be a positive integer. If V 1 ≤ V 2 , then CKFL(V 1 , M ) ⊆ CKFL(V 2 , M ). See Appendix A for the proof.

3.5. MODEL TRAINING DETAILS

To ensure that the input to a lattice model lies in its domain, we follow the input calibration techniques explained in Gupta et al. (2016) . For numeric inputs, we use a learned one-dimensional piecewise c f kfl . . . Calibrators can be efficiently trained to be monotonic. f kfl × γ 1 . . . × γ M x AVG + β f (x) Our model architecture for KFL is depicted in Figure 3 and defined by the following equation f (x) = β + 1 M m∈[M ] γ m f kfl c(x); w (m) 1 , . . . , w (m) D , where c is the input calibration layer, described above (note that c has its own learned parameters we don't show here to simplify the notation) and γ m , β ∈ R are additional scaling and bias parameters, respectively. To guarantee that the final model is monotonic in a given feature, it is sufficient to impose monotonicity on its calibrator and each of the terms in the sum. We train our model by minimizing the empirical risk subject to the monotonicity constraints using projected mini-batch gradient descent. Following Proposition 2, after every gradient descent step, we project each w (m) d to be nonnegative where required. Then for each monotonic feature d we project w (m) d so that its components increase if σ = 1 decrease if σ = -1. Computing this projection is an instance of the Isotonic Regression problem and can be solved efficiently using the "pool-adjacentviolators" algorithm (Barlow et al., 1972) . In our experiments, we instead use a more efficient approximation algorithm that computes a vector that satisfies the constraints, but may not be the closest (in L 2 -norm) to the original vector. See Appendix B for the exact algorithm we use.

4. EXPERIMENTS

We present experimental results on three datasets, summarized in Table 2 . The first dataset is the same public Adult Income dataset (Dheeru & Karra Taniskidou, 2017) with the same monotonicity constraint setup described in Canini et al. (2016) . The other two datasets are provided by a large internet services company with monotonic features specified by product groups, as domain knowledge or policy requirements. We compare KFL against two previously proposed baseline lattice models: (1) calibrated multilinear interpolation lattice, and (2) calibrated simplex interpolation lattice, which use the input calibration layer described in Section 3.5 and Gupta et al. (2016) . We note that we do not compare KFL to deep lattice networks (DLN) (You et al., 2017) in this paper. In a DLN, KFL would act as a replacement layer for the lattice layers, which only comprise a subset of the model layers. We believe that such a comparison would be dominated by the other layers and not properly show how KFL differs from previously proposed lattice models. All our models were implemented using TensorFlow (Abadi et al., 2015) and TensorFlow Lattice (Google AI Blog, 2017) . Open-source code for KFL has been pushed to the TensorFlow Lattice 2.0 library and can be downloaded at github.com/tensorflow/lattice. We used logistic loss for classification and mean squared error (mse) loss for regression. For each experiment, we train for 100 epochs with a batch size of 256 using the Adam optimizer and validate the learning rate from {0.001, 0.01, 0.1, 1.0} with five-fold cross-validation. We use a lattice size with the same entry V for each direction and tune V from {2, 4, 8} and different settings of M from [1, 100] for KFL (we note that increasing V above 2 for our baselines was not feasible on our datasets). The results were averaged over 10 independent runs. The train and evaluation times were measured on a workstation with 6 Intel Xeon W-2135 CPUs. We report the models with best cross-validation results, as well as some other settings for comparison.

4.1. UCI ADULT INCOME

For the Adult Income dataset Dheeru & Karra Taniskidou (2017) , the goal is to predict whether or not income is greater than or equal to $50,000. We followed the same setup described in Canini et al. (2016) , specifying capital gain, weekly hours of work, education level, and the gender wage gap as monotonically increasing features. The results are summarized in Table 3 and Figure 4 . 

4.2. QUERY RESULT MATCHING

This experiment is a regression problem of predicting the matching quality score of a result to a query. The results are summarized in Table 4 and Figure 5 . 

4.3. USER QUERY INTENT

For this real-world problem, the goal is to classify the user intent into one of two classes. For the baseline multilinear and simplex models, a single lattice with all 24 features is infeasible. When the number of features is too large for a single lattice, we follow the technique described in Canini et al. (2016) , which uses an ensemble of lattice sub models-each taking a small subset of the calibrated input features. The result of the model is the average of the sub-models' outputs. The calibrators are shared among all lattices. Here we use a random tiny lattice ensemble of L lattices with each seeing a random subset of 10 of the 24 features. We tune and set L = 100. The results are summarized in Table 5 and Figure 6 . 

5. DISCUSSION

For each one of our experiments, we can see very similar results: (1) the accuracy or mse of KFL is comparable to or slightly better than the baselines, (2) the training time is significantly reduced, (3) the number of parameters is significantly reduced, (4) KFL can use larger lattice sizes V because the number of parameters scales linearly and not as a power of D, (5) increasing M increases the capacity of the KFL model class where the value for M that makes KFL expressive enough to perform well is a relatively small (and going above such an M no longer provides any more value), and ( 6) the evaluation speed of KFL compared to multilinear and simplex interpolation aligns with our theoretical runtime analysis, where KFL is significantly faster than multilinear interpolation (particularly when the number of features is large, e.g. Experiment 4.3) and comparable to simplex interpolation. Furthermore, each experiment highlights different qualities of KFL. Experiment 4.1 shows that increasing V can sometimes be more important than increasing M , where such an increase can potentially lead to further boosts in performance with only a linear impact to the model's time and space efficiency. This result follows from Proposition 6. In Figure 4 , we see that KFL acts like a form of regularization: KFL's accuracy drops as M gets too high, indicating that we start to over fit and lose this regularization effect as the size of the model class increases. Experiment 4.2 shows that KFL can also perform well for regression tasks but will likely need a larger M and V for more complex problems. It also shows a trade-off between the values of M and V where computation time, number of parameters, and mse remain pretty much fixed. Experiment 4.3 demonstrates KFL's ability to create higher-order feature interactions by keeping all features in a single lattice. Unlike multilinear and simplex interpolated lattice functions, KFL can handle many more features in a single lattice. It demonstrates that KFL can achieve even faster evaluations compared to the ensembling method described by Canini et al. (2016) . Experiments 4.2 and 4.3 also show that there is a logarithmic-like improvement of performance with respect to M , indicating that the M needed to achieve comparable accuracy is potentially impractically large (likely for more complex problems); however, as Proposition 6 shows, there is a nice trade-off between M and V where we can instead increase V to compensate and ultimately achieve comparable or better performance with a reasonable M .

6. CONCLUSION

We proposed KFL, which reparameterizes monotonic lattice regression using Kronecker factorization to achieve significant improvement in parameter and computational efficiency. We proved sufficient and necessary conditions to efficiently impose multiple constraints during training on a model in KFL w.r.t a subset of its features. We showed that through ensembling the KFL function class strictly increases as more base KFL models are added, becoming the class of all multilinear interpolated lattice functions when the number of base models is sufficiently large. Our experimental results demonstrated its practical advantage over the multilinear and simplex interpolated lattice models in terms of speed and storage space while maintaining accuracy. For future work, it would be interesting to (1) explore applications of KFL in more complex architectures such as deep lattice networks (You et al., 2017) and ( 2) to understand if we can impose other useful shape constraints (Cotter et al., 2019; Gupta et al., 2018) .

SUPPLEMENTAL MATERIAL

A PROOFS Proposition 1. Let V ∈ N D . For all w 1 ∈ R V[1] , . . . , w D ∈ R V[D] f kfl (x; w 1 , . . . , w D ) = d f pwl (x[d]; w d ), for all x ∈ D V . Proof of Proposition 1. For v ∈ N, let Ψ v : R → R v be given by Ψ v (x) = [f pwl (x; e 1,v ) f pwl (x; e 2,v ) . . . f pwl (x; e v,v )]. Let w 1 , . . . , w D be vectors satisfying the conditions in the proposition. By the definition of KFL(V), we have f kfl (•; w 1 , . . . , w D ) = f mll (•; ⊗ D d=1 w d ). Fix Φ = Φ multilinear V . It follows from (2) that for all x ∈ D V we have Φ(x) = ⊗ D d=1 Ψ V[d] (x[d] ). Therefore using (4) we get, for all x ∈ D v , f kfl (x; w 1 , . . . , w D ) = f mll (x; ⊗ D d=1 w d ) = (⊗ D d=1 w d ) • (⊗ D d=1 Ψ V [d](x[d])) = D d=1 (w d • Ψ V[d] (x[d])). Now, for each d ∈ [D] we have w d • Ψ V[d] (x[d]) = i∈[V[d]] w d [i]f pwl (x[d]; e i,V[d] ) = f pwl x[d]; i w d [i]e i,V[d] = f pwl (x[d]; w d ). Substituting ( 7) into (6), we obtain (5). Proposition 2. Fix V ∈ N D , let f ∈ KFL(V) and let i 1 , . . . , i p ∈ [D] be distinct direction indices for p ≤ D. Then f is increasing in directions i 1 , . . . , i p iff f can be written as:  f (x) = σ d f pwl (x[d]; w d ),

3.. For all

j ∈ [p], σw ij [1] ≤ σw ij [2] ≤ . . . ≤ σw ij [V[i j ]]. To prove the proposition we shall make use of the following lemma. Lemma 1. Let f (x) = d f pwl (x[d]; w d ) be a function in KFL(V). If for some i ∈ [D], f is increasing in direction i in D V then for all d ∈ [D] \ {i}, either w d ≥ 0 or w d ≤ 0. Proof of Lemma 1. Let f be as above, and assume to the contrary that for some j ∈ [D] \ {i}, neither w j ≥ 0 nor w j ≤ 0 holds. Then there must exist s, t ∈ [V [j] ] such that w j [s] < 0 and w j [t] > 0. For d ∈ [D], let d (x) = f pwl (x; w d ). Then j (s -1) = w j [s] < 0 and j (t -1) = w j [t] > 0. By our assumption there exists x ∈ D V and δ > 0, such that f (x + δe i ) -f (x) > 0. 1 Now, for any r ∈ [0, V[j] -1] f (x+(r-x[j])e j +δe i )-f (x+(r-x[j])e j ) = i (x[i] + δ) j (r) d∈[D]\{i,j} d (x[d]) -j (r) d∈[D]\{j} d (x[d]) =   i (x[i] + δ) d∈[D]\{i} d (x[d]) - d∈[D] d (x[d])   j (r) j (x[j]) = (f (x + δe i ) -f (x)) j (r) j (x[j]) . Since for some r∈{s -1, t -1} it holds that j (r)/ j (x[j])<0, it follows from the last equation that for that r, f (x+(r-x[j])e j +δe i ) < f (x+(r-x[j])e j ), which contradicts our assumption that f is increasing in direction i. Proof of Proposition 2. We first show the "if" direction. Assume that f can be written as above and let j ∈ [p]. Then f is the product of two terms: σf pwl (x[i j ], w ij ) and d =ij f pwl (x[d]; w d ). Condition 3, implies the first term is increasing in direction i j . The second term doesn't depend on x[i j ] and conditions 1 and 2 imply that it is nonnegative over D V . It follows that f , the product of the two terms, is increasing in direction i j . We next show the "only if" direction. To simplify the notation, we only prove this for the case p ≥ 2. A similar argument shows the proposition holds for p = 1, as well. Let f ∈ KFL(V) be increasing in directions i 1 , . . . , i p in D V . By Proposition 1 there exists D vectors, w d ∈ R V[d] , d ∈ [D], such that f (x) = d f pwl (x[d]; w d ). Let w d be -w d if w d ≤ 0 or w d , otherwise. By Lemma 1 applied to directions i 1 and i 2 , it follows that for every d ∈ [D], w d ≥ 0. Since for any scalar α ∈ R, f pwl (x[d]; αw d ) = αf pwl (x[d]; w d ), it follows that f (x) = σ d f pwl (x[d]; w d ), where σ = (-1) |{d:w d ≤0}| . Therefore condition 1 holds. As for condition 3, let j ∈ [p] and assume to the contrary that σw ij [s] > σw ij [s + 1] for some s ∈ [V[i j ]-1]. Since f is not identically zero in D V (as we don't consider the zero function increasing in direction i j ), for each d ∈ [D], there must be x d ∈ [0, V[d] -1] such that f pwl (x d ; w d ) = 0. Since w d ≥ 0, it follows that f pwl (x d ; w d ) > 0. Let x ∈ R D be the vector with x[d] = x d if d = i j and x[i j ] = s, otherwise. Then we have f (x) = σf pwl (s; w ij ) d =ij f pwl (x d ; w d ) = σw ij [s] d =ij f pwl (x d ; w d ) > σw ij [s + 1] d =ij f pwl (x d ; w d ) = f (x + e ij ) contradicting our assumption that f is increasing in direction i j . Therefore condition 3 holds as well. Proposition 3. Fix V ∈ N D , let f ∈ KFL(V) and let i 1 , . . . , i p ∈ [D] be distinct direction indices for p ≤ D. Then f is convex in directions i 1 , . . . , i p iff f can be written as: f (x) = σ d f pwl (x[d]; w d ), where σ ∈ {+1, -1}, and w d ∈ R V[d] , d ∈ [D] are vectors satisfying the following 3 conditions: 1. If p ≥ 2 then w d ≥ 0 for all d ∈ [D]. 2. If p = 1 then w d ≥ 0 for all d ∈ [D] \ {i 1 }. 3. For all j ∈ [p], V[i j ] = 2 or V[i j ] > 2 and σ(w ij [k + 2] -w ij [k + 1]) ≥ σ(w ij [k + 1] - w ij [k]) for all k ∈ [V[i j ] -2]. To prove the proposition we shall make use of the following lemma. Lemma 2. Let f (x) = d f pwl (x[d]; w d ) be a function in KFL(V). If for some i ∈ [D], f is convex in direction i in D V then for all d ∈ [D] \ {i}, either w d ≥ 0 or w d ≤ 0. Proof of Lemma 2. Let f be as above, and assume to the contrary that for some j ∈ [D] \ {i} neither w j ≥ 0 nor w j ≤ 0. Then there must exist s, q ∈ [V [j] ] such that w j [s] < 0 and w j [q] > 0. For d ∈ [D], let d (x[d]) = f pwl (x[d]; w d ). Then j (s -1) = w j [s] < 0 and j (q -1) = w j [q] > 0. By our assumption there exists x ∈ D V and δ > 0 such that tf ( x) + (1 -t)f (x + δe i ) - f (tx + (1 -t)(x + δe i )) > 0. 2 To further simplify notation, we set d =i,j d (x[d]) = z d =i,j and d =i d (x[d]) = z d =i . Now, for any r ∈ [0, V[j]], we have: tf (x + (r -x[j])e j ) + (1 -t)f (x + (r -x[j])e j + δe i ) -f (t(x + (r -x[j])e j ) + (1 -t)(x + (r -x[j])e j + δe i )) = t i (x[i]) j (r)z d =i,j + (1 -t) i (x[i] + δ) j (r)z d =i,j -i (tx[i] + (1 -t)(x[i] + δ)) j (r)z d =i,j = t i (x[i])z d =i + (1 -t) i (x[i] + δ)z d =i -i (tx[i] + (1 -t)(x[i] + δ))z d =i j (r) j (x[j]) = tf (x) + (1 -t)f (x + δe i ) -f (tx + (1 -t)(x + δe i )) j (r) j (x[j]) Since for some r ∈ {s -1, q -1} it holds that j (r)/ j (x[j]) < 0, it follows from the last equation that for that r, f (t(x + (r -x[j])e j ) + (1 -t)(x + (r -x[j])e j + δe i )) > tf (x + (r -x[j])e j ) + (1 -t)f (x + (r -x[j])e j + δe i ), which contradicts our assumption that f is convex in direction i (for the point x + (r -x[j])e j ). Proof of Proposition 3. We first show the "if" direction. Assume that f can be written as above and let j ∈ [p]. Then f is the product of two terms: σf pwl (x[i j ]; w ij ) and d =ij f pwl (x[d]; w d ). Condition 3 implies the first term is convex in direction i j . The second term does not depend on x[i j ] and conditions 1 and 2 imply that it is non-negative over D V . It follows that f , the product of the two terms, is convex in direction i j . We next show the "only if" direction. To simplify the notation, we only prove this for the case p ≥ 2. A similar argument shows the proposition holds for p = 1, as well. Let f ∈ KFL(V) be convex in directions i 1 , ..., i p in D V . By Proposition 1 there exists D vectors, w d ∈ R V[d]  σzf pwl (tx 1 [i j ] + (1 -t)(x 2 [i j ]); w ij ) ≤ σztf pwl (x 1 [i j ]; w ij ) + σz(1 -t)f pwl (x 2 [i j ]; w ij ) σf pwl (0.5(k -1) + 0.5(k + 1); w ij ) ≤ 0.5 σf pwl (k -1; w ij ) + σf pwl (k + 1; w ij ) 2σw ij [k + 1] ≤ σw ij [k] + σw ij [k + 2] σ(w ij [k + 1] -w ij [k]) ≤ σ(w ij [k + 2] -w ij [k + 1]) which contradicts our assumption. Proof of Proposition 4 (nonnegative case). We first show the "if" direction. Assume that f can be written as above. Then for each d ∈ [D], since w d ≥ 0, it follows that f pwl (x[d]; w d ) ≥ 0. Therefore, f , the product of D nonnegative piecewise linear functions, is nonnegative. We next show the "only if" direction. By Proposition 1 there exists D vectors, w d ∈ R V[d] , d ∈ [D], such that f (x) = d f pwl (x[d]; w d ). If for some i ∈ [D], neither w i ≥ 0 nor w i ≤ 0 then there must exist s, t ∈ [0, V[i] -1] such that f pwl (s; w i ) < 0 and f pwl (t; w i ) > 0.



Otherwise f (x + δei) = f (x) for all x,δ implying that f (x) does not depend on x[i]-which we don't regard as increasing in direction i. Otherwise tf (x) + (1 -t)f (x + δei) -f (tx + (1 -t)(x + δei)) = 0 for all x, δ, t implying that f (x)is affine in direction i, which we don't regard as convex in direction i. If for some j ∈ [D] \ {i} w j = 0 then f is still nonnegative, but for this edge case, f can be trivially written as f (x) = d f pwl(x[d]; w d ) where w d = 0.



Figure 1: f (x) = f pwl (x; [0 1 0.5]) Figure 2: 2×2 multilinear interpolated lattice with four parameters defining a monotonically increasing function in both dimensions.

and use a "calibration" transformation c : R D → D V to map the "raw" input x into the lattice domainD V . Here, c(x) = (c 1 (x[1]), . . . , c D (x[D])), and each c d : R → [0, V[d] -1]is a learned piecewise linear function with a fixed prespecified number of linear segments. SeeGupta et al. (2016)) for more details. Let C(V) denote the set of all calibration transformations whose image is in D V . For a positive integer M , we define the calibrated Kronecker Factored Lattice model class CKFL by: CKFL

Figure 3: KFL Model Architecture

Figure 4: UCI Adult Income Figure 5: Query Result Matching Figure 6: User Query Intent

Figure 7: Comparison of test results of KFL with different M and V values

where σ ∈ {+1, -1}, and w d ∈ R V[d] , d ∈ [D] are vectors satisfying the following 3 conditions: 1. If p ≥ 2 then w d ≥ 0 for all d ∈ [D]. 2. If p = 1 then w d ≥ 0 for all d ∈ [D] \ {i 1 }.

, d ∈ [D], such that f (x) = d f pwl (x[d]; w d ). Let w d be -w d if w d ≤ 0 or w d , otherwise. By Lemma 2 applied to directions i 1 and i 2 , it follows that for every d ∈ [D], w d ≥ 0. Since for any scalar α ∈ R, f pwl (x[d]; αw d ) = αf pwl (x[d]; w d ), it follows that f (x) = σ d f pwl (x[d]; w d ), where σ = (-1) |{d:w d ≤0}| . Therefore condition 1 holds.As for condition 3, let j ∈ [p] and assume to the contrary thatV[i j ] > 2 but σ(w ij [k + 2]w ij [k + 1]) < σ(w ij [k + 1] -w ij [k]) for some k ∈ [V[i j -2]]. Since f is not identically zero in D V (as we don't consider the zero function convex in direction i j ), for each d ∈ [D], there must bex d ∈ [0, V[d] -1] such that f pwl (x d ; w d ) = 0. Since w d ≥ 0 it follows that f pwl (x d ; w d ) > 0. Now let t = 0.5 and consider ordered inputs x 1 , x 2 ∈ R D such that x 1 [d] = x 2 [d] = x d for all d ∈ [D], x 1 [i j ] = k -1, and x 2 [i j ] = k + 1. We first decompose f (x) = σf pwl (x[i j ]; w ij ) d =ij f pwl (x[d]; w d ).Next, by construction, we note thatd =ij f pwl (x 1 [d]; w d ) = d =ij f pwl (x 2 [d]; w d ) = d =ij f pwl (x 3[d]; w d ) = z > 0, and since f is convex in direction i j , we have:

Fix V ∈ N D , let f ∈ KFL(V).Then f is nonnegative (resp. strictly positive) iff f can be written as f (x) = d f pwl (x[d]; w d ), where w d ≥ 0 (resp. w d > 0) for all d ∈ [D].

Experiment Dataset Summary

UCI Adult Income Results

Query Result Matching Results

User Query Intent Results

annex

It follows that there exists x ∈ D V such that either f pwl (s; w i ) d =i f pwl (x[d] ; w d ) < 0 or f pwl (t; w i ) d =i f pwl (x[d] ; w d ) < 0, thus making f not nonnegative 3 . Hence, for all d ∈ [D], either w d ≥ 0 or w d ≤ 0. Now, let w d be -w d if w d ≤ 0 or w d , otherwise. Since for any scalar α ∈ R, f pwl (x[d] ; αw d ) = αf pwl (x[d] ; w d ), it follows that f (x) = σ d f pwl (x[d] ; w d ), where σ = (-1) |{d:w d ≤0}| . Since d f pwl (x[d] ; w d ) ≥ 0, for f to be nonnegative, |{d : w d ≤ 0}| must be even. This implies σ = 1. Then a nonnegative function f is written as fProof of Proposition 4 (strictly positive case). We first show the "if" direction. Assume that f can be written as above. Then for each d ∈Therefore, f , the product of D strictly positive piecewise linear functions, is strictly positive.We next show the "only if" direction. By Proposition 1 there exists D vectors,Regard θ as a tensor of size V and set r = r(θ). Thus, there exist rD vectors wand by the definition of M in the proposition it holds that r ≤ M . Since the zero function is in KFL(V), it follows that f mll (Since the zero function is in KFL(V), KFL(V, m) ⊆ KFL(V, m + 1). Assume to the contrary that KFL(V, m + 1) = KFL(V, m), and let θ ∈ R M V be a tensor of size V and rank M . Thenfor some M D vectors w), thus by our assumption it is also in KFL(V, m). Therefore there exist mD vectors uj . Therefore we may replace the sum of the first m + 1 elements in (11) with the sum of the m simple tensors i∈[m] ⊗ D j=1 u (i) j , obtaining a sum of M -1 simple tensors that yields θ, contradicting the assumption that r(θ) = M .D ). For each such m and d, define w 

B APPROXIMATION ALGORITHM FOR MONOTONICITY PROJECTION

We use Algorithm 1 to map a vector w ∈ R D to a close (in L 2 norm) vector w that satisfies w[1] ≤ . . . ≤ w[D]. The vector w may not be the closest such vector to w, but in our experiments the algorithm worked well enough as the projection step in the projected SGD optimization algorithm and is faster than the pool-adjacent-violators algorithm. We use an analogous algorithm to map w to a close vector with monotonic decreasing components. In the description of the algorithm we denote by |v| the L 2 norm of a real vector v. 

