SCALABLE LEARNING AND MAP INFERENCE FOR NONSYMMETRIC DETERMINANTAL POINT PROCESSES

Abstract

Determinantal point processes (DPPs) have attracted significant attention in machine learning for their ability to model subsets drawn from a large item collection. Recent work shows that nonsymmetric DPP (NDPP) kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, for an item collection of size M , existing NDPP learning and inference algorithms require memory quadratic in M and runtime cubic (for learning) or quadratic (for inference) in M , making them impractical for many typical subset selection tasks. In this work, we develop a learning algorithm with space and time requirements linear in M by introducing a new NDPP kernel decomposition. We also derive a linear-complexity NDPP maximum a posteriori (MAP) inference algorithm that applies not only to our new kernel but also to that of prior work. Through evaluation on real-world datasets, we show that our algorithms scale significantly better, and can match the predictive performance of prior work.

1. INTRODUCTION

Determinantal point processes (DPPs) have proven useful for numerous machine learning tasks. For example, recent uses include summarization (Sharghi et al., 2018) , recommender systems (Wilhelm et al., 2018) , neural network compression (Mariet & Sra, 2016) , kernel approximation (Li et al., 2016) , multi-modal output generation (Elfeki et al., 2019) , and batch selection, both for stochastic optimization (Zhang et al., 2017) and for active learning (Bıyık et al., 2019) . For subset selection problems where the ground set of items to select from has cardinality M , the typical DPP is parameterized by an M × M kernel matrix. Most prior work has been concerned with symmetric DPPs, where the kernel must equal its transpose. However, recent work has considered the more general class of nonsymmetric DPPs (NDPPs) and shown that these have additional useful modeling power (Brunel, 2018; Gartrell et al., 2019) . In particular, unlike symmetric DPPs, which can only model negative correlations between items, NDPPs allow modeling of positive correlations, where the presence of item i in the selected set increases the probability that some other item j will also be selected. There are many intuitive examples of how positive correlations can be of practical importance. For example, consider a product recommendation task for a retail website, where a camera is found in a user's shopping cart, and the goal is to display several other items that might be purchased. Relative to an empty cart, the presence of the camera probably increases the probability of buying an accessory like a tripod. Although NDPPs can theoretically model such behavior, the existing approach for NDPP learning and inference (Gartrell et al., 2019) is often impractical in terms of both storage and runtime requirements. These algorithms require memory quadratic in M and time quadratic (for inference) or cubic (for learning) in M ; for the not-unusual M of 1 million, this requires storing 8TB-size objects in memory, with runtime millions or billions of times slower than that of a linear-complexity method. Learning: We propose a new decomposition of the NDPP kernel which reduces the storage and runtime requirements of learning and inference to linear in M . Fortuitously, the modified decomposition retains all of the previous decomposition's modeling power, as it covers the same part of the NDPP kernel space. The algebraic manipulations we apply to get linear complexity for this decomposition cannot be applied to prior work, meaning that our new decomposition is crucial for scalability. Inference: After learning, prior NDPP work applies a DPP conditioning algorithm to do subset expansion (Gartrell et al., 2019) , with quadratic runtime in M . However, prior work does not examine the general problem of MAP inference for NDPPs, i.e., solving the problem of finding the highestprobability subset under a DPP. For symmetric DPPs, there exists a standard greedy MAP inference algorithm that is linear in M . In this work, we develop a version of this algorithm that is also linear for low-rank NDPPs. The low-rank requirement is unique to NDPPs, and highlights the fact that the transformation of the algorithm from the symmetric to the nonsymmetric space is non-trivial. To the best of our knowledge, this is the first MAP algorithm proposed for NDPPs. We combine the above contributions through experiments that involve learning NDPP kernels and applying MAP inference to these kernels to do subset selection for several real-world datasets. These experiments demonstrate that our algorithms are much more scalable, and that the new kernel decomposition matches the predictive performance of the decomposition from prior work. For intuition about the kernel parameters, notice that the probabilities of singletons {i} and {j} are proportional to L ii and L jj , respectively. Hence, it is common to think of L's diagonal as representing item qualities. The probability of a pair {i, j} is proportional to det(L {i,j} ) = L ii L jj -L ij L ji . Thus, if -L ij L ji < 0, this indicates i and j interact negatively. Similarly, if -L ij L ji > 0, then i and j interact positively. Therefore, off-diagonal terms determine item interactions. (The vague term "interactions" can be replaced by the more precise term "correlations" if we consider the DPP's marginal kernel instead; see Gartrell et al. (2019, Section 2.1) for an extensive discussion.)

2. BACKGROUND

In order to ensure that P L defines a probability distribution, all principal minors of L must be non-negative: det(L Y ) ≥ 0. Matrices that satisfy this property are called P 0 -matrices (Fang, 1989, Definition 1). There is no known generative method or matrix decomposition that fully covers the space of all P 0 matrices, although there are many that partially cover the space (Tsatsomeros, 2004) . One common partial solution is to use a decomposition that covers the space of symmetric P 0 matrices. By restricting to the space of symmetric matrices, one can exploit the fact that L ∈ P 0 if L is positive semidefinite (PSD) * (Prussing, 1986) . Any symmetric PSD matrix can be written as the Gramian matrix of some set of vectors: L := V V , where V ∈ R M ×K . Hence, the V V decomposition provides an easy means of generating the entire space of symmetric P 0 matrices. It also has a nice intuitive interpretation: we can view the i-th row of V as a length-K feature vector describing item i. Unfortunately, the symmetry requirement limits the types of correlations that a DPP can capture. A symmetric model is able to capture only nonpositive interactions between items, since L ij L ji = L 2 ij ≥ 0, whereas a nonsymmetric L can also capture positive correlations. (Again, see Gartrell et al. (2019, Section 2.1) for more intuition.) To expand coverage to nonsymmetric matrices in P 0 , it is natural to consider nonsymmetric PSD matrices. In what follows, we denote by P + 0 the set of all nonsymmetric (and symmetric) PSD matrices. Any nonsymmetric PSD matrix is in P 0 (Gartrell et al., 2019, Lemma 1), so P + 0 ⊆ P 0 . However, unlike in the symmetric case, the set of nonsymmetric PSD



Consider a finite set Y = {1, 2, . . . , M } of cardinality M , which we will also denote by [[M ]]. A DPP on [[M ]] defines a probability distribution over all of its 2 M subsets. It is parameterized by a matrix L ∈ R M ×M , called the kernel, such that the probability of each subset Y ⊆ [[M ]] is proportional to the determinant of its corresponding principal submatrix: Pr(Y ) ∝ det(L Y ). The normalization constant for this distribution can be expressed as a single M × M determinant: Y ⊆[[M ]] det(L Y ) = det(L + I) (Kulesza et al., 2012, Theorem 2.1). Hence, Pr(Y ) = det(L Y )/ det(L + I). We will use P L to denote this distribution.

