SCALABLE LEARNING AND MAP INFERENCE FOR NONSYMMETRIC DETERMINANTAL POINT PROCESSES

Abstract

Determinantal point processes (DPPs) have attracted significant attention in machine learning for their ability to model subsets drawn from a large item collection. Recent work shows that nonsymmetric DPP (NDPP) kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, for an item collection of size M , existing NDPP learning and inference algorithms require memory quadratic in M and runtime cubic (for learning) or quadratic (for inference) in M , making them impractical for many typical subset selection tasks. In this work, we develop a learning algorithm with space and time requirements linear in M by introducing a new NDPP kernel decomposition. We also derive a linear-complexity NDPP maximum a posteriori (MAP) inference algorithm that applies not only to our new kernel but also to that of prior work. Through evaluation on real-world datasets, we show that our algorithms scale significantly better, and can match the predictive performance of prior work. Learning: We propose a new decomposition of the NDPP kernel which reduces the storage and runtime requirements of learning and inference to linear in M . Fortuitously, the modified decomposition retains all of the previous decomposition's modeling power, as it covers the same part of the NDPP kernel space. The algebraic manipulations we apply to get linear complexity for this decomposition cannot be applied to prior work, meaning that our new decomposition is crucial for scalability. Inference: After learning, prior NDPP work applies a DPP conditioning algorithm to do subset expansion (Gartrell et al., 2019) , with quadratic runtime in M . However, prior work does not examine the general problem of MAP inference for NDPPs, i.e., solving the problem of finding the highestprobability subset under a DPP. For symmetric DPPs, there exists a standard greedy MAP inference algorithm that is linear in M . In this work, we develop a version of this algorithm that is also linear for low-rank NDPPs. The low-rank requirement is unique to NDPPs, and highlights the fact that the transformation of the algorithm from the symmetric to the nonsymmetric space is non-trivial. To the best of our knowledge, this is the first MAP algorithm proposed for NDPPs. We combine the above contributions through experiments that involve learning NDPP kernels and applying MAP inference to these kernels to do subset selection for several real-world datasets. These experiments demonstrate that our algorithms are much more scalable, and that the new kernel decomposition matches the predictive performance of the decomposition from prior work. Consider a finite set Y = {1, 2, . . . , M } of cardinality M , which we will also denote by [[M ]]. A DPP on [[M ]] defines a probability distribution over all of its 2 M subsets. It is parameterized by a matrix L ∈ R M ×M , called the kernel, such that the probability of each subset Y ⊆ [[M ]] is proportional to the determinant of its corresponding principal submatrix: Pr(Y ) ∝ det(L Y ). The normalization constant for this distribution can be expressed as a single M × M determinant: Y ⊆[[M ]] det(L Y ) = det(L + I) (Kulesza et al., 2012, Theorem 2.1). Hence, Pr(Y ) = det(L Y )/ det(L + I). We will use P L to denote this distribution. * Recall that a matrix L ∈ R M ×M is defined to be PSD if and only if x Lx ≥ 0, for all x ∈ R M .

1. INTRODUCTION

Determinantal point processes (DPPs) have proven useful for numerous machine learning tasks. For example, recent uses include summarization (Sharghi et al., 2018) , recommender systems (Wilhelm et al., 2018) , neural network compression (Mariet & Sra, 2016) , kernel approximation (Li et al., 2016) , multi-modal output generation (Elfeki et al., 2019) , and batch selection, both for stochastic optimization (Zhang et al., 2017) and for active learning (Bıyık et al., 2019) . For subset selection problems where the ground set of items to select from has cardinality M , the typical DPP is parameterized by an M × M kernel matrix. Most prior work has been concerned with symmetric DPPs, where the kernel must equal its transpose. However, recent work has considered the more general class of nonsymmetric DPPs (NDPPs) and shown that these have additional useful modeling power (Brunel, 2018; Gartrell et al., 2019) . In particular, unlike symmetric DPPs, which can only model negative correlations between items, NDPPs allow modeling of positive correlations, where the presence of item i in the selected set increases the probability that some other item j will also be selected. There are many intuitive examples of how positive correlations can be of practical importance. For example, consider a product recommendation task for a retail website, where a camera is found in a user's shopping cart, and the goal is to display several other items that might be purchased. Relative to an empty cart, the presence of the camera probably increases the probability of buying an accessory like a tripod. Although NDPPs can theoretically model such behavior, the existing approach for NDPP learning and inference (Gartrell et al., 2019) is often impractical in terms of both storage and runtime requirements. These algorithms require memory quadratic in M and time quadratic (for inference) or cubic (for learning) in M ; for the not-unusual M of 1 million, this requires storing 8TB-size objects in memory, with runtime millions or billions of times slower than that of a linear-complexity method. In this work, we make the following contributions: For intuition about the kernel parameters, notice that the probabilities of singletons {i} and {j} are proportional to L ii and L jj , respectively. Hence, it is common to think of L's diagonal as representing item qualities. The probability of a pair {i, j} is proportional to det(L {i,j} ) = L ii L jj -L ij L ji . Thus, if -L ij L ji < 0, this indicates i and j interact negatively. Similarly, if -L ij L ji > 0, then i and j interact positively. Therefore, off-diagonal terms determine item interactions. (The vague term "interactions" can be replaced by the more precise term "correlations" if we consider the DPP's marginal kernel instead; see Gartrell et al. (2019, Section 2.1) for an extensive discussion.) In order to ensure that P L defines a probability distribution, all principal minors of L must be non-negative: det(L Y ) ≥ 0. Matrices that satisfy this property are called P 0 -matrices (Fang, 1989 , Definition 1). There is no known generative method or matrix decomposition that fully covers the space of all P 0 matrices, although there are many that partially cover the space (Tsatsomeros, 2004) . One common partial solution is to use a decomposition that covers the space of symmetric P 0 matrices. By restricting to the space of symmetric matrices, one can exploit the fact that L ∈ P 0 if L is positive semidefinite (PSD) * (Prussing, 1986) . Any symmetric PSD matrix can be written as the Gramian matrix of some set of vectors: L := V V , where V ∈ R M ×K . Hence, the V V decomposition provides an easy means of generating the entire space of symmetric P 0 matrices. It also has a nice intuitive interpretation: we can view the i-th row of V as a length-K feature vector describing item i. Unfortunately, the symmetry requirement limits the types of correlations that a DPP can capture. A symmetric model is able to capture only nonpositive interactions between items, since L ij L ji = L 2 ij ≥ 0, whereas a nonsymmetric L can also capture positive correlations. (Again, see Gartrell et al. (2019, Section 2.1) for more intuition.) To expand coverage to nonsymmetric matrices in P 0 , it is natural to consider nonsymmetric PSD matrices. In what follows, we denote by P + 0 the set of all nonsymmetric (and symmetric) PSD matrices. Any nonsymmetric PSD matrix is in P 0 (Gartrell et al., 2019 , Lemma 1), so P + 0 ⊆ P 0 . However, unlike in the symmetric case, the set of nonsymmetric PSD matrices does not fully cover the set of nonsymmetric P 0 matrices. For example, consider L = 1 5/3 1/2 1 with det(L {1} ), det(L {2} ), det(L {1,2} ) ≥ 0, but x Lx < 0 for x = -1 1 . Still, nonsymmetric PSD matrices cover a large enough portion of the P 0 space to be useful in practice, as evidenced by the experiments of Gartrell et al. (2019) . This work covered the P + 0 space by using the following decomposition: L := S + A, with S := V V for V ∈ R M ×K , and A := BC -CB for B, C ∈ R M ×K . This decomposition makes use of the fact that any matrix L can be decomposed uniquely as the sum of a symmetric matrix S = (L + L T )/2 and a skew-symmetric matrix A = (L -L T )/2. All skew-symmetric matrices A are trivially PSD, since x Ax = 0 for all x ∈ R M . Hence, the L here is guaranteed to be PSD simply because its S uses the standard Gramian decomposition V V . In this work we will also only consider P + 0 , and leave to future work the problem of finding tractable ways to cover the rest of P 0 . We propose a new decomposition of L that also covers the P + 0 space, but allows for more scalable learning. As in prior work, our decomposition has inner dimension K that could be as large as M , but is usually much smaller in practice. Our algorithms work well for modest values of K. In cases where the natural K is larger (e.g., natural language processing), random projections can often be used to significantly reduce K (Gillenwater et al., 2012a) .

3. NEW KERNEL DECOMPOSITION AND SCALABLE LEARNING

Prior work on NDPPs proposed a maximum likelihood estimation (MLE) algorithm (Gartrell et al., 2019) . Due to that work's particular kernel decomposition, this algorithm had complexity cubic in the number of items M . Here, we propose a kernel decomposition that reduces this to linear in M . We begin by showing that our new decomposition covers the space of P + 0 matrices. Before diving in, let us define Σ i := 0 λ i -λ i 0 as shorthand for a 2 × 2 block matrix with zeros on-diagonal and opposite values off-diagonal. Then, our proposed decomposition is as follows: L := S + A, with S := V V and A := BCB , where V , B ∈ R M ×K , and C ∈ R K×K is a block-diagonal matrix with some diagonal blocks of the form Σ i , with λ i > 0, and zeros elsewhere. The following lemma shows that this decomposition covers the space of P + 0 matrices. Lemma 1. Let A ∈ R M ×M be a skew-symmetric matrix with rank ≤ M . Then, there exist B ∈ R M × and positive numbers λ 1 , . . . , λ /2 , such that A = BCB , where C ∈ R × is the block-diagonal matrix with /2 diagonal blocks of size 2 given by Σ i , i = 1, . . . , /2 and zero elsewhere. The proof of Lemma 1 and all subsequent results can be found in Appendix F. With this decomposition in hand, we now proceed to show that it can be used for linear-time MLE learning. To do so, we must show that corresponding NDPP log-likelihood objective and gradient can be computed in time linear in M . Given a collection of n observed subsets {Y 1 , ..., Y n } composed of items from Y = [[M ]], the full formulation of the regularized log-likelihood is: φ(V , B, C) = 1 n n i=1 log det VY i V Y i + BY i CB Y i -log det V V + BCB + I -R(V , B), where V Yi ∈ R |Yi|×K denotes a matrix composed of the rows of V that correspond to the items in Y i . The regularization term, R(V , B), is defined as follows: R(V , B) = α M i=1 1 µ i v i 2 2 + β M i=1 1 µ i b i 2 2 , where µ i counts the number of occurrences of item i in the training set, v i and b i are rows of V and B, respectively, and α, β > 0 are tunable hyperparameters. This regularization is similar to that of prior works (Gartrell et al., 2017; 2019) . We omit regularization for C. Theorem 1 shows that computing the regularized log-likelihood and its gradient both have time complexity linear in M . The complexities also depend on K, the rank of the NDPP, and K , the size of the largest observed subset in the data. For many real-world datasets we observe that K M , and we set K = K . Hence, linearity in M means that we can efficiently perform learning for datasets with very large ground sets, which is impossible with the cubic-complexity L decomposition in prior work (Gartrell et al., 2019) . Theorem 1. Given an NDPP with kernel L = V V + BCB , parameterized by V of rank K, B of rank K, and a K × K matrix C, we can compute the regularized log-likelihood (Eq. 2) and its gradient in O(M K 2 + K 3 + nK 3 ) time, where K is the size of the largest of the n training subsets.

4. MAP INFERENCE

After learning an NDPP, one can then use it to infer the most probable item subsets in various situations. Several inference algorithms have been well-studied for symmetric DPPs, including sampling (Kulesza & Taskar, 2011; Anari et al., 2016; Li et al., 2016; Launay et al., 2018; Gillenwater et al., 2019; Poulson, 2019; Dereziński, 2019) and MAP inference (Gillenwater et al., 2012b; Han et al., 2017; Chen et al., 2018; Han & Gillenwater, 2020) . We focus on MAP inference: argmax Y ⊆Y det(L Y ) such that |Y | = k, for cardinality budget k ≤ M . MAP inference is a better fit than sampling when the end application requires the generation of a single output set, which is usually the case in practice (e.g., this is usually true for recommender systems). MAP inference for DPPs is known to be NP-hard even in the symmetric case (Ko et al., 1995; Kulesza et al., 2012) . For symmetric DPPs, one usually approximates the MAP via the standard greedy algorithm for submodular maximization (Nemhauser et al., 1978) . First, we describe how to efficiently implement this for NDPPs. Then, in Section 4.1 we prove a lower bound on its approximation quality. To the best of our knowledge, this is the first investigation of how to apply the greedy algorithm to NDPPs. Greedy begins with an empty set and repeatedly adds the item that maximizes the marginal gain until the chosen set is size k. Here, we design an efficient greedy algorithm for the case where the NDPP kernel is low-rank. For generality, in what follows we write the kernel as L = BCB , since one can easily rewrite our matrix decomposition (Eq. 1), as well as that of Gartrell et al. (2019) , to take this form. For example, for our decomposition: L = V V + BCB = (V B) I 0 0 C V B . Using Schur's determinant identity, we first observe that, for Y ⊆ [[M ]] and i ∈ [[M ]], the marginal gain of a NDPP can be written as  det(L Y ∪{i} ) det(L Y ) = L ii -L iY (L Y ) -1 L Y i = b i Cb i -b i C B Y (B Y CB Y ) -1 B Y Cb i , ( ) where b i ∈ R 1×K and B Y ∈ R |Y |×K . A naïve computation of Eq. 5 is O(K 2 + k 3 ), B Y (B Y CB Y ) -1 B Y = k j=1 p j q j , where row vectors p j , q j ∈ R 1×K for j = 1, . . . , k satisfy p 1 = b a1 /(b a1 Cb a1 ), q 1 = b a1 , and p j+1 = b aj -b aj C j i=1 q i p i b aj C(b aj -b aj C j i=1 q i p i ) , q j+1 = b aj -b aj C j i=1 p i q i . Algorithm 1 Greedy MAP inference/conditioning for low-rank NDPPs  1: Input: B ∈ R M ×K , C ∈ R K×K , the cardinality k And {a 1 , . . . , a k } for conditioning 2: initialize P ← [ ], Q ← [ ] and Y ← ∅ 3: ∆ i ← b i Cb i for i ∈ [[M ]] where b i ∈ R 1×K is the i-th row in B 4: a ← argmax i ∆ i and Y ← Y ∪ {a} a ← a 1 for conditioning 5: while |Y | ≤ k do 6: p ← b a -b a C Q P /∆ a 7: q ← b a -b a CP Q 8: P ← [P ; p] and Q ← [Q; q] 9: ∆ i ← ∆ i -b i Cp b i C q for i ∈ [[M ]], i / ∈ Y 10: a ← argmax i ∆ i and Y ← Y ∪ {a} a ← a |Y |+1

MAP Inference Memory

Symmetric DPP (Gartrell et al., 2017 ) O(M K 2 + nK 3 ) O(M Kk + M K 2 ) O(M K) O(M K) Nonsymmetric DPP (Gartrell et al., 2019 ) O(M 3 + M K 2 + nK 3 ) O(M Kk + M K 2 ) O(M 2 ) O(M K) † Scalable nonsymmetric DPP (this work) O(M K 2 + nK 3 ) O(M Kk + M K 2 ) O(M K + K 2 ) O(M K + K 2 ) Plugging Eq. 6 into Eq. 5, the marginal gain with respect to Y ∪ {a} can be computed by simply updating from the previous gain with respect to Y . That is, det(L Y ∪{a,i} ) det(L Y ∪{a} ) = b i Cb i - |Y |+1 j=1 b i Cp j b i C q j (8) = det(L Y ∪{i} ) det(L Y ) -b i Cp |Y |+1 b i C q |Y |+1 . The marginal gains when Y = ∅ are equal to diagonals of L and require O(M K 2 ) operations. Then, computing the update terms in Eq. 9 for all i ∈ [[M ]] needs O(M K) operations. Since the total number of updates is k, the overall complexity becomes O(M K 2 + M Kk). We provide a full description of the implied greedy algorithm for low-rank NDPPs in Algorithm 1. Table 1 summarizes the complexitiy of our methods and those of previous work. Note that the full M × M L + I matrix is used to compute the DPP normalization constant in Gartrell et al. (2019) , which is why this approach has memory complexity of O(M 2 ) for MLE learning.

4.1. APPROXIMATION GUARANTEE FOR GREEDY NDPP MAP INFERENCE

As mentioned above, Algorithm 1 is an instantiation of the standard greedy algorithm used for submodular maximization (Nemhauser et al., 1978) . This algorithm has a (1 -1/e)-approximation guarantee for the problem of maximizing nonnegative, monotone submodular functions. While the function f (Y ) = log det(L Y ) is submodular for a symmetric PSD L (Kelmans & Kimelfeld, 1983) , it is not monotone. Often, as in Han & Gillenwater (2020) , it is assumed that the smallest eigenvalue of L is greater than 1, which guarantees montonicity. There is no particular evidence that this assumption is true for practical models, but nevertheless the greedy algorithm tends to perform well in practice for symmetric DPPs. Here, we prove a similar approximation guarantee that covers NDPPs as well, even though the function f (Y ) = log det(L Y ) is non-submodular when L is nonsymmetric. In Section 5.5, we further observe that, as for symmetric DPPs, the greedy algorithm seems to work well in practice for NDPPs. We leverage a recent result of Bian et al. ( 2017), who proposed an extension of greedy algorithm guarantees to non-submodular functions. Their result is based on the submodularity ratio and curvature of the objective function, which measure to what extent it has submodular properties. Theorem 2 extends this to provide an approximation ratio for greedy MAP inference of NDPPs. Theorem 2. Consider a nonsymmetric low-rank DPP L = V V + BCB , where V , B are of rank K, and C ∈ R K×K . Given a cardinality budget k, let σ min and σ max denote the smallest and largest singular values of L Y for all Y ⊆ [[M ]] and |Y | ≤ 2k. Assume that σ min > 1. Then, log det(L Y G ) ≥ 4(1 -e -1/4 ) 2(log σ max /log σ min ) -1 log det(L Y * ) (10) where Y G is the output of Algorithm 1 and Y * is the optimal solution of MAP inference in Eq. 4. Thus, when the kernel has a small value of log σ max / log σ min , the greedy algorithm finds a nearoptimal solution. In practice, we observe that the greedy algorithm finds a near-optimal solution even for large values of this ratio (see Section 5.5). As remarked above, there is no evidence that the condition σ min > 1 is usually true in practice. While this condition can be achieved by multiplying L by a constant, this leads to a (potentially large) additive term in Eq. 10. We provide Corollary 1 in Appendix D, which excludes the σ min > 1 assumption, and quantifies this additive term.

4.2. GREEDY CONDITIONING FOR NEXT-ITEM PREDICTION

We briefly describe here a small modification to the greedy algorithm that is necessary if one wants to use it as a tool for next-item prediction. Given a set Y ⊆ [[M ]], Kulesza et al. (2012) showed that a DPP with L conditioned on the inclusion of the items in Y forms another DPP with kernel L Y := L Ȳ -L Ȳ ,Y L -1 Y L Ȳ ,Y where Ȳ = [[M ]]\Y . The singleton probability Pr(Y ∪{i} | Y ) ∝ L Y ii can be useful for doing next-item prediction. We can use the same machinery from the greedy algorithm's marginal gain computations to effectively compute these singletons. More concretely, suppose that we are doing next-item prediction as a shopper adds items to a digital cart. We predict the item that maximizes the marginal gain, conditioned on the current cart contents (the set Y ). When the shopper adds the next item to the cart, we update Y to include this item, rather than our predicted item (line 10 in Algorithm 1). We then iterate until the shopper checks out. The comments on the righthand side of Algorithm 1 summarize this procedure. The runtime of this prediction is the same that of the greedy algorithm, O(M K 2 + M K|Y |). We note that this cost is comparable to that of an approach based on the DPP dual kernel from prior work (Mariet et al., 2019) , which has O(M K 2 + K 3 + |Y | 3 ) complexity. However, since it is non-trivial to define the dual kernel for NDPPs, the greedy algorithm may be the simpler choice for next-item prediction for NDPPs.

5. EXPERIMENTS

To further simplify learning and MAP inference, we set B = V , which results in L = V V + V CV = V (I + C)V . This change also simplifies regularization, so that we only perform regularization on V , as indicated in the first term of Eq. 3, leaving us with the single regularization hyperparameter of α. While setting B = V restricts the class of nonsymmetric L kernels that can be represented, we compensate for this restriction by relaxing the block-diagonal structure imposed on C, so that we learn a full skew-symmetric K × K matrix C. To ensure that C and thus A is skew-symmetric, we parameterize C by setting C = D -D T , were D varies over R K×K . Code for all experiments is available at https://github.com/cgartrel/scalable-nonsymmetric-DPPs.

5.1. DATASETS

We perform experiments on several real-world public datasets composed of subsets: 1. Amazon Baby Registries: This dataset consists of registries or "baskets" of baby products, and has been used in prior work on DPP learning (Gartrell et al., 2016; 2019; Gillenwater et al., 2014; Mariet & Sra, 2015) . The registries contain items from 15 different categories, such as "apparel", with a catalog of up to 100 items per category. Our evaluation mirrors that of Gartrell et al. (2019) ; we evaluate on the popular apparel category, which contains 14,970 registries, as well as on a dataset composed of the three most popular categories: apparel, diaper, and feeding, which contains a total of 31,218 registries. 2. UK Retail: This dataset (Chen et al., 2012) 

5.2. EXPERIMENTAL SETUP AND METRICS

We use a small held-out validation set, consisting of 300 randomly-selected baskets, for tracking convergence during training and for tuning hyperparameters. A random selection of 2000 of the remaining baskets are used for testing, and the rest are used for training. Convergence is reached during training when the relative change in validation log-likelihood is below a predetermined threshold. We use PyTorch with Adam (Kingma & Ba, 2015) for optimization. We initialize C from the standard Gaussian distribution with mean 0 and variance 1, and B (which we set equal to V ) is initialized from the uniform(0, 1) distribution. Subset expansion task. We use greedy conditioning to do next-item prediction, as described in Section 4.2. We compare methods using a standard recommender system metric: mean percentile rank (MPR) (Hu et al., 2008; Li et al., 2010) . MPR of 50 is equivalent to random selection; MPR of 100 means that the model perfectly predicts the next item. See Appendix A for a complete description of the MPR metric. Subset discrimination task. We also test the ability of a model to discriminate observed subsets from randomly generated ones. For each subset in the test set, we generate a subset of the same length by drawing items uniformly at random (and we ensure that the same item is not drawn more than once for a subset). We compute the AUC for the model on these observed and random subsets, where the score for each subset is the log-likelihood that the model assigns to the subset.

5.3. PREDICTIVE PERFORMANCE RESULTS FOR LEARNING

Since the focus of our work is on improving NDPP scalability, we use the low-rank symmetric DPP (Gartrell et al., 2017) and the low-rank NDPP of prior work (Gartrell et al., 2019) as baselines for our experiments. Table 2 compares these approaches and our scalable low-rank NDPP. We see that NDPPs generally outperform symmetric DPPs. Furthermore, we see that our scalable NDPP matches or exceeds the predictive quality of the baseline NDPP. We believe that our model sometimes improves upon this baseline NDPP due to the use of a simpler kernel decomposition with fewer parameters, likely leading to a simplified optimization landscape. As expected, we observe that the scalable NDPP trains far faster than the NDPP for datasets with large ground sets. For the Amazon: 3-category dataset, both approaches show comparable results, with the scalable NDPP converging 1.07× faster than NDPP. But for the UK Retail dataset, which has a much larger ground set, our scalable NDPP achieves convergence about 8.31× faster. Notice that our scalable NDPP also opens to the door to training on datasets with large M , such as the Instacart and Million Song dataset, which is infeasible for the baseline NDPP due to high memory and compute costs. For example, NDPP learning using Gartrell et al. (2019) for the Million Song dataset would require approximately 1.1 TB of memory, while using our scalable NDPP approach requires approximately 445.9 MB.

5.5. PERFORMANCE RESULTS FOR MAP INFERENCE

We run various approximatation algorithms for MAP inference, including the greedy algorithm (Algorithm 1), stochastic greedy algorithm (Mirzasoleiman et al., 2015) , MCMC-based DPP sampling (Li et al., 2016) , and greedy local search (Kathuria & Deshpande, 2016) . The stochastic greedy algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. The MCMC sampling begins with a random subset Y of size k and picks i ∈ Y and j / ∈ Y uniformly at random. Then, it swaps them with probability det(L Y ∪{j}\{i} )/(det(L Y ∪{j}\{i} ) + det(L Y )) and iterates this process. The greedy local search algorithm (Kathuria & Deshpande, 2016) starts from the output from the greedy algorithm, Y G , and replaces i ∈ Y G with j / ∈ Y G that gives the maximum improvement, if such i, j exist. This replacement process iterates until no improvement exists, or at Table 3: Average relative error and 95% confidence intervals of MAP inference algorithms on NDPPs learned from real-world datasets. For all datasets, we evaluate 10 kernels learned with different initializations, and run 100 random trials for stochastic greedy (Mirzasoleiman et al., 2015) and MCMC sampling (Li et al., 2016) . All errors are relative to greedy local search (Kathuria & Deshpande, 2016) . most k 2 log(10k) steps have been completed, to guarantee a tight approximation (Kathuria & Deshpande, 2016) . We use greedy local search as a baseline since it always returns a better solution than greedy. However, it is the slowest among all algorithms, as its time complexity is O(M Kk 4 log k). We choose k = 10, and provide more details of all algorithms in Appendix C. To evaluate the performance of MAP inference, we report the relative log-determinant ratio defined as log det(L Y * ) -log det(L Y ) log det(L Y * ) where Y is the output of benchmark algorithms and Y * is the greedy local search result. Results are reported in Table 3 . We observe that the greedy (Algorithm 1) achieves performance close to that of the significantly more expensive greedy local search algorithm, with relative errors of up to 0.045. Stochastic greedy and MCMC sampling have significantly larger errors. For completeness, in Appendix E we also present experiments comparing the performance of greedy and exact MAP on small synthetic NDPPs, for which the exact MAP can be feasibly computed.

5.6. TIME COMPARISON FOR MAP INFERENCE

We provide the wall-clock time of the above algorithms for real-world datasets in Table 4 . Observe that the greedy algorithm is the fastest method for all datasets except Million Song. For Million Song, MCMC sampling is faster than other approaches, but it has much larger relative errors in terms of log-determinant (see Table 3 ), which is not suitable for our purposes.

6. CONCLUSION

We have presented a new decomposition for nonsymmetric DPP kernels that can be learned in time linear in the size of the ground set, which is a significant improvement over the complexity of prior work. Empirical results indicate that this decomposition matches the predictive performance of the prior decomposition. We have also derived the first MAP algorithm for nonsymmetric DPPs and proved a lower bound on the quality of its approximation. In future work we hope to develop intuition about the meaning of the parameters in the C matrix and consider kernel decompositions that cover other parts of the nonsymmetric P 0 space. • Instacart dataset: K = 100, α = 0.001. • Million Song dataset: K = 150, α = 0.01. For all of the above model configurations we use a batch size of 200 during training, except for the scalable NDPPs trained on the Amazon apparel, Amazon three-category, Instacart, and Million Song datasets, where a batch size of 800 is used.

C BENCHMARK ALGORITHMS FOR MAP INFERENCE

We test the following approximate algorithms for MAP inference: Greedy local search. This algorithm starts from the output of greedy, Y G , and replaces i ∈ Y G with j / ∈ Y G that gives the maximum improvement of the determinant, if such i, j exist. Kathuria & Deshpande (2016) showed that running the search for such a swap O(k 2 log(k/ε)) times with an accuracy parameter ε gives a tight approximation guarantee for MAP inference for symmetric DPPs. We set the number of swaps to k 2 log(10k) for ε = 0.1 and use greedy local search as a baseline, since it is strictly an improvement on the greedy solution. The proposed greedy conditioning can be used for fast greedy local search. Specifically, for each i ∈ Y G , Algorithm 1 can compute marginal improvements conditioned by Y G \ {i} in time O(M Kk), and thus its runtime can be O(M Kk 4 log(k/ε)). However, it is the slowest among all of our benchmark algorithms. Stochastic greedy. This algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. Mirzasoleiman et al. (2015) proved that (M/k) log(1/ε) samples are enough to guarantee an (1 -1/e -ε)-approximation ratio for submodular functions (i.e., symmetric DPPs). We choose ε = 0.1 and set the number of samples to (M/k) log(10) . Under this setting, the time complexity of stochastic greedy is O(M Kk 2 log(1/ε)), which is better than the naïve exact greedy algorithm. However, we note that it is worse than that of our efficient greedy implement (Algorithm 1). This is because the stochastic greedy uses different random samples for every iteration and this does not take advantage of the amortized computations in Lemma 2. In our experiments, we simply modify line 10 in Algorithm 1 for stochastic greedy (argmax is operated on a random subset of marginal gains), hence it can run in O(M Kk + (M/k) log(1/ε)) time. In practice, we observe that stochastic greedy is slightly slower than exact greedy due to the additional costs of the random sampling process. MCMC sampling. We also compare inference algorithms with sampling from a nonsymmetric DPP. To the best of our knowledge, exact sampling of a non-Hermitian DPP was studied in Poulson (2019) , which requires the Cholesky decomposition with O(M 3 ) complexity. This is infeasible for a large M . To resolve this, Markov Chain Monte-Carlo (MCMC) based sampling is preferred (Li et al., 2016) for symmetric DPPs. In particular, we consider a Gibbs sampling for k-DPP, which begins with a random subset Y with size k, and picks i ∈ Y and j / ∈ Y uniformly at random. Then, it swaps them with probability det(L Y ∪{j}\{i} ) det(L Y ∪{j}\{i} ) + det(L Y ) and repeat this process for several steps. Li et al. (2016) showed that O(N k log(k/ε)) swaps are enough to approximate the ground-truth distribution under symmetric DPPs. However, for a fair runtime comparison to Algorithm 1, we set the number of swaps to 3N/K .

D COROLLARY OF THEOREM 2

Theorem 2 requires the technical condition σ min > 1, but in practice there is no particular evidence that this condition holds. While this condition can be achieved by multiplying L by a constant, this leads to a (potentially large) additive term in Eq. 10. Here, we provide Corollary 1 which excludes the σ min > 1 assumption from Theorem 2, and quantifies this additive term. Corollary 1. Consider a nonsymmetric low-rank DPP L = V V + BCB , where V , B are of rank K, and C ∈ R K×K . Given a cardinality budget k, let σ min and σ max denote the smallest and  log det(L Y G ) ≥ 4(1 -e -1/4 ) 2 log κ + 1 log det(L Y * ) -1 - 4(1 -e -1/4 ) 2 log κ + 1 k (1 -log σ min ) where Y G is the output of Algorithm 1 and Y * is the optimal solution of MAP inference in Eq. 4. The proof of Corollary 1 is provided in Appendix F.5. Note that instead of log(σ max )/ log(σ min ), Corollary 1 has a log(σ max /σ min ) term in the denominator.

E PERFORMANCE GUARANTEE FOR GREEDY MAP INFERENCE

The matrices learned on real datasets are too large to compute the exact MAP solution, but we can compute exact MAP for small matrices. In this section, we explore the performance of the greedy algorithm studied in Theorem 2 for 5 × 5 synthetic kernel matrices. More formally, we first pick K = 3 singular values s 1 , s 2 , s 3 from a kernel learned for the "Amazon: 3-category" dataset (a plot of these singular values can be seen in Fig. 2(c )) and generate L = V 1 diag([s 1 , s 2 , s 3 ])V 2 , where V 1 , V 2 ∈ R 5×3 are random orthonormal matrices. To ensure that L is a P 0 matrix, we repeatedly sample V 1 , V 2 until all principal minors of L are nonnegative. We also evaluate the performance of the symmetric DPP, where the kernel matrices are generated similarly to the NDPP, except we set V 1 = V 2 . We set k = 3 and generate 10,000 random kernels for both symmetric DPPs and NDPPs. The results for symmetric and nonsymmetric DPPs are shown in Fig. 2 (a) and Fig. 2 (b), respectively. We plot the approximation ratio of Algorithm 1, i.e., log det(L Y G )/ log det(L Y * ), with respect to log(σ max /σ min ), from Corollary 1. We observe that the greedy algorithm for both often shows approximation ratios close to 1. However, the worst-case ratio for NDPPs is worse than that of symmetric DPPs; log det(L Y ) for L ∈ P + 0 is non-submodular, and the greedy algorithm with a nonsubmodular function does not have as tight of a worst-case bound as in the symmetric case.

F PROOFS

F.1 PROOF OF LEMMA 1 Lemma 1. Let A ∈ R M ×M be a skew-symmetric matrix with rank ≤ M . Then, there exist B ∈ R M × and positive numbers λ 1 , . . . , λ /2 , such that A = BCB , where C ∈ R × is the block-diagonal matrix with /2 diagonal blocks of size 2 given by Σ i , i = 1, . . . , /2 and zero elsewhere. Proof. First, we note that rank of a nonsingular skew-symmetric matrix is always even, because all of its eigenvalues are purely imaginary and come in conjugate pairs. There exists some orthogonal matrix P ∈ R M ×M and Σ =                    0 λ 1 -λ 1 0 0 λ 2 0 -λ 2 0 . . . 0 λ /2 0 -λ /2 0 0 . . . 0                    such that A = PΣP (see, e.g., (Thompson, 1988, Proposition 2.1) ). Let C be the × supmatrix of Σ obtained by keeping its first rows and columns and let Q = I 0 , where I is the × identity matrix. Then, Σ = QCQ and one can write A = P QCQ P . Setting B = P Q proves the lemma.

F.2 PROOF OF THEOREM 1

Theorem 1. Given an NDPP with kernel L = V V + BCB , parameterized by V of rank K, B of rank K, and a K × K matrix C, we can compute the regularized log-likelihood (Eq. 2) and its gradient in O(M K 2 + K 3 + nK 3 ) time, where K is the size of the largest of the n training subsets. Proof. We first show that the log-likelihood can be computed in time linear in M . Using the matrix determinant lemma, one can easily verify that the DPP normalization term can be computed as det(I + L) = det I + (V BC) V B = det I 2K + V B (V BC) where I 2K is the identity matrix with dimension 2K. As Eq. 14 requires a matrix-multiplication between (2K)×M matrices and the determinant of (2K)×(2K) matrices, this allows us to transform a O(M 3 ) operation into an O(M K 2 + K 3 ) one. Having established that the normalization term in the likelihood can be computed in O(M K 2 + K 3 ) time, we proceed with characterizing the complexity of the other terms in the likelihood. The first term in Eq. 2 consists of determinants of size |Y i |. Assuming that these never exceed size K , each can be computed in at most O(K 3 ) time. The regularization term is a simple sum of norms that can be computed in O(M K) time. Therefore, the full regularized log-likelihood can be computed in O(M K 2 + K 3 + nK 3 ) time. To prove that the gradient of the log-likelihood can be computed in time linear in M , we begin by showing that the logarithm of DPP normalization term can be factorized as follows: Z = log det(I + L) (15) = log det I 2K + V B (V B) I K 0 0 C (16) = log det I K 0 0 C -1 + V B (V B) + log det I K 0 0 C (17) = log det I K + V V V B B V C -1 + B B + log det(C) = log det I K + V V + log det C -1 + B (I -V (I K + V V ) -1 V )B + log det(C) where Eq. 17 follows from the determinant commutativity (i.e., det(AB) = det(A) det(B)) and Eq. 18 and Eq. 19 come from the Schur's determinant identity † . For simplicity, we write X = I -V (I K + V V ) -1 V and (C -1 ) = C -, and note that X depends only on V . The gradient of Z has three parts: ∇Z = (∇ V Z, ∇ B Z, ∇ C Z) where each can be computed as ∇ V Z = ∇ V log det(I K + V V ) + ∇ V log det(C -1 + B XB) (20) = 2V (I K + V V ) -1 -XB((C -1 + B XB) -1 + (C -+ B XB) -1 )B XV ∇ B Z = ∇ B log det(C -1 + B XB) (22) = XB (C -1 + B XB) -1 + (C -+ B XB) -1 ∇ C Z = ∇ C log det(C) + ∇ C log det(C -1 + B XB) (24) = C --C -(C -1 + B XB) -C - Observe that X combines a M × M identity matrix with M × K matrices, hence multiplying it with a M × K matrix (e.g., XV or XB) can be computed in O(M K 2 ) time. Since each of the remaining matrix inverses in Eq. 21, Eq. 23, and Eq. 25 involve a K × K matrix inverse, with a cost of O(K 3 ) operations, we have a net computational cost of O(M K 2 + K 3 ) for computing ∇ log det(I + L). The gradient of the first term in Eq. 2 involves computing gradients of determinants of size at most K , which results in size K matrix inverses, since for a matrix A, ∂ ∂Aij (log det(A)) = (A -1 ) ij . Each of these inverses can be computed in O(K 3 ) time. The gradient of the simple sum-of-norms regularization term can be computed in O(M K) time. Therefore, combining these results with the results above for the complexity of the gradient of the normalization term, we have the following overall complexity of the gradient for the full log-likelihood: O(M K 2 + K 3 + nK 3 ). F.3 PROOF OF LEMMA 2 Lemma 2. Given B ∈ R M ×K , C ∈ R K×K , and Y = {a 1 , . . . , a k } ⊆ [[M ]], let b i ∈ R 1×K be the i-th row in B and B Y ∈ R |Y |×K be a matrix containing rows in B indexed by Y . Then, it holds that B Y (B Y CB Y ) -1 B Y = k j=1 p j q j , where row vectors p j , q j ∈ R 1×K for j = 1, . . . , k satisfy p 1 = b a1 /(b a1 Cb a1 ), q 1 = b a1 , and p j+1 = b aj -b aj C j i=1 q i p i b aj C(b aj -b aj C j i=1 q i p i ) , q j+1 = b aj -b aj C j i=1 p i q i . (7) † det A B C D = det(A) det(D -CA -1 B). Proof. We prove by induction on k. When k = 1, the result is trivial because B Y (B Y CB Y ) -1 B Y = b a1 (b a1 Cb a1 ) -1 b a1 = p 1 q 1 . Now we assume that the statement holds for k -1. Let Y := {a 1 , . . . , a k-1 } and a := a k . From the inductive hypothesis, it holds B Y (B Y CB Y ) -1 B Y = k-1 j=1 p j q j . Now we write B Y ∪{a} B Y ∪{a} CB Y ∪{a} -1 B Y ∪{a} (28) = B Y ∪{a} B Y b a C B Y b a -1 B Y ∪{a} (29) = B Y b a B Y CB Y B Y Cb a b a CB Y b a Cb a -1 B Y b a . ( ) To handle the inverse matrix we employ the Schur complement, which yields X y z w -1 = X -1 0 0 0 + 1 w -zX -1 y X -1 yzX -1 -X -1 y -zX -1 1 (31) for any non-singular square matrix X ∈ R k×k , column vector y ∈ R k and row vector z ∈ R 1×k , unless (w -zX -1 y) = 0. Applying this, we have B Y CB Y B Y Cb a b a CB Y b a Cb a -1 = (B Y CB Y ) -1 0 0 0 + 1 b a Cb a -b a CB Y (B Y CB Y ) -1 B Y Cb a (B Y CB Y ) -1 B Y Cb a b a CB Y (B Y CB Y ) -1 -(B Y CB Y ) -1 B Y Cb a -b a CB Y (B Y CB Y ) -1 1. ( ) Substituting Eq. 32 into Eq. 30, we obtain B Y ∪{a} B Y ∪{a} CB Y ∪{a} -1 B Y ∪{a} (33) = B Y B Y CB Y -1 B Y + b a -B Y (B Y CB Y ) -1 B Y Cb a b a -b a CB Y (B Y CB Y ) -1 B Y b a C b a -B Y (B Y CB Y ) -1 B Y Cb a (34) = k-1 j=1 p j q j + b a - k-1 j=1 p j q j Cb a b a -b a C k-1 j=1 p j q j b a C b a - k-1 j=1 p j q j Cb a (35) = k-1 j=1 p j q j + p k q k ( ) where the third line holds from the inductive hypothesis Eq. 27 and the last line holds from the definition of p k , q k ∈ R 1×K . where Y G is the output of Algorithm 1 and Y * is the optimal solution of MAP inference in Eq. 4. Proof. The proof of Theorem 2 relies on an approximation guarantee for nonsubmodular greedy maximization (Bian et al., 2017, Theorem 1) . We introduce their result below. and Y 0 = ∅, Y t := {a 1 , . . . , a t }, t = 1, . . . , k be the successive chosen by the greedy algorithm with budget k. Denote γ be the largest scalar such that i∈X\Y t (f (Y t ∪ {i}) -f (Y t )) ≥ γ(f (X ∪ Y t ) -f (Y t )), for ∀X ⊆ [[M ]], |X| = k and t = 0, . . . , k -1, and α be the smallest scalar such that f (Y t-1 ∪ {i} ∪ X) -f (Y t-1 ∪ X) ≥ (1 -α) (f (Y t-1 ∪ {i}) -f (Y t-1 )). ( ) for ∀X ⊆ [[M ]], |X| = k and i ∈ Y k-1 \ X. Then, it holds that f (Y k ) ≥ 1 α 1 -e -αγ f (Y * ). In order to apply this result for MAP inference of NDPPs, the objective should be monotone nondecreasing and nonnegative. We first show that σ min > 1 is a sufficient condition for both monotonicity and nonnegativity. The proof of Lemma 3 is provided in Appendix F.6. Next, we aim to find proper bounds on α and γ. To resolve this, we provide the following upper and lower bounds of the marginal gain for f (Y ) = log det(L Y ). where the last inequality comes from Eq. 40. Similarly, we get f (X ∪ Y t ) -f (Y t ) = r j=1 f ({x 1 , . . . , x j } ∪ Y t ) -f ({x 1 , . . . , x j-1 } ∪ Y t ) ≤ r(2 log σ max -log σ min ) where the last inequality comes from Eq. 41. Combining Eq. 42 to Eq. 44, we obtain that i∈X\Y t f (Y t ∪ {i}) -f (Y t ) f (X ∪ Y t ) -f (Y t ) ≥ log σ min 2 log σ max -log σ min which allows us to choose γ = 2 log σmax log σmin -1 -1 . To bound α, we similarly use Lemma 4 to obtain f (X ∪ Y t-1 ∪ {i}) -f (X ∪ Y t-1 ) f (Y t-1 ∪ {i}) -f (Y t-1 ) ≥ log σ min 2 log σ max -log σ min (46) 



† The exact memory complexity for MAP inference is 3M K, since V , B, and C used in this model are all M × K matrices.



TIME COMPARISON FOR LEARNING In Fig. 1, we report the wall-clock training time of the decomposition of Gartrell et al. (2019) (NDPP) and our scalable NDPP for the Amazon: 3-category (Fig. 1(a)) and UK Retail (Fig. 1(b)) datasets.

Figure 1: The negative log-likelihood of the training set for Gartrell et al. (2019)'s NDPP (blue, dashed) and our scalable NDPP (red, solid) versus wall-clock time for the (a) Amazon: 3-category and (b) UK Retail datasets.

Figure 2: Approximation ratios of greedy with respect to different values of log(σ max /σ min ) from Corollary 1 under (a) symmetric DPP and (b) nonsymmetric DPP. (c)The singular values of the kernels learned for the "Amazon: 3-category" dataset. We construct 10,000 random P 0 matrices L ∈ R 5×5 , with rank K = 3, whose singular values are from the learned kernels.

Consider a nonsymmetric low-rank DPP L = V V + BCB , where V , B are of rank K, and C ∈ R K×K . Given a cardinality budget k, let σ min and σ max denote the smallest and largest singular values of L Y for all Y ⊆ [[M ]] and |Y | ≤ 2k. Assume that σ min > 1. Then, log det(L Y G ) ≥ 4(1 -e -1/4 ) 2(log σ max /log σ min ) -1 log det(L Y * ) (10)

(Bian et al., 2017, Theorem 1)). Consider a set function f defined on all subsets of {1, . . . , M } = [[M ]] is monotone nondecreasing and nonnegative, i.e., 0≤ f (Y ) ≤ f (X) for ∀Y ⊆ X ⊆ [[M ]].Given a cardinality budget k ≥ 1, let Y * be the optimal solution of max |Y |=k f (Y )

Given a P 0 matrix L ∈ R M ×M and the budget k ≥ 0, a set function f (Y ) = log det(L Y ) defined on Y ⊆ [[M ]] is monotone nondecreasing and nonnegative for |Y | ≤ k when σ min > 1.

Let f (Y ) = log det(L Y ) and assume that σ min > 1. Then, for Y ⊆ [[M ]], |Y | < 2k and i / ∈ Y , it holds thatf (Y ∪ {i}) -f (Y ) ≥ log σ min , (40) f (Y ∪ {i}) -f (Y ) ≤ 2 log σ max -log σ min (41)where σ min and σ max are the smallest and largest singular values ofL Y for all Y ⊆ [[M ]], |Y | ≤ 2k.The proof of Lemma 4 is provided in Appendix F.7. To bound γ, we considerX ⊆ [[M ]], |X| = k and denote X \ Y t = {x 1 , . . . , x r } = ∅. Then, i∈X\Y t (f (Y t ∪ {i}) -f (Y )) = r j=1 f (Y t ∪ {x r }) -f (Y t ) ≥ r log σ min(42)

since we must invert a |Y | × |Y | matrix, where |Y | ≤ k. However, one can compute Eq. 5 more efficiently by observing that its B Y (B Y CB Y ) -1 B Y component can actually be expressed without an inverse, as a rank-|Y | matrix, that can be computed in O(K 2 ) time. Lemma 2. Given B ∈ R M ×K , C ∈ R K×K , and Y = {a 1 , . . . , a k } ⊆ [[M ]], let b i ∈ R 1×K be the i-th row in B and B Y ∈ R |Y |×K be a matrix containing rows in B indexed by Y . Then, it holds that

Algorithm complexities for several DPP models. Our model and the symmetric DPP model(Gartrell et al., 2017) can perform both tasks in time linear in the size of ground set M , but ours is a more general model that can capture positive as well as negative item correlations.

contains baskets representing transactions from an online retail company that sells unique all-occasion gifts. We omit baskets with more than 100 items, leaving us with a dataset containing 19,762 baskets drawn from a catalog of M = 3,941 products. Baskets containing more than 100 items are in the long tail of the basket-size distribution of the data, so omitting larger baskets is reasonable, and allows us to use a low-rank factorization of the DPP with K = 100. 3. Instacart: This dataset(Instacart, 2017) contains baskets purchased by Instacart users. We omit baskets with more than 100 items, resulting in 3.2 million baskets and a catalog of 49,677 products. 4. Million Song: This dataset (McFee et al., 2012) contains playlists ("baskets") of songs played by Echo Nest users. We trim playlists with more than 150 items, leaving us with 968,674 baskets and a catalog of 371,410 songs.

Average MPR, AUC, and test log-likelihood for all datasets, for the low-rank symmetric DPP(Gartrell et al., 2017), low-rank NDPP(Gartrell et al., 2019), and our scalable NDPP models. MPR and AUC results show 95% confidence estimates obtained via bootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the confidence interval. See Appendix B for the hyperparameter settings used in these experiments. The baseline NDPP model cannot be feasibly trained on the Instacart and Million Song datasets, as memory and computational costs are prohibitive due to large ground set sizes.

Wall-clock time (in milliseconds)  of MAP inference algorithms on NDPPs learned from real-world datasets.

A MEAN PERCENTILE RANK

We begin our definition of MPR by defining percentile rank (PR). First, given a set J, let p i,J = Pr(J ∪ {i} | J). The percentile rank of an item i given a set J is defined aswhere Y\J indicates those elements in the ground set Y that are not found in J.For our evaluation, given a test set Y , we select a random element i ∈ Y and compute PR i,Y \{i} . We then average over the set of all test instances T to compute the mean percentile rank (MPR):

B HYPERPARAMETERS FOR EXPERIMENTS IN TABLE 2

Preventing numerical instabilities: The first term on the right side of Eq. 2 will be singular whenever |Y i | > K, where Y i is an observed subset. Therefore, to address this in practice we set K to the size of the largest subset observed in the data, K , as in Gartrell et al. (2017) . However, this does not entirely address the issue, as the first term on the right side of Eq. 2 may still be singular even when |Y i | ≤ K. In this case though, we know that we are not at a maximum, since the value of the objective function is -∞. Numerically, to prevent such singularities, in our implementation we add a small I correction to each L Yi when optimizing Eq. 2 (we set = 10 -5 in our experiments).We perform a grid search using a held-out validation set to select the best performing hyperparameters for each model and dataset. The hyperparameter settings used for each model and dataset are described below.Symmetric low-rank DPP (Gartrell et al., 2016) . For this model, we use K for the number of item feature dimensions for the symmetric component V , and α for the regularization hyperparameter for V . We use the following hyperparameter settings:• Both Amazon datasets: K = 30, α = 0.• UK Retail dataset: K = 100, α = 1.• Instacart dataset: K = 100, α = 0.001.• Million Song dataset: K = 150, α = 0.0001.Baseline NDPP (Gartrell et al., 2019) . For this model, to ensure consistency with the notation used in Gartrell et al. (2019) , we use D to denote the number of item feature dimensions for the symmetric component V , and D to denote the number of item feature dimensions for the nonsymmetric components, B and C. As described in Gartrell et al. (2019) , α is the regularization hyperparameter for the V , while β and γ are the regularization hyperparameters for B and C, respectively. We use the following hyperparameter settings:• Both Amazon datasets: D = 30, α = 0.• Amazon apparel dataset: D = 30.• Amazon three-category dataset: D = 100.• UK Retail dataset: D = 100, D = 20, α = 1.• All datasets: β = γ = 0. Scalable NDPP. As described in Section 3, we use K to denote the number of item feature dimensions for the symmetric component V and the dimensionality of the nonsymmetric component C. α is the regularization hyperparameter. We use the following hyperparameter settings:• Amazon apparel dataset: K = 30, α = 0.• Amazon three-category dataset: K = 100, α = 1.• UK dataset: K = 100, α = 0.01. and we can choose α = 1 -log σmin 2 log σmax-log σmin = 2(log σmax-log σmin)2 log σmax-log σmin .Now let κ = log σmax log σmin . Then γ = 1 2κ-1 and α = 2(κ-1) 2κ-1 . Putting γ and α into Eq. 39, we havewhere the second inequality holds from the fact that max κ≥1 2(κ-1) (2κ-1) 2 = 1 4 and 1 -e -x ≥ 4 exp(-1/4)x for x ∈ [0, 1/4].

F.5 PROOF OF COROLLARY 1

Corollary 1. Consider a nonsymmetric low-rank DPP L = V V + BCB , where V , B are of rank K, and C ∈ R K×K . Given a cardinality budget k, let σ min and σ max denote the smallest and largest singular values ofwhere Y G is the output of Algorithm 1 and Y * is the optimal solution of MAP inference in Eq. 4.Proof. Now consider L = ( e σmin )L where e is the exponential constant. Then, σ min = σ min ( e σmin ) = e and σ max = σ max ( e σmin ). Using the fact that log det(L Y ) = log det(L Y ) -|Y | log σ min , we obtain the result.

F.6 PROOF OF LEMMA 3

Before stating the proof, we introduce interlacing properties of singular values. Theorem 4 (Interlacing Inequality for Singular Values, (Thompson, 1972, Theorem 1) ). Consider a real matrixThen, the singular values have the following interlacing properties:Note that when M = N and P = Q = N -1, it holds that β i ≥ σ i+2 for i = 1, . . . , N -2.We are now ready to prove Lemma 3. Proof. Since L ∈ P 0 , all of its principal submatrices are also in P 0 . By the definition of a P 0 matrix, it holds thatwhere where σ j and σ j are the j-th largest singular value of L Y ∪{i} and L Y , respectively. Similarly, using Eq. 51, we get

