STREAMING PROBABILISTIC DEEP TENSOR FACTOR-IZATION

Abstract

Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SPIDER, a Streaming ProbabilistIc Deep tEnsoR factorization method. We first use Bayesian neural networks (NNs) to construct a deep tensor factorization model. We assign a spike-and-slab prior over each NN weight to encourage sparsity and to prevent overfitting. We then use multivariate Delta's method and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efficient streaming posterior inference algorithm in the assumed-density-filtering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving new tensor entries, and meanwhile select and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications.

1. Introduction

Tensor factorization is a fundamental tool for multiway data analysis. While many tensor factorization methods have been developed (Tucker, 1966; Harshman, 1970; Chu & Ghahramani, 2009; Kang et al., 2012; Choi & Vishwanathan, 2014) , most of them conduct a mutilinear decomposition and are incapable of capturing complex, nonlinear relationships in data. Deep neural networks (NNs) are a class of very flexible and powerful modeling framework, known to be able to estimate all kinds of complicated (e.g., highly nonlinear) mappings. The most recent work (Liu et al., 2018; 2019) have attempted to incorporate NNs into tensor factorization and shown a promotion of the performance, in spite of the risk of overfitting the tensor data that are typically sparse. Nonetheless, one critical bottleneck for NN based factorization is the lack of effective approaches for streaming data. In practice, many applications produce huge volumes of data at a fast pace (Du et al., 2018) . It is extremely costly to run the factorization from scratch every time when we receive a new set of entries. Some privacy-demanding applications (e.g., SnapChat) even forbid us from revisiting the previously seen data. Hence, given new data, we need an effective way to update the model incrementally and promptly. A general and popular approach is streaming variational Bayes (SVB) (Broderick et al., 2013) , which integrates the current posterior with the new data, and then estimates a variational approximation as the updated posterior. Although SVB has been successfully used to develop the state-of-the-art multilinear streaming factorization (Du et al., 2018) , it does not perform well for (deep) NN based factorization. Due to the nested linear and nonlinear coupling of the latent embeddings and NN weights, the variational model evidence lower bound (ELBO) that SVB maximizes is analytically intractable and we have to seek for stochastic optimization, which is unstable and hard to diagnose the convergence. Consequently, the posterior updates are often unreliable and inferior, and in turn hurt the subsequent updates, leading to poor model estimations finally. To address these issues, we propose SPIDER, a streaming probabilistic deep tensor factorization method that not only exploits NN's expressive power to capture intricate relationships, but also provides efficient, high-quality posterior updates for streaming data. Specifically, we first use Bayesian neural networks to build a deep tensor factorization model, where the input is the concatenation of the associated factors in each tensor entry and the NN output predicts the entry value. To reduce the risk of overfitting, we place a spike-and-slab prior over each NN weight to encourage sparsity. For streaming inference, we use multivariate Delta's method (Bickel & Doksum, 2015) that employs a first-order Taylor expansion of the NN output to analytically compute its moments, and match the moments to obtain the its current posterior and the running model evidence. We then use back-propagation to calculate the gradient of the log evidence, with which we match the moments and update the posterior of the embeddings and NN weights in the assumed-density-filtering (Boyen & Koller, 1998) framework. Finally, after processing all the newly received entries, we update the spike-and-slab prior approximation with expectation propagation (Minka, 2001a) to select and inhibit redundant/useless weights. In this way, the incremental posterior updates are deterministic, reliable and efficient. For evaluation, we examined SPIDER on four real-world large-scale applications, including both binary and continuous tensors. We compared with the state-of-the-art streaming tensor factorization algorithm (Du et al., 2018) based on a multilinear form, and streaming nonlinear factorization methods implemented with SVB. In both running and final predictive performance, our method consistently outperforms the competing approaches, mostly by a large margin. The running accuracy of SPIDER is also much more stable and smooth than the SVB based methods.

2. Background

Tensor Factorization. We denote a K-mode tensor by Y ∈ R d1×...×d K , where mode k includes d k nodes. We index each entry by a tuple i = (i 1 , . . . , i K ), which stands for the interaction of the corresponding K nodes. The value of entry i is denoted by y i . To factorize the tensor, we represent all the nodes by K latent embedding matrices U = {U 1 , . . . , U K }, where each U k = [u k 1 , . . . , u k d k ] is of size d k × r k , and each u k j is the embedding vector of node j in mode k. The goal is to use U to recover the observed entries in Y. To this end, the classical Tucker factorization (Tucker, 1966)  assumes Y = W × 1 U 1 × 2 . . . × K U K , where W ∈ R r1×...×r K is a parametric tenor and × k the mode-k tensor matrix multiplication (Kolda, 2006) , which resembles the matrix-matrix multiplication. If we set all r k = r and W to be diagonal, Tucker factorization becomes CANDECOMP/PARAFAC (CP) factorization (Harshman, 1970) . The element-wise form is y i = r j=1 M k=1 u k i k ,j = (u 1 i1 • . . . • u M i M ) 1 , where • is the Hadamard (element-wise) product and 1 the vector filled with ones. We can estimate the embeddings U by minimizing a loss function, e.g., the mean squared error in recovering the observed elements in Y.

Streaming Model Estimation.

A general and popular framework for incremental model estimation is streaming variational Bayes(SVB) (Broderick et al., 2013) , which is grounded on the incremental version of Bayes' rule, p(θ|D old ∪ D new ) ∝ p(θ|D old )p(D new |θ) where θ are the latent random variables in the probabilistic model we are interested in, D old all the data that have been seen so far, and D new the incoming data. SVB approximates the current posterior p(θ|D old ) with a variational posterior q cur (θ). When the new data arrives, SVB integrates q cur (θ) with the likelihood of the new data to obtain an unnormalized, blending distribution, p(θ) = q cur (θ)p(D new |θ) which can be viewed as approximately proportional to the joint distribution p(θ, D old ∪ D new ). To conduct the incremental update, SVB uses p(θ) to construct a variational ELBO (Wainwright et al., 2008) , L(q(θ)) = E q [log p(θ)/q(θ) ], and maximizes the ELBO to obtain the updated posterior, q * = argmax q L(q). This is equivalent to minimizing the Kullback-Leibler (KL) divergence between q and the normalized p(θ). We then set q cur = q * and prepare the update for the next batch of new data. At the beginning (when we do not receive any data), we set q cur = p(θ), the original prior in the model. For efficiency and convenience, a factorized variational posterior q(θ) = j q(θ j ) is usually adopted to fulfill cyclic, closed-form updates. For example, the state-of-the-art streaming tensor factorization, POST (Du et al., 2018) , uses the CP form to build a Bayesian model, and applies SVB to update the posterior of the embeddings incrementally when receiving new tensor entries.

3. Bayesian Deep Tensor Factorization

Despite the elegance and convenience of the popular Tucker and CP factorization, their multilinear form can severely limit the capability of estimating complicated, highly nonlinear/nonstationary

