GRAPH SIGNAL SAMPLING FOR INDUCTIVE ONE-BIT MATRIX COMPLETION: A CLOSED-FORM SOLUTION

Abstract

Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (graph signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices -the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a predictioncorrection update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data as verified on public benchmarks.

1. INTRODUCTION

In domains such as recommender systems and social networks, only "likes" (i.e., ones) are observed in the system and service providers (e.g, Netflix) are interested in discovering potential "likes" for existing users to stimulate demand. This motivates the problem of 1-bit matrix completion (OBMC), of which the goal is to recover missing values in an n-by-m item-user matrix R ∈ {0, 1} n×m . We note that R i,j = 1 means that item i is rated by user j, but R i,j = 0 is essentially unlabeled or unknown which is a mixture of unobserved positive examples and true negative examples. However, in real world new users, who are not exposed to the model during training, may appear at testing stage. This fact stimulates the development of inductive 1-bit matrix completion, which aims to recover unseen vector y ∈ {0, 1} n from its partial positive entries Ω + ⊆ {j|y j = 1} at test time. Fig. 1 (a) emphasizes the difference between conventional and inductive approaches. More formally, let M ∈ {0, 1} n×(m+1) denote the underlying matrix, where only a subset of positive examples Ψ is randomly sampled from {(i, j)|M i,j = 1, i ≤ n, j ≤ m} such that R i,j = 1 for (i, j) ∈ Ψ and R i,j = 0 otherwise. Consider (m+1)-th column y out of matrix R, we likewise denote its observations s i = 1 for i ∈ Ω + and s i = 0 otherwise. We note that the sampling process here assumes that there exists a random label noise ξ which flips a 1 to 0 with probability ρ, or equivalently s = y + ξ where ξ i = -1 for i ∈ {j|y j = 1} -Ω + , and ξ i = 0 otherwise. (1) Fig. 1 (a) presents an example of s, y, ξ to better understand their relationships. Fundamentally, the reconstruction of true y from corrupted s bears a resemblance with graph signal sampling. Fig. 1(b) shows that the item-user rating matrix R can be used to define a homogeneous Our GS-IMC approach, which regards y as a signal residing on nodes of a homogeneous item-item graph, aims to reconstruct true signal y from its observed values (orange colored) on a subset of nodes (gray shadowed). item-item graph (see Sec 3.1), such that user ratings y/s on items can be regarded as signals residing on graph nodes. The reconstruction of bandlimited graph signals from certain subsets of vertices (see Sec 2) has been extensively studied in graph signal sampling (Pesenson, 2000; 2008) . Despite popularity in areas such as image processing (Shuman et al., 2013; Pang & Cheung, 2017; Cheung et al., 2018) and matrix completion (Romero et al., 2016; Mao et al., 2018; McNeil et al., 2021) , graph signal sampling appears less studied in the specific inductive one bit matrix completion problem focused in this paper (see Appendix A for detailed related works). Probably most closely related to our approach are MRFCF (Steck, 2019) and SGMC (Chen et al., 2021) which formulate their solutions as spectral graph filters. However, we argue that these methods are orthogonal to us since they focus on optimizing the rank minimization problem, whereas we optimize the functional minimization problem, thereby making it more convinient and straightforward to process and analyze the matrix data with vertex-frequency analysis (Hammond et al., 2011; Shuman et al., 2013) , time-variant analysis (Mao et al., 2018; McNeil et al., 2021) , smoothing and filtering (Kalman, 1960; Khan & Moura, 2008) . Furthermore, (Steck, 2019; Chen et al., 2021) can be incorporated as special cases of our unified graph signal sampling framework (see Appendix B for detailed discussions). Another emerging line of research has focused on learning the mapping from side information (or content features) to latent factors (Jain & Dhillon, 2013; Xu et al., 2013; Ying et al., 2018; Zhong et al., 2019) . However, it has been recently shown (Zhang & Chen, 2020; Ledent et al., 2021; Wu et al., 2021) that in general this family of algorithms would possibly suffer inferior expressiveness when high-quality content is not available. Further, collecting personal data is likely to be unlawful as well as a breach of the data minimization principle in GDPR (Voigt & Von dem Bussche, 2017) . Much effort has also been made to leverage the advanced graph neural networks (GNN) for improvements. van den Berg et al. (2017) represent the data matrix R by a bipartite graph then generalize the representations to unseen nodes by summing the embeddings over the neighbors. Zhang & Chen (2020) develop graph neural networks which encode the subgraphs around an edge into latent factors then decode the factors back to the value on the edge. Besides, Wu et al. (2021) consider the problem in a downsampled homogeneous graph (i.e., user-user graph in recommender systems) then exploit attention networks to yield inductive representations. The key advantage of our approach is not only the closed form solution which takes a small fraction of training time required for GNNs, but also theory results that guarantee accurate reconstruction and provide guidance for practical applications. We emphasize the challenges when connecting ideas and methods of graph signal sampling with inductive 1-bit matrix completion -1-bit quantization and online learning. Specifically, 1-bit quantization raises challenges for formulating the underlying optimization problems: minimizing squared loss on the observed positive examples Ω + yields a degenerate solution -the vector with all entries equal to one achieves zero loss; minimizing squared loss on the corrupted data s introduces the systematic error due to the random label noise ξ in Eq. (1). To address the issue, we represent the observed data R as a homogeneous graph, then devise a broader class of regularization functionals on graphs to mitigate the impact of discrete random noise ξ. Existing theory for total variation denoising (Sadhanala et al., 2016; 2017) and graph regularization (Belkin et al., 2004; Huang et al., 2011) , which takes into account continuous Gaussian noise, does not sufficiently address recoverability in inductive 1-bit matrix completion (see Sec 3.4) . We finally mange to derive a closed-form solution, entitled Graph Sampling for Inductive (1-bit) Matrix Completion GS-IMC which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. For online learning, existing matrix factorization methods (Devooght et al., 2015; Volkovs & Yu, 2015; He et al., 2016) incrementally update model parameters via gradient descent, requiring an expensive line search to set the best learning rate. To scale up to large data, we develop a Bayesian extension called BGS-IMC where a prediction-correction algorithm is devised to instantly refreshes the prediction given new incoming data. The prediction step tracks the evolution of the optimization problem such that the predicted iterate does not drift away from the optimum, while the correction step adjusts for the distance between current prediction and the new information at each step. The advantage over baselines is that BGS-IMC considers the uncertainties in the graph Fourier domain, and the prediction-correction algorithm can efficiently provide the unbiased and minimum-variance predictions in closed form, without using gradient descent techniques. The contributions are: • New Inductive 1-bit Matrix Completion Framework. We propose and technically manage (for the first time to our best knowledge) to introduce graph signal sampling to inductive 1-bit matrix completion. It opens the possibility of benefiting the analysis and processing of the matrix with signal processing toolbox including vertex-frequency analysis (Hammond et al., 2011; Shuman et al., 2013) , time-variant analysis (Mao et al., 2018; McNeil et al., 2021) , smoothing and filtering (Kalman, 1960; Khan & Moura, 2008) etc. We believe that our unified framework can serve as a new paradigm for 1-bit matrix completion, especially in large-scale and dynamic systems. • Generalized Closed-form Solution. We derive a novel closed-form solution (i.e., GS-IMC) in the graph signal sampling framework, which incorporates existing closed-form solutions as special cases, e.g., (Chen et al., 2021; Steck, 2019) . GS-IMC is learned from only positive data with discrete random noise. This is one of key differences to typical denoising methods (Sadhanala et al., 2016) where efforts are spent on removing continuous Gaussian noise from a real-valued signal. • Robustness Enhancement. We consider the online learning scenario and construct a Bayesian extension, i.e., BGS-IMC where a new prediction-correction algorithm is proposed to instantly yield unbiased and minimum-variance predictions given new incoming data. Experiments in Appendix E show that BGS-IMC is more cost-effective than many neural models such as SASREC (Kang & McAuley, 2018) , BERT4REC (Sun et al., 2019) and GREC (Yuan et al., 2020) . We believe that this proves a potential for the future application of graph signal sampling to sequential recommendation. • Theoretical Guarantee and Empirical Effectiveness. We extend Paley-Wiener theorem of (Pesenson, 2009) on real-valued data to positive-unlabelled data with statistical noise. The theory shows that under mild conditions, unseen rows and columns in training can be recovered from a certain subset of their values that is present at test time. Empirical results on real-world data show that our methods achieve state-of-the-art performance for the challenging inductive Top-N ranking tasks.

2. PRELIMINARIES

In this section, we introduce the notions and provide the necessary background of graph sampling theory. Let G = (V, E, w) denote a weighted, undirected and connected graph, where V is a set of vertices with |V | = n, E is a set of edges formed by the pairs of vertices and the positive weight w(u, v) on each edge is a function of the similarity between vertices u and v. Space L 2 (G) is the Hilbert space of all real-valued functions f : V → R with the following norm: f = v∈V |f (v)| 2 , and the discrete Laplace operator Ł is defined by the formula (Chung & Graham, 1997) : Łf (v) = 1 d(v) u∈N (v) w(u, v) f (v) d(v) - f (u) d(u) , f ∈ L 2 (G) where N (v) signifies the neighborhood of node v and d(v) = u∈N (v) w(u, v) is the degree of v. Definition 1 (Graph Fourier Transform). Given a function or signal f in L 2 (G), the graph Fourier transform and its inverse (Shuman et al., 2013) can be defined as follows:  f G = U f and f = U f , R(λ) = γλ R(Ł) = γŁ H(λ) = 1/(1 + γλ) Diffusion Process (Stroock & Varadhan, 1969) R(λ) = exp(γ/2λ) R(Ł) = exp(γ/2Ł) H(λ) = 1/(exp(γ/2λ) + 1) One-Step Random Walk (Pearson, 1905) R(λ) = (a -λ) -1 R(Ł) = (aI -Ł) - H(λ) = (a -λ)/(a -λ + 1) Inverse Cosine (MacLane, 1947) R(λ) = (cos λπ/4) -1 R(Ł) = (cos Łπ/4) -H(λ) = 1/(1/(cos λπ/4) + 1) where U represents eigenfunctions of discrete Laplace operator Ł, f G denotes the signal in the graph Fourier domain and f G (λ l ) = f , u l signifies the information at the frequency λ lfoot_0 . Definition 2 (Bandlimiteness). f ∈ L 2 (G) is called ω-bandlimited function if its Fourier transform f G has support in [0, ω], and ω-bandlimited functions form the Paley-Wiener space PW ω (G). Definition 3 (Graph Signal Sampling). Given y ∈ PW ω (G), y can be recovered from its values on the vertices Ω + by minimizing below objective (Pesenson, 2000; 2008) , with positive scalar k: min f ∈L2(G) Ł k f s.t., f (v) = y(v), ∀v ∈ Ω + . Recall that the observation in inductive 1-bit matrix completion consists of only ones and no zeros (i.e., y(v) = 1 for v ∈ Ω + ) and Ł k 1 = 0. It is obvious that minimizing the loss on the observed entries corresponding to ones, produces a degenerate solution -the vector with all entries equal to one achieves zero loss. From this point of view, existing theory for sampling real-valued signals (Pesenson, 2000; 2008) is not well suited to the inductive 1-bit matrix completion problem.

3. CLOSED-FORM SOLUTION FOR 1-BIT MATRIX COMPLETION

This section builds a unified graph signal sampling framework for inductive 1-bit matrix completion that can inductively recover y from positive ones on set Ω + . The rational behind our framework is that the rows that have similar observations are likely to have similar reconstructions. This makes a lot of sense in practice, for example a user (column) is likely to give similar items (rows) similar scores in recommender systems. To achieve this, we need to construct a homogeneous graph G where the connected vertices represent the rows which have similar observations, so that we can design a class of graph regularized functionals that encourage adjacent vertices on graph G to have similar reconstructed values. In particular, we mange to provide a closed-form solution to the matrix completion problem (entitled GS-IMC), together with theoretical bounds and insights.

3.1. GRAPH DEFINITION

We begin with the introduction of two different kinds of methods to construct homogeneous graphs by using the zero-one matrix R ∈ R n×m : (i) following the definition of hypergraphs (Zhou et al., 2007) , matrix R can be regarded as the incidence matrix, so as to formulate the hypergraph Laplacian matrix as Ł = I -D -1/2 v RD - e R D -1/2 v where D v ∈ R n×n (D e ∈ R m×m ) is the diagonal degree matrix of vertices (edges); and (ii) for regular graphs, one of the most popular approaches is to utilize the covariance between rows to form the adjacent matrix A i,j = Cov(R i , R j ) for i = j so that we can define the graph Laplacian matrix as Ł = I -D -1/2 v AD -1/2 v .

3.2. GRAPH SIGNAL SAMPLING FRAMEWORK

Given a graph G = (V, E), any real-valued column y ∈ R n can be viewed as a function on G that maps from V to R, and specifically the i-th vector component y i is equivalent to the function value y(i) at the i-th vertex. Now it is obvious that the problem of inductive matrix completion, of which the goal is to recover column y from its values on entries Ω + , bears a resemblance to the problem of graph signal sampling that aims to recover function y from its values on vertices Ω + . However, most of existing graph signal sampling methods (Romero et al., 2016; Mao et al., 2018; McNeil et al., 2021) yield degenerated solutions when applying them to the 1-bit matrix completion problem. A popular heuristic is to treat some or all of zeros as negative examples Ω -, then to recover y by optimizing the following functional minimization problem, given any k = 2 l , l ∈ N:  min f ∈L2(G) [R(Ł)] k f s.t., s Ω -f Ω ≤

Low Degree Vertices

Figure 2 : Recall results on Netflix data of very-high degree vertices (left), high degree vertices (left middle), medium degree vertices (right middle) and low degree vertices (right) for top-100 ranking tasks, where λ 50 on the x-axis corresponds to the assumption of space PW λ50 (G) or namely we use the eigenfunctions whose eigenvalues are not greater than λ 50 to make predictions. The results show that low(high)-frequency functions reflect user preferences on the popular (cold) items. where recall that s = y + ξ is the observed data corrupted by discrete random noise ξ, and s Ω (f Ω ) signifies the values of s (f ) only on Ω = Ω + ∪ Ω -; R(Ł) = l R(λ l )u l u l denotes the regularized Laplace operator in which {λ l } and {u l } are respectively the eigenvalues and eigenfunctions of operator Ł. It is worth noting that s(i) = y(i) + ξ(i) = 0 for i ∈ Ω -is not the true negative data, and hence Ω -will introduce the systematic bias when there exists i ∈ Ω -so that y(i) = 1. The choice of regularization function R(λ) needs to account for two critical criteria: 1) The resulting regularization operator R(Ł) needs to be semi-positive definite. 2) As mentioned before, we expect the reconstruction ŷ to have similar values on adjacent nodes, so that the uneven functions should be penalized more than even functions. To account for this, we adopt the family of positive, monotonically increasing functions (Smola & Kondor, 2003) as present in Table 1 . To the end, we summarize two natural questions concerning our framework: 1) What are the benefits from introducing the regularized Laplacian penalty? It is obvious that minimizing the discrepancy between s Ω and f Ω does not provide the generalization ability to recover unknown values on the rest vertices V -Ω, and Theorem 4 and 5 answer the question by examining the error bounds. 2) What kind of R(Ł) constitutes a reasonable choice? It has been studied in (Huang et al., 2011) that R(Ł) is most appropriate if it is unbiased, and an unbiased R(Ł) reduces variance without incurring any bias on the estimator. We also highlight the empirical study in Appendix C that evaluates how the performance is affected by the definition of graph G and regularization function R(λ).

3.3. CLOSED-FORM SOLUTION

In what follows, we aim to provide a closed-form solution for our unified framework by treating all of the zeros as negative examples, i.e., s(v) = 1 for v ∈ Ω + and s(v) = 0 otherwise. Then by using the method of Lagrange multipliers, we reformulate Eq. ( 5) to the following problem: min f ∈L2(G) 1 2 f , R(Ł)f + ϕ 2 s -f 2 , where ϕ > 0 is a hyperparameter. Obviously, this problem has a closed-form solution: ŷ = I + R(Ł)/ϕ - s = l 1 + R(λ l )/ϕ u l u l - s = H(Ł)s, where H(Ł) = l H(λ l )u l u l with kernel 1/H(λ l ) = 1 + R(λ) /ϕ, and we exemplify H(λ) when ϕ = 1 in Table 1 . From the viewpoint of spectral graph theory, our GS-IMC approach is essentially a spectral graph filter that amplifies(attenuates) the contributions of low(high)-frequency functions. Remark. To understand low-frequency and high-frequency functions, Figure 2 presents case studies in the context of recommender systems on the Netflix prize data (Bennett et al., 2007) . Specifically, we divide the vertices (items) into four classes: very-high degree (> 5000), high degree (> 2000), medium degree (> 100) and low degree vertices. Then, we report the recall results of all the four classes in different Paley-Wiener spaces PW λ50 (G), . . . , PW λ1000 (G) for top-100 ranking prediction. The interesting observation is: (1) the low-frequency functions with eigenvalues less than λ 100 contribute nothing to low degree vertices; and (2) the high-frequency functions whose eigenvalues are greater than λ 500 do not help to increase the performance on very-high degree vertices. This finding implies that low(high)-frequency functions reflect the user preferences on the popular(cold) items. From this viewpoint, the model defined in Eq. ( 7) aims to exploit the items with high clickthrough rate with high certainty, which makes sense in commercial applications.

3.4. ERROR ANALYSIS

Our GS-IMC approach defined in Eq. ( 7) bears a similarity to total variation denoising (Sadhanala et al., 2016; 2017) , graph-constrained regularization (Belkin et al., 2004; 2006) , and particularly Laplacian shrinkage methods (Huang et al., 2011) . However, we argue that the proposed GS-IMC approach is fundamentally different from previous works. Specifically, they operate on real-valued data while GS-IMC deals with positive-unlabeled data. We believe that our problem setting is more complicated, since the unlabeled data is a mixture of unobserved positive examples and true negative examples. In addition, existing methods analyze the recoverability considering statistical noise to be continuous Gaussian, e.g., Theorem 3 (Sadhanala et al., 2016) , Theorem 1.1 (Pesenson, 2009) etc. However, we study upper bound of GS-IMC in the presence of discrete random label noise ξ. Specifically, Theorem 4 extends Paley-Wiener theorem of (Pesenson, 2009) on real-valued data to positiveunlabelled data, showing that a bandlimited function y can be recovered from its values on certain set Ω. Theorem 5 takes into account statistical noise ξ and shows that a bandlimited function y can be accurately reconstructed if C 2 n = C > 0 is a constant, not growing with n. Theorem 4 (Error Analysis, extension of Theorem 1.1 in (Pesenson, 2009)  ). Given R(λ) with λ ≤ R(λ) on graph G = (V, E), assume that Ω c = V -Ω admits the Poincare inequality φ ≤ Λ Łφ for any φ ∈ L 2 (Ω c ) with Λ > 0, then for any y ∈ PW ω (G) with 0 < ω ≤ R(ω) < 1/Λ, y -ŷk ≤ 2 ΛR(ω) k y and y = lim k→∞ ŷk ( ) where k is a pre-specified hyperparameter and ŷk is the solution of Eq. ( 5) with = 0. Remark. Theorem 4 indicates that a better estimate of y can be achieved by simply using a higher k, but there is a trade-off between accuracy of the estimate on one hand, and complexity and numerical stability on the other. We found by experiments that GS-IMC with k = 1 can achieve SOTA results for inductive top-N recommendation on benchmarks. We provide more discussions in Appendix G. Theorem 5 (Error Analysis, with label noise). Suppose that ξ is the random noise with flip rate ρ, and positive λ 1 ≤ • • • ≤ λ n are eigenvalues of Laplacian Ł, then for any function y ∈ PW ω (G), E MSE(y, ŷ) ≤ C 2 n n ρ R(λ 1 )(1 + R(λ 1 )/ϕ) 2 + 1 4ϕ , where C 2 n = R(ω) y 2 , ϕ is the regularization parameter and ŷ is defined in Eq. ( 7). Remark. Theorem 5 shows that for a constant C 2 n = C > 0 (not growing with n), the reconstruction error converges to zero as n is large enough. Also, the reconstruction error decreases with R(ω) declining which means low-frequency functions can be recovered more easily than high-frequency functions. We provide more discussions on ϕ, ρ in Appendix H.

4. BAYESIAN GS-IMC FOR ONLINE LEARNING

In general, an inductive learning approach such as GAT (Veličković et al., 2017) and SAGE (Hamilton et al., 2017) , etc., can naturally cope with the online learning scenario where the prediction is refreshed given a newly observed example. Essentially, GS-IMC is an inductive learning approach that can update the prediction, more effective than previous matrix completion methods (Devooght et al., 2015; He et al., 2016) . Let ∆s denote newly coming data that might be one-hot as in Fig. 3(a) , ŷ denotes original prediction based on data s, then we can efficiently update ŷ to ŷnew as follows: ŷnew = H(Ł)(s + ∆s) = ŷ + H(Ł)∆s. (10) However, we argue that GS-IMC ingests the new data in an unrealistic, suboptimal way. Specifically, it does not take into account the model uncertainties, assuming that the observed positive data is noise-free. This assumption limits model's fidelity and flexibility for real applications. In addition, it assigns a uniform weight to each sample, assuming that the innovation, i.e., the difference between the current a priori prediction and the current observation information, is equal for all samples.

4.1. PROBLEM FORMULATION

To model the uncertainties, we denote a measurement by z = Uŷ (Fourier basis U) which represents prediction ŷ in the graph Fourier domain and we assume that z is determined by a stochastic process. 10), while BGS-IMC operates in graph Fourier domain. The measurement z/z new is graph Fourier transformation of the prediction ŷ/ŷ new , and we assume hidden states x/x new determine these measurements under noise ν. To achieve this, x/x new should obey the evolution of ŷ/ŷ new , and thus Eq. ( 11) represents Eq. ( 10) under noise η in graph Fourier domain. In Fig. 3 (b), measurement z is governed by hidden state x and noise ν captures the data uncertainties in an implicit manner. The choice of state transition equation need to account for two critical criteria: (1) the model uncertainties need to be considered. ( 2) the transition from state x to state x new need to represent the evolution of predictions ŷ/ŷ y defined in Eq. ( 10). To account for this, we propose a Bayesian extension of GS-IMC, entitled BGS-IMC, which considers the stochastic filtering problem in a dynamic state-space form: x new = x + F∆s + η (11) z new = x new + ν where Eq. ( 11) essentially follows Eq. ( 10) in the graph Fourier domain, i.e., multiplying both sides of Eq. ( 10) by U. In control theory, F = UH(Ł) is called the input matrix and ∆s represents the system input vector. The state equation ( 11) describes how the true state x, x new evolves under the impact of the process noise η ∼ N (0, Σ η ), and the measurement equation ( 12) characterizes how a measurement z new = U (s + ∆s) of the true state x new is corrupted by the measurement noise ν ∼ N (0, Σ ν ). It is worth noting that larger determinant of Σ ν means that data points are more dispersed, while for Σ η large determinant implies that BGS-IMC is not sufficiently expressive and it is better to use measurement for decision making, i.e., BGS-IMC is reduced to GS-IMC. Using Bayes rule, the posterior is given by: p(x new |∆s, z new ) ∝ p(z new |x new )p(x new |∆s), where p(z new |x new ) and p(x new |∆s) follow a Gauss-Markov process.

4.2. PREDICTION-CORRECTION UPDATE ALGORITHM

To make an accurate prediction, we propose a prediction-correction update algorithm, resembling workhorse Kalman filtering-based approaches (Kalman, 1960; Wiener et al., 1964) . To our knowledge, the class of prediction-correction methods appears less studied in the domain of 1-bit matrix completion, despite its popularity in time-series forecasting (Simonetto et al., 2016; de Bézenac et al., 2020) and computer vision (Matthies et al., 1989; Scharstein & Szeliski, 2002) . In the prediction step, we follow the evolution of the state as defined in Eq. ( 11) to compute the mean and the covariance of conditional p(x new |∆s): E[x new |∆s] = x + F∆s = xnew and Var(x new |∆s) = P + Σ η = Pnew , ( ) where x is the estimate state of x and P is the estimate covariance, i.e., P= E(x -x)(xx) , while xnew , Pnew are the extrapolated estimate state and covariance respectively. Meanwhile, it is easy to obtain the mean and the covariance of conditional p(z new |x new ): E[z new |x new ] = E[x new + ν] = x new and Var(z new |x new ) = E[νν ] = Σ ν . In the correction step, we combine Eq. ( 13) with Eq. ( 14) and (15): p(x new |∆s, z new ) ∝ exp (x new -z new ) Σ - ν (x new -z new ) + (x new -xnew ) P- new (x new -xnew ) . By solving ∂ ln p(x new |∆s, z new )/∂x new = 0, we have the following corrected estimate state xnew and covariance P new , where we recall that the new measurement is defined as z new = U (s + ∆s): xnew = xnew + K(z new -xnew ) ) P new = (I -K) Pnew (I -K) + KΣ ν K K = Pnew ( Pnew + Σ ν ) -, ( ) where K is the Kalman gain and z new -xnew is called the innovation. It is worth noting that Eq. ( 16) adjusts the predicted iterate xnew in terms of the innovation, the key difference to GS-IMC and existing methods, e.g., GAT (Veličković et al., 2017) and SAGE (Hamilton et al., 2017) . Remark. The BGS-IMC approach is highly scalable in Paley-Wiener spaces. Let PW ω (G) be the span of k ( n) eigenfunctions whose eigenvalues are no greater than ω, then the transition matrix F in ( 11) is k-by-n and every covariance matrix is of size k × k. 16) is an unbiased and minimum-variance estimator. Proposition 6. Given an observation ∆s, provided F is known, xnew obtained in Eq. ( 16) is the optimal linear estimator in the sense that it is unbiased and minimum-variance. To summarize, the complete procedure of BGS-IMC is to first specify Σ η , Σ ν , P using prior knowledge, then to calculate extrapolated state xnew using ( 14), and finally to obtain xnew using ( 16) so that we have the updated model prediction as ŷnew = Ux new that ingests the new observation.

5. EXPERIMENT

This section evaluates GS-IMC (in Section 3) and BGS-IMC (in Section 4) on real-world datasets. All the experiments are conducted on the machines with Xeon 3175X CPU, 128G memory and P40 GPU with 24 GB memory. The source code and models will be made publicly available.

5.1. EXPERIMENTAL SETUP

We adopt three large real-world datasets widely used for evaluating recommendation algorithms: (1) Koubei (1, 828, 250 ratings of 212, 831 users and 10, 213 items); ( 2) Tmall (7, 632, 826 ratings of 320, 497 users and 21, 876 items); (3) Netflix (100, 444, 166 ratings of 400, 498 users and 17, 770 items). For each dataset, we follow the experimental protocols in (Liang et al., 2018; Wu et al., 2017a) for inductive top-N ranking, where the users are split into training/validation/test set with ratio 8 : 1 : 1. Then, we use all the data from the training users to optimize the model parameters. In the testing phase, we sort all interactions of the validation/test users in chronological order, holding out the last one interaction for testing and inductively generating necessary representations using the rest data. The results in terms of hit-rate (HR) and normalized discounted cumulative gain (NDCG) are reported on the test set for the model which delivers the best results on the validation set. We implement our method in Apache Spark with Intel MKL, where matrix computation is parallelized and distributed. In experiments, we denote item-user rating matrix by R and further define the Laplacian Ł = I-D -1/2 v RD - e R D -1/2 v . We set a=4, γ=1, ϕ=10 for GS-IMC, while we set the covariance to Σ η =Σ ν =10 -4 I and initialize P using the validation data for BGS-IMC. In the test stage, if a user has |Ω| training interactions, BGS-IMC uses first |Ω|-1 interactions to produce initial state x, then feed last interaction to simulate the online update. In the literature, there are few of existing works that enable inductive inference for topN ranking only using the ratings. To make thorough comparisons, we prefer to strengthen IDCF with GCMC for the improved performance (IDCF+ for short) rather than report the results of IDCF (Wu et al., 2021) and GCMC (van den Berg et al., 2017) as individuals. Furthermore, we study their performance with different graph neural networks including ChebyNet (Defferrard et al., 2016) , GAT (Veličković et al., 2017) , GraphSage (Hamilton et al., 2017) , SGC (Wu et al., 2019) and ARMA (Bianchi et al., 2021) . We adopt the Adam optimizer (Kingma & Ba, 2015) with the learning rate decayed by 0.98 every epoch. We search by grid the learning rate and L 2 regularizer in {0.1, 0.01, . . . , 0.00001}, the dropout rate over {0.1, 0.2, . . . , 0.7} and the latent factor size ranging {32, 64, . . . , 512} for the optimal performance. In addition, we also report the results of the shallow models i.e., MRCF (Steck, 2019) and SGMC (Chen et al., 2021) which are most closely related to our proposed method. The software provided by the authors is used in the experiments. Table 2 : Hit-Rate results against the baselines for inductive top-N ranking. Note that SGMC (Chen et al., 2021) is a special case of our method using the cut-off regularization, and MRFCF (Steck, 2019) is the full rank version of our method with (one-step) random walk regularization. The standard errors of the ranking metrics are less than 0.005 for all the three datasets. Table 3 : NDCG results of GS-IMC and BGS-IMC against the baselines for inductive top-N ranking. Koubei, Density=0.08% Tmall, Density=0.10% Netflix, Density=1.41% Model N@10 N@50 N@100 N@10 N@50 N@100 N@10 N@50 N@100 (Wang et al., 2019) , as their accuracies were found below on par in SGMC (Chen et al., 2021) and IDCF (Wu et al., 2021) . IDCF * (

5.2. ACCURACY COMPARISON

In this section, GS-IMC and BGS-IMC assume that the underlying signal is λ 1000 -bandlimited, and we compare them with eight state-of-the-arts graph based baselines, including spatial graph models (i.e., IDCF (Wu et al., 2021) , IDCF+GAT (Veličković et al., 2017) , IDCF+GraphSAGE (Hamilton et al., 2017) ), approximate spectral graph models with high-order polynomials (i.e., IDCF+SGC (Wu et al., 2019) , IDCF+ChebyNet (Defferrard et al., 2016) , IDCF+ARMA (Bianchi et al., 2021)) and exact spectral graph models (i.e., MRFCF (Steck, 2019) and SGMC (Chen et al., 2021) ). In Table 2 and Table 3 , the results on the real-world Koubei, Tmall and Netflix show that BGS-IMC outperforms all the baselines on all the datasets. Note that MRFCF (Steck, 2019) is the full rank version of GS-IMC with (one-step) random walk regularization. We can see that MRFCF underperforms its counterpart on all the three datasets, which demonstrates the advantage of the bandlimited assumption for inductive top-N ranking tasks. Further, BGS-IMC consistently outperforms GS-IMC on all three datasets by margin which proves the efficacy of the prediction-correction algorithm for incremental updates. Additionally, we provide extensive ablation studies in Appendix C, scalability studies in Appendix D and more comparisons with SOTA sequential models in Appendix E. To summarize, the reason why the proposed method can further improve the prediction accuracy is due to 1) GS-IMC exploits the structural information in the 1-bit matrix to mitigate the negative influence of discrete label noise in the graph vertex domain; and 2) BGS-IMC further improves the prediction accuracy by considering continuous Gaussian noise in the graph Fourier domain and yielding unbiased and minimum-variance predictions using prediction-correction update algorithm.

6. CONCLUSION

We have introduced a unified graph signal sampling framework for inductive 1-bit matrix completion, together with theoretical bounds and insights. Specifically, GS-IMC is devised to learn the structural information in the 1-bit matrix to mitigate the negative influence of discrete label noise in the graph vertex domain. Second, BGS-IMC takes into account the model uncertainties in the graph Fourier domain and provides a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstructions. Both GS-IMC and BGS-IMC have closed-form solutions and are highly scalable. Experiments on the task of inductive top-N ranking have shown the supremacy. In Appendix, we present the detailed related works in Appendix A, generalization of SGMC and MRFCF in Appendix B, extensive ablation studies in Appendix C and scalability studies in Appendix D, limitation and future work in Appendix F, proofs of theoretical results in Appendix G -I and more implementation details in Appendix J.

A RELATED WORK

Inductive matrix completion. There has been a flurry of research on problem of inductive matrix completion (Chiang et al., 2018; Jain & Dhillon, 2013; Xu et al., 2013; Zhong et al., 2019) , which leverage side information (or content features) in the form of feature vectors to predict inductively on new rows and columns. The intuition behind this family of algorithms is to learn mappings from the feature space to the latent factor space, such that inductive matrix completion methods can adapt to new rows and columns without retraining. However, it has been recently shown (Zhang & Chen, 2020; Ledent et al., 2021; Wu et al., 2021) that inductive matrix completion methods provide limited performance due to the inferior expressiveness of the feature space. On the other hand, the prediction accuracy has strong constraints on the content quality, but in practice the high quality content is becoming hard to collect due to legal risks (Voigt & Von dem Bussche, 2017) . By contrast, one advantage of our approach is the capacity of inductive learning without using side information. Graph neural networks. Inductive representation learning over graph structured data has received significant attention recently due to its ubiquitous applicability. Among the existing works, Graph-SAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) propose to generate embeddings for previously unseen data by sampling and aggregating features from a node's local neighbors. In the meantime, various approaches such as ChebyNet (Defferrard et al., 2016) To leverage recent advance in graph neural networks, lightGCN (He et al., 2020) , GCMC (van den Berg et al., 2017) and PinSAGE (Ying et al., 2018) represent the matrix by a bipartite graph then generalize the representations to unseen nodes by summing the content-based embeddings over the neighbors. Differently, IGMC (Zhang & Chen, 2020) trains graph neural networks which encode the subgraphs around an edge into latent factors then decode the factors back to the value on the edge. Recently, IDCF (Wu et al., 2021) studies the problem in a downsampled homogeneous graph (i.e., user-user graph in recommender systems) then applies attention networks to yield inductive representations. Probably most closely related to our approach are IDCF (Wu et al., 2021) and IGMC (Zhang & Chen, 2020) which do not assume any side information, such as user profiles and item properties. The key advantage of our approach is not only the closed form solution for efficient GNNs training, but also the theoretical results which guarantee the reconstruction of unseen rows and columns and the practical guidance for potential improvements. Graph signal sampling. In general, graph signal sampling aims to reconstruct real-valued functions defined on the vertices (i.e., graph signals) from their values on certain subset of vertices. Existing approaches commonly build upon the assumption of bandlimitedness, by which the signal of interest lies in the span of leading eigenfunctions of the graph Laplacian (Pesenson, 2000; 2008) . It is worth noting that we are not the first to consider the connections between graph signal sampling and matrix completion, as recent work by Romero et al. (Romero et al., 2016) has proposed a unifying kernel based framework to broaden both of graph signal sampling and matrix completion perspectives. However, we argue that Romero's work and its successors (Benzi et al., 2016; Mao et al., 2018; McNeil et al., 2021) are orthogonal to our approach as they mainly focus on real-valued matrix completion in the transductive manner. Specifically, our approach concerns two challenging problems when connecting the ideas and methods of graph signal sampling with inductive one-bit matrix completion -one-bit quantization and online learning. To satisfy the requirement of online learning, existing works learn the parameters for new rows and columns by performing either stochastic gradient descent used in MCEX (Giménez-Febrer et al., 2019) , or alternating least squares used in eALS (He et al., 2016) . The advantage of BGS-IMC is three fold: (i) BGS-IMC has closed form solutions, bypassing the well-known difficulty for tuning learning rate; and (ii) BGS-IMC considers the random Gaussian noise in the graph Fourier domain, characterizing the uncertainties in the measurement and modeling; (iii) prediction-correction algorithm, resembling Kalman filtering, can provide unbiased and minimum-variance reconstructions. Probably most closely related to our approach are SGMC (Chen et al., 2021) and MRFCF (Steck, 2019) in the sense that both of them formulate their solutions as spectral graph filters and can be regarded as methods for data filtering in domains of discrete signal processing. More specifically, SGMC optimizes latent factors V, U by minimizing the normalized matrix reconstruction error: min U,V D -1/2 v RD -1/2 e -VU , s.t. U ≤ , V ≤ η, while MRFCF minimizes the following matrix reconstruction error: min X R -XR +λ X s.t. diag(X) = 0, where the diagonal entries of parameter X is forced to zero. It is obvious now that both SGMC and MRFCF focus on minimizing the matrix reconstruction problem. This is one of the key differences to our graph signal sampling framework which optimizes the functional minimization problem as defined in Eq. 5. We argue that our problem formulation is more suitable for the problem of inductive one-bit matrix completion, since it focuses on the reconstruction of bandlimited functions, no matter if the function is observed in the training or at test time. Perhaps more importantly, both of methods (Chen et al., 2021; Steck, 2019) can be included as special cases of our framework. We believe that a unified framework cross graph signal sampling and inductive matrix completion could benefit both fields, since the modeling knowledge from both domains can be more deeply shared. Advantages of graph signal sampling perspectives. A graph signal sampling perspective requires to model 1-bit matrix data as signals on a graph and formulate the objective in the functional space. Doing so opens the possibility of processing, filtering and analyzing the matrix data with vertexfrequency analysis (Hammond et al., 2011; Shuman et al., 2013) , time-variant analysis (Mao et al., 2018; McNeil et al., 2021) , smoothing and filtering (Kalman, 1960; Khan & Moura, 2008) etc. In this paper, we technically explore the use of graph spectral filters to inductively recover the missing values of matrix, Kalman-filtering based approach to deal with the streaming data in online learning scenario, and vertex-frequency analysis to discover the advantages of dynamic BERT4REC model over static BGS-IMC model. We believe that our graph signal sampling framework can serve as a new paradigm for 1-bit matrix completion, especially in large-scale and dynamic systems.

B GENERALIZING SGMC AND MRFCF

This section shows how GS-IMC generalizes SGMC (Chen et al., 2021) and MRFCF (Steck, 2019) . GS-IMC generalizes SGMC. Given the observation R, we follow standard routine of hypergraph (Zhou et al., 2007) to calculate the hypergraph Laplacian matrix Ł = I -D -1/2 v RD - e R D -1/2 v , where D v (D e ) is the diagonal degree matrix of vertices (edges). Then the rank-k approximation (see Eq. ( 9) in (Chen et al., 2021) ) is equivalent to our result using bandlimited norm R(λ) = 1 if λ ≤ λ k and R(λ) = ∞ otherwise, ŷ = l 1 + R(λ l )/ϕ u l u l - s = l≤k u l u l s = U k U k s where we set ϕ = ∞ and lim ϕ→∞ R(λ)/ϕ = ∞ for λ > λ k , and matrix U k comprises k leading eigenvectors whose eigenvalues are less than or equal to λ k .

GS-IMC generalizes MRFCF.

Given R, we simply adopt the correlation relationship to construct the affinity matrix and define the Laplacian as Ł = 2I -D -1/2 v RR D -1/2 v . Then the matrix approximation (see Eq. ( 4) in (Steck, 2019) ) is equivalent to our GS-IMC approach using one-step Table 6 : HR, NDCG on the Netflix prize data of GS-IMC (w/ random walk regularization), where we adopt different methods for constructing the homogeneous graph for inductive top-N ranking. HR@10 HR@50 HR@100 NDCG@10 NDCG@50 NDCG@100 GS-IMC w/ Hypergraph 0.09660±0.0006 0.22328±0.0002 0.32235±0.0011 0.05452±0.0004 0.08158±0.0004 0.09759±0.0002 GS-IMC w/ Covariance 0.09767±0.0012 0.22388±0.0006 0.31312±0.0052 0.05454±0.0005 0.08171±0.0007 0.09613±0.0007 C.2 IMPACT OF GRAPH DEFINITIONS Table 6 present the HR and NDCG results of GS-IMC with one-step random walk regularization on the Netflix prize data. To avoid the clutter, we omit the results of GS-IMC with other regularization functions, since their results share the same trends. It seems that the regular graph that use covariance matrix as the affinity matrix has better HR and NDCG results when recommending 10 and 50 items, while the hypergraph helps achieve better results when recommending 100 items.

D SCALABILITY STUDIES

The solution for either GS-IMC or BGS-IMC requires to compute leading eigenvetors whose eigenvalues are less than or equal to pre-specified ω. However, one might argue that it is computationally intractable on the industry-scale datasets. To address the concerns, one feasible approach is to perform the Nyström (Fowlkes et al., 2004) method to obtain the leading eigenvectors. For the completeness of the paper, we present the pseudo-code of the approximate eigendecomposition (Chen et al., 2021) in Algorithm 1, of which the computational complexity is O(lnk + k 3 ) where n is the number of columns in Ł, l is the number of sampled columns and k is the number of eigenvectors to compute. This reduces the overhead from O(n 3 ) to O(lnk + k 3 ), linear to the number of rows. To evaluate how the proposed GS-IMC and BGS-IMC methods perform with the approximate eigenvectors, we conduct the experiments on the largest Netflix prize data. Table 7 reports the HR, NDCG and runtime results for the standard GS-IMC and BGS-IMC methods, and their scalable versions entitled GS-IMCs and BGS-IMCs. To make the comparison complete, we also present the results of neural IDCF (Wu et al., 2021) model equipped with ChebyNet (Defferrard et al., 2016) . It is obvious that the standard GS-IMC and BGS-IMC methods consume only a small fraction of training time, required by graph neural networks. Meanwhile, GS-IMCs achieves comparable ranking

Algorithm 1 Approximate Eigendecomposition

Require: n × l matrix C derived from l columns sampled from n × n kernel matrix L without replacement, l × l matrix A composed of the intersection of these l columns, l × l matrix W, rank k, the oversampling parameter p and the number of power iterations q. Ensure: approximate eigenvalues Σ and eigenvectors U. 1: Generate a random Gaussian matrix Ω ∈ R l×(k+p) , then compute the sample matrix A q Ω. 2: Perform QR-Decomposition on A q Ω to obtain an orthonormal matrix Q that satisfies the equation A q Ω = QQ A q Ω, then solve ZQ Ω = Q WΩ. 3: Compute the eigenvalue decomposition on the (k + p)-by-(k + p) matrix Z, i.e., Z = U Z Σ Z U Z , to obtain U W = QU Z [:, : k] and Σ W = Σ Z [: k, : k]. 4: Return Σ ← Σ W , U ← CA -1/2 U W Σ -1/2 W . Table 7 : Hit-Rate, NDCG and Runtime of the enhanced IDCF (Wu et al., 2021) model equipped with ChebyNet (Defferrard et al., 2016) , GS-IMC, BGS-IMC (w/ random walk regularization) and their scalable versions (i.e., GS-IMCs and BGS-IMCs) for inductive top-N ranking on Netflix data. HR@10 HR@50 HR@100 NDCG@10 NDCG@50 NDCG@100 Runtime IDCF+ChebyNet 0.08735±0.0016 0.19335±0.0042 0.27470±0.0053 0.04996±0.0010 0.07268±0.0017 0.08582±0.0037 598 min GS-IMC 0.09660±0.0006 0.22328±0.0002 0.32235±0.0011 0.05452±0.0004 0.08158±0.0004 0.09759±0.0002 12.0 min GS-IMCs 0.09638±0.0007 0.22258±0.0009 0.31994±0.0015 0.05352±0.0006 0.08135±0.0006 0.09657±0.0002 1.5 min BGS-IMC 0.09988±0.0006 0.23390±0.0005 0.33063±0.0009 0.05593±0.0004 0.08400±0.0004 0.09982±0.0001 12.5 min BGS-IMCs 0.10005±0.0011 0.23318±0.0014 0.32750±0.0020 0.05508±0.0006 0.08365±0.0006 0.09890±0.0001 2.0 min λ0 λ200 λ400 λ600 λ800 λ1000 0.000 0.025 Energy GS-IMC λ0 λ200 λ400 λ600 λ800 λ1000 0.000 0.025 BGS-IMC λ0 λ200 λ400 λ600 λ800 λ1000 0.000 0.025

BERT4REC

Figure 4 : Spectrum analysis for static GS-IMC, BGS-IMC and dynamic BERT4REC on the Koubei dataset. Compared to BERT4REC, the energy of GS-IMC and BGS-IMC is concentrated on the low frequencies since the high-frequency functions are highly penalized during minimization. performance to GS-IMC, while improving the efficiency by 8X. Likewise, BGS-IMCs enjoys the improvement in the system scalability without significant loss in prediction accuracy. The overall results demonstrate that GS-IMC and BGS-IMC are highly scalable in very large data. E SPECTRUM ANALYSIS AND DISCUSSION WITH SEQUENTIAL MODELS We compare BGS-IMC with recent sequential recommendation models, including Transformer-based SAS-REC (Kang & McAuley, 2018) , BERT-based BERT4REC (Sun et al., 2019) and causal CNN based GREC (Yuan et al., 2020) . We choose the embedding size of 256 and search the optimal hyper-parameters by grid. Each model is configured using the same parameters provided by the original paper i.e., two attention blocks with one head for SAS-REC, three attention blocks with eight heads for BERT4REC and six dilated CNNs with degrees 1, 2, 2, 4, 4, 8 for GREC. Table 8 presents HR and NDCG results on Koubei for inductive top-N ranking. Note that BGS-IMC only accepts the most recent behavior to update the obsolete state for incremental learning, whereas SASREC, BERT4REC and GREC focus on modeling the dynamic patterns in the sequence. Hence, such a comparison is not in favor of BGS-IMC. Interestingly, we see that static BGS-IMC achieves comparable HR results to SOTA sequential models, while consuming a small fraction of running time. From this viewpoint, BGS-IMC is more cost-effective than the compared methods. To fully understand the performance gap in NDCG, we analyze GS-IMC, BGS-IMC and the best baseline BERT4REC in the graph spectral domain, where we limit the 2 norm of each user's spectral signals to one and visualize their averaged values in Figure 4 . As expected, the energy of GS-IMC and BGS-IMC is concentrated on the low frequencies, since the high-frequency functions are highly penalized during minimization. Furthermore, the proposed prediction-correction update algorithm increases the energy of high-frequency functions. This bears a similarity with BERT4REC of which high-frequency functions are not constrained and can aggressively raise the rankings of unpopular items. This explains why BERT4REC and BGS-IMC have better NDCGs than GS-IMC.

F LIMITATION AND FUTURE WORK

Limitation on sequence modeling. The proposed BGS-IMC method is simple and cannot capture the sophisticated dynamics in the sequence. However, we believe that our work opens the possibility of benefiting sequential recommendation with graph signal processing techniques, for example extended Kalman filter, KalmanNet and Particle filter. Limitation on sample complexity. The sample complexity is not provided in the paper, and we believe that this is an open problem due to the lack of regularity in the graph which prevent us from defining the idea of sampling "every other node" (the reader is referred to (Anis et al., 2016; Ortega et al., 2018) for more details). Future work on deep graph learning. Though GS-IMC and BGS-IMC are mainly compared with neural graph models, we note that our approach can help improve the performance of existing graph neural networks including GAT (Veličković et al., 2017) and SAGE (Hamilton et al., 2017) , etc. We summarize the following directions for future works: 1) It is interesting to see how GS-IMC takes advantage of content features. One feasible idea is to use GS-IMC as multi-scale wavelets which G.1 EXTRA DISCUSSION In (Pesenson, 2008) , the complementary set S = Ω c = V -Ω which admits Poincare inequality is called the Λ-set. Theorem 4 in our paper and Theorem 1.1 in (Pesenson, 2009) state that bandlimited functions y ∈ PW ω can be reconstructed from their values on a uniqueness set Ω = V -S. To better understand the concept of Λ-set, we restate Lemma 9 from (Pesenson, 2008) which presents the conditions for Λ-set. It is worth pointing out that (i) the second condition suggests that the vertices from Λ-set would likely be sparsely connected with the uniqueness set Ω; and (ii) the vertices in Λ-set are disconnected with each other or isolated in the subgraph constructed by the vertices S, otherwise there always exists a non-zero function φ ∈ L 2 (S), φ = 0 which makes Łφ = 0. Lemma 9 (restated from Lemma 3.6 in (Pesenson, 2008) ). Suppose that for a set of vertices S ⊂ V (finite or infinite) the following holds true: 1. every point from S is adjacent to a point from the boundary bS, the set of all vertices in V which are not in S but adjacent to a vertex in S; 2. for every v ∈ S there exists at least one adjacent point u v ∈ bS whose adjacency set intersects S only over v; 3. the number Λ = sup v∈s d(v) is finite; Then the set S is a Λ-set which admits the Poincare inequality φ ≤ Λ Łφ , φ ∈ L 2 (S). In our experiments for recommender systems, each user's ratings might not comply with Poincare inequality. This is because there exists some users who prefer niche products/movies (low-degree nodes). As shown in Fig. 2 , user preferences on low-degree nodes are determined by high-frequency functions. When R(ω) is not large enough, Poincare inequality does not hold for such users. This also explains why our model performs poorly for cold items. Regarding to choice of parameter k, empirical results show that using k ≥ 2 does not help improve the performance, and note that when k is large enough, all kernels will be reduced to bandlimited norm, i.e., R(λ) = 1 if λ ≤ λ k ≤ 1, since the gap between eigenvalues shrinks.

H PROOF OF THEOREM 5

Proof. Let ξ denote the random label noise which flips a 1 to 0 with rate ρ, assume that the sample s = y + ξ is observed from y under noise ξ, then for a graph spectral filter H ϕ = (I + R(Ł)/ϕ) -1 with positive ϕ > 0, we have E MSE(y, ŷ) = 1 n E y -H ϕ (y + ξ) 2 ≤ 1 n E H ϕ ξ 2 + 1 n (I -H ϕ )y 2 , where the last inequality holds due to the triangular property of matrix norm. To bound E H ϕ ξ 2 , let C n = R 1/2 (ω) y , then E H ϕ ξ 2 (a) = y(v)=1 ρ(H ϕ,( * ,v) × -1) 2 + (1 -ρ)(H ϕ,( * ,v) × 0) 2 = ρ y(v)=1 (H ϕ,( * ,v) y(v)) 2 = ρ H ϕ y 2 (b) ≤ sup R 1/2 (Ł)y ≤Cn ρ H ϕ y 2 = sup z ≤Cn ρ H ϕ R -1/2 (Ł)z 2 = ρC 2 n σ 2 max H ϕ R -1/2 (Ł) = ρC 2 n max l=1,...,n 1 (1 + R(λ l )/ϕ) 2 1 R(λ l ) ≤ ρϕ 2 C 2 n R(λ 1 )(ϕ + R(λ 1 )) 2 , where (a) follows the definition of the flip random noise ξ and (b) holds to the fact that y is in the Paley-Wiener space PW ω (G). As for the second term, (I -H ϕ )y 2 ≤ sup R 1/2 (Ł)y ≤Cn (I -H ϕ )y 2 (a) = sup z ≤Cn (I -H ϕ )R -1/2 (Ł)z 2 = C 2 n σ 2 max (I -H ϕ )R -1/2 (Ł) = C 2 n max l=1,...,n 1 - 1 1 + R(λ l )/ϕ 2 1 R(λ l ) = C 2 n ϕ max l=1,...,n R(λ l )/ϕ (R(λ l )/ϕ + 1) 2 (b) ≤ C 2 n 4ϕ . ( ) where (a) holds due to the fact that the eigenvectors of I -H ϕ are the eigenvectors of R(Ł); and (b) follows the simple upper bound x/(1 + x) 2 ≤ 1/4 for x ≥ 0. By combing everything together, we conclude the result E MSE(y, ŷ) ≤ C 2 n n ρϕ 2 R(λ 1 )(ϕ + R(λ 1 )) 2 + 1 4ϕ . H.1 EXTRA DISCUSSION Choosing ϕ to balance the two terms on the right-hand side above gives ϕ * = ∞ for ρ < 1/8 and 1 + R(λ 1 )/ϕ * = 2ρ  This result implies that we can use a large ϕ to obtain accurate reconstruction when the flip rate ρ is not greater than 1/8, and ϕ need to be carefully tuned when the flip rate ρ is greater than 1/8.

I PROOF OF PROPOSITION 6

As below we present the proof in a Bayesian framework, and the reader is referred to (Maybeck, 1982) for a geometrical interpretation of Monte Carlo estimate statistics.

Proof of the minimal variance

To minimize the estimate variance, we need to minimize the main diagonal of the covariance P new : trace P new = trace (I -K) Pnew (I -K) + KΣ µ K . This implies that the variance of estimate xnew is minimized when K = Pnew (I + Pnew ) -.

Proof of the unbiasedness

Suppose that the obsolete estimate x is unbiased, i.e. Ex = x, then using Eq. ( 11) we have E xnew = E x + F∆s = x + F∆s = x new . Because of Eq. ( 12) and the measurement noise ν has zero mean, it gives E z new = E x new + ν = x new . Putting everything together, we conclude the following result E xnew = E xnew + K(z new -xnew ) = x new + K(x new -x new ) = x new . This implies that the estimate state xnew is unbiased.

J IMPLEMENTATION DETAILS

In this section, we present the details for our implementation in Section 5 including the additional dataset details, evaluation protocols, model architectures in order for reproducibility. All the experiments are conducted on the machines with Xeon 3175X CPU, 128G memory and P40 GPU with 24 GB memory. The configurations of our environments and packages are listed below: • Ubuntu 16.04 • CUDA 10.2 • Python 3.7 • Tensorflow 1.15.3 • Pytorch 1.10 • DGL 0.7.1 • NumPy 1.19.0 with MKL Intel J.1 ADDITIONAL DATASET DETAILS We use three real-world datasets which are processed in line with (Liang et al., 2018; Steck, 2019) : (1) for Koubeifoot_2 , we keep users with at least 5 records and items that have been purchased by at least 100 users; and (2) for Tmallfoot_3 , we keep users who click at least 10 items and items which have been seen by at least 200 users; and (3) for Netflix 4 , we keep all of the users and items. In addition, we chose the random seed as 9876 when splitting the users into training/validation/test sets. • Use tanh as the activation. • Use inner product between user embedding and item embedding as ranking score. GraphSAGE. We use the SAGEConv layer available in DGL for implementation. The detailed architecture description is as below: • A sequence of two-layer SAGEConv. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ReLU as the activation. • Use inner product between user embedding and item embedding as ranking score. SGC. We use the SGConv layer available in DGL for implementation. The detailed architecture description is as below: • One-layer SGConv with two hops. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ReLU as the activation. • Use inner product between user embedding and item embedding as ranking score. ChebyNet. We use the ChebConv layer available in DGL for implementation. The detailed architecture description is as below: • One-layer ChebConv with two hops. • Add self-loop and use batch normalization for graph convolution in each layer. • Use ReLU as the activation. • Use inner product between user embedding and item embedding as ranking score. ARMA. We use the ARMAConv layer available in DGL for implementation. The detailed architecture description is as below: • One-layer ARMAConv with two hops. • Add self-loop and use batch normalization for graph convolution in each layer. • Use tanh as the activation. • Use inner product between user embedding and item embedding as ranking score. We also summarize the implementation details of the compared sequential baselines as follows. SASREC. 5 We use the software provided by the authors for experiments. The detailed architecture description is as below: • A sequence of two-block Transformer with one head. • Use maximum sequence length to 30. • Use inner product between user embedding and item embedding as ranking score. BERT4REC. 6 We use the software provided by the authors for experiments. The detailed architecture description is as below: • A sequence of three-block Transformer with eight heads. • Use maximum sequence length to 30 with the masked probability 0.2. • Use inner product between user embedding and item embedding as ranking score.



To be consistent with(Shuman et al., 2013), u l (l-th column of matrix U) is the l-th eigenvector associated with the eigenvalue λ l , and the graph Laplacian eigenvalues carry a notion of frequency. https://tianchi.aliyun.com/dataset/dataDetail?dataId=53 https://tianchi.aliyun.com/dataset/dataDetail?dataId=35680 https://kaggle.com/netflix-inc/netflix-prize-data https://github.com/kang205/SASRec https://github.com/FeiSun/BERT4Rec



Figure 1: (a) Conventional 1-bit matrix completion focuses on recovering missing values in matrix R, while inductive approaches aim to recover new column y from observations s that are observed at testing stage. ξ denotes discrete noise that randomly flips ones to zeros. (b) Our GS-IMC approach, which regards y as a signal residing on nodes of a homogeneous item-item graph, aims to reconstruct true signal y from its observed values (orange colored) on a subset of nodes (gray shadowed).

Figure 3: (a) Online learning scenario requires the model to refresh the predictions based on newly coming data ∆s that is one-hot (orange colored). (b) GS-IMC deals with this problem in graph vertex domain using Eq. (10), while BGS-IMC operates in graph Fourier domain. The measurement z/z new is graph Fourier transformation of the prediction ŷ/ŷ new , and we assume hidden states x/x new determine these measurements under noise ν. To achieve this, x/x new should obey the evolution of ŷ/ŷ new , and thus Eq. (11) represents Eq. (10) under noise η in graph Fourier domain.

Then, we differentiate the trace of P new with respect to Kd trace P new d K = trace 2K Pnew -2 Pnew + trace 2KΣ u .The optimal K which minimizes the variance should satisfy d trace(P new )/d K = 0, then it gives K(I + Pnew ) = Pnew .

Figure 5: Evaluation protocols, where the users in top block (green) are used for training and the ones in bottom block (pink) are used for evaluation. (a) transductive ranking, where the model performance is evaluated based on the users already known during the model training; (b) inductive ranking, the model performance is evaluated using the users unseen during the model training.

Regularization functions, operators, kernels with free parameters γ ≥ 0, a ≥ 2.

Computationally, when P, Σ η , Σ ν are diagonal, it takes O(k 2 ) time to compute xnew and P new , and O(nk) time for xnew and Pnew . The total time complexity is O(nk + k 2 ), linear to the number of vertices n. Further, Proposition 6 shows that xnew in (

Comparisons to neural sequential models for the task of inductive top-N ranking on Koubei.

1/3 for ρ ≥ 1/8. Plugging in this choice, we have the upper bound if ρ ≥ 1

funding

* Junchi Yan is the correspondence author who is also with Shanghai AI Laboratory. The work was in part supported by NSFC (62222607), STCSM (22511105100).

annex

random walk norm,where we set ϕ = 1 and a ≥ λ max is a pre-specified parameter for the random walk regularization.

C ABLATION STUDIES

This study evaluates how GS-IMC and BGS-IMC perform with different choice of the regularization function and the graph definition. In the following, we assume the underlying signal to recover is in the Paley-Wiener space PW λ1000 (G), and hence we only take the first 1000 eigenfunctions whose eigenvalues are not greater than λ 1000 to make predictions.C.1 IMPACT OF REGULARIZATION FUNCTIONS Table 4 and 5 show that for the proposed GS-IMC models, Tikhonov regularization produces the best HR and NDCG results on both Koubei and Netflix, while Diffusion process regularization performs the best on Tmall. Meanwhile, BGS-IMC with random walk regularization achieves the best HR and NDCG results on Koubei, while Tikhonov regularization and Diffusion process regularization are best on Tmall and Netflix. Perhaps more importantly, BGS-IMC consistently outperforms GS-IMC on all three datasets by margin which proves the efficacy of the prediction-correction algorithm.We highlight the reason why BGS-IMC can further improve the performance of GS-IMC is due to the fact that BGS-IMC considers Gaussian noise in the Fourier domain and the prediction-correction update algorithm is capable of providing unbiased and minimum-variance predictions. Model N@10 N@50 N@100 N@10 N@50 N@100 N@10 N@50 N@100 GS-IMC-Tikhonov (Sec. 

G PROOF OF THEOREM 4

Proof. This proof is analogous to Theorem 1.1 in (Pesenson, 2009) , where we extend their results from Sobolev norm to a broader class of positive, monotonically increasing functionals.Proof of the first part of the Theorem 4.Suppose that the Laplacian operator Ł has bounded inverse and the fitting error = 0, if y ∈ PW ω (G) and ŷk interpolate y on a set Ω = V -Ω c and Ω c admits the Poincare inequality φ ≤ Λ Łφ for any φ ∈ L 2 (Ω c ). Then yŷk ∈ L 2 (Ω c ) and we haveAt this point, we can apply Lemma 7 with Λ = a and φ = yŷk . It gives the following inequalityfor all k = 2 l , l = 0, 1, 2, . . . Since R(λ) is positive and monotonically increasing function, it givesBecause the interpolant ŷk minimize the norm R(Ł) k • , we havePutting everything together, we conclude the first part of Theorem 4:Proof of the second part of the Theorem 4.Since ΛR(ω) < 1 holds, it gives the following limit With the non-negativity of the norm, we haveThis implies the second part of the Theorem 4:Lemma 7 (restated from Lemma 4.1 in (Pesenson, 2009) ). Suppose that Ł is a bounded selfadjoint positive definite operator in a Hilbert space L 2 (G), and φ ≤ a Łφ holds true for any φ ∈ L 2 (G) and a positive scalar a > 0, then for all k = 2 l , l = 0, 1, . . . , the following inequality holds trueLemma 8 (restated from Theorem 2.1 in (Pesenson, 2008) ). A function f ∈ L 2 (G) belongs to PW ω (G) if and only if the following Bernstein inequality holds true for all s ∈ R +Published as a conference paper at ICLR 2023

J.2 EVALUATION PROTOCOLS

In Figure 5 , we illustrate the difference between the transductive ranking and inductive ranking evaluation protocols. In the transductive ranking problem, the model performance is evaluated on the users already known during the model training, whereas the model performance is evaluated on the unseen users in the inductive ranking problems. It is worth noting that in the testing phrase, we sort all interactions of the validation/test users in chronological order, holding out the last one interaction for testing and inductively generating necessary representations on the rest data. In a nutshell, we evaluate our approach and the baselines for the challenging inductive next-item prediction problem.

J.3 EVALUATION METRICS

We adopt hit-rate (HR) and normalized discounted cumulative gain (NDCG) to evaluate the model performance. Suppose that the model provide N recommended items for user u as R u , let T u denote the interacted items of the user, then HR is computed as follows:where 1|Ω| is equal to 1 if set Ω is not empty and is equal to 0 otherwise. NDCG evaluates ranking performance by taking the positions of correct items into consideration:where Z is the normalized constant that represents the maximum values of DCG@N for T u .

J.4 GRAPH LAPLACIAN

Let R denote the item-user rating matrix, D v and D e denotes the diagonal degree matrix of vertices and edges respectively, then graph Laplacian matrix used in our experiments is defined as follows:where I is identity matrix.

J.5 DISCUSSION ON PREDICTION FUNCTIONS

In experiments, we focus on making personalized recommendations to the users, so that we are interested in the ranks of the items for each user. Specifically, for top-k ranking problem we choose the items with the k-largest predicted ratings,More importantly, our proposed method is also suitable for the link prediction problem, where the goal is classify whether an edge between two vertices exists or not. This can be done by choosing a splitting point to partition the candidate edges into two parts. There are many different ways of choosing such splitting point. One can select the optimal splitting point based on the ROC or AUC results on the validation set.

J.6 MODEL ARCHITECTURES

As mentioned before, we equip IDCF (Wu et al., 2021) with different GNN architectures as the backbone. Here we introduce the details for them.GAT. We use the GATConv layer available in DGL for implementation. The detailed architecture description is as below:• A sequence of one-layer GATConv with four heads.• Add self-loop and use batch normalization for graph convolution in each layer.GREC. 7 We use the software provided by the authors for experiments. The detailed architecture description is as below:• A sequence of six-layer dilated CNN with degree 1, 2, 2, 4, 4, 8.• Use maximum sequence length to 30 with the masked probability 0.2.• Use inner product between user embedding and item embedding as ranking score. 

