COLLABORATIVE FILTERING WITH SMOOTH RECON-STRUCTION OF THE PREFERENCE FUNCTION Anonymous

Abstract

The problem of predicting the rating of a set of users to a set of items in a recommender system based on partial knowledge of the ratings is widely known as collaborative filtering. In this paper, we consider a mapping of the items into a vector space and study the prediction problem by assuming an underlying smooth preference function for each user, the quantization at each given vector yields the associated rating. To estimate the preference functions, we implicitly cluster the users with similar ratings to form dominant types. Next, we associate each dominant type with a smooth preference function; i.e., the function values for items with nearby vectors shall be close to each other. The latter is accomplished by a rich representation learning in a so called frequency domain. In this framework, we propose two approaches for learning user and item representations. First, we use an alternating optimization method in the spirit of k-means to cluster users and map items. We further make this approach less prone to overfitting by a boosting technique. Second, we present a feedforward neural network architecture consisting of interpretable layers which implicitely clusters the users. The performance of the method is evaluated on two benchmark datasets (ML-100k and ML-1M). Albeit the method benefits from simplicity, it shows a remarkable performance and opens a venue for future research. All codes are publicly available on the GitLab.

1. INTRODUCTION

Nowadays, recommender systems (RS) are among the most effective ways for large companies to attract more customers. A few statistics are sufficient to attract attention towards the importance of RS: 80 percent of watched movies on Netflix and 60 percent of video clicks on Youtube are linked with recommendations (Gomez-Uribe & Hunt, 2015; Davidson et al., 2010) . However, the world of RS is not limited to video industry. In general, recommender systems can be categorized into three groups (Zhang et al., 2019) : collaborative filtering (CF), content-based RS, and hybrid RS depending on the used data type. In this paper, we focus on CF, which uses historical interactions to make recommendations. There might be some auxiliary information available to the CF algorithm (like the user personal information); however, a general CF method does not take such side information into account (Zhang & Chen, 2019) . This includes our approach in this paper. Recently, deep learning has found its way to RS and specifically CF methods. Deep networks are able to learn non-linear representations with powerful optimization tools, and their efficient implementations have made then promising CF approaches. However, a quick look at some pervasive deep networks in RS (e.g., He et al. (2017) and Wu et al. (2016)) shows that the utilization of deep architectures is limited to shallow networks. Still, it is unclear why networks have not gone deeper in RS in contrast to other fields like computer vision (Zhang et al., 2019) . We suppose that the fundamental reason that limits the application of a deeper structure is the absence of interpretability (look at Seo et al. (2017) , for example). Here, interpretability can be defined in two ways (Zhang et al., 2019) ; first, users be aware of the purpose behind a recommendation, and second, the system operator should know how manipulation of the system will affect the predictions (Zhang et al., 2018) . This paper addresses both issues by formulating the recommendation as a smooth reconstruction of user preferences. Particularly, our contributions are: • The CF problem is formulated as the reconstruction of user preference functions by minimal assumptions. • An alternating optimization method is proposed that effectively optimizes a non-convex loss function and extracts user and item representations. In this regard, effective clustering methods are proposed and tested. • A feed-forward shallow architecture is introduced, which has interpretable layers and performs well in practice. • Despite the simplicity and interpretability of the methods, their performance on benchmark datasets is remarkable.

1.1. RELATED WORKS

The applied methods in CF are versatile and difficult to name. Below, we explain a number of methods which are well-known and are more related to our work. Multilayer perceptron based models. A natural extension of matrix factorization (MF) methods (Mnih & Salakhutdinov, 2008) are Neural Collaborative Filtering (NCF) (He et al., 2017) and Neural Network Matrix Factorization (NNMF) (Dziugaite & Roy, 2015) . Both methods extend the idea behind MF and use the outputs of two networks as the user and the item representations. The innerproduct makes the prediction of two representations. Although our work has some similarity to this method, we model users by functions and represent these functions in a so-called frequency domain. Thus, user and item representations are not in the same space. AutoEncoder based models. AutoRec (Sedhain et al., 2015) and CFN (Strub et al., 2016) are wellknown autoencoder (AE) structures that transform partial observations (user-based or item-based) into full row or column data. Our method differs from AE structures as our network use item (user) representations and predicts user (item) ratings.

2. SMOOTH RECONSTRUCTION FROM NON-UNIFORM SAMPLES

Rating as the output of the preference function. Most of the time, a finite set of features can characterize users and items that constitute the recommendation problem. Although no two users or items are exactly the same, the number of characterization features can be considerably small without losing much information. Assume that item i is characterized by the vector x i ∈ X ⊂ R d . We further assume that all users observe similar features of an item and user u's ratings are determined by a preference function f u : X → [c min , c max ]. The recovery of a general preference function might need an indefinite number of samples, i.e., observed ratings. However, we do not expect user attitudes to change too much with small changes in an item's feature. E.g., if the price is the determinative factor in someone's preference, small changes in the price must not change the preference over this item significantly (look at figure 1 ).  s[n] = M m=-M s[m]e j2π(mn/N ) (1) So 2M + 1 distinct samples from s would be enough to calculate s. For an analytical approach, it is useful to interpret equation 1 as a discretization of a trigonometric continuous equation: h(x) = M (m=-M ) a m e j2π(mx) , a ∈ C 2M +1 , x ∈ [0, 1) Mirroring. Smoothness usually is used to refer to bandlimited signals around the zero-frequency which can be represented by equation 1. However, we use the smooth finite-length signal to refer to a real-valued finite-length signal that has intuitively smooth behavior in its non-zero domain. figure 2 shows an example. The trigonometric functions in equation 2 can not approximate such signals well even if we shift and scale the domain to [0,1] because the original signal is not periodic. One possible solution to make the trigonometric functions still a good representative for the finitelength signal would be mirroring. figure 2 shows the shifted, scaled, and mirrored signals in 1D space. ! ! 1 Extension to multi-dimensional real-valued mirrored signals. equation 2 will be simplified for a real-valued s just to include cosine terms. One can obtain the extension of equation 2 for real-valued mirrored signals as: h(x) = h(x 1 , x 2 , . . . , x d ) = M m1=0 . . . M m d =0 A m1,m2,. . . ,m d cos π(m 1 x 1 + ... + m d x d ) , (3) where x ∈ [0, 1] d and A is a d-dimensional real tensor. To simplify the notation, we use m T x to refer m 1 x 1 + ... + m d x d and a to refer vectorized A. Starting from m 1 , m 2 , . . . , m d all to be 0, one can put all the possible values of m as the columns of matrix C = [c] (M +1) d ×d with the same order as they appear in vectorizing of A. Now we can rewrite equation 3 with matrix operations: h(x) = (M +1) d k=1 a k cos πC k,: x Vandermonde matrix. Given r non-uniform samples in [0, 1] d , the Fourier coefficients, a, are the solution to the linear system of equations h(x i ) = s i , i = 1, 2, . . . , r. The Vandermonde matrix for this system is defined as: V = cos C[x 1 , x 2 , . . . , x r ] , where the cos(.) is the element-wise cosine, and [. . . ] shows the stacking operator of column vectors. So, the linear system of equations can be shortened by: V T a = s. Here s is the column vector of r observed s i put together. In contrast to the 1D case, there is no simple theorem on the conditions to estimate a correctly. Roughly speaking, this needs the rank of V to be larger than the number of unknowns, i.e., (M + 1) d or in other words, the number of samples (r) should be enough larger than (M + 1) d . Reconstruction of the preference function from the observed ratings. We can state the problem of rating prediction, as the reconstruction of the preference function of each user (f u ) given the observed ratings of that user (I u ). If we assign a d-dimensional characterization vector (x i ) to each item i that is assumed to lie in X = [0, 1] d , we can estimate the user u Fourier coefficients as a u = (V T u ) † s u . At the starting point we do not know how items are distributed in X which means V u will be inaccurate. So, optimizing the reconstruction loss gives fair characterstics for the items in X : min xi,i∈I u∈U V T u (V T u ) † s u -s u 2 . ( )

3. LEARNING REPRESENTATIONS BY MINIMIZING RECONSTRUCTION LOSS

Minimizing equation 6, aside from the non-convexity of the cost function, implicitly involves solving V T u a u = s u , which can in general be an ill-condition system of linear equations, specially when the user u has few recorder ratings. To reliably estimate the Fourier coefficients a u (user representations), group similar users into a number of clusters and use a single representative for each cluster (virtually increasing the number of available ratings). In addition, we further consider a Tikhonov (L 2 ) regularizer to improve the condition number. With this approach, we need to solve min {xi,i∈I},c L = min u∈U V T c(u) a c(u) -s u 2 + λ k∈C a k 2 , s.t. 0 ≤ x i < 1, where c : U → C is the mapping of the users into clusters, C is the set of clusters and V k is the Vandermonde matrix associated with the cluster k (considering all the users in a cluster as a superuser). Hence, V k is a function of {x i , i ∈ ∪ {u: c(u)=k} I u }. The penalty parameter λ shall be tuned via cross validation. Moreover, the Fourier coefficients a k for the cluster k are obtained by: a k = (V k V T k + λI) -1 V k s k , where s k is the vector of all observed ratings in the cluster k. In the sequel, we propose two approaches for minimizing equation 7. In the first approach (Section 3.1), we alternatively find min {xi,i∈I} L and min c L; as min c L requires a combinatorial search, we introduce an approximate algorithm, named k-representation (Section 3.1.1) inspired by the k-means technique. Each iteration of k-representation consists of assigning each user the cluster with the lowest reconstruction loss, and updating the cluster representatives. In the second approach (Section 3.2), we train a neural network to jointly characterize the items and cluster the users. For this, the loss function of equation 7 is modified to accommodate for soft clustering. 3.1 ALTERNATING OPTIMIZATION

3.1.1. k-REPRESENTATION FOR CLUSTERING THE USERS

The total loss in equation 7 can be divided into partial losses of the form L k = V T k a k -s k 2 (9) for each cluster k. We propose the k-representation (Algorithm 1) to minimize the overall cost iteratively. We first randomly set a k s, k ∈ C; then, each user is assigned to a cluster k for which its reconstruction loss (equation 7) is minimized. After dividing the users into clusters, the representative of each cluster is updated via equation 8, and we return again to the clustering task. Similar to the k-means, there is no theoretical guarantee that the method converges to the global optimizer; nevertheless, the overall loss is decreased in each iteration. We shall evaluate the performance of the method in the Section ?? on both synthetic and real data. Boosted k-representation. By introducing both the clustering and L 2 regularization in equation 8, we have improved the robustness of the inverse problem. However, increasing the number of clusters is still a potential issue in estimating the cluster representatives. Here, we propose to learn an ensemble of weak binary clusterings instead of learning all clusters together (Algorithm 2). The idea is to find the residuals of predicted ratings for each user and fit a new clustering to the residuals. Due to the linearity of the prediction, the final representation is the sum of weak representations for each user.  user clustering c : U → C cluster representatives {a k , k ∈ C} procedure k-REPRESENTATION init. a k from N (0, σ 2 ) for all k ∈ C do Calculate V k from x i s repeat for all u ∈ U do c(u) ← argmin k V T u a k -s u 2 for all k ∈ C do user representatives {a u , u ∈ U} procedure BOOSTED k-REPRESENTATION for all u ∈ U do a u ← 0, , s u ← s u for l = 1 : log 2 |C| do {a (l) k }, c (l) ← k-representation({s (l-1) u }) for all u ∈ U do a (l) u ← a (l) c (l) (u) , s (l) u ← s (l-1) u -V T u a (l) u , a u ← a u + a (l) u end for end procedure

3.1.2. OPTIMIZING ITEM CHARACTERISTICS

The second part of the alternating optimization is to minimize equation 7 w.r.t. item characteristics; i.e., {x i , i ∈ I}, subject to 0 ≤ x i < 1. To simplify the equations, we rewrite the total loss in equation 7 with small modification as: L = (u,i)∈O + ρ(l u,i ) where l u,i is |S u,i -v T i a u | 2 and O + is the set of observed ratings. Here, ρ is a saturating function that reduces the effect of outliers. In equation 7, it is simply the identity function; however, we chose ρ(y) = 2( (1 + y) -1) to better bound the unpredictable errors. We use the Trust Region Reflective (TRF) algorithm of the scipy python package as the optimization algorithm, which selects the updating steps adaptively and without supervision. To facilitate the optimization, we need to calculate the gradient of l u,i w.r.t. {x i }. It is obvious that ∇ xj l u,i is zero for all j = i. For j = i and q = 1, ..., d (dimension of X ) we have: ∂ ∂x i,q l u,i = 2 (S u,i -v T i a u ) (M +1) d n=1 πa u,n sin(πC n,: x i )C n,q . Pre-search. Before using the TRF to optimize the loss, we make use of the current function evaluations to update item characteristics. Consider we are in the t th iteration and the values $%" $ %# $ % & X-layer #neurons = ( = 3 ! #," ! #, # ! # ,& Item-one-hot-layer #neurons = ℐ = 7 1 0 0 0 … 0 0 0 $ / $ / $ / … … 012 V-layer #neurons = / + 1 4 User-layer x (t-1) i , V (t-1) i , a (t-1) i from the previous iteration are available. We define x (t-1 2 ) i = x (t-1) argmin j∈I si-V (t-1) j a (t-1) i 2 . ( ) Then, we use x (t-1 2 ) i as the input to TRM and run a few iterations to get {x (t) i }. This pre-search makes the optimization process less prone to stagnate in local minima.

RECONSTRUCTION

Recent advances in deep learning have made the neural networks great tools even for traditional well-known algebraic calculations. Specifically, neural networks can be optimized effectively with various methods and efficient implementations that boost training. Further, some useful techniques like batch normalization, drop-out, etc., are available, which helps to avoid overfitting. In this section, we will reformulate equation 7 for soft clusters and design an architecture that learns items' characteristics and users' representations concurrently. First we modify equation 7 for soft clustering. Consider c : U → C |C| is the assignment function that determines how much each user is belonged to each cluster. We intentionally do not constraint norm of the c and let it to be scaled appropriately for different user. The total reconstruction loss is: L = u∈U ( k∈C c k (u)V T k a k ) -s u 2 + λ k∈C a k 2 (13) figure 3 shows an inspired architecture from 13. It consists of six layers (four hiddens) which three of them are trainable. The observed ratings will be supplied per item. Each neuron corresponds to an item at the input layer, and the input data should be one-hot vectors. The next layer is X-layer, which is a dense layer with d units. We interpret the weights from unit i in the input layer to the X-layer as x i . At the V-layer, the item's representation (x i ) is multiplied by C (EQ) and forms v i . V-layer does not have any trainable parameters. Next, there is the soft-clustering layer, a dense layer with n 1 neurons. We interpret weights going from the V-layer to unit k of the soft-clustering layer as a k , i.e., the representation of the k th soft cluster. Finally, the output layer or equivalently the User-layer is a dense layer with |U| units. Weights going from the soft-clustering layer to unit u determine c(u). The drop-out layer (not depicted) after V-layer has an important role in preventing the network from overfitting. Dropping-out makes sense because the observed ratings usually come with a lot of uncertainty in a real application, i.e., the same user might rate the same item differently when asked for re-rating, and this is the nature of real data. The drop-out layer prevents overfitting by stopping the network from relying on a specific part of the items' characteristics. One way to increase the capacity of the method and capture non-linear interactions is to let the user's representation be a nonlinear function of clusters' reconstructed ratings. i.e., to change equation 13 RECONST-NET 3.2 ML-100k 3 10 10 n 1 = 10, n 2 = 100, n 3 = 10, drop-out= 0.1 ML-1M 4 10 10 n 1 = 15, n 2 = 100, n 3 = 15, drop-out= 0.1 as: L = u∈U k∈C c k (u)g({V T q a q }) -s u 2 + λ k∈C a k 2 here g can be an arbitrary non-linear function. In figure 3 we have proposed g as two additional hidden layers in right panel with tanh activation. Here, we usually choose n 1 = n 3 and equals our expectation from the number of soft clusters in the data. Still one can interpret the last hidden layer as the soft clustering layer.

3.3. COMBINING MULTI PREDICTORS

Combining user-based and item-based methods. Till now, we have assumed that items are mapped to a vector space, and users have preference functions (user-based method), but there is no reason not to consider it reversely (item-based method). A simple way of combining user-based and itembased methods is to do a linear regression from each method's output to the observed ratings. The validation part of the data will be used for estimating the coefficients of regression. We will see that combining user-based and item-based methods significantly improve the prediction of test data. Leveraging ensemble of predictors. A more complicated but effective way of combining is to leverage an ensemble of combined user-based and item-based methods. Consider we have a predictor f (t) (S) at iteration t that predicts observed ratings S for training the model. At each iteration, we calculate the residuals of the predicted ratings and pass it to the next predictor: t) ). The next predictor, f (t+1) , uses S (t+1) for training its model. The final predictor leveraged from f (t) , t = 1, 2, ..., T , uses the sum of all predictions: S (t+1) = S (t) -f (t) (S ( f (S) = T t=1 f (t) (S).

4. EXPERIMENTS

To evaluate the proposed methods, we use three datasets: a synthetic dataset besides the well known ML-100k and ML-1M datasets (Harper & Konstan, 2015) . The details of each dataset can be found in Table 1 . The synthetic data is created to assess our clustering methods; for this purpose, a number of cluster representatives are randomly chosen in the low-frequncy domain (Guaranteed to be smooth) and then, each user is placed randomly around one of the representatives. The distance of the users to the associated representative (within-cluster variance) is varied in different tests. Look at the Appendix for detailed discussion. The MovieLens datasetsfoot_0 ML-100k and ML-1M are two of the most common benchmarks for collaborative filtering. For the ML-100k, we follow the pre-specified train-test split; i.e., 75%, 5%, and 20% of the total available data is used for training, validation, and test, respectively. For ML-1M, these numbers are 90%, 5%, and 5%, respectively. (Kalofolias et al., 2014) 0.996 -Factorized EAE (Hartford et al., 2018) 0.920 0.860 IGMC (Zhang & Chen, 2019) 0.905 0.857 GC-MC (Berg et al., 2017) 0.910 0.832 Bayesian timeSVD++ (Rendle et al., 2019) 0.886 0.816 sRGCNN (Monti et al., 2017) 0.929 -GRALS (Rao et al., 2015) 0 2 . Figure 4 shows the RMSE on two different datasets during the training stage. The validation is reserved for parameter tuning, and performance on test data is reported for both methods. Figure 4a clearly reveals the performance gain achieved by combining the user-and item-based techniques. Further, the stair shape decrease of the training loss (and validation loss) confirms the suitability of leveraging the ensemble of predictors. Performance comparison. We further conduct experiments on ML-100k and ML-1M datasets. We have ignored the available side information in both datasets. Therefore, for the performance comparison in Table 3 , we have included only methods that do not take into account these side information. Although neither of the alternating optimization and RECONST-NET record the best RMSE, they yield very good results despite their simplicity and interpretability.

5. CONCLUSION

In this article, we formulated the rating prediction problem as a reconstruction problem with a smoothness assumption. The proposed methods are all simple and interpretable but show significant performance comparing to state-of-the-art methods. Specifically, we interpreted different layers of the designed network and evaluated our interpretation in synthetic design. The proposed architecture and rich frequency-domain feature can be a basis for future research on interpretable recommender systems. Here, we study their performances via experiments on synthetic data. We recall that the mentioned methods do not explicitly penalize miss-clustering; instead, they minimize the within-cluster reconstruction loss. As a result, we expect these methods to perform fairly well when the clusters are distinguishable. To measure the matching between the identified clusters and the original ones, we employ the Adjusted Rank Index (ARI) (look at Vinh et al. ( 2010)). For two clusterings c 1 and c 2 with the same domain U, if we form the contingency matrix N with the (i, j) element as |{u, c 1 (u) = i, c 2 (u) = j}|, then, ARI defined based on the elements, column and row sums of N , intuitively shows the rate of agreement between the two clusterings if u is randomly chosen from U. It takes the maximum value 1 for identical clusterings and the minimum value 0 when the clusterings are perceived as fully random with respect to each other. As explained, ARI has the advantage of comparing two clusterings even with different number of clusters. To use the ARI metric, we need a hard clustering of the users. For this purpose, we associate each user in our User-layer (soft clustering layer) in Figure 3 to the neuron (cluster) in the last hidden unit with the largest absolute weight. Although this technique violates the main goal in soft clustering, it provides us with a measure of clustering accuracy. In Figure 5 , the performance of k-representation (1), boosted k-representation (2) and modified results from the soft clustering layer (Figure 3 ) are depicted for three scenarios. As expected, we obtain inferior result from the soft-clustering layer; however, as ARI is above zero, the clustering of this layer is not irrelevant. In the left plot in Figure 5 , the ARI curves in terms of the discrimination index of the clusters (the ratio of between-cluster to within-cluster variances) are presented. We observe that the boosted k-representation works better in cases with higher discrimination index, but loses its performance in case of cluttered clusters. The center plot in Figure 5 , shows ARI changes by varying the density in the rating matrix (ratio of the number of observed to non-observed entries). While the k-representation has the best performance at low densities, its performance drops when the density exceeds a threshold; this might be due to the involved regularization term. Finally, in the right plot of Figure 5 , we have changed the number of clusters in all methods; the correct number of clusters is kept fixed at 4. As we see in these plots, none of the k-representation and its boosted version dominate the other one in all regimes.



https://grouplens.org/datasets/movielens/



Figure 2: Mirroring, shifting and scaling.

Figure 3: Neural network inspired from equation 13.

Figure 4: Training process for different methods

Figure 5: The performance of the clustering techniques on synthetic data

Reconstruction of bandlimited 1D signals. Let us start with the simplest case. Consider s[n], n = 0, 1, . . . , N -1, a 1D signal with length N . We call s to have bandwidth M < N 2 if there is a representation ŝ[m], m = -M, -M + 1, . . . , M -1, M that can represent s as:

Algorithm 1 k-representation Input: item characteristics {x i , i ∈ I} available ratings by each user {I u , s u , u ∈ U} number of clusters |C| initialization variance of cluster representations σ 2 L 2 penalty parameter λ. Output:

Update a k via equation 8 until convergence return {a k } and c

Datasets summary



