Untangle: Critiquing Disentangled Recommendations

Abstract

The core principle behind most collaborative filtering methods is to embed users and items in latent spaces, where individual dimensions are learned independently of any particular item attributes. It is thus difficult for users to control their recommendations based on particular aspects (critiquing). In this work, we propose Untangle: a recommendation model that gives users control over the recommendation list with respect to specific item attributes, (e.g.:less violent, funnier movies) that have a causal relationship in user preferences. Untangle uses a refined training procedure by training (i) a (partially) supervised β-VAE that disentangles the item representations and (ii) a second phase which optimized to generate recommendations for users. Untangle gives control on critiquing recommendations based on users preferences, without sacrificing on recommendation accuracy. Moreover only a tiny fraction of labeled items is needed to create disentangled preference representations over attributes.

1. Introduction

Figure 1 : Untangle model is trained in two phases: Disentangling phase: Input to encoder is a one hot representation of an item (green dotted line). Obtained representation is disentangled across A attributes. Recommendation phase: Input to encoder is the items user interacted with (solid red line) and recommends new items. User and item representations form the basis of typical collaborative filtering recommendation models. These representations can be learned through various techniques such as Matrix Factorization (1; 2), or are constructed dynamically during inference e.g. the hidden state of RNN's in session-based recommendations (3; 4) . As most standard recommendation models solely aim at increasing the performance of the system, no special care is taken to ensure interpretability of the user and item representations. These representations do not explicitly encode user preferences over item attributes. Hence, they cannot be easily used by users to change a.k.a. critique (5) the recommendations. For instance, a user in a recipe recommendation system cannot ask for recommendations for a set of less spicy recipes, as the spiciness is not explicitly encoded in the latent space. Moreover the explainability of the recommendations that are provided by such systems is very limited. In this work, we enrich a state-of-the-art recommendation model to explicitly encode preferences over item attributes in the user latent space while simultaneously optimizing for recommendation's performance. Our work is motivated by disentangled representations in other domains, e.g., manipulating generative models of images with specific characteristics (6) or text with certain attributes (7) . Variational Autoencoders (VAEs), particularly β-VAE's (8) (which we adapt here), are generally used to learn these disentangled representations. Intuitively, they optimize embeddings to capture meaningful aspects of users and items independently. Consequently, such embeddings will be more usable for critiquing. There are two types of disentangling β-VAEs: unsupervised and supervised. In the former, the representations are disentangled to explanatory factors of variation in an unsupervised manner, i.e., without assuming additional information on the existence (or not) of specific aspects. Used in the original β-VAE (8) approach, a lack of supervision often results in inconsistency and instability in disentangled representations (9) . In contrast, in supervised disentangling, a small subset of data is assumed to have side-information (i.e. a label or a tag). This small subset is then used to disentangle into meaningful factors (10; 9) . As critiquing requires user control using familiar terms/attributes, we incorporate supervised disentanglement in a β-VAE architecture in this work. To achieve the explicit encoding of preferences over item attributes in embedding space we refine the training strategy of the untangle model. We essentially train in two phases: i) Disentangling phase: We explicitly disentangle item representations, using very few supervised labels. ii) Recommendation phase: We encode the user, using the bag-of-words representation of the items interacted, and then generate the list of recommended items. Untangle gives fine-grained control over the recommendations across various item attributes, as compared to the baseline. We achieve this with a tiny fraction of attribute labels over items, and moreover achieve comparable recommendation performance compared to state-of-the-art baselines.

2. Related Work

Deep learning based Autoencoder architectures are routinely used in collaborative filtering and recommendation models (11; 12; 13) . In particular (11; 12) adopt denoising autoencoder architectures, whereas (13) uses variational autoencoders. The internal (hidden) representations generated by the encoders in these models are not interpretable and hence cannot be used for critiquing or explanations in recommendations. Recent work on Variational Autoencoders across domains have focused on the task of generating disentangled representations. One of the first approaches used to that end was β-VAE (8; 14; 15) , which essentially enforced a stronger (multiplying that term with β > 1) KL divergence constraint on the VAE objective. Such representations are more controllable and interpretable as compared to VAEs. One of the drawbacks of β-VAE is that the disentanglement of the factors cannot be controlled and that they are relatively unstable and not easy to reproduce particularly when the factors of variance are subtle (9; 8; 14; 16; 17) . This has motivated methods that explicitly supervise the disentangling (10) , that rely either on selecting a good set of disentangling using multiple runs and the label information (18) , or by adding a supervised loss function in the β-VAE objective function (10) . As supervised disentangling methods are better in explainability and could provide control over desired attributes, we motivate our model from (19) for better critiquing in VAE based recommendation systems. In recommender systems similar methods to utilize side information, have also been used recently to allow for models that enable critiquing of recommendations. These models allow users to tune the recommendations across some provided attributes/dimensions. Notable examples are (20; 21) , where the models are augmented with a classifier of the features over which to control the recommendation. Adjusting the features at the output of the classifier modifies the internal hidden state of the model and leads to recommendations that exhibit or not the requested attribute. Note that this method of critiquing is quite different to our approach which allows for a gradual adjustment of the attributtes. Moreover the models in (20; 21) require a fully labeled dataset with respect to the attributes while our approach only requires a small fraction of labeled data. Unsupervised disentanglement was also recently used to identify and potentially use factors of variation from purely collaborative data i.e., data generated by user interactions with items (22) note though that this method focus was mainly on performance of the recommendations and that it does not allow for seamless critiquing as it is not clear what aspect of the data get disentangled.

3. Untangle

The aim of the untangle model is to obtaining controllable user (and item) representations for better critiquing along with optimizing for recommendation performance. To this end, we incorporate a simple supervised disentanglement technique to disentangle across item attributes/characteristics over which we want to provide explicit control to the users. We index users with u ∈ {1, . . . , n}, and items with i ∈ {1, . . . , m}. X n×m is a matrix of user-item interactions (x ui = 1 if user u interacted with item i, and 0 otherwise). A subset of items are assumed to have binary labels for attributes A. Our model is a modified β-VAE architecture, with a feed forward network based encoder and decoder. In Figure 1 , user u is represented by [z : c]. Note that : stands for concatenation, the z part of the representation is non-interpretable by default while on the c part of the representation we map (through a refined learning step) the representation of the attributes of the items over which we would like the user to have control. Each dimension in c is mapped to only one attribute a. Across the paper, we refer the dimension associated with the attribute a, as c a . The user representation is sampled from the distribution parameterized by the encoder (q φ ): q φ (x u * ) = N (µ φ (x u * ), diag(σ φ (x u * )). The input to the encoder is the bag of words representation of the items u interacted with, i.e. the u th row of matrix X, x u * . The decoder generates the probability distribution given user representation [z : c], π(z) ∝ exp(f dec φ ([z : c])) , over the m items. The likelihood function used in recommender system settings (3; 23; 24; 25) is typically the multinomial likelihood: p θ (x u |[z : c]) = i x ui log π i ([z : c]))

3.1. Learning

Training is conducted in two phases: Recommendation and Disentangle phase, as mentioned in Algorithm 1.

Recommendation Phase

The objective in this phase is to optimize the encoder parameterized by (θ), and decoder parameterized by (ψ) to generate personalized recommendations. We train our model with the following objective: L(x u * , θ, φ) ≡ E q θ ([z:c]|xu * ) [logp θ (x u * |[z : c])] -βKL(q φ ([z : c]|x u * )|p([z : c])) ) Intuitively, this is the negative reconstruction error minus the Kullback-Leibler divergence enforcing the posterior distribution of z to be close to the Gaussian distribution (prior) p(z). The KL divergence in β-VAE is computed between the representation sampled from the encoder and the normal distribution p(z) = N (0, I d ). The diagonal co-variance matrix enforces a degree of independence among the individual factors of the representation. Consequently, increasing the weight of the KL divergence term with β > 1 boosts the feature independence criteria, leading to disentangled representation. This ensures that even in the recommendation phase, the learnt user representations are nudged towards disentanglement. Disentanglement Phase Since the attribute information is commonly available across the items. In this phase, we first obtain the item representation in the user latent space (as depicted in the highlighted green box in Figure 1 ). We pass the one hot encoding of an item, and obtain its representation in the latent user space. We then disentangle the obtained representation using the following objective: (2) L(1 i , θ, φ) ≡ E q θ ([z:c]|1i) [logp θ (1 i |[z : c])] -βKL(q φ ([z : c]|1 i )|p([z : c])) + γE q θ (c|1i) l(q φ (c|1 i ), a) Algorithm 1: Untangle: Training Data: X ∈ R n×m containing user-item interactions, with a subset of items having labels for A attributes initialize model params.: Encoder(φ), Decoder(θ) ; do if is_disentangle then // Disentangle representations 1 i ← random mini batch from set of items that are labelled with A set. [z : c] ← sample from N (µ φ (1 i ), diag(σ φ (1 i )) xi * ← Decoder([z : c]) compute gradients ∇L φ , ∇L θ using Objective 2 φ ← φ + ∇L φ θ ← θ + ∇L θ end if is_recommend then // Recommend items x u * ← random mini-batch from dataset [z : c] ← sample from N (µ φ (x u * ), diag(σ φ (x u * )) xu * ← Decoder([z : c]) compute gradients ∇L φ , ∇L θ using Objective 1 φ ← φ + ∇L φ θ ← θ + ∇L θ end while model converges; As in (10) , we modify the β-VAE objective (Objective 1) in to incorporate a classification loss over the factors c, over which we disentangle. This loss penalizes discrepancies between the attribute label prediction for factor c a and the label a of interest, nudging the disentanglement for each attribute to happen over the corresponding factor c a .

4. Datasets

Movielens Dataset: We use the Movielens-1m and Movielens-20m datasets (26) , which contain 1 million and 20 million user-movie interactions, respectively. For the latter, we filter out movies with fewer than 5 ratings and users who rated ≤ 10 movies. We utilize the relevance scores given in the Movielens dataset for 10,381 movies across 1,000 different tags to select attributes for disentangling. E.g., Mission Impossible movie has high relevance (0.79) for the action tag. We take the top 100 tags, based on the mean relevance score across all movies. Among these 100 tags, some tag pairs, like (funny, and very funny), are by definition entangled. Therefore, to identify distinct tags, we cluster these 100 tags (∈ R 10381 movies) into 20 clusters using K-Means clustering. Finally, we select a subset from these 20 clusters, as given in Table 1 for disentangling. We assign the new-clustered tag (as given in Table 1 , Column 1) if the average-relevance score (the mean of relevance scores for tags present in the corresponding cluster) is higher than 0.5.

Goodreads Dataset:

The GoodReads dataset (27) contains user-book interactions for different genres. We use the Children and Comics genres to evaluate our model. We filter out items rated ≤ 5 and users who rated ≤ 10 books. The final statistics are given in Appendix A. We extract the tags for disentangling from the user-generated shelf names, e.g., historical-fiction, to-read. We retrieve the top 100 shelf names. Some tags (like "books-i-have") are not useful to revise recommendations. Therefore, we only consider item attributes that all the authors consider informative for critiquing recommendations. We select a subset for disentangling from this set, as it still contains correlated attributes like historical-fiction, fiction. We select attributes with the corresponding number of books where the attribute was present {horror: 

5. Evaluation Metrics

We evaluate Untangle on these criteria: i) quality of items recommended, ii) extent of disentanglement, iii) control/critiquing based on the disentangled representations. Ranking Based Metrics: We evaluate the quality of items recommended using two ranking-based metrics: Recall@k and normalized discounted cumulative gain NDCG@k. The latter is rank sensitive, whereas Recall@k considers each relevant item in the top-k equally. Recall@k := k i=1 I[item[i] ∈ S] min(k, |S|) DCG@k := k i=1 2 I[item[i]∈S] -1 log(i + 1) NDCG is normalized DCG by dividing it by the largest possible DCG@k. Disentanglement Metrics: We use the Disentanglement, and Completeness metrics introduced in (28) . Disentanglement measures the extent to which each dimension captures at most one attribute. E.g., if a dimension captures all attributes, the Disentanglement score will be 0. We compute importance p aj of a th attribute on j th dimension of [z : c] ∈ R d , with Gradient Boosted Trees as given in (9) . Using the p aj scores, the disentanglement score is defined as: H |A| (P j ) = - |A|-1 a=0 p aj log |A| p aj , D j = (1 -H |A| (P j )) D = d-1 j=0 ρ j D j , ρ j = |A|-1 a=0 p aj d-1 j=0 |A|-1 a=0 p aj We compute entropy H |A| (P j )) for j th dimension. Disentanglement score for dimension j is then 1 -entropy. The final disentanglement score of the system is weighted average of D j across all the dimensions d, where ρ j the dimension's relative importance. Completeness: Measures the extent to which one attribute a is encoded in a single dimension of [z : c]. For a latent representation of 16 dimensions and 2 attributes, if 8 dimensions encode attribute a 1 and the other 8 encode a 2 , then the Disentanglement will be 1 but Completeness will be 0.25. Completeness is defined as: H d (P a ) = - d-1 j=0 p aj log d p aj , C a = (1 -H d (P a )) C = |A|-1 a=0 ρ a C a , ρ a = d-1 j=0 p aj |A|-1 a=0 d-1 j=0 p aj Controller Metric: We propose a simple metric to quantify the extent of control disentangled dimension c a has on recommendations by critiquing attribute a. With superviseddisentanglement, the mapping between dimensions c in the latent representations, and the attributes across which we disentangled is known. The features in these dimensions in c allow the user to control/critique the respective attribute in the generated recommendations. For instance, less violence can be achieved by reducing the corresponding dimension value (violence) in c. We evaluate this by probing if the items where the attribute is present (S a ) are ranked higher when the dimension value c a is increased by a factor of g in the user representation. We extract the items recommended from the decoder (I a (g)), for the new user representation where only c a is multiplied g × c a . We compare (I a (g)) against (S a ) using any ranking-based metric described above. We further vary g for a given range [-G, G], and study if the ranking of (S a ) improves. The Controller-Metric is defined as follows: Controller_Metric(k, g) := |Recall@k(I a (G), S a ) -Recall@k(I a (-G), S a )| Recall@k(I a (-G), S a ) To compute the Controller-Metric for a system, we take the median across all the attributes disentangled in c. Note that the metric value depends on k and the range chosen.

6. Results and Discussions

Recommendation and Disentanglement Performance We train the Untangle model with the parameter settings mentioned in Appendix B. We compare Untangle with the MultiDAE, and MultiVAE models (13) . We also compare our model with a stronger baseline for disentanglement β-VAE, which disentangles the representation in an unsupervised way. We present our results in Table 2 . Note that supervised disentanglement for Table 2 , has been trained with 300 (1%), 1030 (5%), 1500 (5%), 1550 (5%) labelled items for Movielens-(1m,20m) and Goodreads-(Children,Comics) respectively. We observe that our proposed model's performance on ranking-based metrics (Recall@k, and NDCG@k) is comparable to the baselines across all datasets. Thus we show that disentangling the latent representation does not impact the recommendation performance. We also quantify the disentanglement using the Disentanglement and Completeness metrics discussed in Section 5. We infer from Table 2 that the disentanglement achieved across all the mentioned strategies is significantly higher than the baselines. Disentangling with a tiny fraction of labeled items leads to a significant gain in disentanglement compared to β-VAE. We evaluate the extent of the controllability of the disentangled representations. To this end, we compute the Controller Metric, which measures the control over the attribute dimension c a variation. We use the multiplicative range of [-150, +150] to amplify c a , and measure the ranking performance using recall@10 across this range. Note that the rest of the representation remains unchanged. We observe that we get significantly higher controllability for the Untangle model compared to the baseline approaches, especially for Movielens-20m and Goodreads-Comics dataset. By reducing c a we can diminish the existence of items with attribute a from the recommendation list and by gradually increasing the magnitude of c a increase the presence of items with this attribute in the recommendation list up to saturation. Critiquing Recommendations The primary aim of our model is to obtain controllable representations for critiquing. With the Controller Metric, we quantify controllability, here we further analyze the incremental impact of changing the attribute dimension. In this analysis, we visualize the effect on the recommendations of the adjustment of the disentangled factor c a for each attribute a. We multiply the factor with g in Figure 2 and Figure 3 for baseline model MultiVAE and Untangle. Note that for the baseline (MultiVAE), we adjust the dimension that has the highest feature importance score computed using Gradient Boosting Classifier for attribute a. For the movies domain (Figure 2 ), we observe that for MultiVAE (row 1) the variation in c a has no clear correlation with the recommendation performance in terms of the presence or absence of items with this attribute. In contrast to MultiVAE, in the Untangle model, we consistently observe a significant and gradual variation across all the explicitly disentangled attributes A. Even for subtle attributes like suspense, we obtain a complete range of recall@10 from 0.0 to 1.0 We observe similar results for Goodreads comics dataset (Figure 3 ), where we again get gradual and significant change (approximately 1) across all the disentangled attributes. Correlation between Relevance Scores and c a : We observe that disentangling across item representations leads to a fine-grained control for critiquing. We further verify, if the achieved controllability is an outcome of high correlation between factor c a , and the true relevance score across movies for attribute a for Movielens-20m dataset. We randomly sample 500 movies, and obtain their latent representation from the encoder. In Figure 4 , we plot the obtained c a value to and the true relevance score for attribute action. We can infer from the Fewer Labels for Disentanglement One of the advantages of Untangle is that it disentangles with very few labels. We train Untangle with fewer labeled items. Each point in in Figure 5 is an average across 5 different runs with different random seeds. For Movielens-20m just 1% attribute labels yields a disentanglement score of 0.51, which gradually increases up to 0.92 when trained with all labels. For Goodreads-Comics, with 1% labelled books we are able to achieve 0.52 disentanglement, which gradually increases to 0.93 when the model is trained with all the labels. Note that even with 1% labelled items, the disentanglement and completeness scores obtained are significantly higher than β-VAE model:0.21 and 0.19 on Movielens-20m and Goodreads-Comics and respectively. Controllable Attributes With the above analysis, we have established that Untangle leads to controllable representations. In this experiment, we identify if the controllability is restricted to the chosen set of attributes. Therefore, we apply Untangle to a larger set of tags for Movielens-20m dataset. We cluster all the 1181 tags present in the dataset, using K-Means clustering into 50 clusters. The clustering strategy is similar to the one mentioned in Section 4. We then evaluate the controllability for each of the clustered-tag, b. We explicitly encode the corresponding clustered-tag b using Untangle, using 5% of labelled items. The controller metric score is obtained for each tag, across 5 runs. In each run, we sub-sample four clustered tags out of 40 to be disentangled along with the corresponding clustered tag b. This is done to model the impact of disentangling a given attribute alongside with other attributes present in the dataset. We identify that across 40 clustered-tags, we obtain a controller-metric score of > 11.0 for over 21 tags. Some of the attributes which do not have a higher controller-metric score includes:80s, crappy, philosophical, etc. These attributes are also unlikely to be critiqued by user. Some of the most controllable and least controllable tags have been listed in Appendix D.

7. Conclusion

Untangle archives the goals we set, it provides control and critiquing over the user recommendations over a set of predefined item attributes. It does so without sacrificing recommendation quality and only needs a small fraction of labeled items. 



(a) MultiVAE:Sad (b) MultiVAE:Romantic (c) MultiVAE:Suspense (d) MultiVAE:Violence (e) Untangle:Sad (f) Untangle:Romantic (g) Untangle:Suspense (h) Untangle:Violence

Figure 2: Control over recommendations when factor-value c a , is adjusted by multiplicative factor g ∈ [-150, 150]. Recommendation lists are evaluated by recall@(5,10,20). Relevance is determined by the presence of attribute a in the retrieved items. We compare Multi-VAE (top) with Untangle model (bottom) for sad, romantic, suspense and violence on ML-20m.

(a) MultiVAE:Adventure (b) MultiVAE:Sci-Fi (c) MultiVAE:Mystery (d) MultiVAE:Humor (e) Untangle:Adventure (f) Untangle:Sci-Fi (g) Untangle:Mystery (h) Untangle:Humor

Figure 3: We compare Multi-VAE (top) with Untangle model (bottom) for adventure, sci-fi, mystery and humor attributes for Goodreads-Comics for the same analysis done in Figure 2.

Figure 4: Correlation between learnt dimension value c a to the true relevance score across 500 movies for Movielens-20m

Figure 4 that the representations obtained from Untangle have a high Pearson correlation of 0.53 as compared to MultiVAE model (Pearson Correlation: -0.03). The graphs for other attributes/tags are presented in Appendix C.

Figure 5: Variation in Disentanglement and Completeness metrics when model is trained with lesser labels for Movielens-20m and GoodReads-Comics.

(a) MultiVAE:Romantic (b) MultiVAE:Sad (c) MultiVAE:Suspense (d) MultiVAE:Violence (e) Untangle:Romantic (f) Untangle:Sad (g) Untangle:Suspense (h) Untangle:Violence

Figure 6: We compare Multi-VAE (top) with Untangle model (bottom) for the correlation between factor c a and true relevance scores.

1080, humor:9318, mystery:3589, and romance:1399} and {adventure:8162, horror:5518, humor:8314, mystery:5194, romance:7508, sci-fi:7928}, for Goodreads-(Children and Comics) respectively. Each cluster was manually assigned a human-readable label. Some of the tags present in each cluster are listed in column 3. Column 2 lists the number of movies that had high relevance score for tags in each cluster.

Recommendation and Disentanglement performance on Movielens-(1m,20m) and Goodreads-(Comics,Children) domain dataset on the corresponding test split.

A Dataset Statistics

We have mentioned the number of interactions, users, and items for Movielens and Goodreads Dataset in Table 3 . 

B Implementation Details

We divide the set of users into train, validation and test splits. Validation and test splits consist of 10% of the users, across all datasets. For each user in the validation and test split, we use only 80% of the items rated by them to learn the user representation. The remaining 20% is used to evaluate the model's performance. This strategy is similar to the one used by (13) . For all the experiments, user's latent representation is restricted to 32 dimensions. The encoder and decoder consists of two layers with [600, 200] and [200, 600] hidden units respectively, each with ReLu activation. We conduct hyper-parameter tuning to identify β and γ values from [5, 10, 50 ] and [5, 10, 50 , 500] respectively. The threshold M to identify movies where the attribute is present for Movielens-20m , and MovieLens-1m is taken as 0.5 and 0.4 respectively. All the models are run up to 50 epochs. We select the best model, based on its performance on validation dataset for both NDCG@100 and Disentanglement score. We select less than 5% of items for supervised β-VAE using stratified sampling.

C Correlation between dimension value c a and true relevance scores across items

We compare the dimension value c a associated with an attribut a, to the true relevance scores present in the Movielens-20m dataset. We show in Figure 6 that across all the tags, the correlation is consistently higher for Untangle, when compared to MultiVAE.

D Controllable Attributes

Using Untangle, we identify the clustered-tags, which are more controllable for revising user recommendations. We have listed some of the most controllable and least controllable tags in Table 4 . We also list the absolute recall difference obtained across each cluster.Recall Difference Tags in the cluster: 

