IEDR: A CONTEXT-AWARE INTRINSIC AND EXTRIN-SIC DISENTANGLED RECOMMENDER SYSTEM

Abstract

Intrinsic and extrinsic factors jointly affect users' decisions in item selection (e.g., click, purchase). Intrinsic factors reveal users' real interests and are invariant in different contexts (e.g., time, weather), whereas extrinsic factors can change w.r.t. different contexts. Analyzing these two factors is an essential yet challenging task in recommender systems. However, in existing studies, factor analysis is either largely neglected, or designed for a specific context (e.g., the time context in sequential recommendation), which limits the applicability of such models. In this paper, we propose a generic model, IEDR, to learn intrinsic and extrinsic factors from various contexts for recommendation. IEDR contains two key components: a contrastive learning component, and a disentangling component. The two components collaboratively enable our model to learn context-invariant intrinsic factors and context-based extrinsic factors from all available contexts. Experimental results on real-world datasets demonstrate the effectiveness of our model in factor learning and impart a significant improvement in recommendation accuracy over the state-of-the-art methods.

1. INTRODUCTION

Recommender systems aim to predict the probability of a user selecting a given item (e.g., click, purchase). This is a challenging prediction as each decision is jointly affected by multiple factors (Ma et al., 2019) . Psychological research has revealed that users' decision making is mainly influenced by two factors: intrinsic and extrinsic factors (Bénabou & Tirole, 2003; Vallerand, 1997) . An intrinsic factor is an internal motivation for inherent satisfaction, which is often stable for an individual. In contrast, an extrinsic factor is a contextual motivation triggered by the environment (external stimulation), and it often varies among different contexts (e.g., weather, time) (Ryan & Deci, 2000) . For example, on a day with heavy rain, a user decides to take an Uber (a taxi calling app) to work. In this case, the choice of Uber over other taxi calling apps is because the user is more comfortable with this app's user interface (intrinsic factor), while the choice of taking a ride to work is motivated by the weather condition (extrinsic factor). Although the importance of capturing these factors in recommender systems has been recognized, their full potential has not been explored by the existing works. (1) Some studies neglect the intrinsic and extrinsic factor disentangling, and the final prediction mainly relies on learning entangled representations (Barkan & Koenigstein, 2016; Covington et al., 2016; Wu et al., 2019) . With the intrinsic and extrinsic factors entangled behind each decision, the real factors that derive the decision may be incorrectly inferred, resulting in a suboptimal recommendation Wang et al. (2020) . (2) Some studies learn intrinsic and extrinsic factors, but just under a specific context. For example, some sequential recommendation models leverage the time context (order sequence) to learn intrinsic and extrinsic factors (they call them long-and short-term interests) (Hidasi et al., 2016; Yu et al., 2019b) ; some point-of-interest recommendation models leverage the spatial context (geometric distance) to learn the two factors (Li et al., 2017; Wu et al., 2020) . In such models, the factor learning approaches are domain-specific, so it would be difficult to generalize them to other contexts. Meanwhile, the factors may be influenced by multiple contexts. Hence, focusing on a single context may result in inferior factor learning. Therefore, it is still an open question of how to effectively incorporate various context information for learning intrinsic and extrinsic factors in recommender systems. Focusing on this question, we propose a generic recommendation framework that can learn intrinsic and extrinsic factors from various contexts. We first formally define context-agnostic intrinsic and extrinsic factors for recommendation tasks. Following these definitions, we propose an Intrinsic-Extrinsic Disentangled Recommendation (IEDR) model, which contains two modules: a recommendation prediction (RP) module, and a contrastive intrinsic-extrinsic disentangling (CIED) module. For each user-item interaction, the PR module constructs all the contexts as a graph (context graph), and the context representation is obtained via learning the graph. The same procedures are done to obtain the user and item representations from their attributes (e.g., user gender, item category), respectively. Then, the intrinsic and extrinsic factors are learned from these representations for the user and the item perspectives. Meanwhile, the CIED module contains two components: a contrastive learning component that learns a context-invariant intrinsic factor, and a disentangling component that disentangles the intrinsic and extrinsic factors via mutual information minimization. The two components jointly ensure IEDR to learn intrinsic and extrinsic factors. In this paper, we made the following contributions: • To better analyze the factors influencing users' decisions, we formalize the context-agnostic intrinsic and extrinsic factors for recommender systems. Following these definitions, we propose IEDR to learn disentangled intrinsic and extrinsic factors from various contexts for recommendation. • IEDR comprises a context-invariant contrastive learning component, and a mutual information minimization-based disentangling component to effectively disentangle the learned factors. • Extensive experiments on real-world datasets show that (1) IEDR significantly outperforms the state-of-the-art baselines when various contexts are available; (2) our proposed CIED module can successfully learn intrinsic and extrinsic factors.

2. RELATED WORK

This section summarizes the current research progress on recommender systems and contrastive learning related to our work. Feature interaction modeling Many recommender systems leverage feature interactions to improve prediction accuracy. One of the most common techniques is the factorization machine (FM) (Rendle, 2010) , which models feature interactions through dot product and achieves great success. Recent studies extend FM with deep neural networks for more powerful feature interaction modeling (Xiao et al., 2017; He & Chua, 2017; Yu et al., 2019a) . The Wide & Deep model (WDL) (Cheng et al., 2016) proposes a framework that combines shallow and deep modeling of features for recommendation. Guo et al. (2017) combine FM and WDL by replacing the shallow part of WDL with an FM model. Su et al. (2021) leverage the relation reasoning power of graph neural networks for feature interaction modeling. However, these models do not incorporate context information for better factor analysis, and we overcome this issue by leveraging this information to disentangle and learn intrinsic and extrinsic factors for recommendation. Factor disentanglement Intrinsic and extrinsic factors are considered as two basic factors for individual decision making in psychological research (Ryan & Deci, 2000; Bénabou & Tirole, 2003; Vallerand, 1997) . Recent recommender systems have borrowed the idea of capturing these two factors to achieve more accurate recommendation. For example, in sequential recommendation, Hidasi et al. (2016) are the first to leverage the recurrent neural networks to capture users' longand short-term (LS-term) interests from their interacted item sequences. Yu et al. (2019b) propose a time-aware controller to capture the differences between LS-term interests for more accurate interest learning. Zheng et al. (2022) further emphasize the disentanglement between the LS-term interests at different time scales to differentiate the LS-term interests. In point-of-interest recommendation, studies are leveraging spatial context to capture the intrinsic and extrinsic factors (Li et al., 2017; Wu et al., 2020) . However, all of the above studies focus on specific contexts. As a result, their factor learning approaches are hard to apply to other recommendation domains, which may result in a suboptimal solution if other contexts jointly influence these factors. Some studies learn users' multiple factors without knowing the meaning of each factor (i.e., implicit factor). They first define the number of factors (e.g., 4) to be learned, and then disentangle the representations of each pair of factors (Ma et al., 2019; Wang et al., 2020) . These models only ensure that the learned factor representations are disentangled, but cannot guarantee whether they really refer to important factors. Our IEDR model incorporates various contexts for explicit intrinsic and extrinsic factor learning. Contrastive learning Contrastive learning has achieved great success in computer vision (Chen et al., 2020; Chuang et al., 2020; Khosla et al., 2020; Tian et al., 2020; Chen & He, 2021) and neural language processing (Oord et al., 2018; Yang et al., 2019; Gao et al., 2021; Gunel et al., 2021) . Recently, contrastive learning has attracted attention in recommender systems. Yao et al. (2021) conduct contrastive learning on users and items respectively on a two-tower framework to learn robust user and item representations. In addition, Wu et al. (2021) propose a contrastive learning framework on a user-item bipartite graph to capture robust high-degree relationships between users and items. Lin et al. (2021) and Jiang et al. (2021) leverage contrastive learning to eliminate popularity bias. We propose the method that learns intrinsic factor representations that are invariant to context through a contrastive learning approach.

3. PROBLEM STATEMENT AND DEFINITIONS

Let U, V, and C denote the user set, item set, and context set, respectively. Each user u ∈ U consists a set of user features u = {z u 1 , z u 2 , ..., z u p } (e.g., user ID, gender). Similarly, each item v ∈ V is represented by a set of item features v = {z v 1 , z v 2 , ..., z v q } (e.g., branch, color). A context c ∈ C is a set of context features c = {z c 1 , z c 2 , . .., z c m }, denoting the context state when a user selects an item (e.g., weather, daytime). Let D be a dataset containing N instances (i.e., data samples) of (u, v, c), with an corresponding label y ∈ {1, 0} indicating whether or not the user u selects the item v under the context c. The recommendation task can be formulated as predicting the selection probability y ′ = p(u, v, c). In our proposed IEDR model, the intrinsic factor o in and the extrinsic factor o ex are explicitly inferred for both users and items, and jointly leveraged to perform the prediction. Next, we formally define intrinsic and extrinsic factors. We believe these two factors exist from both users' and items' perspectives. This is reasonable since a user selecting an item not only relates to the factors (motivations) of users, e.g., preferring a ride (intrinsic factor) over walking to work on a rainy day (extrinsic factor), but also relates to the factors (attractiveness) of items, e.g., a Taxi calling App with a comfortable user interface (intrinsic factor) and has a discount (extrinsic factor). In the following, we define intrinsic and extrinsic factors from the users' perspective only, as they are similar from the items' perspective. Definition 1. (Intrinsic Factor and Extrinsic Factor) Consider a user u and a set of contexts C; an intrinsic factor of the user u is a factor that is invariant to the contexts in C, i.e., f in (u, c) = f in (u, c ′ ), where f in is a function learning intrinsic factor representations, and c and c ′ are two arbitrary contexts in C. On the other hand, an extrinsic factor of the user u is a factor that changes w.r.t. the context, i.e., there exist contexts c and c ′ in C such that f ex (u, c) ̸ = f ex (u, c ′ ), where f ex learns extrinsic factor representations.

4. INTRINSIC-EXTRINSIC DISENTANGLED RECOMMENDATION MODEL

The overview of our model is visualized in Figure 1 . More specifically, our proposed IEDR model consists of the following two modules, which will be detailed in the next subsections: • A recommendation prediction (RP) module that takes a user and an item as input, and combines them with a set of contexts, to generate intrinsic and extrinsic factor representations for both the user and the item. The predicted probability y ′ is then jointly learned from these representations. • A contrastive intrinsic-extrinsic disentangled (CIED) module is applied to both the user and the item sides to support the intrinsic and extrinsic factor learning. The module contains a contextinvariant contrastive learning component and a disentangling component, to ensure the learned factors satisfy Definition 1.

4.1. THE RECOMMENDATION PREDICTION MODULE

The recommendation prediction (RP) module is a symmetric structure that generates user intrinsic and extrinsic factor representations (o u in , o u ex ) from the user side, and generates item intrinsic and (Su et al., 2021) to generate the representations. SIGN has been proven effective in user/item/context representation learning through modeling feature interactions via graph neural networks. Appendix A.1 provides a detailed description of SIGN. More formally, let f u (u) : R p×d → R d be the function for SIGN-based feature modeling. f u (u) first maps each user feature z u i ∈ u into a d-dimensional feature embedding z u i . Then, it models these feature embeddings to output the user representation u. Similarly, SIGN learns context representation c through f c . Next, a factor generation function f u ie (u, c) : R 2×d → R 2×d (e.g., a neural network) takes the user representation and the context representation as input, and simultaneously generates a user intrinsic representation o u in and a user extrinsic representations o u ex . Here, the output is a 2ddimensional vector, with the first d-dimensional terms as o u in and the rest as o u ex . On the item side, a similar module structure is adopted. We use a different SIGN-based function for the item representation learning v = f v (v), while using the same context representation as that on the user side. A factor generating function f v ie (v, c) is applied to obtain the item intrinsic factor representation o v in and extrinsic factor representation o v ex . Finally, we jointly learn the prediction y ′ = f pred (o u in , o u ex , o v in , o v ex ). We linearly combine the intrinsic and extrinsic representations and use the dot product as the prediction function: f pred (o u in , o u ex , o v in , o v ex ) = (o u in + o u ex ) ⊤ (o v in + o v ex ). A cross-entropy loss function is adopted to minimize the prediction error: L RP (u, v, c) := -y log(y ′ ) + (1 -y) log(1 -y ′ ).

4.2. THE CONTRASTIVE INTRINSIC-EXTRINSIC DISENTANGLING MODULE

While the RP module can generate factor representations, solely using this module cannot correctly distinguish intrinsic representations from extrinsic ones. To address this, we propose a contrastive intrinsic-extrinsic disentangled (CIED) module and apply it to both the user and the item sides. In the following, we only describe the CIED on the user side, as the module on the item side has the same structure. CIED consists of a context-invariant contrastive learning component and a disentangling component. The contrastive learning component learns intrinsic representations that are invariant in different contexts, while the disentangling component leverages a mutual information minimization task to disentangle the intrinsic and extrinsic representations. In the following, we describe these two components in detail.

4.2.1. THE CONTEXT-INVARIANT CONTRASTIVE LEARNING COMPONENT

We propose a context-invariant contrastive learning component to learn the intrinsic representations. More specifically, we maximize the agreement between the intrinsic representation pairs generated from the same user under different contexts (positive pairs), and minimize the agreement between the intrinsic representation pairs generated from the same context with different users (negative pairs) at the same time. More formally, we represent the intrinsic representations with the subscript (o u in ) ij if it is generated through user u i (from i-th data sample) and context c j (from j-th data sample), i.e., (o u in ) ij = f u ie (u i , c j ). For the i-th data sample (u i , v i , c i ) ∈ D, we calculate the objective function based on InfoNCE (Oord et al., 2018) : L u CICL (ui, ci) := -log exp (sim((o u in )ii, (o u in )ij)/τ ) u ℓ ∈U exp (sim((o u in )ii, (o u in ) ℓi )/τ ) , where c j is an arbitrary context, sim(•) is the cosine similarity, and τ is a temperature value. Implementation. To optimize Equation (1), we need to generate c j (c i ̸ = c j ). We adopt a simple method to perform random sampling from the contexts within the same batch. Also, inspired by (Gao et al., 2021) , where a vector with a large dropout rate can be considered as a new vector, we propose a dropout-based method to generate new context representations (e.g., with the dropout rate larger than 50%). In practice, we integrate the two methods to generate different contexts c j for each c i (we empirically show in Appendix G.4 that the integrated generating method results in better prediction accuracies than either methods). Meanwhile, we need to iterate over all the users to generate (o u in ) ℓi (the intrinsic representations generated from an arbitrary user u ℓ and context c i ), which is prohibitive when the users' number is large. Here, we randomly sample L users from the same batch to generate the negative intrinsic representations (o u in ) ℓi , where ℓ = 1, 2, • • • , L. We use the categorical cross-entropy to optimize Equation (1) following (Oord et al., 2018) .

4.2.2. THE DISENTANGLING COMPONENT

We then perform an intrinsic-extrinsic factor disentangling via minimizing the mutual information between the o u in and o u ex generated from f u ie (u, c). Inspired by vCLUB (Cheng et al., 2020) , we minimize their mutual information by estimating a vCLUB-based upper bound. However, the asymmetric property of the original vCLUB may lead to a less robust and inferior intrinsic-extrinsic disentangling (further discussions on this drawback of vCLUB can be found in Appendix D). Therefore, we propose a simple yet effective bidirectional extension of vCLUB for symmetry, which is more robust and achieves better disentanglement. In the bidirectional vCLUB, two variational distributions (e.g., approximated via neural networks) q u 1 (o u ex |o u in ; θ u 1 ) and q u 2 (o u in |o u ex ; θ u 2 ) are proposed with parameters θ u 1 and θ u 2 , to predict the two types of factors, respectively. Then a bidirectional vCLUB-based mutual information upper bound can be obtained as:foot_0  Ibi-vCLUB(o u in ; o u ex ) := 1 2 E p(o u in ,o u ex ) [log q u 1 (o u ex |o u in )] -E p(o u in )p(o u ex ) [log q u 1 (o u ex |o u in )] + E p(o u in ,o u ex ) [log q u 2 (o u in |o u ex )] -E p(o u ex )p(o u in ) [log q u 2 (o u in |o u ex )] . Through minimizing the bidirectional upper bound I bi-vCLUB (o u in ; o u ex ) as above, we minimize the mutual information between o u in and o u ex . Experimental results in Section 5.2.2 show that vCLUB is more robust and achieves better factor learning. Implementation. The optimization of the disentangling component is conducted in two steps iteratively. In the first step, we estimate the upper bound by training θ u 1 and θ u 2 to minimize the loss function L u bi-appr (u i , c i ) := -1 2 log q u 1 (o u ex ) ii |(o u in ) ii + log q u 2 (o u in ) ii |(o u ex ) ii . Following (Cheng et al., 2020) , we use the mean squared error to optimize q u 1 and q u 2 . In the second step, we freeze θ u 1 and θ u 2 , and minimize the mutual information of o u in and o u ex by training other parameters to minimize the upper bound L u Dis (u i , c i ) = I bi-vCLUB (o u in ) ii ; (o u ex ) ii .

4.3. A MULTI-TASK TRAINING

We perform a two-step multi-task training to minimize the empirical risk of IEDR. The two steps run alternatively until convergence. Appendix H provides the pseudo-code of the training procedure. In the first step, we freeze all the parameters except for θ u 1 , θ u 2 , θ v 1 , and θ v 2 , where θ v 1 , θ v 2 are the parameters of q v 1 (o v ex |o v in ; θ v 1 ) and q v 2 (o v in |o v ex ; θ v 2 ) in the disentangling component on the item side. We then minimize R (θ u 1 , θ u 2 , θ v 1 , θ v 2 ) = 1 N N i=1 L u bi-appr (u i , c i ) + L v bi-appr (v i , c i ) . In the second step, we freeze θ u 1 , θ u 2 , θ v 1 , and θ v 2 , and minimize the following function: arg min R(ω) = 1 N N i=1 L RP (ui, vi, ci) + λ1 L u CICL (ci, ui) + L v CICL (ci, vi) + λ2 L u Dis (ui, ci) + L v Dis (vi, ci) , where L v bi-appr , L v CICL , and L v Dis are the losses on the item side, λ 1 and λ 2 are the weight factors, and ω are all the trainable parameters except for θ u 1 , θ u 2 , θ v 1 , and θ v 2 .

4.4. THEORETICAL ANALYSIS: CONTEXT-INVARIANT CONTRASTIVE LEARNING IN INFORMATION THEORY

In this section, we reason the context-invariant contrastive learning from the perspective of information theory. As formally defined in Theorem 1, optimizing Equation ( 1) is equivalent to maximizing the mutual information between the intrinsic representations and user representations, and simultaneously minimizing the mutual information between the intrinsic representations and the context representations. The theorem on the item side can be derived in the same fashion. The proof of this equivalence can be found in Appendix B. Theorem 1 (Equivalence of contrastive loss L u CICL ). Optimizing the contrastive loss is equivalent to solving: arg min N i=1 L u CICL (u i , c i ) = arg max I(o u in , u) -I(o u in , c) . (4)

5. EXPERIMENTS

We conduct extensive experiments to demonstrate the effectiveness of our model. In this section, we focus on 1) the recommendation performance of IEDR compared to the state-of-the-art methods; 2) the effectiveness of each component in IEDR; and 3) the ability to disentangle intrinsic and extrinsic factors of IEDR. We discuss the datasets, baselines, implementations, and parameter settings in detail in Appendix F.

5.1. OVERALL PERFORMANCE

We evaluate the recommendation performance of our model, by comparing it with various baselines in two scenarios. In the first scenario, we learn intrinsic and extrinsic factors from various contexts. In the second scenario, we learn the factors from a specific (time) context and compare with sequential recommendation baselines that also learn intrinsic and extrinsic factors (i.e., long and short term interests). We use three common evaluation metrics for recommender systems: AUC, NDCG@k, and HR@k with k being 5 and 10.

5.1.1. FACTOR LEARNING FROM VARIOUS CONTEXTS

We run our model and feature interaction-based baselines on the datasets that contain various contexts: the Frappe dataset (Baltrunas et al., 2015) for app recommendation, and the Yelp dataset (Wu et al., 2022) for restaurant recommendation. As our baselines, we use feature interaction-based approaches (Xiao et al., 2017; He & Chua, 2017; Song et al., 2019; Guo et al., 2017; Cheng et al., 2016; Wang et al., 2021; Yu et al., 2019a; Su et al., 2021) that capture these contexts but neglect the factor learning. We also compare with implicit factor disentanglement methods DisRec (Ma et al., 2019) and DGCF (Wang et al., 2020) . The performance results are reported in Table 1 . Each result is the average of 10 times run. For each metric, the best results are in bold, and the best baseline results are underlined. The rows Improv (standing for Improvements) and p-value show the improvement and statistical significance (via the Wilcoxon signed-rank test (Wilcoxon, 1992) ) of IEDR over the best baseline results, respectively. From Table 1 , we observe that our model significantly outperforms all baselines under all evaluation metrics, and the p-values are all less than the 5% threshold, indicating the significance of the improvements. IEDR outperforms feature interaction-based baselines because IEDR captures the Table 1 : Comparing the prediction performance (in percentage) with the baselines. The Improv and p-value rows show the relative improvements and the statistical significance of IEDR over the best performed baselines, respectively. N@k refers to NDCG@k and H@k refers to HR@k. Frappe Yelp AUC N@5 N@10 H@5 H@10 AUC N@5 N@10 H@5 H@10 AFM intrinsic and extrinsic factors, while these baselines neglect the factor learning. Meanwhile, the implicit factor disentanglement methods (DisRec and DGCF) also perform inferior to our model. One reason is that implicit factor disentanglement is not the best way to infer these factors. In Section 5.2, we empirically verify that replacing the disentanglement module in IEDR with an implicit approach (as in DisRec and DGCF) leads to a decrease in recommendation accuracy. Another reason is that the factor disentanglement of DisRec and DGCF are purely based on user-item interactions, and do not consider context information. This may lead to critical information loss for recommendation. Our model explores context information to learn disentangled intrinsic and extrinsic factors for recommendation, and hence achieves better prediction accuracy. We then evaluate IEDR on two Amazon datasets (Movies and CDs) (McAuley et al., 2015) that contain only the time context. IEDR takes the (bucketed) time context as features to learn intrinsic and extrinsic factors. We compare with the state-of-the-art sequential recommendation baselines (Hidasi et al., 2016; Yu et al., 2019b; Zheng et al., 2022) that learn LS-term interests from the item sequences ordered by the time.

5.1.2. FACTOR LEARNING FROM SPECIFIC CONTEXT

The experimental results are reported in Table 2 , where our model achieves competitive accuracy compared to the baselines. This proves the ability of our model to achieve state-of-the-art recommendation accuracy in the context-specific scenario, even compared with the models designed for the context. Moreover, our IEDR is more versatile and can be applied to various contexts.

5.2. EFFECTIVENESS OF OUR MODEL'S COMPONENTS

This section evaluates the components in IEDR in detail. We demonstrate the results in NDCG@10 due to the space limit. Other metrics show similar trends.

5.2.1. ABLATION STUDY OF CONTRASTIVE INTRINSIC-EXTRINSIC DISENTANGLING MODULE

The contrastive intrinsic-extrinsic disentangling (CIED) module contains a context-invariant contrastive learning component and a disentangling component. In this section, we conduct an ablation study to show the impact of these components. We run our model in three variations: 1) without the contrastive learning component (noCL); 2) without the disentangling component (noDis); 3) without the contrastive learning and disentangling components (noCIED), i.e., the CIED module is not applied. Figure 2a compares our IEDR model with these three variants. The inferior performance of noCIED compared with our full model IEDR demonstrates the importance of learning intrinsic and extrinsic factors for accurate recommendation prediction. noCL can be regarded as performing implicit factor learning. The inferior performance of noCL indicates that explicit intrinsic and extrinsic factor learning is superior to implicit factor learning methods (as in the factor disentangling baselines) in inferring the real reason for users' decisions. noDis learns intrinsic factors but does not guarantee the extrinsic factor learning. Therefore, it also obtains worse results than CIED. Employing both components achieves better performance than learning with only one. This is because either component cannot individually learn intrinsic and extrinsic factors successfully, highlighting the importance of learning the two factors for an accurate recommendation.

5.2.2. DISENTANGLING COMPONENT EVALUATION

We propose a bidirectional vCLUB-based disentangling method (BiDis) that disentangles the intrinsic and extrinsic factors. In this section, we compare our BiDis method with the original vCLUB method (vCLUB). From Figure 2b , we observe that our BiDis achieves a better performance than vCLUB, and its variance is much smaller than vCLUB. This indicates that BiDis generates more robust predictions, which is consistent with our analysis in Section 4.2.2. In addition, we visualize the intrinsic and extrinsic representations learned by the two disentanglement methods with t-SNE in Appendix G.7, and observe that using BiDis results in a better intrinsic-extrinsic disentanglement. This is the reason why BiDis delivers a better performance than vCLUB.

5.3. DISENTANGLEMENT VERIFICATION

This section verifies the intrinsic and extrinsic factor disentangling ability of IEDR, including a visualization of the learned intrinsic and extrinsic representations and a case study to show the differences of these factors in users' decision-making.

5.3.1. VISUALIZATION OF INTRINSIC AND EXTRINSIC REPRESENTATIONS

In Figure 3 , we visualize intrinsic and extrinsic factor representations learned from our model. We see that when the model is equipped with the CIED module (IEDR), the factors are well disentangled. However, when we do not use the CIED module (noCIED), the intrinsic and extrinsic factor representations are mixed together. This indicates that these factors cannot be well learned and disentangled without our CIED module. We provide more visualizations and analysis for our model with different component combinations in Appendix G.7.

5.3.2. CASE STUDY

We conducted a case study to analyze the differences between the learned intrinsic and extrinsic factors. We randomly choose a user from the Frappe dataset and generate the intrinsic matching scores (the dot product of the user's intrinsic representation and the items' (apps) intrinsic representations) in two different contexts (Weekday and Weekend). The same for the extrinsic matching scores. We sort the matching scores for the intrinsic and extrinsic factors, respectively, and list the 4 , from weekday to weekend, the extrinsic scores vary a lot, while the intrinsic scores remain invariant. These observations demonstrate that, in different contexts, the user has different intrinsic factors, as well as consistent intrinsic factors. Then, we illustrate how intrinsic and extrinsic factors may have different impacts on users' choices. Table 3 lists the categories of the items with the 10 highest intrinsic/extrinsic scores for two users, respectively. we observe that users have individual intrinsic interests that show their real hobbits in personal time, e.g., User1 prefers sports and fitness apps while User2 prefers gaming apps. On the other hand, extrinsic factors give a higher rank to the items based on the contexts (Workday), e.g., Tool (Google Search) and Communication (Gmail) rank highest in User1's extrinsic scores. Remark. In addition to the above experiments, we further evaluate our model by 1) running our model on other context/feature modeling methods (Appendix G.1), 2) comparing the proposed model with a naive baseline in learning intrinsic factor representations (Appendix G.3), and 3) studying how the hyperparameter settings influence the performance (Appendix G.6).

6. CONCLUSION

Capturing The statistical interaction graph network (SIGN) (Su et al., 2021) explicitly models feature interactions through a graph neural network. IEDR, learns a user representation u (as well as a context representation, and an item representation) using the SIGN-based model f u (u), where u = {z u 1 , z u 2 , ..., z u p }, f u regards the user u as a user graph G(V, E). In this representtation, V = u is the node set that each feature z u i is a node, and E is the edge set containing all the combinations of pairwise feature interactions, with each feature interaction ⟨z u i , z u j ⟩ being an edge linking to corresponding nodes. Accordingly, the user representation learning becomes a graph learning problem. In SIGN, first, each feature z u i is mapped into a feature embedding z u i ∈ R d of d dimensions as the node embedding. The embeddings are first randomly initialized and are updated through training. Then, SIGN learns the user graph using the function f u : fu(G) = ϕ({ψ({eijh(z u i , z u j )}j∈V )})i∈V , where ϕ and ψ are aggregation functions (e.g., element-wise mean), h(•) : R 2×d → R d is an MLP that models each feature interaction, e ij ∈ {0, 1} is the edge indicator (since we use all pairwise feature interactions, e ij = 1 for all edges). f u outputs the modeled user representation u ∈ R d of d dimensions.

A.2 VARIATIONAL CONTRASTIVE LOG-RATIO UPPER BOUND (VCLUB) OF MUTUAL INFORMATION

Suppose we aim to learn an intrinsic representation o u in and an extrinsic representation o u ex , where their mutual information is minimized. In the vCLUB method (Cheng et al., 2020) , a variational distribution q u 1 (o u ex |o u in ; θ u 1 ) of parameter θ u 1 (e.g., an MLP) is proposed to predict the extrinsic factor given the intrinsic factor. Then, the vCLUB-based mutual information upper bound can be derived as: I vCLUB (o u in ; o u ex ) := E p(o u in ,o u ex ) [log q u 1 (o u ex |o u in )] -E p(o u in )p(o u ex ) [log q u 1 (o u ex |o u in )], where p(o u in , o u ex ) is the joint distribution and p(o u in )p(o u ex ) are is the marginal distribution. vCLUB performs mutual information estimation and minimization in two steps iteratively. In the first step, to ensure Equation (6) holds as the upper bound, θ u 1 is trained to make the log-likelihood function L appr (u, c) (Cheng et al., 2020) ). In the second step, θ u 1 is frozen, and other parameters (e.g., the parameters to generate o u in and o u ex ) are trained to minimize I vCLUB (o u in ; o u ex ) so that the mutual information is minimized. := 1 N N i=1 log q u 1 ((o u ex ) i |(o u in ) i ) maximized (Theorem 3.2 of

B PROOF OF THEOREM 1

Proof. Since the mutual information is not explicitly intractable, we approximate the right side of Equation ( 4) with a lower bound (i.e., MINE (Belghazi et al., 2018) ) and an upper bound (i.e., CLUB (Cheng et al., 2020 )) of mutual information, respectively. More formally, I(o u in , u) ≥ I MINE (o u in , u) := E p(o u in ,u) [log p(o u in , u)] -log E p(o u in )p(u) [p(o u in , u)] ; (7) I(o u in , c) ≤ I CLUB (o u in , c) := E p(o u in ,c) [log p(o u in |c)] -E p(o u in )p(c) [log p(o u in |c)] . ) With the approximated terms above, proving Equation. (4) turns to verify: arg min N i=1 L CICL (u i , c i ) = arg max I MINE (o u in , u) -I CLUB (o u in , c) . By minimizing L CICL , we aim to make (o u in ) ii similar to (o u in ) ij . This procedure can be interpreted in probability as: increasing the probability of f u ie (u i , c j ) to predict (o u in ) ii . Therefore, maximizing the exp(sim((o u in ) ii , (o u in ) ij )/τ ) in Equation ( 1) is equivalent to maximizing p((o u in ) ii |u i , c j ) (exp(•) is monotone increasing so that does not influence the conclusion). Similarly, minimizing exp(sim((o u in ) ii , (o u in ) ℓi )/τ ) is equivalent to minimizing p((o u in ) ii |u ℓ , c i ). Therefore, we have - N i=1 LCICL(ui, ci) = N i=1 log exp(sim((o u in )ii, (o u in )ij)/τ ) u ℓ ∈U exp(sim((o u in )ii, (o u in ) ℓi )/τ ) = N i=1 log[exp(sim((o u in )ii, (o u in )ij)/τ )] - N i=1 log[ u ℓ ∈U exp(sim((o u in )ii, (o u in ) ℓi )/τ )] = N i=1 log[p((o u in )ii|ui, cj)] - N i=1 log[ u ℓ ∈U p((o u in )ii|u ℓ , ci)]. Equation ( 1) only samples one context c j for each data point. However, during the training, all contexts in C are expected to be sampled. If we count all contexts, we have N i=1 log[p((o u in )ii|ui, cj)] - N i=1 log[ u ℓ ∈U p((o u in )ii|u ℓ , ci)] = N i=1 c j ∈C log[p((o u in )ii|ui, cj)] - N i=1 log[ u ℓ ∈U p((o u in )ii|u ℓ , ci)] =E p(o u in ,u)p(c) [log p(o u in |u, c)] -E p(o u in ,c) log E p(u) [p(o u in |u, c)] Equation ( 10) is the probability form of the objective function of the context-invariant counteractive learning component (Equation ( 1)). Equation ( 10) maximizes the likelihood p(o u in |u, c) given the joint distribution of users and intrinsic factors, with the marginal distribution of contexts. Meanwhile, it minimizes the likelihood p(o u in |u, c) given the joint distribution of contexts and intrinsic factors, with the marginal distribution of user. 2From Equation (10), we further have: E p(o u in ,u)p(c) [log p(o u in |u, c)] -E p(o u in ,c) log E p(u) [p(o u in |u, c)] (a) = E p(o u in ,u)p(c) [log p(o u in |u, c)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] = E p(o u in ,u)p(c) [log p(o u in |u, c)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] + E p(u) [log p(u)] -E p(u) [log p(u)] = E p(o u in ,u)p(c) [log p(o u in |u, c)p(u)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] -E p(u) [log p(u)] = E p(o u in ,u)p(c) [log p(o u in , u|c)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] -E p(u) [log p(u)] = E p(o u in ,u)p(c) [log p(o u in , u|c)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] -E p(u) [log p(u)] + E p(o u in )p(u)p(c) [log p(o u in |u, c)] -E p(o u in )p(u)p(c) [log p(o u in |u, c)] = E p(o u in ,u)p(c) [log p(o u in , u|c)] -E p(o u in )p(u)p(c) [log p(o u in |u, c)] -E p(u) [log p(u)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] + E p(o u in )p(u)p(c) [log p(o u in |u, c)] = E p(o u in ,u)p(c) [log p(o u in , u|c)] -E p(o u in )p(u)p(c) [log p(o u in , u|c)] -E p(o u in ,c)p(u) [log p(o u in |u, c)] -E p(o u in )p(u)p(c) [log p(o u in |u, c)] = E p(c) E p(o u in ,u) [log p(o u in , u|c)] -E p(o u in )p(u) [log p(o u in , u|c)] -E p(u) E p(o u in ,c) [log p(o u in |u, c)] -E p(o u in )p(c) [log p(o u in |u, c)] . (a): In the second term, pushing the log inside the expectation dose not change the minimizer. Comparing Equation ( 7) and the first term of Equation ( 11), they both act like classifiers whose objectives maximize the expected log-ratio of the joint distribution over the product of marginal distributions (Hjelm et al., 2019) . Therefore, maximizing this term in Equation (11) will have the same effect to maximizing Equation ( 7). We can interpret the first term of Equation ( 11) as maximizing the mutual information between users and the corresponding intrinsic factor, conditioned on a given context. Similarly, maximizing the negative of the second term of Equation ( 11) will have the same effect of minimizing Equation ( 8), which can be interpreted as minimizing the mutual information between contexts and the corresponding intrinsic factors, conditioned on a given users. Therefore, we can conclude that: arg min (ui,vi,ci)∈D L CICL (u i , c i ) = arg max I MINE (o u in , u) -I CLUB (o u in , c) .

C PREVENTING THE TRIVIAL SOLUTION OF CIED

The two components in the CIED module, the contrastive learning component and the disentangling component, jointly ensure the success of the intrinsic and extrinsic factor representation learning. However, CIED may fall into a trivial solution: f u ie (u, c) maps u to o u in without considering c, and maps c to o u ex without considering u. Although this trivial solution minimizes L CICL (u, c) and L Dis (u, c), o u in (resp. o u ex ) is not the intrinsic (resp. extrinsic) factor, but just a mapping of the user information (resp. context information). We prove that this trivial solution can be avoided by setting f u ie (u, c) as a non-linear function, leading u and c statistically interacted.

Statistical Interaction

We start with introducing the statistical interaction (or non-additive interaction), which ensures a joint influence of several variables on an output variable is not additive (Tsang et al., 2018) . Based on Sorokina et al. (2008) , F (X) shows statistical interaction between variables x i and x j if ∀f \i , f \j , F (X) cannot be expressed as: F (X) ̸ = f \i (x 1 , . . . , x i-1 , x i+1 , . . . , x n ) + f \j (x 1 , . . . , x j-1 , x j+1 , . . . , x n ). ( ) More generally, if using v i ∈ R d to describe the i-th variable with a d-dimension vector (Rendle, 2010; Su et al., 2021) , e.g., variable embedding, each variable can be described in a vector form u i = x i v i . Then, we define the pairwise statistical interaction in vector form by changing the Equation ( 12) into: F (X) ̸ = f \i (u 1 , . . . , u i-1 , u i+1 , . . . , u n ) + f \j (u 1 , . . . , u j-1 , u j+1 , . . . , u n ). ( ) Preventing the Trivial Solution Based on the definition of statistical interaction, we can express the trivial solution as that f u ie (u, c) learns no statistical interaction between u and c: f u ie (u, c) = λ 1 f 1 (u) + λ 2 f 2 (c), where f 1 outputs o u in , f 2 outputs o u ex , and λ are weight scalars. To prevent the trivial solution, we need to ensure that function f u ie (u, c) cannot be modeled in the form of Equation ( 14). Therefore, if u and c are modeled as a statistical interaction in f u ie (u, c), the trivial solution can be prevented. Since f u ie (u, c) only takes u and c as inputs, we just need f u ie to be a non-additive model. That is, f u ie (u, c) should contain a third term f 3 (u, c): f u ie (u, c) = λ 1 f 1 (u) + λ 2 f 2 (c) + λ 3 f 3 (u, c), ( ) where f 3 is a non-additive model and λ 3 ̸ = 0. Therefore, in the optimized situation, o u in = λ 1 f 1 (u) learns part of the information from users that do not interact with context information. o u ex = λ 2 f 2 (c)+λ 3 f 3 (u, c) learns the context information (f 2 (c)) and the information that changes given different contexts (f 3 (u, c)). In Appendix G.2, we empirically analyze how the trivial solution will influence the prediction performance. 

D POTENTIAL PROBLEMS OF THE ASYMMETRIC VCLUB METHOD

The vCLUB-based mutual information minimization method proposed in (Cheng et al., 2020) is an asymmetric method. Appendix A.2 gives an introduction about how vCLUB method performs mutual information minimization. In this section we explain the possible reason that vCLUB is less robust and perform worse than our proposed bidirectional vCLUB method (BiDis). If we directly apply vCLUB to our disentangling component, the parameter θ u 1 of a variational distribution q u 1 (o u ex |o u in ; θ u 1 ) will be trained to approach the vCLUB-based upper bound in Equation ( 6) (Step 1). Then, θ u 1 is frozen, and o u ex , o u in are trained to minimize I(o u in ; o u ex ) via minimizing the upper bound I vCLUB (o u in ; o u ex ) ( Step 2). However, this way of minimizing mutual information may result in an unexpected outcome: the mutual information may be minimized via making o u in contain as less information as possible. To better illustrate the possible outcome, we design q u 1 as a linear function and is well trained in Step 1 to ensure Equation ( 6) is an upper bound of I(o u in ; o u ex ). Figure 5 shows how the unexpected result may occur. In Step 2, o u ex , o u in will be trained to minimize Equation (6). To achieve this goal, it ensures q u 1 cannot predict o u ex given the corresponding o u in from the joint distribution (the first term of Equation ( 6)), and at the same time ensures the output of q u 1 is similar to the other o u ex 's from the marginal distribution (the second term of Equation ( 6)). From o u in perspective (blue circles), the goal can be achieved by pushing the o u in to move from its original position (optimizing the first term of Equation ( 6)), and move towards the mean of the other o u in 's (optimizing the second term of Equation ( 6)). From o u ex perspective (red circles), the goal can be achieved by pushing the o u ex away from its original position (optimizing the first term of Equation ( 6)) and the the mean of the other o u ex 's (optimizing the second term of Equation ( 6)). This clusters all the o u in 's together, making o u in 's contain less information, while all the o u ex 's try to split away from each other, making o u ex 's contain more information. The mutual information minimization procedure is like "transfering" the information from o u in 's to o u ex 's, which is not what we expect. BiDis, however, is a symmetric disentangling method on o u in 's and o u ex 's so that will not result in this issue. This may be the reason that vCLUB performs worse and less robust than our proposed symmetrical disentangling component.

E TIME COMPLEXITY ANALYSIS

Briefly speaking, the time complexity of the whole model is comparable to feature interaction-based recommender systems (e.g., AutoInt, SIGN). The overhead of the alternative optimizing procedure for the disentanglement component is marginal in the whole optimizing procedure. The most time-consuming computations are the feature interaction learning to get user, item, and context representations, which need to conduct interaction modeling on every pair of feature interactions. This procedure has also been done on other feature interaction-based models, therefore, the time complexity of the proposed module is comparable with those methods. Our model takes additional computations on the context-invariant contrastive learning (CICL) component and the disentangling component: For the CICL component, we do not need to perform the feature interaction modeling again, but reuse the generated user/item/context representations, which saves the majority of the overhead. We only need to perform f ie (L+1) times, where L is the number of negative samples. Since f ie is a one-hidden layer (with 128 hidden units) MLP, the overhead is marginal. For the disentangling component, we also reuse the generated user/item/context representations. As we discussed in the paper, we use a two-step learning policy to train our model. Regarding to the reviewer's main concern, the first step in the two-step learning actually takes very little overhead. This is because this step only tries to optimize the parameters of the functions q 1 and q 2 (Eq.2), which are two MLPs with one hidden layer. For each data sample, we only run q 1 and q 2 one time using O in and O ex . In summary, since all of the computations above do not need to perform feature interaction modeling (the most time-consuming procedure in all feature interaction-based models), the small imposed overhead is acceptable considering the effectiveness of our model in capturing accurate intrinsic and extrinsic factors.

F EXPERIMENTAL SETTING

Datasets We evaluate our models in two scenarios with various contexts: a mobile app recommendation and a restaurant recommendation. In the mobile app recommendation, we use the Frappe (Baltrunas et al., 2015) dataset that records mobile app usage logs. Each data sample logs users' app usage in a certain context (e.g., weather, time, location). In addition, some relevant properties of the apps are also captured (e.g., category, developer). In the restaurant recommendation, we use the Yelp dataset (Wu et al., 2022) . It records users' reviews on local restaurants. Due to the fact that a user usually goes to restaurants in the same city, geographic isolation appears in the dataset. Therefore, we select the records in New York City. We regard each record as a data sample that the user has been to the restaurant. We leverage the user/item features and context features (e.g., day of the week) to predict whether a user will go to a given restaurant in a specific context. We also evaluate our model in two Amazon datasets (Movies and CDs) (McAuley et al., 2015) that have been used in sequential recommendation tasks (Yu et al., 2019b) . The datasets contain user-item interactions with timestamps. For the sequential recommendation, we use the same IEDR model structure as that for the Frappe and Yelp datasets, but modify the data input to fit our model. More specifically, we do not directly learn behavior sequences, but consider each behavior as a data sample with time context information. That is, we consider the bucketed timestamp of each user behavior as a time context (we consider one month as a categorized time context). Therefore, behaviors in the same time interval have the same time context, indicating that these behaviors share some similar short-term (extrinsic) interests (e.g., item popularity). For each dataset, only users that have more than 5 records (Frappe and Yelp) and more than 20 records (Movies and CDs) are chosen. We choose the last record of each user for testing, and the second last record of each user for validation. The rest of the records are used for training. Each of these data sample is considered as a positive sample (y = 1). In addition, for each positive data sample in the training set, we randomly choose 2 items (but keep the user and contexts) as negative samples (y = 0), meaning the user did not select the 2 items in that context. For each test/validation data sample, we randomly choose 99 items as negative samples to ensure a more robust evaluation. The statistics of the datasets are shown in Table 4 . Baseline methods IEDR models the feature interactions of users, items, and contexts. Therefore, we compare our model with competitive feature interaction-based recommendation methods. The methods include attentional factorization machine (AFM) (Xiao et al., 2017) , neural factorization machine (NFM) (He & Chua, 2017) , self-attention-based feature interaction model (AutoInt) (Song et al., 2019) , deep factorization machine (DeepFM) (Guo et al., 2017) , wide & deep model (WDL) (Cheng et al., 2016) , improved deep & cross network (DCNv2) (Wang et al., 2021) , and input-aware factorization machine (IFM) (Yu et al., 2019a) . We implement these methods using the DeepCTR package. The statistical interaction graph neural network (SIGN) (Su et al., 2021) is applied based on the released code. The above methods model all the factors in a unified representation without considering the factors that affect users' decisions. Meanwhile, we compare IEDR with the methods that learn implicit factors. They are disentangled variational auto-encoder for recommendation (DisRec) (Ma et al., 2019) and disentangled graph collaborative filtering (DGCF) (Wang et al., 2020) . We implement these methods based on their released code. Note that since DisRec and DGCF models do not consider any feature, their task is to simply predict whether a user will select an item. IERD and other baseline models, however, consider the user-item interactions in specific contexts (a user's decision to select an item may be different in different contexts). For DisRec and DGCF, to prevent the test data samples from appearing in the training set, we remove the data samples from the training set that appear in the test set (with different contexts in other models). For a fair comparison, we set the factor number to 4 for DisRec and DGCF. For sequential recommendation baselines, we compare our model with the models that consider LS-term interests. They are session-based recommender system with recurrent neural networks (GRU4Rec) (Hidasi et al., 2016) , Short-term and Long-term preference Integrated Recommender system (SLI-Rec) (Yu et al., 2019b) , and Contrastive learning framework of Long and Short-term interests for Recommendation (CLSR) (Zheng et al., 2022) . We use the same MLP structure for feature interaction modeling and the same embedding size for features as our IEDR model.

Implementation details

In IEDR, all the MLPs have the same hidden structure: one hidden layer of 128 dimensions and a ReLU activation after that. The input and output sizes of MLPs varies based on their needs. We set the embedding dimension to 32 for all the features. f ie is an MLP that outputs a 64-dimension vector, with the first 32 dimensions are the intrinsic factor representation and the last 32 dimensions are the extrinsic factor representation. For the second (dropout-based) negative context generating method in the context-invariant contrastive learning component, the dropout rate is set to 0.5. The number of negative pairs for contrastive learning is 40 for each data sample (note that the actual negative pairs will be doubled since both (o u in ) ii and (o u in ) ij will generate 40 negative pairs). The temperature τ is set to 0.5. In the disentangling component, q 1 and q 2 are MLPs that output vectors that have the same dimension of intrinsic/extrinsic factor representations. The number of negative samples of the bidirectional vCLUB-based method is 5 for each direction. We set λ 1 to 0.1 for the Frappe dataset and 0.01 for the Yelp dataset, and set λ 2 to 0.1 for both datasets. The λ 1 and λ 2 are both 0.01 for the Movies and the CDs datasets. We run all the experiments on a machine equipped with a CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz, and a GPU: Nvidia Tesla v100 GPU. The model structure of IEDR and its variations used in the experiments are detailed in Table 5 and Table 6 . Note that the component structures of variations are the same as the IEDR if not specified.

G ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide further experimental results that are not included in the main paper.

G.1 OTHER FEATURE MODELING METHODS

In the RP module, although we use a SIGN-based method (Su et al., 2021) to learn user, item, and context features, the module can use any feature modeling method. Here, we use other methods to evaluate whether our model still performs well. Specifically, we run our model with the other three variations using different feature modeling methods: 1) averaging feature embeddings (MEAN); 2) adding an MLP on top of the averaged feature embedding (MLP); and 3) modeling and aggregating feature interactions through a Bi-interaction layer proposed in (He & Chua, 2017 ) (BI). The results are shown in Figure 6 . We report the results of each variation with and without the CIED module.  ⊙ z u j )}j∈u))i∈u → u M LP (u • c) → [o u in , o u ex ] AVG ψ(z u i )i∈u → u - MLP M LP (ψ(z u i )i∈u) → u - BI ψ(z u i ⊙ z u j )i,j∈u → u - Linear - W [u, c] → [o u in , o u ex ] Nonlinear - σ(W [u, c]) → [o u in , o u ex ] IEDRsp - M LP1(u) → o u in , M LP2(u • c) → o u ex † Here we use user representation learning as an example. The item and context learning have the same structure. ϕ, ψ are both element-wise averaging functions and ⊙ is the element-wise product. ‡ Here we use user factor learning as an example. • is a flexible operation to combine two vector, i.e., • is element-wise product for the Frappe dataset, and element-wise summation for the Yelp dataset. [•, •] is the concatenation operation. W is a linear transformation matrix, σ is a ReLU activation. Table 6 : Implementation details of different variations of the contrastive intrinsic-extrinsic disentanglement module. "-" represents the operation is the same as our original IEDR setting. × represents the variation does not contain the component.  f u ie (ui, cj) → (o u in )ij, negative sample: f u ie (u ℓ , ci) → (o u in ) ℓi , f u ie (u ℓ , cj) → (o u in ) ℓj . cj = randChoice(NegGen1, NegGen2) M LP θ 1 (o u in ) → (o u ex ) ′ (q u 1 ) M LP θ 2 (o u ex ) → (o u in ) ′ (q u 2 ) noDis - × noCL × - noCIED × × NegGen1 cj is generated from NegGen1 - NegGen2 cj is generated from NegGen2 - NegGen1&2 - - vCLUB - M LP θ 1 (o u in ) → (o u ex ) ′ (q u 1 ) BiDis - - IEDRsp positive sample: dropout((o u in )i) → (o u in ) p , negative sample: dropout((o u in ) ℓ ) → (o u in ) n . - * For IEDRsp, the positive samples (o u in ) p are generated through a dropout of the intrinsic representation of the user, and the negative samples (o u in ) p are generated through a dropout of intrinsic representations of random users. From this figure, we can see that when equipped with the CIED module, all feature modeling methods perform better than those without the module. It shows that our proposed CIED module can learn intrinsic and extrinsic factors for more accurate recommendation when different feature modeling methods are applied. Meanwhile, we can see that the feature modeling methods can impact the performance. MEAN is just a linear aggregation of features, resulting in the worst performance. Both MLP and BI have better feature modeling ability and hence have better performance than MEAN. The SIGN-based feature modeling (SIGN) is the state-of-the-art feature interaction modeling method and performs the best. (Yao et al., 2021) . Table 7 illustrates the results of IEDR sp compared to our model with IEDR sp using different dropout rates (p = 0.1 and p = 0.5) in the contrastive learning component, and different component combinations (noDis, noCL, noCIED) . We can see that our model outperforms the variation in recommendation accuracy. It proves that IEDR sp cannot ensure a successful intrinsic factor learning and hence incur a worse recommendation accuracy. Unlike IEDR, IEDR sp gains better performance with a lower dropout rate. This is because, in IEDR sp , the dropout generates views representing the same user instead of different users, which is consistent with the conclusion in (Gao et al., 2021) .

G.4 DIFFERENT NEGATIVE CONTEXT GENERATION METHODS

We propose two negative context generating methods in the contrastive learning component: 1) sample other contexts; 2) use a large dropout rate on the original context. We evaluate the two methods in this section. Table 8 shows the results of our model when using only NegGens1, only NegGens2, and NegGen1&2. We can see that NegGen1 results in a better performance than using NegGen2. This is because NegGen1 uses true context representations, which are consistent with what may appear in the test samples. Meanwhile, we see that NegGen1&2 results in the best performance. This is because NegGen2 provides more unseen (randomly generated) context representations, which strengthens the generalization ability of our model. Next, we evaluate NegGen2 with different dropout rates in Figure 9 . The best performance can be achieved when the dropout rates range from 0.5 to 0.7. This is consistent with our claim in Section 4.2.1. The reason is that a small dropout rate (e.g., 0.1) pushes the generated context representation too close to the original one; hence it cannot be considered a different context. However, a relatively large dropout rate (e.g., 0.9) loses too much information; hence, it is no longer a valid context representation. In addition, for NegGen1&2 of all the dropout rates, the results consistently outperform those that only use NegGen2.

G.5 EMPIRICAL ANALYSIS OF TIME COMPLEXITY

We summarize the overall time consumption of IEDR and several feature interaction-based baseline models in Table 9 . The results are recorded by running the models for one batch (batch size 1024) on the Frappe dataset on a machine with CPU:12th Gen Intel(R) Core(TM) i9-12900K, RAM: 32GB, GPU: NVIDIA GeForce RTX 3090. We can see that our model's overall time consumption is slightly higher than other baselines. Next, we summarize the time cost of critical procedures in IEDR in Table 10 . The first four rows are model forwarding procedures, and the last two rows are model (alternative) optimizing procedures. Table 10 shows the feature interaction modeling procedure takes most of the time, which is consistent with our analysis above. Both CICL and disentangling forward procedures (row 2-4) do not pose much overhead since they reuse the feature interaction modeling results. Optimization (step1) is the alternative training step that updates the parameters of the models disentangling component (q 1 and q 2 ). The alternative optimizing procedure produces little overhead (2.21 ms), which is negligible in the whole procedure. 

G.6 EFFECTIVENESS OF MODEL HYPER PARAMETERS

We evaluate our model with different hyperparameter settings, including embedding dimensions, number of negative samples, and loss weight values. Below, we summaries our observations.

G.6.1 EMBEDDING DIMENSION

We run our model with different feature embedding dimensions. We show the results of running our model using different embeddings in Figure 10 . Choosing the embedding dimension is a trade off between the expression ability and efficiency. From the figure, we can see that larger dimensions result in better prediction accuracy. However, the improvement is not significant when the dimension is larger than 32. A larger dimension may even reduce the performance due to the overfitting problem (e.g., dimension 256 for the Frappe dataset).

G.6.2 THE NUMBER OF NEGATIVE SAMPLE AND LOSS WEIGHT

The contrastive learning and disentangling components are both contrastive-based methods that require negative sampling. This section evaluates how the number of negative samples influences performance. We also compare the influence of different loss weights of the two components. We run our model with different negative sample numbers and loss weights for the two components, respectively. From Figure 11 , we can see that a large loss weight, or a large number of negative samples does not necessarily result in a better performance. There is the best combination of the loss weight and the number of negative samples for both components. A large or small loss weight may make the multi-task training unbalanced, harming the final performance. For the number of negative samples, a small number will make the training insufficient, while a large number may cause an overfitting problem.

G.7 MORE VISUALIZATIONS OF INTRINSIC AND EXTRINSIC REPRESENTATIONS

This section provides complete intrinsic and extrinsic representation visualizations in three variations: 1) the contrastive learning component is removed (noCL); 2) the disentangling component is removed (noDis); and 3) the asymmetric disentanglement method (vCLUB) is used. Figure 12 compares these results. We include our main observations below: • The intrinsic and extrinsic factors are perfectly disentangled with our CIED module (IEDR). • Without the disentangling component (noDis), the intrinsic and extrinsic disentangling procedure may not succeed. This is because there is no restriction on extrinsic representations. Therefore, the extrinsic representations can contain any information, including the information of the intrinsic factor. • noCL has worse disentangling performance than IEDR, either. This is because the factors disentangled in noCL are implicit. The implicit factors only ensure the disentanglement between the factors of the same data sample, but not between the factors of other data samples. For example, some context information may be stored in the intrinsic representation in data sample 1 but be stored in the extrinsic representation in data sample 2. • noCIED performs worst among all variations, which is reasonable since it does not distinguish the intrinsic and extrinsic representations. • vCLUB performs disentanglement, but is not very stable in some situations. This is consistent with our analysis in Section 4.2.2.

H ALGORITHM

This section provides the training process of our IEDR model in Algorithm 1. In each epoch, we use the batch stochastic gradient descent method. Algorithm 1 Batch stochastic gradient descent training of IEDR. 1: Input: D = {(u i , v i , c i )} i=1:N with the corresponding true label y i for each data sample. 2: Hyperparameters: B: batch size; L: negative sample number for the context-invariant contrastive learning component; L dis : negative sample number for the disentangling component. 3: Parameters: θ u 1 , θ u 2 , θ v 1 , θ v 2 : parameters for q u 1 , q u 2 , q v 2 , q v 2 , respectively; ω: parameters of IEDR except for θ u 1 , θ u 2 , θ v 1 , θ v 2 . 4: function CONTRASTIVELEARNING USER({(u i , c i )} i=1:B ) 5: for i = 1, ..., B do 6: (o u in ) ii ← f u ie (u i , c i ) 7: ContextGen ← RandomChoice(N egGen1, N egGen2) 8: c j ← ContextGen(c i ) 9: (o u in ) ij ← f u ie (u i , c j ) ▷ Generate positive samples. (o u ex ) pred ii ← q θ1 ((o u in ) ii ), (o u in ) pred ii ← q θ2 ((o u ex ) ii ) ▷ Generate positive samples. 



Ibi-vCLUB(o u in ; o u ex ) is the average of two vCLUB-based upper bounds of different directions. Therefore, it is obvious that Ibi-vCLUB(o u in ; o u ex ) is still an upper bound of I(o u in ; o u ex ). Note that only if f u ie (u, c) is a many-to-one (or one-to-one) mapping then Equation (10) and Equation (1) will be equivalent. Otherwise, given a sample pair (u, c), f u ie (u, c) may have different o u in outputs (i.e., oneto-many). In this situation, the first term of Equation (10) cannot guarantee that the same user with different context will have the same intrinsic factor (i.e., they may have various intrinsic factor representations while still meet the objective of the first term of Equation (10)). We use an MLP as f u ie (u, c), which is a many-to-one mapping function. Therefore, we can ensure the equivalence between Equation (10) and Equation (1).



Figure 2: (a) Ablation studies with different component(s) removed. (b) The performance and variance statistics of vCLUB and BiDis. The vertical axis is NDCG@10.

(a) User IEDR. (b) Item IEDR. (c) User noCIED. (d) Item noCIED.

Figure 3: Visualization of the learned intrinsic and extrinsic representations with t-SNE for the Frappe dataset. The blue dots are the intrinsic representations, and the red dots are the extrinsic representations.

accurate intrinsic and extrinsic factors from contexts is an essential research topic in recommender systems. Focusing on the problem of existing studies that either neglect the factor learning, or learn the factors from only one specific context, we propose the intrinsic-extrinsic disentangled recommendation (IEDR) model. This generic model effectively learns intrinsic and extrinsic factors from various contexts for a more accurate recommendation. IEDR comprises a context-invariant contrastive learning component, and a mutual information minimization-based disentangling component to ensure the success of the factor learning. Extensive experiments prove our model's ability to learn intrinsic and extrinsic factors and leverage the learned factors for more accurate recommendation prediction. Following this work, we may discover other types of factors that can be considered besides the intrinsic and extrinsic ones, and learn more fine-grained intrinsic and extrinsic factors (e.g., multiple intrinsic factors).A PRELIMINARYA.1 STATISTICALINTERACTION GRAPH NETWORK (SIGN)

Figure 5: An illustrative example demonstrating the potential problem of asymmetric learning in vCLUB. The blue circles are intrinsic representations, and the red circles are extrinsic representations. The dotted arrows are the directions that vCLUB will push o u in and o u ex to move toward in their space.

Figure 9: The performance of different dropout rates for the method 2 (NegGen2).

Figure 10: Hyperparameter study: different embedding dimensions d.

Figure 11: The performance of different numbers of negative samples and the loss weights in the risk minimization function for the contrastive learning component (left) and the disentangling component (right), respectively.

10:for ℓ = 1, ..., L do ▷ Generate negative samples.11:u ℓ1 ← randomChoice({u i } i=1:B ), (o u in ) ℓ1i = f u ie (u ℓ1 , c i ) 12: u ℓ2 ← randomChoice({u i } i=1:B ), (o u in ) ℓ2j = f u ie (u ℓ2 , c j ) CICL (u i , c i ) ← Equation (4)based on the above positive and negative return average({L CICL (u i , c i )} i=1:B ) 17: end function 18: function CONTRASTIVELEARNING ITEM({(v i , c i )} i=1:B ) 19: Symmetric to CONTRASTIVELEARNING USER. 20: end function 21: function DISENTANGLEMENT USER({(u i , c i )} i=1:B ) 22: for i = 1, ..., B do 23:(o u in ) ii , (o u ex ) ii ← f u ie (u i , c i )24:

Comparing the performance of IEDR and the baselines on time context-specific scenario.

Items (in category) of the highest intrinsic and extrinsic scores for different users in Weekday.

Dataset statistics. "Count" refers to the number of users/items, "Features" represent the number of different features (for User and Item, the number of features excludes the user/item ids).

Implementation details of different variations on the recommendation prediction module. "-" represent the operation is the same as our original IEDR setting.

Comparing the performance of IEDRsp with different dropout rates (for NegGen2).

The overall time consumption of different models in one batch training.

The time consumption of critical procedures in IEDR in one batch training.

(o u in ) r , (o u ex ) r ← randomChoice { (o u in ) ii , (o u ex ) ii } i=1:B return average({(L bi-appr ) i } i=1:B ), average({(L Dis ) i } i=1:B ) 37: end function 38: function DISENTANGLEMENT ITEM({(v i , c i )} i=1:B )

G.2 FALLING INTO TRIVIAL SOLUTIONS

As discussed in Section C, our model may fall into a trivial solution if f u ie (u, c) is a linear mapping method. To evaluate how the trivial solution influences our model in learning the factors, we run our model with f ie being linear. Specifically, we concatenate u and c and feed them into an MLP without a hidden layer or activation (a linear mapping), making it easy to fall into the trivial solution. We call this variation Linear. Then, we avoid this by simply adding a nonlinear activation function (ReLU) activation after the linear mapping. We call this variation Nonlinear. Figure 7 shows the weight values of f ie of the two variations. The color shows the weights mapping from user/context representations to intrinsic/extrinsic representations. The darker the color, the larger the weight (more information of user/context is mapped into intrinsic/extrinsic representations). The figure shows that in the Linear variation, user information is largely mapped into intrinsic representation (user-intrinsic block) but not extrinsic representation (user-extrinsic block). Context information is largely mapped into extrinsic representation (context-extrinsic block) but not intrinsic representation (context-intrinsic block). This means that the Linear variation falls into the trivial solution. On the contrary, in the Nonlinear variation, user information is mapped into extrinsic representation (user-extrinsic block), showing that the extrinsic representation contains both user and context information. Figure 8 shows the performance of the two variations. We can see that Linear model performs worse than Nonlinear model. It proves that learning intrinsic and extrinsic factors results in a better performance than simply mapping user and context information into two representations, respectively (the trivial solution).

G.3 COMPARING THE IMPACT OF DIFFERENT CONTRASTIVE LEARNING VARIATIONS

To learn the intrinsic factor, we propose a context-invariant contrastive learning method. However, directly generating intrinsic factor representations through only user information seems to be a more direct way, i.e., o u in = f u ie (u). We argue that the intrinsic representations learned this way could not guarantee that the representations are intrinsic factors. This is because the information in the learned intrinsic factor representations can vary with different contexts, since these factors have never been modeled w.r.t. the contexts.In this section, we empirically show that this approach to learning intrinsic factors is inferior to our context-invariant contrastive learning method in producing accurate recommendations. To do so, we design a variation (IEDR sp ) by splitting the intrinsic-extrinsic factor generation into two functions: o u in = f u in (u), and o u ex = f u ex (u, c). Both f in and f ex have the same structure as f ie , with the output dimension being a half to ensure the consistency of the factor representation Algorithm 1 Batch stochastic gradient descent training of IEDR (continued).for i = 1, ..., B do ▷ Line 45-47 are the recommendation prediction module.45:end for 50:L RP ← average({L RP ) i } i=1:B 51: 

