STABLEDR: STABILIZED DOUBLY ROBUST LEARNING FOR RECOMMENDATION ON DATA MISSING NOT AT RANDOM

Abstract

In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. Moreover, the fact that DR relies more on extrapolation will lead to suboptimal performance. To address the above limitations while retaining double robustness, we propose a stabilized doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. Theoretical analysis shows that StableDR has bounded bias, variance, and generalization error bound simultaneously under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approaches significantly outperform the existing methods.

1. INTRODUCTION

Modern recommender systems (RSs) are rapidly evolving with the adoption of sophisticated deep learning models (Zhang et al., 2019) . However, it is well documented that directly using advanced deep models usually achieves sub-optimal performance due to the existence of various biases in RS (Chen et al., 2020; Wu et al., 2022b) , and the biases would be amplified over time (Mansoury et al., 2020; Wen et al., 2022) . A large number of debiasing methods have emerged and gradually become a trend. For many practical tasks in RS, such as rating prediction (Schnabel et al., 2016; Wang et al., 2020a; 2019) , post-view click-through rate prediction (Guo et al., 2021) , post-click conversion rate prediction (Zhang et al., 2020; Dai et al., 2022) , and uplift modeling (Saito et al., 2019; Sato et al., 2019; 2020) , a critical challenge is to combat the selection bias and confounding bias that leading to significantly difference between the trained sample and the targeted population (Hernán & Robins, 2020) . Various methods were designed to address this problem and among them, doubly robust (DR) methods (Wang et al., 2019; Zhang et al., 2020; Chen et al., 2021; Dai et al., 2022; Ding et al., 2022) play the dominant role due to their better performance and theoretical properties. The success of DR is attributed to its double robustness and joint-learning technique. However, the DR methods still have many limitations. Theoretical analysis shows that inverse probability scoring (IPS) and DR methods may have infinite bias, variance, and generalization error bounds, in the presence of extremely small propensity scores (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Li et al., 2023b) . In addition, due to the fact that users are more inclined to evaluate the preferred items, the problem of data missing not at random (MNAR) often occurs in RS. This would cause selection bias and results in inaccuracy for methods that more rely on extrapolation, such as error imputation based (EIB) (Marlin et al., 2007; Steck, 2013) and DR methods. (Marlin et al., 2007; Steck, 2013; Schnabel et al., 2016) uses a fixed imputation/propensity model (Left), whereas DR-JL (Wang et al., 2019) , MRDR-DL (Guo et al., 2021) , and AutoDebias (Chen et al., 2021) uses alternative learning between the imputation/propensity and the prediction model (Middle). The proposed learning approach updates the three models cyclically with stabilization (Right). To overcome the above limitations while maintaining double robustness, we propose a stabilized doubly robust (SDR) estimator with a weaker reliance on extrapolation, which reduces the negative impact of extrapolation and MNAR effect on the imputation model. Through theoretical analysis, we demonstrate that the SDR has bounded bias and generalization error bound for arbitrarily small propensities, which further indicates that the SDR can achieve more stable predictions. Furthermore, we propose a novel cycle learning approach for SDR. Figure 1 shows the differences between the proposed cycle learning of SDR and the existing unbiased learning approaches. Twophase learning (Marlin et al., 2007; Steck, 2013; Schnabel et al., 2016) first obtains an imputation/propensity model to estimate the ideal loss and then updates the prediction model by minimizing the estimated loss. DR-JL (Wang et al., 2019) , MRDR-DL (Guo et al., 2021) , and AutoDebias (Chen et al., 2021) alternatively update the model used to estimate the ideal loss and the prediction model. The proposed learning method cyclically uses different losses to update the three models with the aim of achieving more stable and accurate prediction results. We have conducted extensive experiments on two real-world datasets, and the results show that the proposed approach significantly improves debiasing and convergence performance compared to the existing methods.

2.1. PROBLEM SETTING

In RS, due to the fact that users are more inclined to evaluate the preferred items, the collected ratings are always missing not at random (MNAR). We formulate the data MNAR problem using the widely adopted potential outcome framework (Neyman, 1990; Imbens & Rubin, 2015) . Let U = {1, 2, ..., U }, I = {1, 2, ..., I} and D = U × I be the index sets of users, items, all user-item pairs. For each (u, i) ∈ D, we have a treatment o u,i ∈ {0, 1}, a feature vector x u,i , and an observed rating r u,i , where o u,i = 1 if user u rated the item i in the logging data, o u,i = 0 if the rating is missing. Let r u,i (1) is defined as the be the rating that would be observed if item i had been rated by user u, which is observable only for O = {(u, i) | (u, i) ∈ D, o u,i = 1}. Many tasks in RS can be formulated by predicting the potential outcome r u,i (1) using feature x u,i for each (u, i). Let ru,i (1) = f (x u,i ; ϕ) be a prediction model with parameters ϕ. If all the potential outcomes {r u,i (1) : (u, i) ∈ D} were observed, the ideal loss function for solving parameters ϕ is given as L ideal (ϕ) = |D| -1 (u,i)∈D e u,i , where e u,i is the prediction error, such as the squared loss e u,i = (r u,i (1) -r u,i (1)) 2 . L ideal (ϕ) can be regarded as a benchmark of unbiased loss function, even though it is infeasible due to the missingness of {r u,i (1) : o u,i = 0}. As such, a variety of methods are developed through approximating L ideal (ϕ) to address the selection bias, in which the propensity-based estimators show the relatively superior performance (Schnabel et al., 2016; Wang et al., 2019) , and the IPS and DR estimators are E IP S = |D| -1 (u,i)∈D o u,i e u,i pu,i and E DR = |D| -1 (u,i)∈D êu,i + o u,i (e u,i -êu,i ) pu,i , where pu,i is an estimate of propensity score p u,i := P(o u,i = 1|x u,i ), êu,i is an estimate of e u,i .

2.2. RELATED WORK

Debiased learning in recommendation. The data collected in RS suffers from various biases (Chen et al., 2020; Wu et al., 2022b) , which are entangled with the true preferences of users and pose a great challenge to unbiased learning. There is increasing interest in coping with different biases in recent years (Zhang et al., 2021; Ai et al., 2018; Liu et al., 2016; Liu et al., 2021) . Schnabel et al. (2016) proposed using inverse propensity score (IPS) and self-normalized IPS (SNIPS) methods to address the selection bias on data missing not at random, Saito (2019) and Saito et al. (2020) extended them to implicit feedback data. Marlin et al. (2007) and Steck (2013) derived an error imputation-based (EIB) unbiased learning method. These three approaches adopt two-phase learning (Wang et al., 2021) , which first learns a propensity/imputation model and then applies it to construct an unbiased estimator of the ideal loss to train the recommendation model. A doubly robust joint learning (DR-JL) method (Wang et al., 2019) was proposed by combining the IPS and EIB approaches. Subsequently, strands of enhanced joint learning methods were developed, including MRDR (Guo et al., 2021) , Multi-task DR (Zhang et al., 2020) , DR-MSE (Dai et al., 2022) , BRD-DR (Ding et al., 2022) , TDR Li et al. (2023b) , uniform data-aware methods (Bonner & Vasile, 2018; Liu et al., 2020; Chen et al., 2021; Wang et al., 2021; Li et al., 2023c ) that aimed to seek better recommendation strategies by leveraging a small uniform dataset, and multiple robust method (Li et al., 2023a ) that specifies multiple propensity and imputation models and achieves unbiased learning if any of the propensity models, imputation models, or even a linear combination of these models can accurately estimate the true propensities or prediction errors. Chen et al. (2020) reviewed various biases in RS and discussed the recent progress on debiasing tasks. Wu et al. (2022b) established the connections between the biases in causal inference and the biases, thereby presenting the formal causal definitions for RS. Stabilized causal effect estimation. The proposed method builds on the stabilized average causal effect estimation approaches in causal inference. Molenberghs et al. (2015) summarized the limitations of doubly robust methods, including unstable to small propensities (Kang & Schafer, 2007; Wu et al., 2022a) , unboundedness (van der Laan & Rose, 2011), and large variance (Tan, 2007) . These issues inspired a series of stabilized causal effect estimation methods in statistics (Kang & Schafer, 2007; Bang & Robins, 2005; van der Laan & Rose, 2011; Molenberghs et al., 2015) . Unlike previous works that focused only on achieving learning with unbiasedness in RS, this paper provides a new perspective to develop doubly robust estimators with much more stable statistical properties.

3. STABILIZED DOUBLY ROBUST ESTIMATOR

In this section, we elaborate the limitations of DR methods and propose a stabilized DR (SDR) estimator with a weaker reliance on extrapolation. Theoretical analysis shows that SDR has bounded bias and generalization error bound for arbitrarily small propensities, while IPS and DR don't.

3.1. MOTIVATION

Even though DR estimator has double robustness property, its performance could be significantly improved if the following three stabilization aspects can be enhanced. More stable to small propensities. As shown in Schnabel et al. (2016) , Wang et al. (2019) and Guo et al. (2021) , if there exist some extremely small estimated propensity scores, the IPS/DR estimator and its bias, variance, and tail bound are unbounded, deteriorating the prediction accuracy. What's more, such problems are widespread in practice, given the fact that there are many long-tailed users and items in RS, resulting in the presence of extreme propensities. More stable through weakening extrapolation. DR relies more on extrapolation because the imputation model in DR is learned from the exposed events O and extrapolated to the unexposed events. If the distributional disparity of e u,i on o u,i = 0 and o u,i = 1 is large, the imputed errors are likely to be inaccurate on the unexposed events and incur bias of DR. Therefore, it is beneficial to reduce bias if we can develop an enhanced DR method with weaker reliance on extrapolation. More stable training process of updating a prediction model. In general, alternating training between models results in better performance. From Figure 1 , Wang et al. (2019) proposes joint learning for DR, alternatively updating the error imputation and prediction models with given estimated propensities. Double learning (Guo et al., 2021) further incorporates parameter sharing between the imputation and prediction models. Bi-level optimization (Wang et al., 2021; Chen et al., 2021) can be viewed as alternately updating the prediction model and the other parameters used to estimate the loss. To the best of our knowledge, this is the first paper that proposes a algorithm to update the three models (i.e., error imputation model, propensity model, and prediction model) separately using different optimizers, which may resulting in more stable and accurate rating predictions.

3.2. STABILIZED DOUBLY ROBUST ESTIMATOR

We propose a stabilized doubly robust (SDR) estimator that has a weaker dependence on extrapolation and is robust to small propensities. The SDR estimator consists of the following three steps. Step 1 (Initialize imputed errors). Pre-train imputation model êu,i , let Ê ≜ |D| -1 (u,i)∈D êu,i . Step 2 (Learn constrained propensities). Learn a propensity model pu,i satisfying 1 |D| (u,i)∈D o u,i pu,i êu,i -Ê = 0. Step 3 (SDR estimator). The SDR estimator is given as E SDR = (u,i)∈D o u,i e u,i pu,i (u,i)∈D o u,i pu,i ≜ (u,i)∈D w u,i e u,i , where w u,i = ou,i pu,i (u,i)∈D ou,i pu,i . It can be seen that SDR estimator has the same form as SNIPS estimator, but the propensities are learned differently. In SDR, the estimation of propensity model relies on the imputed errors, whereas not in SNIPS. Each step in the construction of SDR estimator plays a different role. Specifically, the Step 2 is designed to enable double robustness property as shown in Theorem 1 (see Appendix A.1 for proofs). Theorem 1 (Double Robustness). E SDR is an asymptotically unbiasedfoot_0 estimator of L ideal , when either the learned propensities pu,i or the imputed errors êu,i are accurate for all user-item pairs. We provide an intuitive way to illustrate the rationale of SDR. On the one hand, if the propensities can be accurately estimated (i.e., pu,i = p u,i ) by using a common model (e.g., logistic regression) without imposing constraint (1). Then the expectation of the left hand side of constraint (1) becomes E O 1 |D| (u,i)∈D o u,i pu,i êu,i -Ê = 1 |D| (u,i)∈D êu,i -Ê ≡ 0, which indicates the constraint (1) always holds as the sample size goes to infinity by the strong law of large numbersfoot_1 , irrespective of the accuracy of the imputed errors êu,i . This implies that the constraint (1) imposes almost no restriction on the estimation of propensities. In this case, the SDR estimator is almost equivalent to the original SNIPS estimator. On the other hand, if the propensities cannot be accurately estimated by using a common model, but the imputed errors are accurate (i.e., êu,i = e u,i ). In this case, Ê is an unbiased estimator. Specifically, E SDR satisfies 1 |D| (u,i)∈D o u,i (e u,i -E SDR ) pu,i = 0. Combining the constraint (1) and equation ( 2) gives 1 |D| (u,i)∈D o u,i (e u,i -êu,i ) pu,i + o u,i ( Ê -E SDR ) pu,i = 0, where the first term equals to 0 if êu,i = e u,i , it implies that E SDR = Ê, then the unbiasedness of E SDR follows immediately from the unbiasedness of Ê. In addition, Step 3 is designed for two main reasons to achieve stability. First, E SDR is more robust to extrapolation compared with DR. This is because the propensities are learned from the entire data and thus have less requirement on extrapolation. Second, E SDR is more stable to small propensities, since the self-normalization imposes the weight w u,i to fall on the interval [0,1]. In summary, forcing the propensities to satisfy the constraint (1) makes the SDR estimator not only doubly robust, but also captures the advantages of both SNIPS and DR estimators. The design of SDR enables the constrained propensities to adaptively find the direction of debiasing if either the learned propensities without imposing constraint (1) or the imputed errors are accurate.

3.3. THEORETICAL ANALYSIS OF STABLENESS

Through theoretical analysis, we note that previous debiasing estimators such as IPS (Schnabel et al., 2016) and DR-based methods (Wang et al., 2019; Guo et al., 2021) tend to have infinite biases, variances, tail bound, and corresponding generalization error bounds, in the presence of extremely small estimated propensities. Remarkably, the proposed SDR estimator doesn't suffer from such problems and is stable to arbitrarily small propensities, as shown in the following Theorems (see Appendixes A.2, A.3 and A.4 for proofs). Theorem 2 (Bias of SDR). Given imputed errors êu,i and learned propensities pu,i satisfying the stabilization constraint (1), with pu,i > 0 for all user-item pairs, the bias of E SDR is Bias(E SDR ) = 1 |D| (u,i)∈D δ u,i - (u,i)∈D δ u,i p u,i /p u,i (u,i)∈D p u,i /p u,i + O(|D| -1 ), where δ u,i = e u,i -êu,i is the error deviation. Theorem 2 shows the bias of the SDR estimator consisting of a dominant term given by the difference between δ u,i and its weighted average, and a negligible term of order O(|D| -1 ). The fact that the δ u,i and its convex combinations are bounded, shows that the bias is bounded for arbitrarily small pu,i . Compared to the Bias ( E IP S ) = |D| -1 | u,i∈D (p u,i -p u,i )e u,i /p u,i | and Bias (E DR ) = |D| -1 | u,i∈D (p u,i -p u,i )δ u,i /p u,i |, it indicates that IPS and DR will have extremely large bias when there exists an extremely small pu,i . Theorem 3 (Variance of SDR). Under the conditions of Theorem 2, the variance of E SDR is Var (E SDR ) = (u,i)∈D p u,i (1 -p u,i )h 2 u,i /p 2 u,i (u,i)∈D p u,i /p u,i 2 + O(|D| -2 ), where h u,i = (e u,i -êu,i ) -(u,i)∈D {p u,i (e u,i -êu,i )/p u,i }/ (u,i)∈D {p u,i /p u,i } is a bounded difference between e u,i -êu,i and its weighted average. Theorem 3 shows the variance of the SDR estimator consisting of a dominant term and a negligible term of order O(|D| -2 ). The boundedness of the variance for arbitrarily small pu,i is given directly from the fact that SDR has a bounded range given by the self-normalized form. Compared to the Var ( E IP S ) = |D| -2 u,i∈D p u,i (1 -p u,i )e 2 u,i /p 2 u,i and Var (E DR ) = |D| -2 u,i∈D p u,i (1 -p u,i )(e u,i -êu,i ) 2 /p 2 u,i , it indicates that IPS and DR will have extremely large variance (tend to infinity) when there exist an extremely small pu,i (tends to 0). Theorem 4 (Tail Bound of SDR). Under the conditions of Theorem 2, for any prediction model, with probability 1 -η, the deviation of E SDR from its expectation has the following tail bound |E SDR -E O (E SDR )| ≤ 1 2 log 4 η (u,i)∈D (δ max -δ u,i ) 2 + (δ u,i -δ min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 where δ min = min (u,i)∈D δ u,i , δ max = max (u,i)∈D δ u,i , ϵ ′ = log(4/η)/2 • D\(u,i) 1/p 2 u,i , and D \ (u, i) is the set of D excluding the element (u, i). Note that D\(u,i) p u,i /p u,i = O(|D|) and ϵ ′ = O(|D| 1/2 ) in Theorem 4, it follows that the tail bound of the SDR estimator converges to 0 for large samples. In addition, the tail bound is bounded for arbitrarily small pu,i . Compared to the tail bound of IPS and DR, with probability 1 -η, we have |E IP S -E O [E IP S ]| ≤ log (2/η) 2|D| 2 (u,i)∈D e u,i pu,i 2 , |E DR -E O [E DR ]| ≤ log (2/η) 2|D| 2 (u,i)∈D δ u,i pu,i 2 , which are both unbounded when pu,i → 0. For SDR in the prediction model training phase, the boundedness of the generalization error bound (see Theorem 5 in Appendix A.5) follows immediately from the boundedness of the bias and tail bound. The above analysis demonstrates that SDR can comprehensively mitigate the negative effects caused by extreme propensities and results in more stable predictions. Theorems 2-5 are stated under the constraint (1). If we estimate the propensities with constraint (1), but finally constraint (1) somehow doesn't hold exactly, the associated bias, variance, and generalization error bound of SDR are presented in Appendix B.

4. CYCLE LEARNING WITH STABILIZATION

In this section, we propose a novel SDR-based cycle learning approach, that not only exploits the stable statistical properties of the SDR estimator itself, but also carefully designs the updating process among various models to achieve higher stability. In general, inspired by the idea of value iteration in reinforcement learning (Sutton & Barto, 2018) , alternatively updating the model tends to achieve better predictive performance, as existing debiasing training approaches suggested (Wang et al., 2019; Guo et al., 2021; Chen et al., 2021) . As shown in Figure 1 , the proposed approach dynamically interacts with three models, utilizing the propensity model and imputation model simultaneously in a differentiated way, which can be regarded as an extension of these methods. In cycle learning, given pre-trained propensities, the inverse propensity weighted imputation error loss is used to first obtain an imputation model, and then take the constraint (1) as the regularization term to train a stabilized propensity model and ensure the double robustness of SDR. Finally, the prediction model is updated by minimizing the SDR loss and used to readjust the imputed errors. By repeating the above update processes cyclically, the cycle learning approach can fully utilize and combine the advantages of the three models to achieve more accurate rating predictions. Specifically, the data MNAR leads to the presence of missing r u,i (1), so that all e u,i cannot be used directly. Therefore, we obtain imputed errors by learning a pseudo-labeling model ru,i (1) parameterized by β, and the imputed errors êu,i = CE(r u,i (1), ru,i (1)) are updated by minimizing L e (ϕ, α, β) = |D| -1 (u,i)∈D o u,i (ê u,i -e u,i ) 2 π(x u,i ; α) + λ e ∥β∥ 2 F , where e u,i = CE(r u,i (1), ru,i (1)), λ e ≥ 0, pu,i = π(x u,i ; α) is the propensity model, ∥ • ∥ 2 F is the Frobenius norm. For each observed ratings, the inverse of the estimated propensities are used for weighting to account for MNAR effects. Next, we consider two methods for estimating propensity scores, which are Naive Bayes with Laplace smoothing and logistic regression. The former provides a wide range of opportunities for achieving stability constraint (1) through the selection of smoothing coefficients. The latter requires user and item embeddings, which are obtained by employing MF before performing cycle learning. The learned propensities need to both satisfy the accuracy, which is evaluated with cross entropy, and meet the constraint (1) for stabilization and double robustness. The propensity model π(x u,i ; α) is updated by using the loss L ce (ϕ, α, β) + η • L stable (ϕ, α, β), where L ce (ϕ, α, β) is cross entropy loss of propensity model and L stable (ϕ, α, β) = |D| -1 (u,i)∈D o u,i π(x u,i ; α) êu,i -Ê 2 + λ stable ∥α∥ 2 F , where λ stable ≥ 0, and η is a hyper-parameter for trade-off. Finally, the prediction model f (x u,i ; ϕ) is updated by minimizing the SDR loss where the first term is equivalent to the left hand side of equation ( 3), and λ sdr ≥ 0. In cycle learning, the updated prediction model will be used for re-update the imputation model using the next sample batch. Notably, the designed algorithm strictly follows the proposed SDR estimator in Section 3.2. From Figure 1 and Alg. 1, our algorithm first updates imputed errors ê by Step 1, and then learns a propensity p based on learned ê to satisfy the constraint (1) in Step 2. The main purpose of the first two steps is to ensure that the SDR estimator in Step 3 has double robustness and has a lower extrapolation dependence compared to the previous DR methods. Finally, from L sdr (ϕ, α, β) = (u,i)∈D o u,i e u,i π(x u,i ; α) (u,i)∈D o u,i π(x u,i ; α) + λ sdr ∥ϕ∥ 2 F , Step 3 we update the predicted rating r by minimizing the estimation of the ideal loss using the proposed SDR estimator. For the next round, instead of re-initializing, Step 1 updates the imputed errors ê according to the new prediction model, then Step 2 re-updates the constrained propensities p, and then uses Step 3 to update the prediction model r again, and so on. We summarized the cycle learning approach in Alg. 1.

5. REAL-WORLD EXPERIMENTS

In this section, several experiments are conducted to evaluate the proposed methods on two realworld benchmark datasets. We conduct experiments to answer the following questions: RQ1. Do the proposed Stable-DR and Stable-MRDR approaches improve in debiasing performance compared to the existing studies? RQ2. Do our methods stably perform well under the various propensity models? RQ3. How does the performance of our method change under different strengths of the stabilization constraint?

5.1. EXPERIMENTAL SETUP

Dataset and preprocessing. To answer the above RQs, we need to use the datasets that contain both MNAR ratings and missing-at-random (MAR) ratings. Following the previous studies (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Chen et al., 2021) , we conduct experiments on the two commonly used datasets: Coatfoot_2 contains ratings from 290 users to 300 items. Each user evaluates 24 items, containing 6,960 MNAR ratings in total. Meanwhile, each user evaluates 16 items randomly, which generates 4,640 MAR ratings. Yahoo! R3foot_3 contains totally 311,704 MNAR and 54,000 MAR ratings from 15,400 users to 1,000 items. Baselines. In our experiments, we take Matrix Factorization (MF) (Koren et al., 2009) (He et al., 2017) as the base model respectively, and compare against the proposed methods with the following baselines: Base Model, IPS (Saito et al., 2020; Schnabel et al., 2016) , SNIPS (Swaminathan & Joachims, 2015) , IPS-AT (Saito, 2020) , CVIB (Wang et al., 2020b) , DR (Saito, 2020) , DR-JL (Wang et al., 2019) , and MRDR-JL (Guo et al., 2021) . In addition, Naive Bayes with Laplace smoothing and logistic regression are used to establish the propensity model respectively. Experimental protocols and details. The following four metrics are used simultaneously in the evaluation of debiasing performance: MSE, AUC, NDCG@5, and NDCG@10. All the experiments are implemented on PyTorch with Adam as the optimizerfoot_4 . We tune the learning rate in {0.005, 0.01, 0.05, 0.1}, weight decay in {1e -6, 5e -6, . . . , 5e -3, 1e -2}, constrain parameter eta in {50, 100, 150, 200} for Coat and {500, 1000, 1500, 2000} for Yahoo! R3, and batch size in {128, 256, 512, 1024 {128, 256, 512, , 2048} for Coat and {1024, 2048, 4096, 8192, 16384} , 4096, 8192, 16384} for Yahoo! R3. In addition, for the Laplacian smooth parameter in Naive Bayes model, the initial value is set to 0 and the learning rate is tuned in {5, 10, 15, 20} for Coat and in {50, 100, 150, 200} for Yahoo! R3.

5.2. PERFORMANCE COMPARISON (RQ1)

Table 1 summarizes the performance of the proposed Stable-DR and Stable-MRDR methods compared with previous methods. First, the causally-inspired methods perform better than the base model, verifying the necessity of handling the selection bias in rating prediction. For previous meth- ods, SNIPS, CVIB and DR demonstrate competitive performance. Second, the proposed Stable-DR and Stable-MRDR have the best performance in all four metrics. On one hand, our methods outperform SNIPS, attributed to the inclusion of the propensity model in the training process, as well as the boundedness and double robustness of SDR. On the other hand, our methods outperform DR-JL and MRDR-JL, attributed to the stabilization constraint introduced in the training of the propensity model. This further demonstrates the benefit of cycle learning, in which the propensity model is acted as the mediation between the imputation and prediction model during the training process, rather than updating the prediction model from the imputation model directly.

5.3. ABLATION AND PARAMETER SENSITIVITY STUDY (RQ2, RQ3)

The debiasing performance under different stabilization constraint strength and propensity models is shown in Figure 2 . First, the proposed Stable-DR and Stable-MRDR outperform DR-JL and MRDR-JL, when either Naive Bayes with Laplace smoothing or logistic regression is used as propensity models. It indicates that our methods have better debiasing ability in both the feature containing and collaborative filtering scenarios. Second, when the strength of the stabilization constraint is zero, our method performs similarly to SNIPS and slightly worse than the DR-JL and MRDR-JL, which indicates that simply using cross-entropy loss to update the propensity model is not effective in improving the model performance. However, as the strength of the stabilization constraint increases, Stable-DR and Stable-MRDR using cycle learning have a stable and significant improvement compared to DR-JL and MRDR-JL. Our methods achieve the optimal performance at the appropriate constraint strength, which can be interpreted as simultaneous consideration of accuracy and stability to ensure boundedness and double robustness of SDR.

6. CONCLUSION

In this paper, we propose an SDR estimator for data MNAR that maintains double robustness and improves the stability of DR in the following three aspects: first, we show that SDR has a weaker extrapolation dependence than DR and can result in more stable and accurate predictions in the presence of MNAR effects. Next, through theoretical analysis, we show that the proposed SDR has bounded bias, variance, and generalization error bounds under inaccurate imputed errors and arbitrarily small estimated propensities, while DR does not. Finally, we propose a novel learning approach for SDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approach significantly outperforms the existing methods in terms of both convergence and prediction accuracy.

APPENDIX

Throughout, following existing studies (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Dai et al., 2022) , we assume that the indicator matrix O contains independent random variables and each o u,i follows a Bernoulli distribution with probability p u,i .

A PROOF OF THEOREMS

A.1 PROOF OF THEOREM 1 Proof of Theorem 1. To demonstrate the double robustness of the SDR, first note that P lim |D|→∞ E SDR = L ideal = 1 if the learned propensities are accurate (Swaminathan & Joachims, 2015) , since |D| -1 (u,i)∈D o u,i /p u,i converges to 1 almost surely as |D| goes to infinity and IPS is unbiased. Besides, the constraint (1) is constructed to ensure the unbiasedness of E SDR if the error imputation model is correctly specified. In fact, E SDR satisfies 1 |D| (u,i)∈D o u,i (e u,i -E SDR ) pu,i = 0. Combining the constraint (1) and equation ( 4) gives 1 |D| (u,i)∈D o u,i (e u,i -êu,i ) pu,i + o u,i ( Ê -E SDR ) pu,i = 0, where the first term equals to 0 when the imputation model is correctly specified, it implies that E SDR = Ê, then the unbiasedness of E SDR follows immediately from the unbiasedness of Ê.

A.2 PROOF OF THEOREM 2

Proof of Theorem 2. Equation (3) implies that E SDR can be expressed as E SDR = 1 |D| (u,i)∈D o u,i (e u,i -êu,i + Ê) pu,i 1 |D| (u,i)∈D o u,i pu,i . For notational simplicity, let w u,i ≜ o u,i /p u,i and v u,i ≜ o u,i (e u,i -êu,i + Ê)/p u,i , then E SDR can be written as a ratio statistic E SDR = 1 |D| (u,i)∈D v u,i 1 |D| (u,i)∈D w u,i ≜ f (v, w), where f (v, w) = v/w, v = |D| -1 (u,i)∈D v u,i , and w = |D| -1 (u,i)∈D w u,i . Applying the Taylor expansion around (µ v , µ w ) ≜ (E[v], E[ w]) yields that f (v, w) = f (µ v , µ w ) + f ′ v (µ v , µ w ) (v -µ v ) + f ′ w (µ v , µ w ) ( w -µ w ) + 1 2 f ′′ vv (µ v , µ w ) (v -µ v ) 2 + 2f ′′ vw (µ v , µ w ) (v -µ v ) ( w -µ w ) + f ′′ ww ( w -µ w ) 2 + R(ṽ, w), where R(ṽ, w) is the remainder term. Note that f ′′ vv (µ v , µ w ) = 0, f ′′ vw (µ v , µ w ) = -1/µ 2 w , and f ′′ ww (µ v , µ w ) = 2µ v /µ 3 w , then taking an expectation on both sides of the Taylor expansion leads to E(v/ w) = µ v µ w - Cov(v, w) (µ w ) 2 + Var( w)µ v (µ w ) 3 + E[R(ṽ, w)]. A.4 PROOF OF THEOREM 4 Proof of Theorem 4. The McDiarmid's inequality states that for independent bounded random variables X 1 , X 2 , . . . X n , where X i ∈ X i for all i and a mapping f : X 1 × X 2 × • • • × X n → R. Assume there exist constant c 1 , c 2 , . . . , c n such that for all i, sup x1,••• ,xi-1,xi,x ′ i ,xi+1,••• ,xn |f (x 1 , . . . , x i-1 , x i , x i+1 , • • • , x n ) -f (x 1 , . . . , x i-1 , x ′ i , x i+1 , • • • , x n )| ≤ c i . Then, for any ϵ > 0, P (|f (X 1 , X 2 , • • • , X n ) -E [f (X 1 , X 2 , • • • , X n )]| ≥ ϵ) ≤ 2 exp - 2ϵ 2 n i=1 c 2 i . In fact, equation ( 5) implies that the SDR estimator can be written as E SDR = (u,i)∈D o u,i (e u,i -êu,i ) pu,i (u,i)∈D o u,i pu,i + Ê, denoted as f (o 1,1 , . . . , o u,i , . . . , o U,I ). Note that sup ou,i,o ′ u,i f (o 1,1 , . . . , o u,i , . . . , o U,I ) -f o 1,1 , . . . , o ′ u,i . . . , o U,I ≤          δ max - δ u,i /p u,i + D\(u,i) o u,i /p u,i δ max 1/p u,i + D\(u,i) o u,i /p u,i , if δ u,i ≤ (δ min + δ max )/2, D\(u,i) o u,i /p u,i δ min + δ u,i /p u,i D\(u,i) o u,i /p u,i + 1/p u,i -δ min , if δ u,i > (δ min + δ max )/2, where D \ (u, i) is the set of D excluding the element (u, i). Next, we focus on analyzing the D\(u,i) o u,i /p u,i . The Hoeffding's inequality states that for independent bounded random variables X 1 , . . . , X n that take values in intervals of sizes ρ 1 , . . . , ρ n with probability 1 and for any ϵ > 0, P k X k -E( k X k ) ≥ ϵ ≤ 2 exp -2ϵ 2 k ρ 2 k . For D\(u,i) o u,i /p u,i , we have P D\(u,i) o u,i /p u,i - D\(u,i) p u,i /p u,i ≥ ϵ) ≤ 2 exp -2ϵ 2 D\(u,i) 1/p 2 u,i Setting the last term equals to η/2, and solving for ϵ gives that with probability at least 1 -η/2, the following inequality holds D\(u,i) o u,i /p u,i - D\(u,i) p u,i /p u,i ≤ 1 2 log 4 η D\(u,i) 1 p2 u,i ≜ ϵ ′ . Therefore, combining ( 7) and (8) yields that with probability at least 1 -η/2, sup o1,1,...,ou,i,o ′ u,i ,...,o U,I f (o 1,1 , . . . , o u,i , . . . , o U,I ) -f o 1,1 , . . . , o ′ u,i . . . , o U,I ≤          δ max - δ u,i /p u,i + ( D\(u,i) p u,i /p u,i -ϵ ′ )δ max 1/p u,i + ( D\(u,i) p u,i /p u,i -ϵ ′ ) , if δ u,i ≤ (δ min + δ max )/2, ( D\(u,i) p u,i /p u,i -ϵ ′ )δ min + δ u,i /p u,i ( D\(u,i) p u,i /p u,i -ϵ ′ ) + 1/p u,i -δ min , if δ u,i > (δ min + δ max )/2, ≤ (δ max -δ u,i )/{1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )}, if δ u,i ≤ (δ min + δ max )/2, (δ u,i -δ min )/{1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )}, if δ u,i > (δ min + δ max )/2, where δ u,i = e u,i -êu,i is the error deviation, δ min = min (u,i)∈D δ u,i , and δ max = max (u,i)∈D δ u,i . Invoking McDiarmid's inequality leads to that P (|ESDR -EO(ESDR)| ≥ ϵ) ≤ 2 exp -2ϵ 2 (u,i):δ u,i ≤ δ min +δmax 2 (δmax -δu,i) 2 {1 + pu,i( D-(u,i) pu,i/pu,i -ϵ ′ )} 2 + (u,i):δ u,i > δ min +δmax 2 (δu,i -δmin) 2 {1 + pu,i( D-(u,i) pu,i/pu,i -ϵ ′ )} 2 ≤ 2 exp -2ϵ 2 (u,i) {(δmax -δu,i) 2 + (δu,i -δmin) 2 }/{1 + pu,i( D-(u,i) pu,i/pu,i -ϵ ′ )} 2 Setting the last term equals to η/2, and solving for ϵ complete the proof.

A.5 GENERALIZATION BOUND UNDER INACCURATE MODELS

Theorem 5 (Generalization Bound under Inaccurate Models). For any finite hypothesis space of predictions H = { Ŷ1 , . . . , Ŷ|H| }, with probability 1 -η, the true risk R( Ŷ † ) deviates from the SDR estimator with imputed errors êu,i and learned propensities pu,i satisfying the stabilization constraint 1 is bounded by R( Ŷ † ) ≤ ÊSDR ( Ŷ † ) + 1 |D| (u,i)∈D δ † u,i - (u,i)∈D δ † u,i p u,i /p u,i (u,i)∈D p u,i /p u,i Bias Term + 1 2 log 4|H| η (u,i)∈D (δ max -δ § u,i ) 2 + (δ § u,i -δ min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 Variance Term where δ § u,i is the error deviation corresponding to the prediction model Ŷ § = argmax Ŷh ∈H (u,i)∈D (δ max -δ § u,i ) 2 + (δ § u,i -δ min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 . Proof of Theorem 5. Proof. Theorem 4 shows that for all predictions Ŷh ∈ H, we have  P E SDR ( Ŷh ) -E[E SDR ( Ŷh )] ≥ ϵ ≤ 2 exp -2ϵ

< η

Solving the last inequality for ϵ, it is concluded that, with probability 1 -η, the following inequality holds E[E SDR ( Ŷ † )] -E SDR ( Ŷ † ) ≤ 1 2 log 4|H| η (u,i)∈D (δ max -δ § u,i ) 2 + (δ § u,i -δ min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 . Theorem 2 shows that for the optimal prediction model Ŷ † , the following inequality holds R( Ŷ † ) -E[E SDR ( Ŷ † )] ≤ 1 |D| (u,i)∈D δ † u,i - (u,i)∈D δ † u,i p u,i /p u,i (u,i)∈D p u,i /p u,i . The stated results can be obtained by adding the two inequalities above.

B FURTHER THEORETICAL ANALYSIS OF SDR

Without loss of generality, we assume 1 |D| (u,i)∈D o u,i pu,i (ê u,i -Ê) = λ, λ ̸ = 0. In this case, the learned propensities must be inaccurate; otherwise, the constraint (1) holds naturally as the same size increases. Thus, if the imputed errors are accurate, then L ideal = Ê. By a exactly same arguments of equation ( 3 (e max -e u,i ) 2 + (e u,i -e min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 , where δ min = min (u,i)∈D e u,i , δ max = max (u,i)∈D e u,i , ϵ ′ = log(4/η)/2 • D\(u,i) 1/p 2 u,i , and D \ (u, i) is the set of D excluding the element (u, i). In addition, we can derive the generation error bound of SDR. Given a finite hypothesis space H of the prediction model, then for any a prediction model h ∈ H, with probability 1 -η, the true risk R(h) deviates from the SDR estimator is bounded by R(h) ≤ ÊSDR (h) + Bias(E SDR ) + 1 2 log( 4|H| η ) (u,i)∈D (e max -e § u,i ) 2 + (e § u,i -e min ) 2 {1 + pu,i ( D\(u,i) p u,i /p u,i -ϵ ′ )} 2 , where e § u,i is the error deviation corresponding to the prediction model h § = arg max h∈H (u,i)∈D (e max -e § u,i ) 2 + (e § u,i -e min ) 2 {1 + pu,i ( D-(u,i) p u,i /p u,i -ϵ ′ )} 2 .



Asymptotically unbiased means unbiasedness as the sample size goes to infinity. This is the reason why we adopt the notation of "asymptotically unbiased". https://www.cs.cornell.edu/˜schnabts/mnar/ http://webscope.sandbox.yahoo.com/ For all experiments, we use NVIDIA GeForce RTX 3090 as the computing resource.



Figure 1: During the training of updating a prediction model, two-phase learning(Marlin et al., 2007;Steck, 2013;Schnabel et al., 2016) uses a fixed imputation/propensity model (Left), whereas DR-JL(Wang et al., 2019), MRDR-DL(Guo et al., 2021), and AutoDebias(Chen et al., 2021) uses alternative learning between the imputation/propensity and the prediction model (Middle). The proposed learning approach updates the three models cyclically with stabilization (Right).

1: The Proposed Stable DR (MRDR) Cycle Learning, Stable-DR (MRDR) Input: observed ratings Y o , and η, λ e , λ stable , λ sdr ≥ 0 while stopping criteria is not satisfied do for number of steps for training the imputation model do Sample a batch of user-item pairs {(u j , i j )} J j=1 from O; Update β by descending along the gradient ∇ β L e (ϕ, α, β); end for number of steps for training the propensity model do Sample a batch of user-item pairs {(u k , i k )} K k=1 from D; Calculate the gradient of propensity cross entropy error ∇ α L ce (ϕ, α, β); Calculate the gradient of propensity stable constraint (1) ∇ α L stable (ϕ, α, β); Update α by descending along the gradient ∇ α L ce (ϕ, α, β) + η • ∇ α L stable (ϕ, α, β) end for number of steps for training the prediction model do Sample a batch of user-item pairs {(u l , i l )} L l=1 from O; Update ϕ by descending along the gradient ∇ ϕ L sdr (ϕ, α, β); end end

Figure 2: MSE, AUC and Increasing Ratio (IR) of Stable-DR and Stable-MRDR comparing with two baseline algorithms DR-JL and MRDR-JL in two different propensity model setting: Naive Bayes with Laplace smoothing (Top) and logistic regression (Bottom) respectively.

Performance on Coat and Yahoo!R3, using MF, SLIM, and NCF as the base models.

McDiarmid's inequality and union bound ensures the following uniform convergence results:P E SDR ( Ŷ † ) -E[E SDR ( Ŷ † )] ≤ ϵ ≥ 1 -η ⇐ P max Ŷh ∈H E SDR ( Ŷh ) -E[E SDR ( Ŷh )] ≤ ϵ ≥ 1 -η SDR ( Ŷh ) -E[E SDR ( Ŷh )] ≥ ϵ < η {(δ max -δ h u,i ) 2 + (δ h u,i -δ min ) 2 }/{1 + pu,i ( D-(u,i) p u,i /p u,i -ϵ ′ )} 2 < η {(δ max -δ § u,i ) 2 + (δ § u,i -δ min ) 2 }/{1 + pu,i ( D-(u,i) p u,i /p u,i -ϵ ′ )} 2

This means that the degree of violation of constraint (1) determines the size of the bias of SDR. Furthermore, we can compute the bias, variance, tail bound, and generalization error bound of SDR. Specifically, if both the learned propensities and imputed errors are inaccurate, constraint (3) does not hold either. Then the bias of SDR is the variance of SDR becomesVar (E SDR ) = (u,i) p u,i (1 -p u,i ) h2 u,i /p 2 u,i (u,i) p u,i /p u,i 2 + O(|D| -2 ),where hu,i = e u,i -(u,i)∈D {p u,i e u,i /p u,i } (u,i)∈D {p u,i /p u,i }. The tail bound of SDR is given as|E SDR -E O (E SDR )| ≤

ACKNOWLEDGMENTS

The work was supported by the National Key R&D Program of China under Grant No. 2019YFB1705601.

ETHICS STATEMENT

This work is mostly theoretical and experiments are based on synthetic and public datasets. We claim that this work does not present any foreseeable negative social impact.

REPRODUCIBILITY STATEMENT

Code is provided in Supplementary Materials to reproduce the experimental results.

By some calculations, we have

). Thus, the bias of E SDR is given as+ O(|D| -1 ).

A.3 PROOF OF THEOREM 3

Proof of Theorem 3. According to the proof of Theorem 2, we haveThen the variance of E SDR can be decomposed into asDenote f (v, w) = v/w, and apply delta method aroundwhere h u,i = (e u,i -êu,i ) -(u,i) {p u,i (e u,i -êu,i )/p u,i }/ (u,i) {p u,i /p u,i } is a bounded difference between e u,i -êu,i and its weighted average. The conclusion that the SDR variance is bounded for any propensities is given directly by the self-normalized form of SDR, i.e., the bounded range of SDR is [δ min , δ max ].

