TDR-CL: TARGETED DOUBLY ROBUST COLLABORA-TIVE LEARNING FOR DEBIASED RECOMMENDATIONS

Abstract

Bias is a common problem inherent in recommender systems, which is entangled with users' preferences and poses a great challenge to unbiased learning. For debiasing tasks, the doubly robust (DR) method and its variants show superior performance due to the double robustness property, that is, DR is unbiased when either imputed errors or learned propensities are accurate. However, our theoretical analysis reveals that DR usually has a large variance. Meanwhile, DR would suffer unexpectedly large bias and poor generalization caused by inaccurate imputed errors and learned propensities, which usually occur in practice. In this paper, we propose a principled approach that can effectively reduce the bias and variance simultaneously for existing DR approaches when the error imputation model is misspecified. In addition, we further propose a novel semi-parametric collaborative learning approach that decomposes imputed errors into parametric and nonparametric parts and updates them collaboratively, resulting in more accurate predictions. Both theoretical analysis and experiments demonstrate the superiority of the proposed methods compared with existing debiasing methods.

1. INTRODUCTION

Addressing various tasks in recommender systems (RSs) with causality-based methods has become increasingly popular (Wu et al., 2022b) . Causality-based recommendation has shown its great potential in both numeric experiments and theoretical analyses across extensive literature (Chen et al., 2020; Wang et al., 2019) . Generally, the basic question faced in RS is that "what would the feedback be if recommending an item to a user", requiring to estimate the causal effect of a recommendation on user feedback. To answer the question, many methods have been proposed, such as inverse propensity score (IPS) (Schnabel et al., 2016) , self-normalized inverse propensity score (SNIPS) (Swaminathan & Joachims, 2015) , error imputation based (EIB) methods (Steck, 2010) , and doubly robust (DR) methods (Chen et al., 2021; Wang et al., 2019; 2021; Dai et al., 2022; Ding et al., 2022) . Among them, the DR method and its variants show superior performance. We compare and evaluate these methods in terms of three desired properties, including doubly robust (Hernán & Robins, 2020; Wu et al., 2022c) , robust to small propensities (Rosenbaum, 2020), and low variance (Tan, 2007) . Failing to meet any of them may lead to sub-optimal performance (Molenberghs et al., 2015; van der Laan & Rose, 2011) . Our theoretical analysis shows that DR has much greater variance and is less robust to small propensities compared to EIB (Kang & Schafer, 2007) , even though the imputed errors and the learned propensities are accurate. Meanwhile, DR would suffer unexpectedly large bias and poor generalization caused by inaccurate imputed errors and learned propensities, which usually occur in practice. In this paper, we first propose a novel targeted doubly robust (TDR) method, that can capture the merits of both DR and EIB effectively, by leveraging the targeted learning technique (van der Laan & Rose, 2011; 2018) . TDR can effectively reduce the bias and variance simultaneously for existing DR approaches when the imputed errors are less accurate. Remarkably, TDR provides a model- L ideal = |D| -1 (u,i)∈D e u,i , where e u,i is the prediction error, e.g., the squared loss e u,i = (r u,i (1) -f θ (x u,i )) 2 . However, since r u,i (1) is observed only when o u,i = 1, the ideal loss is non-computable. Restricting the analysis to non-missing data will obtain biased conclusions, as the observed data may form an unrepresentative sample of the target population. Different debiasing methods are designed to approximate and substitute the ideal loss. For example, the IPS and EIB estimators are given as L IP S = |D| -1 (u,i)∈D o u,i e u,i /p u,i , L EIB = |D| -1 (u,i)∈D [o u,i e u,i + (1 -o u,i )ê u,i ], where pu,i is an estimate of propensity score p u,i := P(o u,i = 1|x u,i ), êu,i is an estimate of prediction error g u,i := E[e u,i |x u,i ], i.e., it fits e u,i using x u,i . The DR estimator is formulated as L DR = |D| -1 (u,i)∈D êu,i + o u,i (e u,i -êu,i ) pu,i , which enjoys doubly robust property, i.e., it is an unbiased estimator of ideal loss when either imputed errors or learned propensities are accurate.

3. MOTIVATION

DR approaches have been extensively studied in RS for various debiasing tasks for its double robustness, e.g., rating prediction (Wang et al., 2019; 2020a; Li et al., 2023b; c) , learning-to-rank (LTR) (Saito, 2020; Oosterhuis, 2022) , and post-click conversion rate prediction (Guo et al., 2021; Dai et al., 2022) , etc. However, the DR still have several limitations that need to resolved. We first show that DR has a large variance and is sensitive to small propensities as shown in Proposition 1 (see Appendix A for proofs). Proposition 1. If pu,i and êu,i are accurate estimates of p u,i and g u,i , respectively, i.e., pu,i = p u,i , êu,i = g u,i , then IPS, EIB and DR estimators are unbiased, and their variances satisfy Var(L EIB ) ≤ Var(L DR ) ≤ Var(L IP S ), where the equality holds if and only if p u,i = 1 for all (u, i) ∈ D. In addition, when p u,i tends to 0, Var(L IP S ) and Var(L DR ) tends to infinity, and Var(L EIB ) tends to its minimum. Proposition 1 shows that the EIB estimator is low-variance and robust to small propensities (Tan, 2007; Imbens & Rubin, 2015; Wu et al., 2021; 2022a) . In RS, some small propensities will appear inevitably due to the sparsity of the exposed data, resulting in a significant difference between Var(L EIB ) and Var(L DR ). Nevertheless, EIB usually has a large bias and is not preferred in practice. Proposition 1 provides a motivation to develop an estimator that combines the low-variance and robustness to small propensities of EIB with the double robustness of DR. In summary, DR outperforms IPS in terms of both bias and variance. When compared with EIB, if êu,i is inaccurate but pu,i is accurate, DR tends to have a smaller bias, but if both êu,i and pu,i are accurate, then EIB has a smaller variance. If êu,i is accurate but pu,i is inaccurate, then EIB may be superior to DR in terms of both bias and variance. In practice, both pu,i and êu,i are likely to be at least mildly inaccurate, so choosing from EIB and DR involves the bias-variance trade-off. Ideally, it is desirable to develop a method that is robust to small propensities, with lower bias and variance compared to previous DR methods, while maintaining the double robustness.

4.1. TARGETED DOUBLY ROBUST ESTIMATOR

We first bridge the explicit form of the DR estimator and the EIB estimator by noting that L DR = 1 |D| (u,i)∈D [o u,i e u,i + (1 -o u,i )ê u,i ] L EIB + 1 |D| (u,i)∈D o u,i (e u,i -êu,i ) 1 -pu,i pu,i correction term , where L DR is formally equivalent to adding a correction term using learned propensities to L EIB . The correction term has an important role in the bias-variance trade-off for the estimations of the ideal loss as shown in Proposition 1. Specifically, compared with L EIB , L DR can reduce bias by adding the correction term. As a compromise, the correction term will increase the variance of the DR estimator. Thus, if êu,i is computed in a manner that ensures that 1 |D| (u,i)∈D o u,i (e u,i -êu,i ) 1 -pu,i pu,i = 0. ( ) then the EIB estimator would have small bias and the DR estimator would have small variance. For equation (2) to hold, a naive method is taking it as a constraint condition when training the error imputation model. However, the constraint (2) may degrade the accuracy of the imputed errors because it will restrict the hypothesis space of the error imputation model. Instead of directly estimating êu,i satisfying the constraint (2), we propose to exploit the extra information on propensities when training the error imputation model. The basic idea of the proposed TDR estimator consists of the following two steps. Step 1 (Initialization). Let êu,i be the imputed error obtained by using any of the existing DR methods. Step 2 (Targeting). Update êu,i by fitting an extended one-parameter model as follows ẽu,i (η) = êu,i + η(1/p u,i -1) which includes a single variable 1/p u,i -1 and the offset ĥ(x u,i ). The parameter η is solved by minimizing the squared loss between ẽu,i (η) and e u,i in the exposed events. Then the proposed TDR estimator is given as L T DR = |D| -1 (u,i)∈D ẽu,i + o u,i (e u,i -ẽu,i )/p u,i . The targeting step enlarges the hypothesis space of ẽu,i compared to êu,i , and does not sacrifice the accuracy of the error imputation model, due to the introduce of an error correction term 1/p u,i -1 to estimate e u,i . Theorem 1 shows the validity and preservation of TDR (see Appendix B for proofs). Theorem 1. The imputed error ẽu,i obtained with TDR satisfies the following properties: (a) (validity) ẽu,i satisfies equation (2), which implies TDR would have smaller bias than EIB and smaller variance than DR based on the initial imputed error êu,i . (b) (preservation) η in the targeting step will converge to 0 and renders ẽu,i = êu,i when êu,i already satisfies equation (2). From Theorem 1, TDR guarantees that equation (2) always holds, regardless of the choice of the initial imputed errors. In addition, TDR inherits the desirable properties of EIB, such as low-variance and robust to small propensities, since equations ( 1) and (2) implies that the TDR estimator can be regarded as an EIB estimator. TDR would reduce the variance of DR as shown in Theorem 1, a further question is whether the variance-reduction will come at the expense of an increase in bias? Remarkably, TDR has no sacrifice of bias. Specifically, it can be shown (see Appendix C) that the bias of both L DR and L T DR are composed of the product of the errors of the propensity model and imputation model weighted by 1/p u,i . Therefore, given the same learned propensities, the more accurate the imputed errors are, the smaller the bias is. Since TDR updates êu,i by adding an extra term 1/p u,i -1, so ẽu,i is expected to be more accurate than êu,i , resulting in a smaller bias for L T DR than L DR . Importantly, the TDR provides a model-agnostic framework due to the free choice of the initial imputed errors in Step 1, which has great potentially strengths for recommendation. TDR can be assembled into any competing DR approach (Wang et al., 2019; Guo et al., 2021; Dai et al., 2022) , by updating its error imputation model with the targeting step. This extra targeting step tends to reduce both the bias and variance of the competing DR approach, resulting in more accurate predictions. Next, Theorem 2 indicates the double robustness of the TDR estimator (see Appendix C for proofs). Theorem 2. The proposed TDR estimator have the following properties: (a) (unbiasedness under accurate imputed errors) L T DR is unbiased if ẽu,i accurately estimates g u,i . (b) (unbiasedness under accurate learned propensities) Suppose that pu,i accurate estimates p u,i , and the validity of êu,i doesn't hold, then L EIB is biased, while L T DR is unbiased. Besides, Theorem 2(b) reveals that TDR can remove the bias of L EIB even though the initial imputed errors are inaccurate, provided the learned propensities are accurate.

4.2. SEMI-PARAMETRIC COLLABORATIVE LEARNING

In this subsection, we propose a novel TDR-based collaborative learning (TDR-CL) approach, in which the imputed errors ẽu,i are decomposed into a parametric error imputation model part êu,i and a nonparametric targeting part ω u,i ≜ η(1/p u,i -1) as in Section 4.1, i.e., ẽu,i = êu,i + ω u,i . The latter corrects the residuals of the error imputation model. By updating both the parametric and nonparametric parts collaboratively, the bias and variance of the TDR estimator can be further reduced, resulting in more accurate predictions. First, the embedding of each user u and item i is obtained by matrix factorization, and the stack layer gets the embedding x u,i by concatenation. TDR-based learning methods require estimated propensities for all user-item pairs, thus the Naive Bayes approach is no longer applicable. To handle this problem, the pre-trained propensities are obtained by conducting logistic regression of o u,i on x u,i , and the model parameters are used as the initialization of p ξ (x u,i ) in the iterative learning process. Given both the parametric error imputation part êu,i = g ϕ (x u,i ) and nonparametric targeting part ω u,i , the propensity model p ξ (x u,i ) and the prediction model f θ (x u,i ) are updated simultaneously using the training loss L T DR-CL (θ, ξ, ϕ) = L T DR + |D| -1 (u,i)∈D -o u,i • log pu,i -(1 -o u,i ) • log(1 -pu,i ) , where pu,i = p ξ (x u,i ), e u,i = (f θ (x u,i ) -r u,i (1)) 2 , ẽu,i = (f θ (x u,i ) -g ϕ (x u,i ) -ω u,i -⊥ (f θ (x u,i )) ) 2 with ⊥ the operator that sets the gradient of the operand to zero thus ∇ θ ⊥ (f θ (x u,i )) = 0 and ⊥ (f θ (x u,i )) = f θ (x u,i ). Then, unlike traditional alternative learning algorithms that directly use the parametric part g ϕ (x u,i ) as ẽu,i , the proposed collaborative learning additionally uses ω u,i as a non-parametric correction  η * ← arg min η o u,i (e u,i (θ) -êu,i (ϕ) -η(1/p u,i -1)) 2 ; Update ω u,i ← ω u,i + η * (1/p u,i -1 ) for all user-item pairs. end end term summed with g ϕ (x u,i ) to correct the estimation of e u,i . Specifically, given the prediction model and the propensity model, ẽu,i first updates its parametric part g ϕ (x u,i ) by minimizing L e (θ, ξ, ϕ) = |D| -1 (u,i)∈D o u,i (ẽ u,i -e u,i ) 2 /p u,i , where e u,i = r u,i (1) -f θ (x u,i ) , ẽu,i = g ϕ (x u,i ) + ω u,i . Next, the targeting step described in Section 4.1 is applied to further update the imputed errors ẽu,i . Through calculating the optimal step size for line search η * = arg min η o u,i (e u,i (θ) -êu,i (ϕ) -η(1/p u,i -1)) 2 , the non parametric targeted error term ω u,i is updated by adding η * (1/p u,i -1). In summary, the proposed learning approach collaboratively update the parametric term êu,i = g ϕ (x u,i ) and the nonparametric term ω u,i to achieve a better trade-off to estimate e u,i , which can reduce the bias of the existing DR methods such as DR-JL (Wang et al., 2019) and MRDR-DL (Guo et al., 2021) , by further modeling for the fitted residuals of the parametric parts êu,i . On the other hand, as shown in Theorems 1 and 2, when êu,i is already an accurate estimate of e u,i , the introduction of the targeted error term ẽu,i satisfies no-harm property and the unbiasedness is maintained. We summarized the proposed TDR-CL approach in Alg. 1.

5. SEMI-SYNTHETIC EXPERIMENTS

In this section, following the previous studies (Schnabel et al., 2016; Wang et al., 2019; Saito, 2020; Guo et al., 2021) , we aim to answer the following research question (RQ) on the semi-synthetic datasets: RQ1. Does the proposed TDR estimator in estimating the ideal loss have both the statistical properties of lower bias and variance in the presence of selection bias?

5.1. EXPERIMENTAL SETUP

Dataset and Preprocessing. MovieLens 100Kfoot_0 (ML-100K) is a dataset of 100,000 missing-not-atrandom (MNAR) ratings from 943 users and 1,682 movies collected from movie recommendation ratings. MovieLens 1Mfoot_1 (ML-1M) is a larger dataset of 1,000,209 MNAR ratings from 6,040 users and 3,952 movies. Following the data preprocessing procedure of previous studies (Schnabel et al., 2016; Wang et al., 2019; Saito, 2020; Guo et al., 2021) , we first use matrix factorization (Koren et al., 2009) to complete the rating matrix in the five-scale. Then for each predicted ratings R u,i ∈ IPS 0.0338 ± 0.0033 0.0390 ± 0.0037 0.0511 ± 0.0033 0.0696 ± 0.0026 0.0129 ± 0.0027 0.0526 ± 0.0026 DR 0.0140 ± 0.0034 0.0180 ± 0.0037 0.0150 ± 0.0034 0.0401 ± 0.0016 0.0101 ± 0.0027 0.0237 ± 0.0025 TDR 0.0053 ± 0.0026* 0.0035 ± 0.0025* 0.0066 ± 0.0032* 0.0325 ± 0.0015* 0.0029 ± 0.0020* 0.0193 ± 0.0025* Naive 0.0682 ± 0.0007 0.0783 ± 0.0007 0.1014 ± 0.0008 0.1377 ± 0.0005 0.0256 ± 0.0007 0.1054 ± 0.0006 EIB 0.5437 ± 0.0005 0.5872 ± 0.0005 0.6157 ± 0.0005 0.2531 ± 0.0001 0.3575 ± 0.0002 0.1442 ± 0.0001 ML-1M IPS 0.0343 ± 0.0009 0.0394 ± 0.0009 0.0508 ± 0.0009 0.0687 ± 0.0006 0.0130 ± 0.0008 0.0528 ± 0.0007 DR 0.0130 ± 0.0009 0.0168 ± 0.0009 0.0133 ± 0.0009 0.0399 ± 0.0005 0.0090 ± 0.0008 0.0229 ± 0.0007 TDR 0.0054 ± 0.0009* 0.0031 ± 0.0009* 0.0076 ± 0.0009* 0.0324 ± 0.0005* 0.0031 ± 0.0008* 0.0187 ± 0.0007* Note: * means statistically significant results (p-value ≤ 0.001) using the paired-t-test compared with the best baseline. {1, 2, 3, 4, 5}, we assign the p u,i ∈ (0, 1) with p u,i = pα max (1,5-Ru,i) . Finally, we replace the predicted ratings R u,i with r true u,i in {0.1, 0.3, 0.5, 0.7, 0.9} and sample the binary click indicator and conversion label with the Bernoulli sampling o u,i ∼ Bern(p u,i ), r u,i ∼ Bern(r true u,i ), ∀(u, i) ∈ D, where Bern(•) denotes the Bernoulli distribution. Predicted Metrics. The following prediction metrics are used to evaluate the debiasing performance under different scenarios. • ONE: ru,i is identical to the r true u,i , except that |{(u, i) | r true u,i = 0.9}| randomly selected r true u,i of 0.1 are flipped to 0.9. • THREE: Same as ONE, but flipping r true u,i of 0.3 instead. • FIVE: Same as ONE, but flipping r true u,i of 0.5 instead. • ROTATE: ru,i = r u,i -0.2 when r u,i ≥ 0.3, and ru,i = 0.9 when r u,i = 0.1. • SKEW: ru,i follows the truncated Gaussian distribution N [0.1,0.9] (µ = r true u,i , σ = (1 -r true u,i )/2). • CRS: ru,i = 0.2 if the r true u,i ≤ 0.6. Otherwise, ru,i = 0.6. Experimental Details. For each prediction matrix R = {r u,i (1) : (u, i) ∈ D}, the proposed TDR is compared with Naive (Koren et al., 2009) , EIB (Hernández-Lobato et al., 2014; Steck, 2010) , IPS (Saito et al., 2020; Schnabel et al., 2016) , and DR (Wang et al., 2019; Saito, 2020) methods. We obtain the propensities by 1/p u,i = (1 -β)/p u,i + β/p e , where p e = |D| -1 (u,i)∈D o u,i , and β is randomly sampled from [0, 1] to introduce noises. Define êu,i = CE( (u,i)∈O r u,i w u,i , ru,i ), where w u,i = (1/p u,i ) ( (u,i)∈O 1/p u,i ), CE denotes the cross entropy loss. For EIB and DR, the imputed error is computed as ẽu,i = êu,i , For TDR, ẽu,i = êu,i + η * (1/p u,i -1), where η * = arg min η (u,i)∈O (e u,i -êu,i -η(1/p u,i -1)) 2 . The performance of the estimators is based on the absolute relative error (RE) of the estimated and ideal loss RE(L est ) = |L ideal ( R) -L est ( R)|/L ideal ( R) , where L est denotes the estimator to be compared. RE evaluates the accuracy of the estimated loss, and a smaller RE value indicates a higher estimation accuracy.

5.2. EXPERIMENT RESULTS (RQ1)

In Table 1 , we report the means and standard deviations of the RE of the five estimators for each predicted matrix over 20 times of sampling. On the one hand, the average RE of the IPS, DR and TDR methods is significantly lower than that of the Naive method, verifying the validity of causalbased debiasing methods. The proposed TDR achieves the lowest RE in all settings, attributed to the introduced correction term ω u,i for estimating e u,i , that further reduces the bias of DR. The direct application of the EIB method is even worse than the Naive method, attributed to the challenge to make an accurate estimate of e u,i . On the other hand, same as the conclusion of Theorem 1, the standard deviation of the EIB method is significantly lower compared to the IPS and DR methods. The proposed TDR method combines the advantages of the EIB in terms of lower standard deviations than IPS and DR in all settings, reflecting stronger robustness. It can be concluded that the estimation accuracy and robustness of the proposed method are significantly improved compared to the previous methods.

6. REAL-WORLD EXPERIMENTS

In this section, we conduct experiments to evaluate the proposed methods on two real-world benchmark datasets containing missing-at-random (MAR) ratings. Throughout, our methods are implemented without uniform data to estimate the propensities, which differs from the existing Naive Bayes approach. We aim to answer the following RQs: RQ2. How do the proposed methods compare with the existing methods in terms of debiasing performance in practice? RQ3. How does the collaborative learning phase design affect the performance of our methods? RQ4. Do our methods stably perform well under different learned propensities? 6.1 EXPERIMENTAL SETUP Dataset and Preprocessing. MAR ratings are necessary to evaluate the performance of debiasing methods on real-world datasets. Following previous studies, we take the following two benchmark datasets: Coat Shoppingfoot_2 has 4,640 MAR and 6,960 MNAR ratings of 290 users to 300 Coats. Music! R3foot_3 has 54,000 MAR and 311,704 MNAR ratings of 15,400 users to 1,000 songs. Baselines. We take the widely used Matrix Factorization (MF) as the base model (Koren et al., 2009) , and compare the proposed methods with the following baselines: Base Model (Koren et al., 2009) , IPS (Schnabel et al., 2016) , SNIPS (Swaminathan & Joachims, 2015) , IPS with asymmetric training (IPS-AT) (Saito, 2020) , CVIB (Wang et al., 2020b) , DIB (Liu et al., 2021) , DR (Saito, 2020) , DR-JL (Wang et al., 2019) , DR-CL, MRDR-JL (Guo et al., 2021) , MRDR-CL, where DR-CL and MRDR-CL are performed using the proposed Alg. 1, but without the targeting step update (lines 9-11), also for comparison purpose. In addition, the proposed TDR-based methods include TDR, TDR-JL, and TMRDR-JL implemented by a single targeting step, and TDR-CL and TMRDR-CL implemented by collaborative learning approach as shown in Alg. 1. The real-world experimental protocols and details are provided in Appendix D.

6.2. PERFORMANCE COMPARISON (RQ2)

In Table 2 , we report the performance of various debiasing methods using MSE, AUC, NDCG@5, and NDCG@10 as evaluation metrics. For previous de-biasing methods, propensity-based IPS, SNIPS, IPS-AT, and information bottleneck-based CVIB and DIB all outperform the base model. The doubly robust methods, such as DR-JL, DR-CL, MRDR-JL, and MRDR-CL, using alternating learning and outperforming DR, which are considered as the most competitive baselines. The proposed TDR estimators are implemented by both single-step and collaborative learning, respectively, based on DR and MRDR as initialized error imputation models, outperforming the baseline methods significantly on all AUC, NDCG@5, and NDCG@10 metrics, attributed to the effectiveness of the introduced nonparametric correction term. It is noted that the collaborative version of TDR achieves the optimal performance both within DR and MRDR, which implements the proposed targeting step repeatedly. The fact that TDR-JL and TMRDR-JL implemented the targeting step only at the final training of the prediction models outperformed DR-JL and MRDR-JL, respectively, further demonstrates the effectiveness of the proposed targeting step to correct imputed errors.

6.3. IN-DEPTH ANALYSIS (RQ3, RQ4)

Ablation Study (RQ3). To illustrate the specific reasons for the effectiveness of the TDR-CL algorithm, we conduct ablation studies on DR-based and MRDR-based methods, respectively. From Figure 1 , DR-CL and DR-JL perform similarly on MSE, AUC and NDCG@5 metrics, and the MRDR approach has similar findings, which indicates that the directly use of collaborative learning approach without targeting steps cannot improve prediction performance. However, for the proposed TDR-CL and TMRDR-CL methods, there is a significant performance improvement compared to the DR-CL and MRDR-CL methods without targeting steps. This ablation study reveals that the improvement in the proposed TDR-CL and TMRDR-CL originates from the nonparametric correction term of the imputed errors, not from introducing additional model parameters for updating. Effect on Learned Propensities (RQ4). An important fact is that the nonparametric correction term in the TDR estimator is based on given learned propensities. In order to examine whether the proposed targeting steps stably help to improve the prediction accuracy under different learned propensities obtained by setting different clipping threshold, we conducted repeated experiments to quantify the sensitivity of the TDR-CL method to the propensity clipping threshold. From Figure 2 , the proposed method outperforms the DR-JL and DR-CL methods in terms of AUC, NDCG@5, and NDCG@10 on all clipping thresholds. The optimal performance is reached when the clipping threshold is equal to 0.15, which is interpreted as achieving the best trade-off between information utilization and robustness.

7. RELATED WORK

Debiasing in Recommendation. Bias is a common problem inherent in RS (Chen et al., 2020; Wu et al., 2022b) , such as popularity bias (Zhang et al., 2021) , model selection bias (Yuan et al., 2019) , user self-selection bias (Saito, 2020) , position bias (Ai et al., 2018) , and conformity bias (Liu et al., 2016) . Various methods were proposed for unbiased learning. For example, Schnabel et al. (2016) considered the recommendation as treatment and introduced the IPS and self-normalized IPS (SNIPS) methods to debiasing in explicit feedback data. Saito et al. (2020) extended it to the implicit recommendation. Wang et al. (2019) proposed a doubly robust joint learning approach that improved the IPS method. Subsequently, a series of enhanced DR methods were developed, such as MRDR (Guo et al., 2021) , Multi-task DR (Zhang et al., 2020) , DR-MSE (Dai et al., 2022) , BRD-DR (Ding et al., 2022) , and SDR (Li et al., 2023c) . Li et al. (2023a) proposed a multiple robust method that takes the advantages of multiple propensity and error imputation models. In addition, several new debiasing algorithm are designed via using an extra small uniform dataset (Bonner & Vasile, 2018; Chen et al., 2021; Liu et al., 2020; Wang et al., 2021; Li et al., 2023b) . Chen et al. (2020) provided a thorough discussion the recent progress on debiasing tasks in RS. Wu et al. (2022b) established a unified causal analysis framework and gave formal causal definitions of various biases in RS from the perspective of causal inference. Unlike the existing enhanced DR approaches that purse a better bias-variance trade-off, the proposed TDR reduces both the bias and variance and is theoretically guaranteed. Figure 2 : Learning performance on MAR test set of AUC (left), NDCG@5 (middle), and NDCG@10 (right) with varying levels of propensity clipping threshold. Targeted Learning. Targeted learning is a general framework in causal inference (van der Laan & Rose, 2011) that includes many field-specific approaches to accommodate various scientific problems in different fields, such as survival analysis (Stitelman et al., 2012) , genomics (Gruber & van der Laan, 2010) , epidemiology (Rose & van der Laan, 2014) and etc. More application scenarios about targeted learning can refer to the two excellent monographs (van der Laan & Rose, 2011; 2018) . Shi et al. (2019) proposed adapting neural networks for estimating the average treatment effects based on targeted learning. Different from the existing literature of targeted learning, this paper deals with the estimator and learning problem simultaneously. To the best of our knowledge, this is the first paper that extends targeted learning to the field of debiased recommendation.

8. CONCLUSION

In this paper, we propose a TDR estimator for debiased recommendation that enjoys the properties of double robustness, boundedness, low variance, and robustness to small propensities simultaneously. Theoretical analysis shows that TDR can effectively reduce the bias and variance simultaneously for any DR estimator when the error imputation model is less accurate. In addition, we further propose a novel uniform-data-free TDR-based collaborative learning approach that adaptively implements the targeting step, thus making the prediction model more robust. We conducted experiments on both semi-synthetic and real-world data. The superiority of the proposed method is demonstrated when compared with the existing debiasing methods. Throughout, we adopt 1/p u,i -1 as a key choice of targeting step to satisfy equation ( 2), which can be regraded as a first-order targeted learning Carone et al. (2014) . In future work, we will explore higher-order targeted learning and more effective feature selection in the proposed targeting step.

A PROOF OF PROPOSITION 1

Recall that p u,i = P(o u,i = 1|x u,i ) = E[o u,i |x u,i ] and g u,i = E[e u,i |x u,i ], both of them are functions of x u,i . Throughout, we maintain the common unconfoundedness assumption (i.e., r u,i (1) ⊥ ⊥ o u,i | x u,i ) and the consistency assumption, (i.e., r u,i (1) = r u,i if o u,i = 1). All the lower-case letters denote random variables for simplification. Proof of Proposition 1. The property of unbiasedness is obvious. Next, we focus on analysing the variance. Define σ 2 (x u,i ) = Var(e u,i |x u,i ) = E[(e u,i -g u,i ) 2 | x u,i ], then E[e 2 u,i |x u,i ] = σ 2 (x u,i ) + g 2 u,i . The variance of IPS estimator is given by Var(L IP S ) = |D| -1 • Var( o u,i e u,i p u,i ) = |D| -1 • E[ o 2 u,i e 2 u,i p 2 u,i ] -E( o u,i e u,i p u,i ) 2 = |D| -1 • E E[o u,i |x u,i ] • E[e 2 u,i |x u,i ] p 2 u,i -E E[o u,i |x u,i ] • E[e u,i |x u,i ] p u,i 2 = |D| -1 • E[ e 2 u,i p u,i ] -E(e u,i ) 2 = |D| -1 • E[ E(e 2 u,i |x u,i ) p u,i ] -E(e u,i ) 2 = |D| -1 • E[ σ 2 (x u,i ) + g 2 u,i p u,i ] -E(e u,i ) 2 , where the third equation follows by the law of iterated expectations and the unconfoundedness assumption. The variance of DR estimator is derived by |D| • Var(L DR ) = Var g u,i + o u,i (e u,i -g u,i ) p u,i = Var e u,i + o u,i -p u,i p u,i (e u,i -g u,i ) = Var(e u,i ) + Var o u,i -p u,i p u,i (e u,i -g u,i ) = E[e 2 u,i ] -[E(e u,i )] 2 + Var o u,i -p u,i p u,i (e u,i -g u,i ) = E[σ 2 (x u,i ) + g 2 u,i ] -[E(e u,i )] 2 + E (o u,i -p u,i ) 2 p 2 u,i (e u,i -g u,i ) 2 = E[σ 2 (x u,i ) + g 2 u,i ] -[E(e u,i )] 2 + E E{(o u,i -p u,i ) 2 |x u,i } p 2 u,i • E{(e u,i -g u,i ) 2 |x u,i } = E[σ 2 (x u,i ) + g 2 u,i ] -[E(e u,i )] 2 + E p u,i (1 -p u,i )σ 2 (x u,i ) p 2 u,i = E[ σ 2 (x u,i ) p u,i + g 2 u,i ] -[E(e u,i )] 2 , where the fifth equation holds by noting that E e u,i (o u,i -p u,i ) p u,i (e u,i -g u,i ) = E E(o u,i -p u,i |x u,i ) p u,i • E{e u,i (e u,i -g u,i )|x u,i } = 0. Since E[o u,i e u,i + (1 -o u,i )g u,i ] = E[g u,i ] = E[e u,i ], we have |D| • Var(L EIB ) = Var (o u,i e u,i + (1 -o u,i )g u,i ) = E {o u,i e u,i + (1 -o u,i )g u,i } 2 -[E(e u,i )] 2 = E o u,i e 2 u,i + (1 -o u,i )g 2 u,i -[E(e u,i )] 2 = E p u,i {σ 2 (x u,i ) + g 2 u,i } + (1 -p u,i )g 2 u,i -[E(e u,i )] 2 = E p u,i σ 2 (x u,i ) + g 2 u,i -[E(e u,i )] 2 B PROOF OF THEOREM 1 Proof of Theorem 1. The parameter η is solved by minimizing (u,i)∈D o u,i • e u,i -êu,i -η( 1 pu,i -1) 2 . Taking the first derivative of the above loss with respect to η and setting it to zero leads to that (u,i)∈D o u,i • e u,i -êu,i -η( 1 pu,i -1) • (1/p u,i -1) = 0, which implies that (u,i)∈D o u,i • {e u,i -ẽu,i } • (1/p u,i -1) = 0, namely, the equation ( 2) holds. This finishes the proof of Theorem 1(a). If êu,i already satisfies equation 2), then η = 0 is a solution of equation ( 4). Let η is another solution of equation ( 4). Since the solution of equation ( 4) is unique, then η will converges to 0. This proves the conclusion of Theorem 1(b).

C PROOF OF THEOREM 2

Proof of Theorem 2. The result of Theorem 2(a) is obvious. To show Theorem 2(b). We first claim that if êu,i is an accurate estimate of g u,i , i.e., êu,i = g u,i , then it will satisfy equation (2 

D REAL-WORLD EXPERIMENTAL PROTOCOLS AND DETAILS

Experimental protocols and details. For real-world experiments, the following four metrics were considered as the evaluation metrics: MSE, AUC, NDCG@5, and NDCG@10. For fast convergence in the learning phase, Adam is utilized as the optimizer for all models. We tune the learning rate in {0.001, 0.005, 0.01, 0.05, 0.1}, weight decay in [1e -6, 1e -2] at 10x multiplicative ratio, and batch size in {128, 256, 512, 1024 {128, 256, 512, , 2048} for Coat and {1024, 2048, 4096, 8192, 16384} , 4096, 8192, 16384} for Music! R3. Specifically for the propensity training, we tune the clipping threshold in {0.05, 0.10, 0.15, 0.20}. After finding out the best configuration on the validation set, we evaluate the trained models on the MAR test set. Experiments are conducted using NVIDIA GeForce RTX 3090.



https://grouplens.org/datasets/movielens/100k/ https://grouplens.org/datasets/movielens/1m/ https://www.cs.cornell.edu/˜schnabts/mnar/ http://webscope.sandbox.Music.com/



The Proposed Targeted Doubly Robust Collaborative Learning, TDR-CL Input: observed ratings R o , pre-trained learned propensities P, and ω u,i = 0. while stopping criteria is not satisfied do for number of steps for training the prediction and propensity model do Sample a batch of user-item pairs {(u j , i j )} J j=1 from D; Update θ and ξ by descending along the gradient ∇ θ,ξ L T DR-CL (θ, ξ, ϕ); end for number of steps for training the imputation model with targeting step do Sample a batch of user-item pairs {(u k , i k )} K k=1 from O; Update ϕ by descending along the gradient ∇ ϕ L e (θ, ξ, ϕ); Sample a batch of user-item pairs {(u l , i l )} L l=1 from D;

(a) Comparison of DR-JL, DR-CL and TDR-CL in terms of MSE, AUC and NDCG@5. (b) Comparison of MRDR-JL, MRDR-CL and TMRDR-CL in terms of MSE, AUC and NDCG@5.

Figure 1: Ablation studies on DR methods (top) and MRDR methods (bottom), where DR-CL and MRDR-CL skips the targeting steps in TDR-CL and TMRDR-CL.

Mean and standard deviation of the relative error on the Naive, EIB, IPS, DR and TDR. ± 0.0025 0.0790 ± 0.0028 0.1027 ± 0.0028 0.1378 ± 0.0011 0.0265 ± 0.0021 0.1062 ± 0.0022 EIB 0.5442 ± 0.0016 0.5878 ± 0.0017 0.6167 ± 0.0018 0.2533 ± 0.0004 0.3584 ± 0.0007 0.1443 ± 0.0007 ML-100K

MSE, AUC, NDCG@5, and NDCG@10 on the MAR test set of Coat and Music. We bold the outperforming DR-based and MRDR-based models. The proposed TDR methods implemented by a single targeting step are marked with * and collaborative learning are marked with †.

and the bias ofL T DR is Bias(L T DR ) = E e u,i + (o u,i -p u,i ) p u,i (e u,i -ẽu,i ) -E[e u,i ] = E E(o u,i -p u,i |x u,i ) p u,i E{e u,i -ẽu,i |x u,i }This proves the result of Theorem 2(b).Biases of DR and TDR. Given pu,i and ẽu,i for all (u, i) ∈ D, the bias of TDR isBias(L T DR ) = E e u,i + (o u,i -pu,i ) pu,i (e u,i -ẽu,i ) -E[e u,i ] = E E(o u,i -pu,i |x u,i ) pu,i E{e u,i -ẽu,i |x u,i } = E (p u,i -pu,i ) pu,i • (g u,i -ẽu,i ) .Similarly, given pu,i and êu,i for all (u, i) ∈ D, the bias of DR is

ACKNOWLEDGMENTS

The work was supported by the National Key R&D Program of China under Grant No. 2019YFB1705601.

ETHICS STATEMENT

This work is mostly theoretical and experiments are based on synthetic and public datasets. We claim that this work does not present any foreseeable negative social impact.

REPRODUCIBILITY STATEMENT

Code is provided in Supplementary Materials to reproduce the experimental results.

