STABLEDR: STABILIZED DOUBLY ROBUST LEARNING FOR RECOMMENDATION ON DATA MISSING NOT AT RANDOM

Abstract

In recommender systems, users always choose the favorite items to rate, which leads to data missing not at random and poses a great challenge for unbiased evaluation and learning of prediction models. Currently, the doubly robust (DR) methods have been widely studied and demonstrate superior performance. However, in this paper, we show that DR methods are unstable and have unbounded bias, variance, and generalization bounds to extremely small propensities. Moreover, the fact that DR relies more on extrapolation will lead to suboptimal performance. To address the above limitations while retaining double robustness, we propose a stabilized doubly robust (StableDR) learning approach with a weaker reliance on extrapolation. Theoretical analysis shows that StableDR has bounded bias, variance, and generalization error bound simultaneously under inaccurate imputed errors and arbitrarily small propensities. In addition, we propose a novel learning approach for StableDR that updates the imputation, propensity, and prediction models cyclically, achieving more stable and accurate predictions. Extensive experiments show that our approaches significantly outperform the existing methods.

1. INTRODUCTION

Modern recommender systems (RSs) are rapidly evolving with the adoption of sophisticated deep learning models (Zhang et al., 2019) . However, it is well documented that directly using advanced deep models usually achieves sub-optimal performance due to the existence of various biases in RS (Chen et al., 2020; Wu et al., 2022b) , and the biases would be amplified over time (Mansoury et al., 2020; Wen et al., 2022) . A large number of debiasing methods have emerged and gradually become a trend. For many practical tasks in RS, such as rating prediction (Schnabel et al., 2016; Wang et al., 2020a; 2019) , post-view click-through rate prediction (Guo et al., 2021) , post-click conversion rate prediction (Zhang et al., 2020; Dai et al., 2022) , and uplift modeling (Saito et al., 2019; Sato et al., 2019; 2020) , a critical challenge is to combat the selection bias and confounding bias that leading to significantly difference between the trained sample and the targeted population (Hernán & Robins, 2020) . Various methods were designed to address this problem and among them, doubly robust (DR) methods (Wang et al., 2019; Zhang et al., 2020; Chen et al., 2021; Dai et al., 2022; Ding et al., 2022) play the dominant role due to their better performance and theoretical properties. The success of DR is attributed to its double robustness and joint-learning technique. However, the DR methods still have many limitations. Theoretical analysis shows that inverse probability scoring (IPS) and DR methods may have infinite bias, variance, and generalization error bounds, in the presence of extremely small propensity scores (Schnabel et al., 2016; Wang et al., 2019; Guo et al., 2021; Li et al., 2023b) . In addition, due to the fact that users are more inclined to evaluate the preferred items, the problem of data missing not at random (MNAR) often occurs in RS. This would cause selection bias and results in inaccuracy for methods that more rely on extrapolation, such as error imputation based (EIB) (Marlin et al., 2007; Steck, 2013) and DR methods. To overcome the above limitations while maintaining double robustness, we propose a stabilized doubly robust (SDR) estimator with a weaker reliance on extrapolation, which reduces the negative impact of extrapolation and MNAR effect on the imputation model. Through theoretical analysis, we demonstrate that the SDR has bounded bias and generalization error bound for arbitrarily small propensities, which further indicates that the SDR can achieve more stable predictions. Furthermore, we propose a novel cycle learning approach for SDR. Figure 1 shows the differences between the proposed cycle learning of SDR and the existing unbiased learning approaches. Twophase learning (Marlin et al., 2007; Steck, 2013; Schnabel et al., 2016) first obtains an imputation/propensity model to estimate the ideal loss and then updates the prediction model by minimizing the estimated loss. DR-JL (Wang et al., 2019) , MRDR-DL (Guo et al., 2021), and AutoDebias (Chen et al., 2021) alternatively update the model used to estimate the ideal loss and the prediction model. The proposed learning method cyclically uses different losses to update the three models with the aim of achieving more stable and accurate prediction results. We have conducted extensive experiments on two real-world datasets, and the results show that the proposed approach significantly improves debiasing and convergence performance compared to the existing methods.

2.1. PROBLEM SETTING

In RS, due to the fact that users are more inclined to evaluate the preferred items, the collected ratings are always missing not at random (MNAR). We formulate the data MNAR problem using the widely adopted potential outcome framework (Neyman, 1990; Imbens & Rubin, 2015) . Let U = {1, 2, ..., U }, I = {1, 2, ..., I} and D = U × I be the index sets of users, items, all user-item pairs. For each (u, i) ∈ D, we have a treatment o u,i ∈ {0, 1}, a feature vector x u,i , and an observed rating r u,i , where o u,i = 1 if user u rated the item i in the logging data, o u,i = 0 if the rating is missing. Let r u,i (1) is defined as the be the rating that would be observed if item i had been rated by user u, which is observable only for O = {(u, i) | (u, i) ∈ D, o u,i = 1}. Many tasks in RS can be formulated by predicting the potential outcome r u,i (1) using feature x u,i for each (u, i). Let ru,i (1) = f (x u,i ; ϕ) be a prediction model with parameters ϕ. If all the potential outcomes {r u,i (1) : (u, i) ∈ D} were observed, the ideal loss function for solving parameters ϕ is given as L ideal (ϕ) = |D| -1 (u,i)∈D e u,i , where e u,i is the prediction error, such as the squared loss e u,i = (r u,i (1) -r u,i (1)) 2 . L ideal (ϕ) can be regarded as a benchmark of unbiased loss function, even though it is infeasible due to the missingness of {r u,i (1) : o u,i = 0}. As such, a variety of methods are developed through approximating L ideal (ϕ) to address the selection bias, in which the propensity-based estimators show the relatively superior performance (Schnabel et al., 2016; Wang et al., 2019) , and the IPS and DR estimators are E IP S = |D| -1 (u,i)∈D o u,i e u,i pu,i and E DR = |D| -1 (u,i)∈D êu,i + o u,i (e u,i -êu,i ) pu,i , where pu,i is an estimate of propensity score p u,i := P(o u,i = 1|x u,i ), êu,i is an estimate of e u,i .



Figure 1: During the training of updating a prediction model, two-phase learning(Marlin et al.,  2007; Steck, 2013; Schnabel et al., 2016)  uses a fixed imputation/propensity model (Left), whereas DR-JL(Wang et al., 2019),MRDR-DL (Guo et al., 2021), and AutoDebias (Chen et al., 2021)   uses alternative learning between the imputation/propensity and the prediction model (Middle). The proposed learning approach updates the three models cyclically with stabilization (Right).

annex

Published as a conference paper at ICLR 2023 

