PERSONALIZED REWARD LEARNING WITH INTERACTION-GROUNDED LEARNING (IGL)

Abstract

In an era of countless content offerings, recommender systems alleviate information overload by providing users with personalized content suggestions. Due to the scarcity of explicit user feedback, modern recommender systems typically optimize for the same fixed combination of implicit feedback signals across all users. However, this approach disregards a growing body of work highlighting that (i) implicit signals can be used by users in diverse ways, signaling anything from satisfaction to active dislike, and (ii) different users communicate preferences in different ways. We propose applying the recent Interaction Grounded Learning (IGL) paradigm to address the challenge of learning representations of diverse user communication modalities. Rather than requiring a fixed, human-designed reward function, IGL is able to learn personalized reward functions for different users and then optimize directly for the latent user satisfaction. We demonstrate the success of IGL with experiments using simulations as well as with real-world production traces.

1. INTRODUCTION

From shopping to reading the news, modern Internet users have access to an overwhelming amount of content and choices from online services. Recommender systems offer a way to improve user experience and decrease information overload by providing a customized selection of content. A key challenge for recommender systems is the rarity of explicit user feedback, such as ratings or likes/dislikes (Grčar et al., 2005) . Rather than explicit feedback, practitioners typically use more readily available implicit signals, such as clicks (Hu et al., 2008) , webpage dwell time (Yi et al., 2014) , or inter-arrival times (Wu et al., 2017) as a proxy signal for user satisfaction. These implicit signals are used as the reward objective in recommender systems, with the popular Click-Through Rate (CTR) metric as the gold standard for the field (Silveira et al., 2019) . However, directly using implicit signals as the reward function presents several issues. Implicit signals do not directly map to user satisfaction. Although clicks are routinely equated with user satisfaction, there are examples of unsatisfied users interacting with content via clicks. Clickbait exploits cognitive biases such as caption bias (Hofmann et al., 2012) or the curiosity gap (Scott, 2021) so that low quality content attracts more clicks. Direct optimization of the CTR degrades user experience by promoting clickbait items (Wang et al., 2021) . Recent work shows that users will even click on content that they know a priori they will dislike. In a study of online news reading, Lu et al. (2018a) discovered that 15% of the time, users would click on articles that they strongly disliked. Similarly, although longer webpage dwell times are associated with satisfied users, a study by Kim et al. (2014) found that dwell time is also significantly impacted by page topic, readability and content length. Different users communicate in different ways. Demographic background is known to have an impact on the ways in which users engage with recommender systems. A study by Beel et al. (2013) shows that older users have CTR more than 3x higher than their younger counterparts. Gender also has an impact on interactions, e.g. men are more likely to leave dislikes on YouTube videos than women (Khan, 2017) . At the same time, a growing body of work shows that recommender systems do not provide consistent performance across demographic subgroups. For example, multiple studies on ML fairness in recommender systems show that women on average receive less accurate recommendations compared to men (Ekstrand et al., 2018; Mansoury et al., 2020) . Current systems are also unfair across different age brackets, with statistically significant recommendation utility degradation as the age of the user increases (Neophytou et al., 2022) . The work of Neophytou et al. identifies usage features as the most predictive of mean recommender utility, hinting that the inconsistent performance in recommendation algorithms across subgroups arises from differences in how users interact with the recommender system. These challenges motivate the need for personalized reward functions. However, extensively modeling the ways in which implicit signals are used or how demographics impact interaction style is costly and inefficient. Current state-of-the-art systems utilize reward functions that are manually engineered combinations of implicit signals, typically refined through laborious trial and error methods. Yet as recommender systems and their users evolve, so do the ways in which users implicitly communicate preferences. Any extensive models or hand tuned reward functions developed now could easily become obsolete within a few years time. To this end, we propose Interaction Grounded Learning (IGL) Xie et al. ( 2021) for personalized reward learning (IGL-P). IGL is a learning paradigm where a learner optimizes for unobservable rewards by interacting with the environment and associating observable feedback with the true latent reward. Prior IGL approaches assume the feedback either depends on the reward alone (Xie et al., 2021) , or on the reward and action (Xie et al., 2022) . These methods are unable to disambiguate personalized feedback that depends on the context. Other approaches such as reinforcement learning and traditional contextual bandits suffer from the choice of reward function. However our proposed personalized IGL, IGL-P, resolves the 2 above challenges while making minimal assumptions about the value of observed user feedback. Our new approach is able to incorporate both explicit and implicit signals, leverage ambiguous user feedback and adapt to the different ways in which users interact with the system. Our Contributions: We present IGL-P, the first IGL strategy for context-dependent feedback, the first use of inverse kinematics as an IGL objective, and the first IGL strategy for more than two latent states. Our proposed approach provides an alternative to agent learning methods which require handcrafted reward functions. Using simulations and real production data, we demonstrate that IGL-P is able to learn personalized rewards when applied to the domain of online recommender systems, which require at least 3 reward states.

2.1. CONTEXTUAL BANDITS

The contextual bandit (Auer et al., 2002; Langford & Zhang, 2007 ) is a statistical model of myopic decision making which is pervasively applied in recommendation systems (Bouneffouf et al., 2020) . IGL operates via reduction to contextual bandits, hence, we briefly review contextual bandits here. The contextual bandit problem proceeds over T rounds. At each round t ∈ [T ], the learner receives a context x t ∈ X (the context space), selects an action a t ∈ A (the action space), and then observes a reward r t (a t ), where r t : A → [0, 1] is the underlying reward function. We assume that for each round t, conditioned on x t , r t is sampled from a distribution P rt (• | x t ). A contextual bandit (CB) algorithm attempts to minimize the cumulative regret Reg CB (T ) := T t=1 r t (π ⋆ (x t ))r t (a t ) (1) relative to an optimal policy π ⋆ over a policy class Π. In general, both the contexts x 1 , . . . , x T and the distributions P r1 , . . . , P r T can be selected in an arbitrary, potentially adaptive fashion based on the history. In the sequel we will describe IGL in a stochastic environment, but the reduction induces a nonstationary contextual bandit problem, and therefore the existence of adversarial contextual bandit algorithms is relevant.

