RESACT: REINFORCING LONG-TERM ENGAGEMENT IN SEQUENTIAL RECOMMENDATION WITH RESIDUAL ACTOR

Abstract

Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online interaction. ResAct optimizes the policy by first reconstructing the online behaviors and then improving it via a Residual Actor. To extract long-term information, ResAct utilizes two information-theoretical regularizers to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a large-scale industrial dataset which consists of tens of millions of recommendation requests. Experimental results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.

1. INTRODUCTION

In recent years, sequential recommendation has achieved remarkable success in various fields such as news recommendation (Wu et al., 2017; Zheng et al., 2018; de Souza Pereira Moreira et al., 2021) , digital entertainment (Donkers et al., 2017; Huang et al., 2018; Pereira et al., 2019) , Ecommerce (Chen et al., 2018; Tang & Wang, 2018) and social media (Zhao et al., 2020b; Rappaz et al., 2021) . Real-life products, such as Tiktok and Kuaishou, have influenced the daily lives of billions of people with the support of sequential recommender systems. Different from traditional recommender systems which assume that the number of recommended items is fixed, a sequential recommender system keeps recommending items to a user until the user quits the current service/session (Wang et al., 2019; Hidasi et al., 2016) . In sequential recommendation, as depicted in Figure 1 , users have the option to browse endless items in one session and can restart a new session after they quit the old one (Zhao et al., 2020c) . To this end, an ideal sequential recommender system would be expected to achieve i) low return time between sessions, i.e., high frequency of user visits; and ii) large session length so that more items can be browsed in each session. We denote these two characteristics, i.e., return time and session length, as long-term engagement, in contrast to immediate engagement which is conventionally measured by click-through rates (Hidasi et al., 2016) . Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Despite great importance, unfortunately, how to effectively improve long-term engagement in sequential recommendation remains largely uninvestigated. Relating the changes in long-term user engagement to a single recommendation is a tough problem (Wang et al., 2022) . Existing works on sequential recommendation have typically focused on estimating the probability of immediate engagement with various neural network architectures (Hidasi et al., 2016; Tang & Wang, 2018) . However, they neglect to explicitly improve user stickiness such as increasing the frequency of visits or extending the average session length. There have been some recent efforts to optimize long-term engagement in sequential recommendation. However, they are usually based on strong assumptions such as recommendation diversity will increase long-term engagement (Teo et al., 2016; Zou et al., 2019) . In fact, the relationship between recommendation diversity and long-term engagement is largely empirical, and how to measure diversity properly is also unclear (Zhao et al., 2020c) . , 2019) . We can formulate the recommender system as an agent, with users as the environment, and assign rewards to the recommender system based on users' response, for example, the return time between two sessions. However, back to reality, there are significant challenges. First, the evolvement of user stickiness lasts for a long period, usually days or months, which makes the evaluation of state-action value difficult. Second, probing for rewards in previously unexplored areas, i.e., exploration, requires live experiments and may hurt user experience. Third, rewards of long-term engagement only occur at the beginning or end of a session and are therefore sparse compared to immediate user responses. As a result, representations of states may not contain sufficient information about long-term engagement.

Restart

To mitigate the aforementioned challenges, we propose to learn a recommendation policy that is close to, but better than, the online-serving policy. In this way, i) we can collect sufficient data near the learned policy so that state-action values can be properly estimated; and ii) there is no need to perform online interaction. However, directly learning such a policy is quite difficult since we need to perform optimization in the entire policy space. Instead, our method, ResAct, achieves it by first reconstructing the online behaviors of previous recommendation models, and then improving upon the predictions via a Residual Actor. The original optimization problem is decomposed into two sub-tasks which are easier to solve. Furthermore, to learn better representations, two informationtheoretical regularizers are designed to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a real-world dataset consisting of tens of millions of recommendation requests. The results show that ResAct significantly outperforms previous stateof-the-art methods in various long-term engagement optimization tasks.

2. PROBLEM STATEMENT

In sequential recommendation, users interact with the recommender system on a session basis. A session starts when a user opens the App and ends when he/she leaves. As in Figure 1 , when a user starts a session, the recommendation agent begins to feed items to the user, one for each recommendation request, until the session ends. For each request, the user can choose to consume the recommended item or quit the current session. A user may start a new session after he/she exits the old one, and can consume an arbitrary number of items within a session. An ideal recommender system with a goal for long-term engagement would be expected to minimize the average return time between sessions while maximizing the average number of items consumed in a session. Formally, we describe the sequential recommendation problem as a Markov Decision Process (MDP) which is defined by a tuple ⟨S, A, P, R, γ⟩. S = S h × S l is the continuous state space. s ∈ S indicates the state of a user. Considering the session-request structure in sequential recommendation, we decompose S into two disjoint sub-spaces, i.e., S h and S l , which is used to represent session-level



Figure 1: Sequential recommendation.Recently, reinforcement learning has achieved impressive advances in various sequential decision-making tasks, such as games(Silver et al., 2017; Schrittwieser et al.,  2020), autonomous driving(Kiran et al., 2021)  and robotics(Levine et al., 2016). Reinforcement learning in general focuses on learning policies which maximize cumulative reward from a long-term perspective(Sutton &  Barto, 2018). To this end, it offers us a promising framework to optimize long-term engagement in sequential recommendation(Chen et al., 2019). We can formulate the recommender system as an agent, with users as the environment, and assign rewards to the recommender system based on users' response, for example, the return time between two sessions. However, back to reality, there are significant challenges. First, the evolvement of user stickiness lasts for a long period, usually days or months, which makes the evaluation of state-action value difficult. Second, probing for rewards in previously unexplored areas, i.e., exploration, requires live experiments and may hurt user experience. Third, rewards of long-term engagement only occur at the beginning or end of a session and are therefore sparse compared to immediate user responses. As a result, representations of states may not contain sufficient information about long-term engagement.

