RESACT: REINFORCING LONG-TERM ENGAGEMENT IN SEQUENTIAL RECOMMENDATION WITH RESIDUAL ACTOR

Abstract

Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online interaction. ResAct optimizes the policy by first reconstructing the online behaviors and then improving it via a Residual Actor. To extract long-term information, ResAct utilizes two information-theoretical regularizers to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a large-scale industrial dataset which consists of tens of millions of recommendation requests. Experimental results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.

1. INTRODUCTION

In recent years, sequential recommendation has achieved remarkable success in various fields such as news recommendation (Wu et al., 2017; Zheng et al., 2018; de Souza Pereira Moreira et al., 2021 ), digital entertainment (Donkers et al., 2017; Huang et al., 2018; Pereira et al., 2019 ), Ecommerce (Chen et al., 2018; Tang & Wang, 2018) and social media (Zhao et al., 2020b; Rappaz et al., 2021) . Real-life products, such as Tiktok and Kuaishou, have influenced the daily lives of billions of people with the support of sequential recommender systems. Different from traditional recommender systems which assume that the number of recommended items is fixed, a sequential recommender system keeps recommending items to a user until the user quits the current service/session (Wang et al., 2019; Hidasi et al., 2016) . In sequential recommendation, as depicted in Figure 1 , users have the option to browse endless items in one session and can restart a new session after they quit the old one (Zhao et al., 2020c) . To this end, an ideal sequential recommender system would be expected to achieve i) low return time between sessions, i.e., high frequency of user visits; and ii) large session length so that more items can be browsed in each session. We denote these two characteristics, i.e., return time and session length, as long-term engagement, in contrast to immediate engagement which is conventionally measured by click-through rates (Hidasi et al., 2016) . Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time.

