RESACT: REINFORCING LONG-TERM ENGAGEMENT IN SEQUENTIAL RECOMMENDATION WITH RESIDUAL ACTOR

Abstract

Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online interaction. ResAct optimizes the policy by first reconstructing the online behaviors and then improving it via a Residual Actor. To extract long-term information, ResAct utilizes two information-theoretical regularizers to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a large-scale industrial dataset which consists of tens of millions of recommendation requests. Experimental results show that our method significantly outperforms the state-of-the-art baselines in various long-term engagement optimization tasks.

1. INTRODUCTION

In recent years, sequential recommendation has achieved remarkable success in various fields such as news recommendation (Wu et al., 2017; Zheng et al., 2018; de Souza Pereira Moreira et al., 2021) , digital entertainment (Donkers et al., 2017; Huang et al., 2018; Pereira et al., 2019 ), Ecommerce (Chen et al., 2018; Tang & Wang, 2018) and social media (Zhao et al., 2020b; Rappaz et al., 2021) . Real-life products, such as Tiktok and Kuaishou, have influenced the daily lives of billions of people with the support of sequential recommender systems. Different from traditional recommender systems which assume that the number of recommended items is fixed, a sequential recommender system keeps recommending items to a user until the user quits the current service/session (Wang et al., 2019; Hidasi et al., 2016) . In sequential recommendation, as depicted in Figure 1 , users have the option to browse endless items in one session and can restart a new session after they quit the old one (Zhao et al., 2020c) . To this end, an ideal sequential recommender system would be expected to achieve i) low return time between sessions, i.e., high frequency of user visits; and ii) large session length so that more items can be browsed in each session. We denote these two characteristics, i.e., return time and session length, as long-term engagement, in contrast to immediate engagement which is conventionally measured by click-through rates (Hidasi et al., 2016) . Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Despite great importance, unfortunately, how to effectively improve long-term engagement in sequential recommendation remains largely uninvestigated. Relating the changes in long-term user engagement to a single recommendation is a tough problem (Wang et al., 2022) . Existing works on sequential recommendation have typically focused on estimating the probability of immediate engagement with various neural network architectures (Hidasi et al., 2016; Tang & Wang, 2018) . However, they neglect to explicitly improve user stickiness such as increasing the frequency of visits or extending the average session length. There have been some recent efforts to optimize long-term engagement in sequential recommendation. However, they are usually based on strong assumptions such as recommendation diversity will increase long-term engagement (Teo et al., 2016; Zou et al., 2019) . In fact, the relationship between recommendation diversity and long-term engagement is largely empirical, and how to measure diversity properly is also unclear (Zhao et al., 2020c) . Recently, reinforcement learning has achieved impressive advances in various sequential decision-making tasks, such as games (Silver et al., 2017; Schrittwieser et al., 2020) , autonomous driving (Kiran et al., 2021) and robotics (Levine et al., 2016) . Reinforcement learning in general focuses on learning policies which maximize cumulative reward from a long-term perspective (Sutton & Barto, 2018) . To this end, it offers us a promising framework to optimize long-term engagement in sequential recommendation (Chen et al., 2019) . We can formulate the recommender system as an agent, with users as the environment, and assign rewards to the recommender system based on users' response, for example, the return time between two sessions. However, back to reality, there are significant challenges. First, the evolvement of user stickiness lasts for a long period, usually days or months, which makes the evaluation of state-action value difficult. Second, probing for rewards in previously unexplored areas, i.e., exploration, requires live experiments and may hurt user experience. Third, rewards of long-term engagement only occur at the beginning or end of a session and are therefore sparse compared to immediate user responses. As a result, representations of states may not contain sufficient information about long-term engagement.

Restart

To mitigate the aforementioned challenges, we propose to learn a recommendation policy that is close to, but better than, the online-serving policy. In this way, i) we can collect sufficient data near the learned policy so that state-action values can be properly estimated; and ii) there is no need to perform online interaction. However, directly learning such a policy is quite difficult since we need to perform optimization in the entire policy space. Instead, our method, ResAct, achieves it by first reconstructing the online behaviors of previous recommendation models, and then improving upon the predictions via a Residual Actor. The original optimization problem is decomposed into two sub-tasks which are easier to solve. Furthermore, to learn better representations, two informationtheoretical regularizers are designed to confirm the expressiveness and conciseness of features. We conduct experiments on a benchmark dataset and a real-world dataset consisting of tens of millions of recommendation requests. The results show that ResAct significantly outperforms previous stateof-the-art methods in various long-term engagement optimization tasks.

2. PROBLEM STATEMENT

In sequential recommendation, users interact with the recommender system on a session basis. A session starts when a user opens the App and ends when he/she leaves. As in Figure 1 , when a user starts a session, the recommendation agent begins to feed items to the user, one for each recommendation request, until the session ends. For each request, the user can choose to consume the recommended item or quit the current session. A user may start a new session after he/she exits the old one, and can consume an arbitrary number of items within a session. An ideal recommender system with a goal for long-term engagement would be expected to minimize the average return time between sessions while maximizing the average number of items consumed in a session. Formally, we describe the sequential recommendation problem as a Markov Decision Process (MDP) which is defined by a tuple ⟨S, A, P, R, γ⟩. S = S h × S l is the continuous state space. s ∈ S indicates the state of a user. Considering the session-request structure in sequential recommendation, we decompose S into two disjoint sub-spaces, i.e., S h and S l , which is used to represent session-level (high-level) features and request-level (low-level) features, respectively. A is the continuous action space (Chandak et al., 2019; Zhao et al., 2020a) , where a ∈ A is a vector representing a recommended item. P : S × A × S → R is the transition function, where p(s t+1 |s t , a t ) defines the state transition probability from the current state s t to the next state s t+1 after recommending an item a t . R : S × A → R is the reward function, where r(s t , a t ) is the immediate reward by recommending a t at state s t . The reward function should be related to return time and/or session length; γ is the discount factor for future rewards. Given a policy π(a|s) : S × A → R, we define a state-action value function Q π (s, a) which outputs the expected cumulative reward (return) of taking an action a at state s and thereafter following π: Q π (s t , a t ) = E (s t ′ ,a t ′ )∼π r(s t , a t ) + ∞ t ′ =t+1 γ (t ′ -t) • r(s t ′ , a t ′ ) . The optimization objective is to seek a policy π(a|s) such that the return obtained by the recommendation agents is maximized, i.e., max π J (π ) = E st∼d π t (•),at∼π(•|st) [Q π (s t , a t )]. Here d π t (•) denotes the state visitation frequency at step t under the policy π.

3. REINFORCING LONG-TERM ENGAGEMENT WITH RESIDUAL ACTOR

To improve long-term engagement, we propose to learn a recommendation policy which is broadly consistent to, but better than, the online-serving policyfoot_0 . In this way, i) we have access to sufficient data near the learned policy so that state-action values can be properly estimated because the notorious extrapolation error is minimized (Fujimoto et al., 2019) ; and ii) the potential of harming the user experience is reduced as we can easily control the divergence between the learned new policy and the deployed policy (the online-serving policy) and there is no need to perform online interaction. Despite the advantages, directly learning such a policy is rather difficult because we need to perform optimization throughout the entire huge policy space. Instead, we propose to achieve it by first reconstructing the onlineserving policy and then improving it. By doing so, the original optimization problem is decomposed into two sub-tasks which are more manageable. Specifically, let π(a|s) denote the policy we want to learn; we decompose it into â = a on + ∆(s, a on ) where a on is sampled from the online-serving policy π on , i.e., a on ∼ π on (a|s), and ∆(s, a on ) is the residual which is determined by a deterministic actor. We expect that adding the residual will lead to higher expected return, i.e., J (π) ≥ J (π on ). As in Figure 2 , our algorithm, ResAct, works in three phases: i) Reconstruction: ResAct first reconstructs the online-serving policy, i.e., πon (a|s) ≈ π on (a|s), by supervised learning. Then ResAct samples n actions from the reconstructed policy, i.e., {ã i on ∼ πon (a|s)} n i=1 as estimators of a on ; ii) Prediction: For each estimator ãi on , ResAct predicts the residual and applies it to ãi on , i.e., ãi = ãi on + ∆(s, ãi on ). We need to learn the residual actor to predict ∆(s, ãon ) such that ã is better than ãon in general; iii) Selection: ResAct selects the best action from the {ã i } n i=0 as the final output, i.e., arg max ã Q π (s, ã) for ã ∈ {ã i } n i=0 . In sequential recommendation, state representations may not contain sufficient information about long-term engagement. To address this, we design two information-theoretical regularizers to improve the expressiveness and conciseness of the extracted state features. The regularizers are maximizing mutual information between state features and long-term engagement while minimizing the entropy of the state features in order to filter out redundant information. The overview of ResAct is depicted in Figure 3 and we elaborate the details in the subsequent subsections. A formal description for ResAct algorithm is shown in Appendix A.

Low-level State Encoder

Residual Sub-actor

State-action Value Networks

Policy Gradient 𝑳 𝑹𝒆𝒄 𝒔 𝒂 𝒔 𝒉 𝑳 𝑪𝒐𝒏 𝑳 𝑬𝒙𝒑 𝒔 𝒍 𝒂 # 𝒐𝒏 𝒂 # 𝒐𝒏 ∆ 𝒛 𝒍 𝒛 𝒉 𝒔 𝒔 𝒂 # 𝑸(𝒔, 𝒂 #) Residual Actor

Data Flow Gradient Flow

Figure 3 : Schematics of our approach. The CVAE-Encoder generates an action embedding distribution, from which a latent vector is sampled for the CVAE-Decoder to reconstruct the action. The reconstructed action ãon , together with state features extracted by the high-level and low-level state encoders, are fed to the residual actor to predict the residual ∆. After adding the residual, the action and the state are sent to the state-action value networks, from which policy gradient can be generated. The framework can be trained in an end-to-end manner.

3.1. RECONSTRUCTING ONLINE BEHAVIORS

To reconstruct behaviors of the online-serving policy, we should learn a mapping πon (a|s) from states to action distributions such that πon (a|s) ≈ π on (a|s) where π on (a|s) is the online-serving policy. A naive approach is to use a model D(a|s; θ d ) with parameters θ d to approximate π on (a|s) and optimize θ d by minimizing E s,aon∼πon(a|s) (D(a|s; θ d ) -a on )foot_1 . (2) However, such deterministic action generation only allows for an instance of action and will cause huge deviation if the only estimator is not precise. To mitigate this, we propose to encode a on into a latent distribution conditioned on s, and decode samples from the latent space to get estimators of a on . By doing so, we can generate multiple action estimators by sampling from the latent distribution. The key idea is inspired by conditional variational auto-encoder (CVAE) (Kingma & Welling, 2014) . We define the latent distribution C(s, a on ) as a multivariate Gaussian whose parameters, i.e., mean and variance, are determined by an encoder E(•|s, a on ; θ e ) with parameters θ e . Then for each latent vector c ∼ C(s, a on ), we can use a decoder D(a|s, c; θ d ) with parameters θ d to map it back to an action. To improve generalization ability, we apply a KL regularizer which controls the deviation between C(s, a on ) and its prior which is chosen as the multivariate normal distribution N (0, 1). Formally, we can optimize θ e and θ d by minimizing the following loss: L Rec θe,θ d = E s,aon,c (D(a|s, c; θ d ) -a on ) 2 + KL(C(s, a on ; θ e )||N (0, 1)) . ( ) where a on ∼ π on (a|s) and c ∼ C(s, a on ; θ e ) 2 . When performing behavior reconstruction for an unknown state s, we do not know its a on and therefore cannot build C(s, a on ; θ e ). As a mitigation, we sample n latent vectors from the prior of C(s, a on ), i.e., {c i ∼ N (0, 1)} n i=0 . Then for each c i , we can generate an estimator of a on by using the decoder ãi on = D(a|s, c i ; θ d ).

3.2. LEARNING TO PREDICT THE OPTIMAL RESIDUAL

By learning the CVAE which consists of E(•|s, a on ; θ e ) and D(a|s, c; θ d ), we can easily reconstruct the online-serving policy and sample multiple estimators of a on by {ã i on = D(a|s, c i ; θ d ), c i ∼ N (0, 1)} n i=0 . For each ãi on , we should predict the residual ∆(s, ãi on ) such that ãi = ãi on +∆(s, ãi on ) is better than ãi on . We use a model f (∆|s, a; θ f ) with parameters θ f to approximate the residual function ∆(s, a). Particularly, the residual actor f (∆|s, a; θ f ) consists of a state encoder and a sub-actor, which are for extracting features from a user state and predicting the residual based on the extracted features, respectively. Considering the bi-level session-request structure in sequential recommendation, we design a hierarchical state encoder consisting of a high-level encoder f h (s h ; θ h ) and a low-level encoder f l (s l ; θ l ) for extracting features from session-level (highlevel) state s h and request-level (low-level) state s l , respectively. To conclude, the residual actor f (∆|s, a; θ f ) = {f h , f l , f a } works as follows: z h = f h (s h ; θ h ), z l = f l (s l ; θ l ); z = Concat(z h , z l ); ∆ = f a (z, a; θ a ). Where z h and z l are the extracted high-level and low-level features, respectively; z is the concatenation of z h and z l , and f a (z, a; θ a ) parameterized by θ a is the sub-actor. Here, θ f = {θ h , θ l , θ a }. Given a state s and a sampled latent vector c ∼ N (0, 1), ResAct generates an action with a deterministic policy π(a|s, c) = D(ã on |s, c; θ d ) + f (∆|s, ãon ; θ f ). We want to optimize the parameters {θ d , θ f } of π(a|s, c) so that the expected cumulative reward J (π) is maximized. Based on the Deterministic Policy Gradient (DPG) theorem (Silver et al., 2014; Lillicrap et al., 2016) , we derive the following performance gradients (a detailed derivation can be found in Appendix B): ∇ θ f J (π) = E s,c ∇ a Q π (s, a)| a=π(a|s,c) ∇ θ f f (∆|s, a; θ f )| a=D(a|s,c;θ d ) . (5) ∇ θ d J (π) = E s,c ∇ a Q π (s, a)| a=π(a|s,c) ∇ θ d D(a|s, c; θ d ) . Here π(a|s, c) = D(ã on |s, c; θ d ) + f (∆|s, ãon ; θ f ), p(•) is the probability function of a random variable, Q π (s, a) is the state-action value function for π. To learn the state-action value function, referred to as critic, Q π (s, a) in Eq. ( 5) and Eq. ( 6), we adopt Clipped Double Q-learning (Fujimoto et al., 2018) with two models Q 1 (s, a; θ q1 ) and Q 2 (s, a; θ q2 ) to approximate it. For transitions (s t , a t , r t , s t+1 ) from logged data, we optimize θ q1 and θ q2 to minimize the following Temporal Difference (TD) loss: L T D θq j = E (st,at,rt,st+1) (Q j (s t , a t ; θ qj ) -y) 2 , j = {1, 2}; y = r t + γ min Q ′ 1 (s t+1 , π′ (a t+1 |s t+1 ); θ ′ q1 ), Q ′ 2 (s t+1 , π′ (a t+1 |s t+1 ); θ ′ q2 ) . Where Q ′ 1 , Q ′ 2 , and π′ are target models whose parameters are soft-updated to match the corresponding models (Fujimoto et al., 2018) . According to the DPG theorem, we can update the parameters θ f in the direction of ∇ θ f J (π) to gain a value improvement in J (π): θ f ← θ f + ∇ θ f J (π), θ f = {θ h , θ l , θ a }. For θ d , since it also needs to minimize L Rec θe,θ d , thus the updating direction is θ d ← θ d + ∇ θ d J (π) -∇ θ d L Rec θe,θ d . Based on π(a|s, c), theoretically, we can obtain the policy π(a|s) by marginalizing out the latent vector c: π(a|s) = p(c)π(a|s, c)dc. This integral can be approximated as π(a|s) ≈ 1 n n i=0 π(a|s, c i ) where {c i ∼ N (0, 1)} n i=0 . However, given that we already have a critic Q 1 (s, a; θ q1 ), we can alternatively use the critic to select the final output: π(a|s) = π(a|s, c * ); c * = arg max c Q 1 (s, π(a|s, c); θ q1 ), c ∈ {c i ∼ N (0, 1)} n i=0 (10)

3.3. FACILITATING FEATURE EXTRACTION WITH INFORMATION-THEORETICAL REGULARIZERS

Good state representations always ease the learning of models (Nielsen, 2015) . Considering that session-level states s h ∈ S h contain rich information about long-term engagement, we design two information-theoretical regularizers to facilitate the feature extraction. Generally, we expect the learned features to have Expressiveness and Conciseness. To learn features with the desired properties, we propose to encode session-level state s h into a stochastic embedding space instead of a deterministic vector. Specifically, s h is encoded into a multivariate Gaussian distribution N (µ h , σ h ) whose parameters µ h and σ h are predicted by the high-level encoder f h (s h ; θ h ). Formally, (µ h , σ h ) = f h (s h ; θ h ) and z h ∼ N (µ h , σ h ) z h is the representation for session-level state s t . Next, we introduce how to achieve expressiveness and conciseness in z h . Expressiveness. We expect the extracted features to contain as much information as possible about long-term engagement rewards, suggesting an intuitive approach to maximize the mutual information between z h and r(s, a). However, estimating and maximizing mutual information I θ h (z h ; r) = p θ h (z h )p(r|z h ) log p(r|z h ) p(r) dz h dr is practically intractable. Instead, we derive a lower bound for the mutual information objective based on variational inference (Alemi et al., 2017) : I θ h (z h ; r) ≥ p θ h (z h )p(r|z h ) log o(r|z h ; θ o ) p (r) dz h dr; = p θ h (z h )p(r|z h ) log o(r|z h ; θ o )dz h dr + H(r), where o(r|z h ; θ o ) is a variational neural estimator of p(r|z h ) with parameters θ o , H(r) = -p(r) log p(r)dr is the entropy of reward distribution. Since H(r) only depends on user responses and stays fixed for the given environment, we can turn to maximize a lower bound of I θ h (z h ; r) which leads to the following expressiveness loss (the derivation is in Appendix C): L Exp θ h ,θo = E s,z h ∼p θ h (z h |s h ) [H(p(r|s)||o(r|z h ; θ o ))] , ( ) where s is state, s h is session-level state, p θ h (z h |s h ) = N (µ h , σ h ), and H(•||•) denotes the cross entropy between two distributions. By minimizing L Exp θ h ,θo , we confirm expressiveness of z h . Conciseness. If maximizing I θ (z h ; r) is the only objective, we could always ensure a maximally informative representation by taking the identity encoding of session-level state (z h = s h ) (Alemi et al., 2017) ; however, such an encoding is not useful. Thus, apart from expressiveness, we want z h to be concise enough to filter out redundant information from s h . To achieve this goal, we also want to minimize I θ h (z h ; s h ) = p(s h )p θ h (z h |s h ) log p θ h (z h |s h ) p θ h (z h ) ds h dz h such that z h is the minimal sufficient statistic of s h for inferring r. Computing the marginal distribution of z h , p θ h (z h ), is usually intractable. So we introduce m(z h ) as a variational approximation to p θ h (z h ), which is conventionally chosen as the multivariate normal distribution N (0, 1). Since KL(p θ h (z h )||m(z h )) ≥ 0, we can easily have the following upper bound: I θ h (z h ; s h ) ≤ p(s h )p θ h (z h |s h ) log p θ h (z h |s h ) m(z h ) ds h dz h . Minimizing this upper bound leads to the following conciseness loss: L Con θ h = p(s h ) p θ h (z h |s h ) log p θ h (z h |s h ) m(z h ) dz h ds h ; = E s [KL(p θ h (z h |s h )||m(z h ))] . By minimizing L Con θ h , we achieve conciseness in z h .

4. EXPERIMENT

We conduct experiments on a synthetic dataset MovieLensL-1m and a real-world dataset RecL-25m to demonstrate the effectiveness of ResAct. We are particularly interested in : Whether ResAct is able to achieve consistent improvements over previous state-of-the-art methods? If yes, why? 

Datasets.

As there is no public dataset explicitly containing signals about long-term engagement, we synthesize a dataset named MovieLensL-1m based on MovieLens-1m (a popular benchmark for evaluating recommendation algorithms) and collected a large-scale industrial dataset RecL-25m from a real-life streaming platform of short-form videos. MovieLensL-1m is constructed by assuming that long-term engagement is proportional to the movie ratings (5-star scale) in MovieLens-1m. RecL-25m is collected by tracking the behaviors of 99,899 users (randomly selected from the platform) for months and recording their long-term engagement indicators, i.e., return time and session lengthfoot_2 . The statistics of RecL-25m are provided in Table 1 , where 25% and 75% denote the corresponding percentile. We did not count the average return time because there are users appearing only once whose return time may go to infinity. The state of a user contains information about gender, age, and historical interactions such as like rate and forward rate. The item to recommend is determined by comparing the inner product of an action and the embedding of videos (Zhao et al., 2020a) . Rewards are designed to measure the relative influence of an item on long-term engagement (details are in Appendix D).

Evaluation Metric and Baselines.

We adopt Normalised Capped Importance Sampling (NCIS) (Swaminathan & Joachims, 2015) , a standard offline evaluation method (Gilotte et al., 2018; Farajtabar et al., 2018) , to assess the performance of different policies. Given that π β is the behavior policy, π is the policy to assess, we evaluate the value by JNCIS (π) = 1 |T | ξ∈T (s,a,r)∈ξ ρπ,π β (s, a)r (s,a,r)∈ξ ρπ,π β (s, a) , ρπ,π β (s, a) = min c, ϕ π(s) (a) ϕ π β (s) (a) . (15) Here T is the testing set with usage trajectories, ϕ π(s) denotes a multivariate Gaussian distribution of which mean is given by π(s), c is a clipping constant to stabilize the evaluation. We compare our method with various baselines, including classic reinforcement learning methods (DDPG, TD3), reinforcement learning with offline training (TD3_BC, BCQ, IQL), and imitation learning methods (IL, IL_CVAE). Detailed introduction about the baselines are in Appendix E. Our method emphasises on the learning and execution paradigm, and is therefore orthogonal to those approaches which focus on designing neural network architectures, e.g., GRU4Rec (Hidasi et al., 2016) . MovieLensL-1m. We first evaluate our method on a benchmark dataset MovieLensL-1m which contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users. We sample the data of 5000 users as the training set, and use the data of the remaining users as the test set (with 50 users as the validation set). As in Table 2 , our method, ResAct, outperforms all the baselines, indicating its effectiveness. We also provide the learning curve in Figure 4 . It can be found that ResAct learns faster and more stable than the baselines on MovieLensL-1m.

4.2. OVERALL PERFORMANCE

RecL-25m. We test the performance of ResAct on RecL-25m in three modes: i) Return Time mode, where the reward signal r(δ) is calculated by Eq. 21; ii) Session Length mode, where the reward signal r(η) is calculated by Eq. 22; and iii) Both, where reward signal is generated by a convex combination of r(δ) and r(η) with weights of 0.7 and 0.3 respectively. The weights is determined by real-world demands on the operational metrics. We also perform sensitivity analysis on the reward weights in Appendix G. Among the 99,899 users, we randomly selected 80% of the users as the training set, of which 500 users were reserved for validation. The remaining 20% users constitute the test set. As shown in Table 3 , our method significantly outperforms the baselines in all the settings. The classic reinforcement learning algorithms, e.g., DDPG and TD3, perform poorly in the tasks, which indicates that directly predicting an action is difficult. The decomposition of actions effectively facilitates the learning process. Another finding is that the offline reinforcement learning algorithms, e.g., IQL, also perform poorly, even though they are specifically designed to learn from logged data. By comparing with imitation learning, we find that the residual actor has successfully found a policy to improve an action, because behavior reconstruction cannot achieve good performance alone. To compare the learning process, we provide learning curves for those RLbased algorithms. Returns are calculated on the validation set with approximately 30,000 sessions. As in Figure 5 , the performance of ResAct increases faster and is more stable than the other methods, suggesting that it is easier and more efficient to predict the residual than to predict an action directly. 

How does ResAct work?

To understand the working process of ResAct, we plot t-SNE (Van der Maaten & Hinton, 2008) embedding of actions generated in the execution phase of ResAct. As in Figure 7 , the reconstructed actions, denoted by the red dots, are located around the initial action (the red star), suggesting that ResAct successfully samples several online-behavior estimators. The blue dots are the t-SNE embedding of the improved actions, which are generated by imposing residuals on the reconstructed actions. The blue star denotes the executed action of ResAct. We can find that the blue dots are near the initial actions but cover a wider area then the red dots. Effect of the CVAE. We design the CVAE for online-behavior reconstruction because of its ability to generate multiple estimators. To explore the effect of the CVAE and whether a deterministic action reconstructor can achieve similar performance, we disable the CVAE in Re-sAct and replace it with a feed-forward neural network. The feed-forward neural network is trained by using the loss in Equation 2. Since the feed-forward neural network is deterministic, ResAct does not need to perform the selection phase as there is only one candidate action. We provide the learning curves of ResAct with and without the CVAE in Figure 6 . As we can find, there is a significant drop in improvement if we disable the CVAE. We deduce that this is because a deterministic behavior reconstructor can only generate one estimator, and if the prediction is inaccurate, performance will be severely harmed. n = 5 n = 1 0 n = 1 5 n = 2 0 n = Number of Online-behavior Estimators. Knowing that generating only one action estimator might hurt performance, we want to further investigate how the number of estimators will affect the performance of ResAct. We first train a ResAct and then change the number of online-behavior estimators to 5, 10, 15, 20 and 25. As in Figure 8 , consistent improvement in performance can be observed across all the three tasks as we increase the number of estimators. The fact suggests that generating more action candidates will benefit the performance, in line with our intuition. We also perform analysis about how the quality of online-behavior estimators could affect the performance in Appendix H. Because the sampling of action estimators is independent, the parallelization of ResAct is not difficult to implement and we can easily speed up the inference. Information-theoretical Regularizers. To explore the effect of the designed regularizers, we disable L Exp θ h ,θo , L Con θ h and both of them in ResAct, respectively. As shown in Table 4 , the removal of any of the regularizers results in a significant drop in performance, suggesting that the regularizers facilitate the extraction of features and thus ease the learning process. An interesting finding is that removing both of the regularizers does not necessarily results in worse performance than removing only one. This suggests that we cannot simply expect either expressiveness or conciseness of features, but rather the combination of both.

5. RELATED WORK

Sequential Recommendation. Sequential recommendation has been used to model real-world recommendation problems where the browse length is not fixed (Zhao et al., 2020c) . Many existing works focused on encoding user previous records with various neural network architectures. For example, GRU4Rec (Hidasi et al., 2016) utilizes Gated Recurrent Unit to exploit users' interaction histories; BERT4Rec (Sun et al., 2019) employs a deep bidirectional self-attention structure to learn sequential patterns. However, these works focus on optimizing immediate engagement like click-through rates. FeedRec (Zou et al., 2019) was proposed to improve long-term engagement in sequential recommendation. However, it is based on strong assumption that recommendation diversity will lead to improvement in user stickiness. Reinforcement Learning in Recommender Systems. Reinforcement learning (RL) has attracted much attention from the recommender system research community for its ability to capture potential future rewards (Zheng et al., 2018; Zhao et al., 2018; Zou et al., 2019; Zhao et al., 2020b; Chen et al., 2021; Cai et al., 2023b; a) . Shani et al. (2005) first proposed to treat recommendation as a Markov Decision Process (MDP), and designed a model-based RL method for book recommendation. Dulac-Arnold et al. (2015) brought RL to MDPs with large discrete action spaces and demonstrated the effectiveness on various recommendation tasks with up to one million actions. Chen et al. (2019) scaled a batch RL algorithm, i.e., REINFORCE with off-policy correction to realworld products serving billions of users. Despite the success, previous works required RL agents to learn in the entire policy space. Considering the expensive online interactions and huge state-action spaces, learning the optimal policy in the entire MDP is quite difficult. Our method instead learns a policy near the online-serving policy to achieve local improvement (Kakade & Langford, 2002; Achiam et al., 2017) , which is much easier. Update the target networks:  θ ′ ← τ θ + (1 -τ )θ ′ for θ ∈ {θ d , θ h , θ l ,

B THE DERIVATION OF PERFORMANCE GRADIENTS

We begin by deriving the gradients of J (π) with respect to the parameters of the residual actor. ∇ θ f J (π) = p(c)p π (s)∇ a Q π (s, a)| a=π(a|s,c) ∇ θ f π(a|s, c)dcds = p(c)p π (s)∇ a Q π (s, a)| a=π(a|s,c) ∇ θ f f (∆|s, a; θ f )| a=D(a|s,c;θ d ) dcds = E s,c ∇ a Q π (s, a)| a=π(a|s,c) ∇ θ f f (∆|s, a; θ f )| a=D(a|s,c;θ d ) The decoder D(a|s, c; θ d ) also affects the policy. The gradients of J (π) with respect to θ d ) is derived similarly: ∇ θ d J (π) = p(c)p π (s)∇ a Q π (s, a)| a=π(a|s,c) ∇ θ d π(a|s, c)dcds = p(c)p π (s)∇ a Q π (s, a)| a=π(a|s,c) ∇ θ d D(a|s, c; θ d )dcds = E s,c ∇ a Q π (s, a)| a=π(a|s,c) ∇ θ d D(a|s, c; θ d ) C DERIVING THE EXPRESSIVENESS LOSS We expect the extracted features to contain as much information as possible about long-term engagement rewards, suggesting an intuitive approach to maximize the mutual information between z h and r(s, a). The mutual information I θ h (z h ; r) is defined according to I θ h (z h ; r) = p(z h , r) log p(z h , r) p (z h ) p (r) dz h dr = p θ h (z h )p(r|z h ) log p(r|z h ) p (r) dz h dr However, estimating and maximizing mutual information is practically intractable. Inspired by variational inference (Alemi et al., 2017) , we derive a tractable lower bound for the mutual information objective. Considering that KL(p(r|z h )||q(r|z h )) ≥ 0, by the definition of KL-divergence, we have p(r|z h ) log p(r|z h )dr ≥ p(r|z h ) log q(r|z h )dr where q(r|z h ) is an arbitrary distribution. Here, we introduce o(r|z h ; θ o ) as a variational neural estimator with parameters θ o of p(r|z h ). Then, I θ h (z h ; r) ≥ p θ h (z h )p(r|z h ) log o(r|z h ; θ o ) p (r) dz h dr = p θ h (z h )p(r|z h ) log o(r|z h ; θ o )dz h dr + H(r) where H(r) = -p(r) log p(r)dr is the entropy of reward distribution. Since H(r) only depends on user responses and stays fixed for the given environment, we can turn to maximize a lower bound of I θ h (z h ; r) which leads to the following expressiveness loss:  L Exp θ h ,θo = - p θ h (z h )p(r|z h ) log o(r|z h ; θ o )dz h dr = - p

D DATASETS D.1 MovieLensL-1m

MovieLensL-1m is synthesized from MovieLens-1m which is representative benchmark dataset for sequential recommendation. MovieLens-1m provides 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users. Ratings are made on a 5-star scale. As MovieLens-1m does not contain any information about long-term user engagement, to generalize it to long-term engagement problem, we make an assumption that a user's long-term engagement is proportional to the movie ratings. Specifically, we assume that recommending a movie for which a user rates 2-stars will not affect engagement, a movie with 3-stars, 4-stars and 5-stars will benefit the long-term engagement by 1, 2, and 3, respectively. Recommending a movie with 1-star is harmful to engagement and will be given a negative reward, -1. The task in MovieLensL-1m is to maximize cumulative benefits on long-term engagement.

D.2 DESIGNING OF REWARDS IN RecL-25m

The rewards of long-term engagement in RecL-25m are designed based on the statistics of the dataset. As a general guideline, we expect rewards to reflect the influence of recommending an item on a user. However, behaviors of users have large variance which makes the influence difficult to measure. For example, if we simply make rewards proportional to session length, or inversely proportional to return time, the recommender system would focus on improving the experience of high activity users, because by doing so it can obtain larger rewards. However, in reality, it is equally if not more important to facilitate the conversion of low activity user to high activity user, which requires us to improve the experience of low activity users. To address this issue, we turn to measuring the relative influence of an item. Concretely, we calculate the average return time δ u avg and the average session length η u avg for a user u, and use these two statistics to quantify rewards. For user u, given a time duration δ u between two sessions, the corresponding reward is calculated by r(δ u ) = ⌊ min(δ u avg , δ 75% ) δ u ⌋ .clip(0, 5) where δ 75% is the 75th percentile of the average return time for all users, which is designed to differentiate active users and inactive users. Rewards for the session length is calculated similarly as r(η u ) = ⌊ η u η u avg × 0.8 ⌋ .clip(0, 5) where η u is the length of a session in the logged data of user u. Since δ u and δ can only be calculated at session-level, without loss of generality, we provide rewards at the end of each session, where rewards for return time is assigned to the previous session.

E BASELINES

Our method is compared with various baselines, including classic reinforcement learning methods (DDPG, TD3), offline reinforcement learning algorithms (TD3_BC, BCQ, IQL), and imitation learning methods (IL, IL_CVAE): • DDPG (Lillicrap et al., 2016): An reinforcement learning algorithm which concurrently learns a Q-function and a policy. It uses the Q-function to guide the optimization of the policy. • TD3 (Fujimoto et al., 2018) : An off-policy reinforcement learning algorithm which applies clipped double-Q learning, delayed policy updates, and target policy smoothing. • TD3_BC (Fujimoto & Gu, 2021) : An reinforcement learning designed for offline training. It adds a behavior cloning (BC) term to the policy update of TD3. • BCQ (Fujimoto et al., 2019) : An off-policy algorithm which restricts the action space in order to force the agent towards behaving similar to on-policy. • IQL (Kostrikov et al., 2022) : An offline reinforcement learning method which takes a state conditional upper expectile to estimate the value of the best actions in a state.

H QUALITY OF ONLINE-BEHAVIOR ESTIMATORS

Despite the selection phase, the quality of the online-behavior estimators, or the action candidates, still significantly affects the performance. On the one hand, the action candidate directly constitutes the final action. On the other hand, the sampled action candidates serve as inputs of the residual module and the selection module. It is certain that sampling an infinite number of action candidates will cover the best action which is sampled by the CVAE. However, action candidates which are far from online-behavior policy may be incorrectly selected. The reason is that the residual module and the selection module are unlikely to encounter such out-of-distribution (OOD) actions and therefore cannot make accurate predictions. The distribution of action candidates should be as close as possible to the distribution of online services to ensure that the output of the residual and selection modules is reliable. We conduct experiments by uniformly sampling 20 action candidates and adding them to the action candidates reconstructed by the CVAE. As in Table 7 , there is a significant decrease in performance even though we increase the number of action candidates. 



The online-serving policy is a historical policy or a mixture of policies which generate logged data to approximate the MDP in sequential recommendation. C(s, aon; θe) is parameterized by θe because it is a multivariate Gaussian whose mean and variance are the output of the encoder E(•|s, aon; θe). Data samples and codes can be found in https://www.dropbox.com/sh/btf0drgm99vmpfe/ AADtkmOLZPQ0sTqmsA0f0APna?dl=0. CONCLUSIONIn this work, we propose ResAct to reinforce long-term engagement in sequential recommendation. ResAct works by first reconstructing behaviors of the online-serving policy, and then improving the reconstructed policy by imposing an action residual. By doing so, ResAct learns a policy which is close to, but better than, the deployed recommendation model. To facilitate the feature extraction, two information-theoretical regularizers are designed to make state representations both expressive and concise. We conduct extensive experiments on a benchmark dataset MovieLensL-1m and a realworld dataset RecL-25m. Experimental results demonstrate the superiority of ResAct over previous state-of-the-art algorithms in all the tasks.



Figure 1: Sequential recommendation.

Figure 2: Workflow of ResAct.

Figure 4: Learning curves of RLbased methods on MovieLensL-1m.

Figure 5: Learning curves of RL-based methods on RecL-25m, averaged over 5 runs.

Figure 8: Ablations for the number of onlinebehavior estimators.

θ a , θ q1 , θ q2 } 12 end Algorithm 2: ResAct-EXECUTION Input: State s, number of estimators n // Reconstruction 1 Generate n estimators of a on : {ã i on = D(a|s, c i ; θ d ), c i ∼ N (0, 1)} n i=0 // Prediction 2 for ãon ∈ {ã i on } n i=0 do 3Predict the residual ∆ = f (∆|s, ãon ; θ f ) as in Eq. 4Apply the residual:ã = ãon + ∆ 5 end // Selection 6 a * = arg max a Q 1 (s, a; θ q1 ), a ∈ {ã i } n i=0Output: Action a *

Statistics of RecL-25m.

Performance comparison on MovieLensL-1m. The "±" indicates 95% confidence intervals.

Performance comparison on RecL-25m in various tasks. The "±" indicates 95% confidence intervals.

Ablations for the information-theoretical regularizers. The "±" indicates 95% confidence intervals.

s)p θ h (z h |s)p(r|s, z h ) log o(r|z h ; θ o )dsdz h dr = -p(s)p θ h (z h |s h )p(r|s) log o(r|z h ; θ o )dsdz h dr

Performance comparison between ResAct and ResAct with uniformly augmented action candidates. The "±" indicates 95% confidence intervals. ±0.0067 0.5433 ±0.0045 0.6675 ±0.0053ResAct + 20 candidates (uniform) 0.5501 ±0.0068 0.3489 ±0.0041 0.4839 ±0.0054

ACKNOWLEDGEMENT

This research is supported by the National Research Foundation, Singapore under its Industry Alignment Fund -Pre-positioning (IAF-PP) Funding Initiative. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

annex

Ethics Statement. ResAct is designed to increase long-term user engagement, increasing the time and frequency that people use the product. Therefore, it will inevitably cause addiction issues. To mitigate the problem, we may apply some features, e.g., the age of users, to control the strength of personalized recommendation. This may be helpful to avoid addiction to some extent.Reproducibility Statement. We describe the implementation details of ResAct in Appendix F, and also provide our source code and data in the supplementary material and external link.• IL: Imitation learning treats the training set as expert knowledge and learns a mapping between observations and actions under demonstrations of the expert.• IL_CVAE (Kingma & Welling, 2014): Imitation learning method with the policy controlled by a conditional variational auto-encoder.

F EXPERIMENTAL DETAILS

Across all methods and experiments, for fair comparison, each network generally uses the same architecture (3-layers MLP with 256 neurons at each hidden layer) and hyper-parameters. We provide the hyper-parameters for ResAct in Table 5 . All methods are implemented with PyTorch. 

G SENSITIVITY ANALYSIS FOR THE REWARD WEIGHTS

When setting the reward weights in the Both mode, we use some usual empirical values by following the real-world requirements for operational metrics. The reward occurs only at the end of each session, which makes it representative for sequential recommendations. If we try other designs, only the value of the reward will change, not the frequency of learning signal. To justify that our algorithm is robust to different reward weights, we perform sensitivity analysis for the weights (return time: session length) of rewards in the Both mode. As shown in Table 6 , our algorithm consistently outperforms the baselines under different reward weights. 

