OFFLINE ADAPTIVE POLICY LEANING IN REAL-WORLD SEQUENTIAL RECOMMENDATION SYSTEMS

Abstract

The training process of RL requires many trial-and-errors that are costly in realworld applications. To avoid the cost, a promising solution is to learn the policy from an offline dataset, e.g., to learn a simulator from the dataset, and train optimal policies in the simulator. By this approach, the quality of policies highly relies on the fidelity of the simulator. Unfortunately, due to the stochasticity and unsteadiness of the real-world and the unavailability of online sampling, the distortion of the simulator is inevitable. In this paper, based on the model learning technique, we propose a new paradigm to learn an RL policy from offline data in the real-world sequential recommendation system (SRS). Instead of increasing the fidelity of models for policy learning, we handle the distortion issue via learning to adapt to diverse simulators generated by the offline dataset. The adaptive policy is suitable to real-world environments where dynamics are changing and have stochasticity in the offline setting. Experiments are conducted in synthetic environments and a real-world ride-hailing platform. The results show that the method overcomes the distortion problem and produces robust recommendations in the unseen real-world.

1. INTRODUCTION

Recent studies have shown that reinforcement learning (RL) is a promising approach for real-world applications, e.g., sequential recommendation systems (SRS) (Wang et al., 2018; Zhao et al.; 2019; Cai et al., 2017) , which make multiple rounds of recommendations for customers and maximize long-term recommendation performance. However, the high trial-and-error costs in the real-world obstruct further applications of RL methods (Strehl et al., 2006; Levine et al., 2018) . Offline (batch) RL is to learn policies with a static dataset collected by behavior policies without additional interactions with the environment (Levine et al., 2020; Siegel et al.; Wang et al., 2020; Kumar et al., 2019) . Since it avoids costly trial-and-errors in the real environment, offline RL algorithms are promising to cost-sensitive applications (Levine et al., 2020) . One scheme of offline RL is learning a simulator from the dataset. In this way, RL policies can be learned from the simulator directly. Although prior works on model-based learning have achieved significant efficiency improvements in online RL by learning dynamics models (Kaiser et al., 2020; Wang et al., 2019; Heess et al.; Luo et al., 2019) , building an accurate simulator is still difficult, especially in offline RL. In particular, the offline dataset may not cover the whole state-action space, and there is no way for sampling in the real-world to recover the prediction error of the learned simulator. The learned policies tend to exploit regions where insufficient data are available, which causes the instability of policy learning (Kurutach et al., 2018; Zhang et al., 2015; Viereck et al., 2017) . By overcoming the problem, recent studies in offline model-based RL (Yu et al., 2020; Kidambi et al., 2020) have made significant progress in MuJoCo environments (Todorov et al., 2012) . These methods learn policies with uncertainty penalties. The uncertainty here is a function to evaluate the confidence of prediction correctness, which often implemented with ensemble techniques (Lowrey et al., 2019; Osband et al., 2018) . By giving large rewards penalty (Yu et al., 2020) or trajectory truncation (Kidambi et al., 2020) with large uncertainty on dynamics models, policy exploration is constrained in the regions where the uncertainty of model prediction is small, so that to avoid optimizing policy to exploit regions with bad generalization ability. However, in real-world applications like SRS, several realistic problems of the current offline learning methods are ignored. First, take SRS as an example, customer behaviors (i.e., the environment) are non-stationary and thus change across different periods and locations. Therefore, besides the prediction error induced by model approximation, the transitions in the offline dataset also can be inaccurate in the future (Krueger et al., 2019; Chen et al., 2018; Zhao et al., 2018; Thomas et al., 2017; Li & de Rijke, 2019) . Second, different from traditional RL environments which are overall deterministic (Brockman et al., 2016) , real-world environments often introduce stochasticity. For example, after recommending a production to a customer, it is hard to model the user's decisions without stochasticity (e.g., buying it or not), even with large enough data. Hidden confounder factors (Forney et al.; Bareinboim et al., 2015; Shang et al., 2019) obstruct the deterministic predictions. As a result, the uncertainty regions are drastically increased and thus the exploration of policy learning are obscured. In this paper, instead of constraining policy exploration in high-confidence regions, we study to handle the offline issue by learning to adapt. We propose an adaptive policy which is trained to make optimal actions efficiently in regions with high confidence for model prediction. While in regions with large uncertainty, the policy is trained to identify the representation of each dynamics model and adapt the optimal decisions on the representation. When deploying the policy in the environment, the policy identifies the dynamics in the real-world through interaction, then adapt its behavior. The module to represent the dynamics models is named environment-context extractor. The extractor and adaptive policy should be learned in a diverse simulator set and thus can generalize to unknown situations. As a solution, we propose to use model-learning techniques with augmentation approaches to generate a simulator set to cover real-world situations. In this way, with a sufficiently large simulator set, the learned policy can adapt robustly in unknown real-world environments. To learn the adaptive policy and environment-context extractor, we first analyze and formulate the environment context representation problem in SRS. In SRS scenarios, the recommendation platforms interact with customers. Each customer can be regarded as an environment in the view of the RL paradigm. The environments include a two-level structure: in the high-level structure, a recommendation platform serves customers from multiple domains (e.g., in different cities and countries). In the low-level structure (i.e., for each domain), there are numerous customers with different behaviors, and the behaviors are dependent on the domain they current in. Although there have been recent interests in learning the representation of environment parameters based on the agent's trajectories in the robotics domain (Peng et al., 2018; Akkaya et al., 2019; Zhu et al., 2018; Sadeghi et al.) , the environment-context representation problem in SRS has never been proposed and the two-level environment structure makes the environment context agnostic based on a single customer's trajectory without considering the domain he/she current in. As a solution, we use a special network to embed the domain information and show that the additional domain information is necessary for representing the environment contexts in SRS. As the result, we propose Offline learning with Adaptive Policy in sequential Recommendation Systems (OapRS), as a new paradigm to solve an offline problem that policies can applied to real-world applications without any additional online sample. By learning to adapt with the representation of dynamics, OapRS is suitable to real-world scenarios in which environments are non-stationary and have stochasticity. As far as we know, this is also the first study on reality-gap and the environment-context representation problem in SRS. We conduct experiments in a reproducible synthetic environment and a real-world recommendation scenario: the driver program recommendation system of a ride-hailing platform. Our empirical evaluations demonstrate that OapRS can learn reasonable environment contexts and makes robust recommendations in unseen environments.

2. RELATED WORK

Reinforcement learning (RL) has shown to be a promising approach for real-world sequential recommendation systems (SRS) (Wang et al., 2018; Zhao et al.; 2019; Cai et al., 2017) to make optimal recommendations with long-term performance. However, numerous online unconstrained trial-and-errors in RL training obstruct the further applications of RL in "safety critical" SRS scenario since it may result in large economic losses (Levine et al., 2020; Gilotte et al.; Theocharous et al., 2015; Thomas et al., 2017) . Many studies propose to overcome the problem by offline (batch) RL (Lange et al., 2012) . Most prior works on offline RL introduce model-free algorithms. To overcome the extrapolation error, which is introduced by the mismatch between the offline dataset and true state-action occupancy (Wang et al., 2020) , these methods are designed to constrain the target policy to be close to the behavior policies (Wang et al., 2020; Kumar et al., 2019; Wu et al., 2019) , apply ensemble

