OFFLINE ADAPTIVE POLICY LEANING IN REAL-WORLD SEQUENTIAL RECOMMENDATION SYSTEMS

Abstract

The training process of RL requires many trial-and-errors that are costly in realworld applications. To avoid the cost, a promising solution is to learn the policy from an offline dataset, e.g., to learn a simulator from the dataset, and train optimal policies in the simulator. By this approach, the quality of policies highly relies on the fidelity of the simulator. Unfortunately, due to the stochasticity and unsteadiness of the real-world and the unavailability of online sampling, the distortion of the simulator is inevitable. In this paper, based on the model learning technique, we propose a new paradigm to learn an RL policy from offline data in the real-world sequential recommendation system (SRS). Instead of increasing the fidelity of models for policy learning, we handle the distortion issue via learning to adapt to diverse simulators generated by the offline dataset. The adaptive policy is suitable to real-world environments where dynamics are changing and have stochasticity in the offline setting. Experiments are conducted in synthetic environments and a real-world ride-hailing platform. The results show that the method overcomes the distortion problem and produces robust recommendations in the unseen real-world.

1. INTRODUCTION

Recent studies have shown that reinforcement learning (RL) is a promising approach for real-world applications, e.g., sequential recommendation systems (SRS) (Wang et al., 2018; Zhao et al.; 2019; Cai et al., 2017) , which make multiple rounds of recommendations for customers and maximize long-term recommendation performance. However, the high trial-and-error costs in the real-world obstruct further applications of RL methods (Strehl et al., 2006; Levine et al., 2018) . Offline (batch) RL is to learn policies with a static dataset collected by behavior policies without additional interactions with the environment (Levine et al., 2020; Siegel et al.; Wang et al., 2020; Kumar et al., 2019) . Since it avoids costly trial-and-errors in the real environment, offline RL algorithms are promising to cost-sensitive applications (Levine et al., 2020) . One scheme of offline RL is learning a simulator from the dataset. In this way, RL policies can be learned from the simulator directly. Although prior works on model-based learning have achieved significant efficiency improvements in online RL by learning dynamics models (Kaiser et al., 2020; Wang et al., 2019; Heess et al.; Luo et al., 2019) , building an accurate simulator is still difficult, especially in offline RL. In particular, the offline dataset may not cover the whole state-action space, and there is no way for sampling in the real-world to recover the prediction error of the learned simulator. The learned policies tend to exploit regions where insufficient data are available, which causes the instability of policy learning (Kurutach et al., 2018; Zhang et al., 2015; Viereck et al., 2017) . By overcoming the problem, recent studies in offline model-based RL (Yu et al., 2020; Kidambi et al., 2020) have made significant progress in MuJoCo environments (Todorov et al., 2012) . These methods learn policies with uncertainty penalties. The uncertainty here is a function to evaluate the confidence of prediction correctness, which often implemented with ensemble techniques (Lowrey et al., 2019; Osband et al., 2018) . By giving large rewards penalty (Yu et al., 2020) or trajectory truncation (Kidambi et al., 2020) with large uncertainty on dynamics models, policy exploration is constrained in the regions where the uncertainty of model prediction is small, so that to avoid optimizing policy to exploit regions with bad generalization ability. However, in real-world applications like SRS, several realistic problems of the current offline learning methods are ignored. First, take SRS as an example, customer behaviors (i.e., the environment)

