DEEP EVIDENTIAL REINFORCEMENT LEARNING FOR DYNAMIC RECOMMENDATIONS

Abstract

Reinforcement learning (RL) has been applied to build recommender systems (RS) to capture users' evolving preferences and continuously improve the quality of recommendations. In this paper, we propose a novel deep evidential reinforcement learning (DERL) framework that learns a more effective recommendation policy by integrating both the expected reward and evidence-based uncertainty. In particular, DERL conducts evidence-aware exploration to locate items that a user will most likely take interest in the future. Two central components of DERL include a customized recurrent neural network (RNN) and an evidential-actor-critic (EAC) module. The former module is responsible for generating the current state of the environment by aggregating historical information and a sliding window that contains the current user interactions as well as newly recommended items that may encode future interest. The latter module performs evidence-based exploration by maximizing a uniquely designed evidential Q-value to derive a policy giving preference to items with good predicted ratings while remaining largely unknown to the system (due to lack of evidence). These two components are jointly trained by supervised learning and reinforcement learning. Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of DERL and its capability to capture long-term user interests.

1. INTRODUCTION

Recommender systems (RS) have been widely used for providing personalized recommendations in diverse fields such as media, entertainment, and e-commerce by effectively improving user experience (Su & Khoshgoftaar, 2009; Sun et al., 2014; Xie et al., 2018) . Various methods have been introduced to tackle the recommendation problem. Traditional methods include: collaborative filtering, which captures user preferences using information of similar users (Koren, 2008) , contentbased, where extra information is used for better latent preference and item representation (Mooney & Roy, 2000) , and hybrid, which integrates both collaborative and content-based methods for a more effective recommendation (Burke, 2002) . Deep learning (DL) has also been increasingly used to build RS due to its ability to model complex and non-linear user-item relationships (Cheng et al., 2016; Guo et al., 2017) . Most RS methods mentioned above consider recommendation as a static process, which fails to consider users' evolving preferences. Some efforts have been devoted to capture users' evolving preferences by shifting the user latent preference over time (Koren, 2009; Charlin et al., 2015; Gultekin & Paisley, 2014) . Similarly, sequential recommendation methods (Kang & McAuley, 2018; Tang & Wang, 2018) attempt to incorporate users' dynamic behavior by leveraging previously interacted items. However, both static and dynamic recommendation methods primarily focus on maximizing the immediate (short-term) reward when making recommendations. As a result, they fail to take into account whether these recommended items will lead to long-term returns in the future, which is essential to maintain a stable user base for the system in the long run. Several recent works have adapted reinforcement learning (RL) in the RS (Chen et al., 2019b; Zhao et al., 2017) . RL has already gained huge success in diverse fields, such as robotics (Kober et al., 2013) and games (Silver et al., 2017) . The core idea of RL is to learn an optimal policy to maximize the total expected reward in the long run. RL methods consider a recommendation procedure as sequential interactions between users and RL agents to learn the optimal recommendation policies effectively. Although RL approaches show promising results in RS (Chen et al., 2019b; Zheng et al., 2018) , they primarily rely on standard exploration strategies (ϵ-greedy), which are less effective in a large item space with sparse reward signals given the limited interactions for most users. Therefore, they may not learn the optimal policy that provides the most informative recommendations to capture effective user preferences and achieve maximum expected reward in the long run. To address the above key challenges, we propose a novel deep evidential reinforcement learning (DERL) method that utilizes a balanced exploitation (with high predicted ratings) and exploration (with evidence-based uncertainty) strategy for effective recommendations. We formulate an evidential RL framework that augments the maximum reward RL objective with evidence-based uncertainty maximization. More importantly, the evidence-based uncertainty formulation substantially improves exploration and robustness by acquiring diverse behaviors that are indicative of a user's long-term interest. As shown in Figure 1b , DERL devotes a strong focus on more diverse genres (denoted by 'others' in the figure) and many of these capture the long-term interest from the user as verified by the detailed recommendation list in Table 1 . In this case, we refer to a user's long-term interest as the genres of movies frequantly watched by the user in the later phase of interactions (i.e., after time step 8 in the given example). These genres are Musical, Action, Adventure, Comedy, and Animation, which match almost perfectly with what DERL recommends in Table 1 . DERL seamlessly integrates two major components: a customized RNN and an Evidence-Actor-Critic module. The former primarily focuses on generating the current state of the environment by aggregating the previous state, current items captured by a sliding window, and future recommended items from the RL agent. This provides effective means of dynamic state representation for better future recommendations. Meanwhile, the EAC module leverages evidence-based uncertainty to most effectively explore the item space to identify items that potentially align with the user's long-term interest. It encourages learning the optimal policy by maximizing a novel evidential Q-value to achieve a maximum long-term cumulative reward. The main contribution of this paper is fourfold: • A novel recommendation model that integrates reinforcement learning with evidential learning to provide uncertainty-aware recommendations. • Evidence-based uncertainty maximization to enable stability and effective exploration. • An off-policy formulation to effectively promote the reuse of previously collected data while stabilizing model training, which is important to address data scarcity in recommender systems. • Seamless integration of a customized RNN, an actor-critic network, and an evidential network to provide an end-to-end integrated training process.



Figure 1: Different recommendation behavior between an existing RL model and DERL

Examples of recommended movies

