DEEP EVIDENTIAL REINFORCEMENT LEARNING FOR DYNAMIC RECOMMENDATIONS

Abstract

Reinforcement learning (RL) has been applied to build recommender systems (RS) to capture users' evolving preferences and continuously improve the quality of recommendations. In this paper, we propose a novel deep evidential reinforcement learning (DERL) framework that learns a more effective recommendation policy by integrating both the expected reward and evidence-based uncertainty. In particular, DERL conducts evidence-aware exploration to locate items that a user will most likely take interest in the future. Two central components of DERL include a customized recurrent neural network (RNN) and an evidential-actor-critic (EAC) module. The former module is responsible for generating the current state of the environment by aggregating historical information and a sliding window that contains the current user interactions as well as newly recommended items that may encode future interest. The latter module performs evidence-based exploration by maximizing a uniquely designed evidential Q-value to derive a policy giving preference to items with good predicted ratings while remaining largely unknown to the system (due to lack of evidence). These two components are jointly trained by supervised learning and reinforcement learning. Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of DERL and its capability to capture long-term user interests.

1. INTRODUCTION

Recommender systems (RS) have been widely used for providing personalized recommendations in diverse fields such as media, entertainment, and e-commerce by effectively improving user experience (Su & Khoshgoftaar, 2009; Sun et al., 2014; Xie et al., 2018) . Various methods have been introduced to tackle the recommendation problem. Traditional methods include: collaborative filtering, which captures user preferences using information of similar users (Koren, 2008) , contentbased, where extra information is used for better latent preference and item representation (Mooney & Roy, 2000) , and hybrid, which integrates both collaborative and content-based methods for a more effective recommendation (Burke, 2002) . Deep learning (DL) has also been increasingly used to build RS due to its ability to model complex and non-linear user-item relationships (Cheng et al., 2016; Guo et al., 2017) . Most RS methods mentioned above consider recommendation as a static process, which fails to consider users' evolving preferences. Some efforts have been devoted to capture users' evolving preferences by shifting the user latent preference over time (Koren, 2009; Charlin et al., 2015; Gultekin & Paisley, 2014) . Similarly, sequential recommendation methods (Kang & McAuley, 2018; Tang & Wang, 2018) attempt to incorporate users' dynamic behavior by leveraging previously interacted items. However, both static and dynamic recommendation methods primarily focus on maximizing the immediate (short-term) reward when making recommendations. As a result, they fail to take into account whether these recommended items will lead to long-term returns in the future, which is essential to maintain a stable user base for the system in the long run. Several recent works have adapted reinforcement learning (RL) in the RS (Chen et al., 2019b; Zhao et al., 2017) . RL has already gained huge success in diverse fields, such as robotics (Kober et al., 2013) and games (Silver et al., 2017) . The core idea of RL is to learn an optimal policy to maximize the total expected reward in the long run. RL methods consider a recommendation procedure as sequential interactions between users and RL agents to learn the optimal recommendation policies effectively. Although RL approaches show promising results in RS (Chen et al., 2019b; Zheng et al., 2018) , they primarily rely on standard exploration strategies (ϵ-greedy), which are less effective in a large item space with sparse reward signals given the limited interactions for most users. Therefore, they may not learn the optimal policy that provides the most informative recommendations to capture effective user preferences and achieve maximum expected reward in the long run. Drama Restoration (1995) Drama Figure 1 further illustrates the limitation of existing RL methods using a standard ϵ-greedy strategy for exploration. The RL agent primarily focuses on highly-rated items in early steps as shown in Figure 1a . Most of these items come from the same genre as shown in Figure 1b , which is further verified by the detailed recommendation list given by Table 1 . Such a recommendation behavior leads to a lower cumulative reward in the later steps as shown in Figure 1c . As Table 1 shows, ϵ-greedy mostly focuses on Drama movies based on the user's current preference. It only captures one novel genre (i.e., Musical, bold in the table) that matches the user's long-term interest. It is clear that more systematic exploration is essential to discover users' long-term interests to maximize the future reward. To address the above key challenges, we propose a novel deep evidential reinforcement learning (DERL) method that utilizes a balanced exploitation (with high predicted ratings) and exploration (with evidence-based uncertainty) strategy for effective recommendations. We formulate an evidential RL framework that augments the maximum reward RL objective with evidence-based uncertainty maximization. More importantly, the evidence-based uncertainty formulation substantially improves exploration and robustness by acquiring diverse behaviors that are indicative of a user's long-term interest. As shown in Figure 1b , DERL devotes a strong focus on more diverse genres (denoted by 'others' in the figure) and many of these capture the long-term interest from the user as verified by the detailed recommendation list in Table 1 . In this case, we refer to a user's long-term interest as the genres of movies frequantly watched by the user in the later phase of interactions (i.e., after time step 8 in the given example). These genres are Musical, Action, Adventure, Comedy, and Animation, which match almost perfectly with what DERL recommends in Table 1 . DERL seamlessly integrates two major components: a customized RNN and an Evidence-Actor-Critic module. The former primarily focuses on generating the current state of the environment by aggregating the previous state, current items captured by a sliding window, and future recommended items from the RL agent. This provides effective means of dynamic state representation for better future recommendations. Meanwhile, the EAC module leverages evidence-based uncertainty to most effectively explore the item space to identify items that potentially align with the user's long-term interest. It encourages learning the optimal policy by maximizing a novel evidential Q-value to achieve a maximum long-term cumulative reward. The main contribution of this paper is fourfold: • A novel recommendation model that integrates reinforcement learning with evidential learning to provide uncertainty-aware recommendations. • Evidence-based uncertainty maximization to enable stability and effective exploration. • An off-policy formulation to effectively promote the reuse of previously collected data while stabilizing model training, which is important to address data scarcity in recommender systems. • Seamless integration of a customized RNN, an actor-critic network, and an evidential network to provide an end-to-end integrated training process. We conduct extensive experiments over four real-world datasets and compare with state-of-the-art baselines to demonstrate the effectiveness of the proposed model.

2. RELATED WORK

Static models. Matrix Factorization (MF) leverages user and item latent factors to infer user preferences (Koren et al., 2009; Funk, 2006; Koren, 2008) . MF is further extended with Bayesian Personalized Ranking (BPR) (Rendle et al., 2012) and Factorization Machine (FM) (Rendle, 2010) . Recently, deep learning-based recommender systems (Cheng et al., 2016; Guo et al., 2017) have achieved impressive performance. DeepFM (Guo et al., 2017) integrates traditional FM and deep learning to learn low-and high-order feature interactions. Both wide and deep networks are jointly trained in (Cheng et al., 2016) for better memorization and generalization. In graph-based methods (Berg et al., 2017) , users and items are represented as a bipartite graph and links are predicted to provide recommendations. Similarly, Neural Graph Collaborative Filtering (Wang et al., 2019) explicitly encodes the collaborative signal via high-order connectivities in the user-item bipartite graph via embedding propagation. Dynamic and sequential models. Dynamic model shifts latent user preference over time to incorporate temporal information. TimeSVD++ (Koren, 2009) (Sun et al., 2019) . Sequential models neglect long-term users' preferences. The proposed DERL model aims to fill this critical gap by performing evidence guided exploration and maximizing total expected reward. RL-based models. RL-based RS models aim to learn an effective policy to maximize the total expected reward in the long run. The on-policy learning with contextual bandit (Li et al., 2010) and Markov Decision Process (MDP) (Zheng et al., 2018) exploits by interacting with real customers in an online environment. A collaborative contextual bandit algorithm called CoLin (Wu et al., 2016) utilizes graph structure in a collaborative manner. On the other hand, off-policy utilizes Monte Carlo (MC) and temporal-difference (TD) methods to achieve stable and efficient learning with users' history (Farajtabar et al., 2018) . Similarly, model-based RL models user-agent interaction via a generative adversarial network (Bai et al., 2019) . Pseudo Dyna-Q (Zou et al., 2020) further integrates both direct and indirect RL approaches in a single unified framework without requiring real customer interactions. However, the above methods utilize random exploration strategies, which are less effective at capturing users' long-term preferences. In contrast, our DERL utilizes evidencebased uncertainty to systematically explore the item space to maximize the long-term reward.

3. PRELIMINARIES

We first introduce the standard RS setup in RL and provide an overview of evidential theory. Recommendation Formulation with RL. We formulate recommendation tasks in a RL setting, where a RL agent interacts with the environment (users and items) to recommend the next items to a user over time in a sequential order to maximize the cumulative reward. We design this problem as the MDP, which includes a sequence of states, actions, and rewards. More formally, a tuple (S, A, p, r) is defined as: Uncertainty and the Evidential Theory. Theory of evidence is a generalization of Bayesian theory to subjective probabilities (Dempster, 1968) . We briefly introduce subjective logic (SL) (Jsang, 2016) and discuss uncertainty estimation based on SL. SL is a probabilistic logic that is built upon probability theory and belief theory. It represents uncertainty by introducing vacuity of evidence in its opinion, which is a multinomial random variable y in domain Y = {1, ..., K}. This opinion can be equivalently represented by a K-dimensional Dirichlet distribution Dir(p|α) where α is a strength over K classes and p = (p 1 , ..., p K ) ⊤ governs a categorical distribution over Y. The term evidence is the measure of the number of supportive observations from data for each class. It has a fixed relationship with the concentration parameter α given a non-informative prior. Let e k be the evidence for a class k. SL measures different types of second-order uncertainty through evidences, including vacuity, dissonance, and a few others (Josang et al., 2018) . In particular, vacuity corresponds to the uncertainty mass of a subjective opinion ω: vac(ω) = U = K S , S = K k=1 (e k + 1) Since vacuity is defined by lack of evidence in the data sample, it provides a natural way to facilitate the exploration of an RL agent, which will be detailed next. Overview. We propose a deep evidential reinforcement learning model to perform dynamic recommendations as shown in Figure 2 . The model includes a recurrent neural network (RNN) to maintain dynamic state space, and an evidential-actorcritic (EAC) module to explore the item space by introducing the evidence-based uncertainty (vacuity) into a new evidential RL setting. By incorporating previous state information, recent items captured by a sliding window, and the recommended items from the RL agent, the RNN module generates the current state s t . This state is further passed to the action network that predicts the mean and variance to form a Gaussian policy distribution. We sample a current action a t from the policy distribution that corresponds to the latent preference of the user that simultaneously captures the past (via a previous state ), current (through a sliding window) and future interest (through RL exploration). By leveraging the current action and item embeddings, the evidence network provides the evidence that can be used to form the rating prediction for exploitation while estimating the uncertainty for effective exploration. The Q-network (critic) generates an evidential Q-value for evidential policy updates of the action network. Table 4 in Appendix A summarizes the major notations.

4.1. ENVIRONMENT SETUP

We start by describing the environment of the proposed evidential RL agent. The environment consists of user-interacted items (an item pool I) from this user's interaction history H u , embedding matrix E to generate the user embedding, the RNN for dynamic state generation, and an evidential reward process (ERP) that specifies an incentive mechanism to each action of the agent. Our reward process encourages a balance between exploitation (based on predicted ratings) and exploration (based on evidence-based uncertainty) when making recommendations to users. In particular, a recommended list consists of a limited number of items. The proposed ERP ranks candidate items according to an evidential score that integrates the predicted rating and evidence-based uncertainty: score u,i = rating u,i + λU π (i|s t , a t ) (2) where λ balances the rating and the uncertainty, and rating u,i is the predicted rating. Given K possible rating classes, the evidence network (introduced later in this section) outputs an evidence vector e i = (e i1 , ..., e iK ) ⊤ for each item i. This will allow us to evaluate rating u,i as K k=1 p ik × k where p ik is rating probability given by ( 9). Meanwhile, uncertainty U π (i|s t , a t ) for item i can be evaluated through (1). Based on the evidential score, an RL agent will choose the top-N items to form a list N u and recommend them to the user. As the feedback to the agent, the user provides the actual rating for each recommended items. Consequently, the evidential reward is r e π (s t , a t ) = 1 N i∈Nu (rating u,i -τ ) + λU π (i|s t , a t ) where rating u,i is the user assigned ground truth rating and τ is a threshold chosen based on the rating mechanism (τ = 3 for a 1 -5 rating system). Given the evidential reward, we introduce an evidential Q-value, which can be computed by repeatedly applying the Bellman operator (B π ): B π Q e (s t , a t ) ≜r e π (s t , a t ) + γE st+1∼π [V (s t+1 )] (4) where V (s t ) = E at∼π [Q e (s t , a t )]. The evidential Q-value will be used for the update of EAC module, which is introduced later in this section.

4.2. THE CUSTOMIZED RNN FOR LATENT STATE GENERATION

A specially designed RNN is used to maintain the state space of a dynamic RS environment. In particular, a state s t is generated by aggregating three pieces of information: the previous state s t-1 , items interacted by the user in the the current step, and newly recommended items. Here, an item is also an embedding vector which encodes item entity information. By aggregating all this information, the current state can evolve from the previous state by effectively capturing the past preference and future predicted preference of the user. In particular, newly interacted items are extracted from the user's interaction history H u using a sliding window and the currently recommended items are obtained by invoking the a t-1 . Assume that a total M items are obtained with a half from the sliding window and the rest from the action. These M items then go through an embedding matrix to produce a user embedding u t for time step t. Then, s t is formed by s t = RNN(s t-1 , u t ) (5) To train the customized RNN, we collect additional data tuples [s t-1 , u t ] into the replay buffer. We then sample batches from the buffer and send u t and s t-1 to the RNN module that generates the current state s t . After that, we send s t to action network that samples a t from the action distribution. Action a t will then go through the evidence network to predict the evidence vector for each candidate item. Finally, we compute evidential loss J Evi as defined in (10) and conduct backpropagation with respect to RNN parameter ω: ∇ ω J RNN (ω) = ∇ ω J Evi (ψ) ) In this way, the computing graph is maintained even in the offline setting and the RNN can be trained as in the standard supervised setting.

4.3. EVIDENTIAL ACTOR CRITIC (EAC)

Training goal. A standard RL model maximizes the expected sum of rewards. We consider a generalized evidential reward function r e π defined in (3), which augments the standard RL objective with the average evidence-based uncertainty of the recommended items to encourage exploration of the item space. We achieve our training goal by updating the evidential actor network that finds the optimal policy to maximize the expected cumulative evidential reward as: J π = T t=0 E (st,at)∼D (r e π (s t , a t )) (7) where D is the distribution of (s t , a t ) from the data or the replay buffer and T is the total number of time steps in the episode. A novel benefit of the new objective is to allow the agent to interact with more informative items for more effective exploration of a large item space. EAC consists of three key networks: action network, evidence network, and critic network, which will be detailed next. Action network. The action network (or policy network) utilizes the current state s t from the offline replay buffer and outputs a policy distribution π(.|s t ), which is modeled as a Gaussian. From this distribution, we sample an action a t that is used in the evidence and the critic networks to provide recommendations and direct the policy update, respectively. For action network update, we use backward update signals from the critic network: ∇ ϕ J π (ϕ) = (-∇ at Q e (s t , a t )) × ∇ ϕ π(•|s t , ϕ) This gradient extends the DDPG style policy update (Lillicrap et al., 2015) by utilizing the chain rule to the Q-network that updates the action network. Evidence network. The evidence network predicts a Dirichlet distribution of class probabilities, which can be considered as an evidence collection process. The learned evidence is informative to quantify the predictive uncertainty of recommended items. The network takes action a t from the replay buffer and item pool I to provide class level evidence. Then, the probability of rating k is p ik = (e ik + 1) S i (9) where e ik is the evidence collected for rating k for item i. To train the evidence network, we define a standard evidential loss by utilizing the MSE loss between rating class probability p ik and the one-hot ground truth label y i , in which y ik = 1 if k is the correct rating, otherwise y ik = 0: J Evi (ψ) = K k=1 (y ik -p ik ) 2 + p ik (1 -p ik ) S i + 1 (10) We update the network by backpropagating the evidential loss J Evi (ψ) with its parameters ψ. Critic network. The critic network is designed to approximate evidential Q value utilizing the current state s t and action a t in a fully connected neural network Q θ (s t , a t ). This Q-value judges whether the agent generated actions matches the current state s t requirements. We derived an update formulation for the critic network following the recent double DQN-based method (Mnih et al., 2015) that utilizes two critic networks to stabilize training process, achieve faster convergence, and provide a better Q-value as: Qe (s t , a t ) = E st+1∼D,at+1∼π [r e π (s t , a t ) + γ × min{Q e (s t+1 , a t+1 ), Qe (s t+1 , a t+1 )}] where Q(s t+1 , a t+1 ) is a target network, which is updated slowly to stabilize the training process. The evidential Q-function parameters are trained by minimizing the temporal difference (TD) error: J Q (θ) = E (st,at,st+1,at+1,r e π (st,at))∼D 1 2 Q e (s t , a t ) -Qe (s t , a t ) 2 ( ) where D is the distribution of (s t , a t , s t+1 , a t+1 , r e π (s t , a t )) in an offline buffer. Furthermore, the Q-network is optimized with stochastic gradient decent. The overall recommendation algorithm is shown in Algorithm 1 of Appendix C.

4.4. DERIVATION OF EVIDENTIAL POLICY ITERATION

We derive evidential policy iteration as a general method for learning optimal uncertainty policies by alternating between evidential policy evaluation and evidential policy improvement in the maximum uncertainty framework. We compute the value of a policy π according to the maximum uncertainty objective of Eq. ( 7). DERL expresses a policy as a Gaussian distribution with mean and covariance of an action neural network. With the above settings, we show that the evidential policy iteration can achieve the optimal policy at convergence. Lemma 1 (Evidential Policy Evaluation). Given the Bellman operator B π in Eq. (4) and Q n+1 = B π Q n , the Q-value will converge to the evidential Q-value of policy π as n → ∞. Lemma 2 (Evidential Policy Improvement). Given a new policy π new that is updated via Eq (8), then Q e πnew (s t , a t ) ≥ Q e π old (s t , a t ) for all (s t , a t ). Theorem 3 (Evidential Policy Iteration). Alternating between evidential policy evaluation and evidential policy improvement for any policy π ∈ Π converges to an optimum evidential policy π * such that Q π * (s t , a t ) ≥ Q e π (s t , a t ) for all (s t , a t ). Table 2: Performance of Recommendation (average P@N and nDCG@N) Category Model MovieLens-1M MovieLens-100K Netflix Yahoo! Music P@5 nDCG@5 P@5 nDCG@5 P@5 nDCG@5 P@5 nDCG@5 Dynamic Remark: The novel of use of vacuity, which is an evidence-based second-order uncertainty, for exploration in RL, can effectively identify uncertain and informative items (from large item space), indicative of users' long-term interest. In particular, the proposed evidential reward encourages the RL agent to recommend items that the model has the least knowledge (as indicated by a high vacuity). After collecting the user feedback, the RL agent can most effectively gain the knowledge on the user preference to make better recommendations in the long run. It should be noted that maximum entropy-based exploration, such as soft-actor-critic (SAC) (Haarnoja et al., 2018) , may not reach an optimum policy. It has been shown that a high entropy may imply either high vacuity (lack of evidence) or high dissonance (conflict of strong evidence) (Shi et al., 2020) . However, dissonance is not effective for exploration in RS due to its focus on confusing items mostly derived based on the users' current interest. Lemma 2 shows that the evidential reward results in the evidential Q-value that is optimal for the policy improvement. We have also experimentally shown this in a qualitative study by demonstrating better recommendation performance than SAC based exploration.

5. EXPERIMENTS

We conduct extensive experiments on four real-world datasets that contain explicit ratings: Movielens-1M, Movielens-100K, Netflix, and Yahoo! Music. For baseline comparisons, we use dynamic models: timeSVD++ (Koren, 2009) , CKF (Gultekin & Paisley, 2014) ; sequential models: CASER (Tang & Wang, 2018) , SASRec (Kang & McAuley, 2018) , BERT4Rec (Sun et al., 2019) ; and RL-based models: ϵ-greedy (Zhao et al., 2013) , DRN (Zheng et al., 2018) , LIRD (Zhao et al., 2017) , CoLin (Wu et al., 2016) . We evaluate rewards based on available ground-truth ratings, which avoids the model from learning from simulated rewards for the non-interacted items that may lead to ineffective recommendations. Further details about datasets, experimental setting, and baseline models are provided in Appendices D, E, and F. Evaluation metrics. We use two standard metrics to measure the recommendation performance. We also use cumulative rewards for the RL-based methods. • Precision@N: It is the fraction of the top-N items recommended in each step of the episode that are positive (rating > τ ) to the user. We average over all test users as the final precision. • nDCG@N: Normalized Discounted Cumulative Gain (nDCG) measures ranking quality, considering the relevant items within the top-N of the ranking list in each step of the RL episode. • Cumulative Reward: It measures average reward considering rewards of top-N recommended items in each step for the RL episode.

5.1. RECOMMENDATION PERFORMANCE COMPARISON

Table 2 summarizes the recommendation performance from all models. The proposed model benefits from both the RNN module and EAC module so that it provides better results in all datasets. The dynamic and sequential models achieve less ideal performance due to their focus on short-term user interest and inability to provide long-run or future preference. RL methods have shown a clear advantage due to their focus on maximizing expected long-term rewards. Thanks to the evidence-based uncertainty exploration, DERL achieves the best performance among all DL based models. We further show the step-wise performance of both precision@5 (P@5) and nDCG@5 metrics considering top-5 recommended items in all datasets as shown in Figure 3 . We show the average precision and nDCG of the test users over each step after the model is fully trained to demonstrate the Under review as a conference paper at ICLR 2023 2 ). This is as expected due to lack of user interactions. All the models start to improve after the initial stage. Dynamic models and sequential models still have poor performance compared to the RL-based methods. The proposed DERL model provides consistently better performance over the entire process. However, it has a smaller advantage at the beginning due to its strong focus on exploration. It is also worth to note that the difference between DERL and other models appears to be smaller on the plots because of the wide range of the y-values (0.2 -1 in most cases) to cover the entire recommendation cycle.

5.2. ABLATION STUDY

First, we provide a comparison of the average cumulative reward to demonstrate how the proposed model achieves higher cumulative reward than other RL-based models. Second, we analyze the impact of hyperparameter (λ) that balances exploitation and exploration in the proposed model. Cumulative reward for RL-based methods. We consider cumulative reward to measure test users' recommendation performance. We plot the average cumulative rewards for the proposed DERL model and baseline RL-models in Figure 4 As can be seen from Figure 5 , dynamically adjusting λ achieves consistently better performance on all datasets. This supports the intuition that in the early steps, a large λ allows the model to conduct sufficient exploration. Once the model gains sufficient knowledge from the environment and is able to make accurate recommendation, reducing λ will allow the model to exploit its knowledge to provide effective recommendations. We conduct a qualitative analysis to show the advantage of using evidence-based uncertainty (vacuity) for RL exploration than other two competitive baselines: entropyguided exploration as in the soft actor-critic (SAC) (Haarnoja et al., 2018) and a contextual bandit algorithm (CoLin) (Wu et al., 2016) . We select a random test user (ID:4967) from the Movielens-1M dataset and show the genre counts, positive counts of recommended items, and cumulative reward in each step as shown in Figure 6 . At the initial few steps, SAC has more positive counts but is less effective in exploration. DERL is able to explore more informative items (evidenced by more diverse genres), where different genres are denoted as Romance (R), Drama (D), Comedy (C), Thriller (T), Others (O) in the left plot. In later steps, DERL consistently both competitive model due to the better utilization of evidence-based uncertainty to discover more informative future items which could reflect the user's long-term preference. Table 3 shows the predicted vacuity for each recommended item. The overall higher vacuity scores indicate that DERL recommends more items that are currently unknown to the users, which is instrumental to explore their long-term interests. It also explores more diverse genres (5 vs. 3) of items than the baselines. The results show a consistent trend: DERL focuses on exploration in the earlier phase by recommending more diverse items (left plot) and both competitive methods achieve better performance (more positive counts) during this phase (middle plot). Due to better exploration, DERL eventually achieves a much better cumulative reward in the later phase (right plot). More specifically, as Table 3 shows, in early steps, DERL identifies four out of five important items (genre types in bold) that are recommended based on (estimated) future long-term interest. However, SAC finds two important movies based on genre and three by the CoLin. SAC selects three comedy movies, which only reflects the user's current preference from the current and past interactions. Similarly, CoLin is also more focused on comedy rather than exploring diverse movies.

6. CONCLUSION

In this paper, we propose a novel deep evidential reinforcement learning framework for dynamic recommendations. The proposed DERL framework learns a more effective recommendation policy by integrating both the expected reward and evidence-based uncertainty. DERL integrates a customized RNN to generate the current state that accurately captures user interest and an evidential-actor-critic module to perform evidence-based exploration to optimize policy by improving an evidential Qvalue. We theoretically prove the convergence behavior of the proposed evidential policy integration strategy. Experimental results on real-world data and comparison with the state-of-the-art competitive models demonstrate the effectiveness of the proposed model.

Appendix

Organization of Appendix. In this Appendix, we first summarize the major mathematical notations in Appendix A. We then present the proofs of lemmas and theorems in Appendix B. We show the detailed DERL algorithm in Appendix C. We present the details of the dataset in Appendix D, experimental setting in Appendix E, and baseline models in Appendix F. Further we include some additional comparison results in Appendix G. The link to the source code is given in Appendix H. A SUMMARY OF NOTATIONS 

B PROOFS OF THEORETICAL RESULTS

In this section, we provide proofs of all lemmas and the theorem.

B.1 PROOF OF LEMMA 1

Given the evidential reward defined as r e π (s t , a t ) = 1 N i∈Nu (rating u,i -τ ) + λU π (i|s t , a t ) the update rule for evidential Q-value can be written as: Q e (s t , a t ) = E π ∞ t ′ =t γ t ′ r e π (s t ′ , a t ′ ) = r e π (s t , a t ) + γE st+1,at+1 [Q e (s t+1 , a t+1 )] Then based on the evaluation convergence rule (Sutton et al., 1999) with finite action space, it is guaranteed that the Q-value will converge to the evidential Q-value of policy π.

B.2 PROOF OF LEMMA 2

The policy can be updated towards the new Q-value function. Consider the updated policy π new as the optimizer of the maximization problem. π new = arg max π′ J π (ϕ) = arg max π′ E st∼D,at∼π′ [Q e π′ (s t , a t )] Denote the old policy as π old . Using the update rule specified in Eq (8) with a sufficiently small step size, we get an updated policy π new that satisfies  E at∼πnew [Q e π old (

B.3 PROOF OF THEOREM 1

Let π i denote the policy at iteration i. We already show that the sequence Q e πi (s t , a t ) is monotonically increasing. Since Q e π (s t , a t ) is bounded above, the sequence converges to some π * . At convergence, it must be the case that J π * (π * (.|s t )) ≤ J π * (π(.|s t )) for π ̸ = π * . Based on Lamma 2, we have Q e π * (s t , a t ) > Q e π (s t , a t ) for all (s t , a t ). In other words, the evidence value of any other policy π is lower than that of the converged policy π * . Therefore, it guarantees convergency to an optimal policy π * such that: Q e π * (s t , a t ) ≥ Q e π (s t , a t ) C DEEP EVIDENTIAL REINFORCEMENT LEARNING ALGORITHM 

D DESCRIPTION OF THE DATASETS

We evaluated DERL on four public benchmark datasets that contain explicit ratings: • Movielens-1Mfoot_0 : This dataset includes 1M explicit feedback (ratings) made by 6,040 anonymous users on 3,900 distinct movies from 04/2000 to 02/2003. • Movielens-100Kfoot_1 : This dataset contains 100,000 explicit ratings on a scale of (1-5) from 943 users on 1,682 movies. Each user at least rated 20 movies from September 19, 1997 through April 22, 1998. • Netflix (Bennett et al., 2007) : This dataset has around 100 million interactions, 480,000 users, and nearly 18,000 movies rated between 1998 to 2005. We pre-processed the dataset and selected 6,042 users with user-item interactions from 01/2002 to 12/2005. • Yahoo! Music rating (Dror et al., 2012) : The dataset includes approximately 300,000 user-supplied ratings, and exactly 54,000 ratings for randomly selected songs. The ratings for randomly selected songs were collected between August 22, 2006 and September 7, 2006.

E EXPERIMENTAL SETTING

We consider each user an episode for the RL setting and split users into 70% as training users and 30% as test users. For each user, we select the first M = 10 interacted items to represent an initial state s 0 . In the next state, we utilize previous state representation and concatenate five items embedding from sliding window and other five items embedding from RL agent to generate current state s t by passing through the RNN module. Then, the action network generates mean and covariance for a Gaussian policy from which action is sampled. This action is further passed to the evidence network, which utilizes the embeddings of user interacted items to produce corresponding evidence for each item. We use the setting of classification, where explicit ratings are used as class labels. With that evidence, we compute the evidential score by evaluating evidence-based rating and uncertainty to rank those items, which provides a list of top-N final recommendations. We then evaluate the evidential reward. We put 10 interactions into each period and set τ = 3.

F COMPARISON BASELINES

We compare with dynamic, sequential, and reinforcement learning models: • Dynamic models include standard dynamic matrix factorization model timeSVD++ (Koren, 2009) as the time-evolving latent factorization model and collaborative Kalman filtering (CKF) (Gultekin & Paisley, 2014 ). • Sequential models include Sequential Recommendation via Convolutional Sequence Embedding (Caser) (Tang & Wang, 2018) , attention-based sequential recommendation model (SAS-Rec) (Kang & McAuley, 2018) , and sequential recommendation with bidirectional encoder (BERT4Rec) (Sun et al., 2019) . • Reinforcement learning-based models include ϵ-greedy (Zhao et al., 2013) , deep Q-network based news recommendation (DRN) (Zheng et al., 2018) , and actor-critic based list-wise recommendation (LIRD) (Zhao et al., 2017) , and contextual bandit based method CoLin (Wu et al., 2016) .

G ADDITIONAL EXPERIMENTS AND COMPARISON RESULTS

In this section, we present additional experiments, with a focus on comparing different types of baselines.

G.1 COMPARISON WITH CONTEXTUAL BANDIT BASED METHODS

In the main paper, we compared with a state-of-the-art collaborative contextual bandit based recommendation method, Colin (Wu et al., 2016) . Here, we include two additional bandit based modelsfoot_2 : 

G.2 COMPARISON WITH OTHER BASELINES

In this section, we include two recent models for comparison: DeepFM (Guo et al., 2017) and DCNv2 (Wang et al., 2021) . DeepFM integrates traditional factorization machine and deep learning to learn low-and high-order feature interactions. Similarly, DCNv2 is more expressive to learn feature interactions and also more cost-efficient. We also include one classical RL-based method called REINFORCE (Chen et al., 2019a) , which applies off-policy learning to handle data bias. The test performance metric P@5 and nDCG@5 among the proposed model DERL and above three baselines in two datasets Movielens-1M and Movielens-100K are shown in Table 5 and Figure 8 . Although these two deep learning-based recommender models achieve reasonable recommendation performance, they mainly lack to handle the temporal preference of the users and hence perform worse than the proposed DERL method. Furthermore, REINFORCE has limited exploration power and cannot effectively capture long-term user preference in the future, hence its performance is also lower than DERL. We further add a static recommendation model, LightGCN (He et al., 2020) , which computes user and item embeddings via a linear aggregation of its neighbors. Its performance is lower than DERL in both datasets because it is ineffective in handling highly sparse user interactions in a dynamic setting. In addition, we also compare with two recent RL-based methods: DRR (Fu et al., 2021) and DHRC (Liu et al., 2020) . The significant advantage of exploration over these baselines further confirms its outstanding recommendation performance achieved by DERL. 

G.3 IMPACT OF VACUITY FOR EXPLORATION

Uncertainty is commonly used to support exploration of RL agents in different applications. Evidential deep learning leverages second-order uncertainty (i.e., vacuity) to perform effective exploration. The proposed evidential reward encourage the RL agent to recommend items that the model has the least knowledge (as indicated by a high vacuity). After collecting the user feedback, the RL agent can most effectively gain the knowledge on the user preference to make better recommendations in the long run. The effectiveness of vacuity guided exploration has been demonstrated in both the motivating example in the introduction and also our empirical evaluations. We further investigate the effect of vacuity in our proposed model by comparing DERL with an alternative design without vacuity. Furthermore, we also compare exploration using the first order uncertainty, which is employed by soft-actor-critic (SAC). We show the comparison results on two datasets in Figure 9 . It can be seen that without uncertainty guided exploration, the model collects the least cumulative reward in a long run. SAC utilizes entropy based exploration and achieves better cumulative reward than without uncertainty guided exploration. This provides evidence that role of the exploration is crucial in RL-based recommendation. However, it performs worse than the vacuity based DERL method. This is because vacuity guided exploration allows our model to focus its exploration on the most informative items that help the model gain the most knowledge to form an optimal policy. The advantage over entropy-based exploration is also consistent with our earlier discussion in Section 4.4 of of the main paper.

H LINK FOR THE SOURCE CODE

https://anonymous.4open.science/r/EvidentialRecommendation-2BDE/ README.md 



https://grouplens.org/datasets/movielens/1M/ https://grouplens.org/datasets/movielens/100k/ https://github.com/HCDM/BanditLib



Figure 1: Different recommendation behavior between an existing RL model and DERL

Figure 2: Overview of the DERL framework

Figure 3: Performance comparison in each time step: (a)-(d): P@5; (e)-(h): nDCG@5

to show their average reward over different training epochs in all four datasets. As can be seen, the cumulative rewards for DERL and RL-based model in the initial epochs are quite close. But in later epochs, DERL clearly outperforms the other baselines. This is because the model explores more effectively during the training process to enhance the knowledge of the model. Further, reward gains for those baselines are largely similar. But for DERL, it has shown a significant improvement in comparison with those baselines.

Figure 6: Genre count, Positive count, cumulative reward for a given userImpact of hyperparameter (λ). The hyperparameter (λ) plays a critical role in recommending the top-N items and generating the evidential reward. We test three different settings: λ = 0.1, λ = 0.5, and gradually reducing λ from 0.5 to 0.1. As can be seen from Figure5, dynamically adjusting λ achieves consistently better performance on all datasets. This supports the intuition that in the early steps, a large λ allows the model to conduct sufficient exploration. Once the model gains sufficient knowledge from the environment and is able to make accurate recommendation, reducing λ will allow the model to exploit its knowledge to provide effective recommendations.

Figure 8: Performance comparison in each time step

Figure 7: Comparison between bandit-based models and ours in four different datasets.

Figure 9: Comparisons exploration strategies in Movielens-1M (left) and Movielens-100K (right).

Examples of recommended movies

State space (S):A state s t = RNN(•|s t-1 , u t ) ∈ S is generated by a customized RNN that utilizes previous state s t-1 and current user u t embedding which is generated from the concatenation of M recently interacted items provided by a sliding window (see details later) and an RL-agent.• Action space (A): An action a t ∈ A is represented as a continuous parameter vector that recommends top-N items for a user based on the current state s t at time t.• Transition probability (p): The transition probability p(s t+1 |s t , a t ) quantifies the probability from state s t to s t+1 with an action a t . • Reward (r): The environment provides an immediate reward as a feedback based on items recommended (actions a t ) to the user in state s t .

Recommend movies for UserID: 4967

Summary of Notations

s t , a t )] ≥ E at∼π old [Q e π old (s t , a t )] (15) Given Eq (15), we have the following inequality Q e π old (s t , a t ) ≤r e (s t , a t ) + γE st+1,at+1∼πnew [Q e π old (s t+1 , a t+1 )] ≤r e (s t , a t ) + γE st+1,at+1∼πnew [r e (s t+1 , a t+1 )] (s t , a t ) is a evidential reward in step t. Therefore, we show that the new policy π new ensures Q e πnew (s t , a t ) ≥ Q e π old (s t , a t ) for all (s t , a t ).

Add (u t , s t-1 , r e π (s t , a t ), U π (.|s t , a t ), done) into replay buffer

Comparison of Recommendation Performance (average P@N and nDCG@N)

