GRAPH NEURAL BANDITS

Abstract

Contextual bandits aim to choose the optimal arm with the highest reward out of a set of candidates based on their contextual information, and various bandit algorithms have been applied to personalized recommendation due to their ability of solving the exploitation-exploration dilemma. Motivated by online recommendation scenarios, in this paper, we propose a framework named Graph Neural Bandits (GNB) to leverage the collaborative nature among users empowered by graph neural networks (GNNs). Instead of estimating rigid user clusters, we model the "fine-grained" collaborative effects through estimated user graphs in terms of exploitation and exploration individually. Then, to refine the recommendation strategy, we utilize separate GNN-based models on estimated user graphs for exploitation and adaptive exploration. Theoretical analysis and experimental results on multiple real data sets in comparison with state-of-the-art baselines are provided to demonstrate the effectiveness of our proposed framework.

1. INTRODUCTION

Contextual bandits are a specific type of multi-armed bandit problem where the additional contextual information (contexts) related to arms are available at each round, and the learner intends to refine its selection strategy based on the received arm contexts and rewards. Various contextual bandit algorithms have been applied in real-world recommendation tasks, such as online content recommendation and advertising (Li et al., 2010; Wu et al., 2016) , and clinical trials (Durand et al., 2018; Villar et al., 2015) . Meanwhile, collaborative effects among users provide us the opportunity to design better recommender strategies, since the target user's preference can be inferred based on other similar users. Such effects have been studied by many bandit works (Gentile et al., 2014; Li et al., 2019; Gentile et al., 2017; Li et al., 2016; Ban & He, 2021) . Different from the conventional collaborative filtering methods (He et al., 2017; Wang et al., 2019) , bandit-based approaches focus on more dynamic environments (such as news, short-video platform) and the exploitation-exploration dilemma inherently existed in the decisions of recommendation. Existing works for clustering of bandits (Gentile et al., 2014; Li et al., 2019; Gentile et al., 2017; Li et al., 2016; Ban & He, 2021; Ban et al., 2022a) have been proposed to model the user correlations (collaborative effects) by clustering users into rigid groups, and assigning each formed group with an estimator to learn the assumed reward functions combined with an Upper Confidence Bound (UCB) strategy for exploration. However, these works only consider the "coarse-grained" user correlations. To be specific, they assume that users from the same group would share identical preferences, i.e., the users from the same group are compelled to make equal contributions to the final decision (arm selection) with regard to the target user. Such formulation of user correlations ("coarsegrained" collaborative effects), evidently fails to comply with real-world application scenarios, since users within the same group tend to have similar but subtly different preferences instead of sharing completely identical tastes. Therefore, given a target user, it is more practical to assume that the rest of the users would impose different levels of (collaborative) effects on this user. Motivated by aforementioned limitations of existing works, in this paper, we propose a novel framework, named Graph Neural Bandits (GNB), to formulate the "fine-grained" collaborative effects, where the correlation of each user pair is preserved by user graphs. Given a target user, other users are allowed to make different contributions to the final decision based on the strength of their correlation to the target user, which therefore corresponds to the "fine-grained" collaborative effects. In particular, in GNB, we propose a novel approach to construct two kinds of user graphs with distinct purposes, called "user exploitation graphs" and "user exploration graphs". Then, we apply two separate graph neural network (GNN) models on these two kinds of user graphs, to incorporate the collaborative effects for both exploitation and exploration purposes in the final decision-making process. Our main contributions can be summarized as follows: 1. Different from existing works that only formulate the "coarse-grained" collaborative effects by neglecting the divergence within user groups, we introduce a new problem setting to model the "fine-grained" user collaborative effects via user graphs. In our setting, the pair-wise user correlations are preserved to contribute differently to the decision-making. 2. We propose a framework named GNB, which has the novel ways to build two kinds of user graphs with two different purposes, i.e., exploitation and adaptive exploration, respectively. Then, GNB utilizes GNN-based models for a refined arm selection strategy by leveraging the user correlations encoded in these two kinds of user graphs. 3. With standard assumptions, we provide the theoretical analysis showing that GNB can achieve the regret upper bound of complexity O( T log(T n)), where T is the number of rounds and n is the number of users. This bound is sharper than the existing related works. 4. Extensive experiments comparing GNB with nine state-of-the-art algorithms are conducted on various real data sets, which demonstrate the effectiveness of our proposed method. After introducing the problem definition in Section 2, we provide the details of our proposed framework in Section 3. Then, we present the theoretical analysis in Section 4, and the experiments in Section 5. Finally, we conclude the paper in Section 6. Due to page limit, we will leave the review of related works to the Section A in the Appendix.

2. GRAPH NEURAL BANDITS: PROBLEM DEFINITION AND NOTATION

Suppose there are a total of n users with the user set U = {1, • • • , n}. At each time step t ∈ [T ], the learner will receive a user u t ∈ U to serve, along with a set of candidate arms X t = {x i,t } i∈[a] for selection. The cardinality of this arm set is |X t | = a, and each arm is described by a d-dimensional context vector x i,t ∈ R d with ∥x i,t ∥ 2 = 1. Meanwhile, each arm x i,t ∈ X t is associated with a reward r i,t . As the user correlation is one important factor in determining the reward, we define the following reward function: r i,t = h(x i,t , u t , G (1), * i,t ) + ϵ i,t where h(•) is the unknown reward mapping function, and ϵ i,t stands for some zero-mean noise such that E[r i,t ] = h(x i,t , u t , G (1), * i,t ). Here, we have G (1), * i,t = (U, E, W , * i,t ) being the unknown user graph induced by arm x i,t , which encodes the "fine-grained" user correlations in terms of expected rewards. In graph G (1), * i,t , each user u ∈ U corresponds to a graph node; meanwhile, E = {e(u, u ′ )} ∀u,u ′ ∈U refers to the set of edges, and the set W (1), * i,t = {w (1), * i,t (u, u ′ )} ∀u,u ′ ∈U stores the weights for each edge from E. Under real-world application scenarios, users sharing the same preference for certain arms (e.g., sports news) may have distinct tastes over other arms (e.g., political news). Thus, we allow each arm x i,t ∈ X t to induce different user collaborations G (1), * i,t . Then, motivated by various real applications (e.g., online recommendation with normalized ratings), we consider r i,t to be bounded r i,t ∈ [0, 1] in this paper, which is standard in existing works (e.g., Gentile et al. (2014; 2017) ; Ban & He (2021); Ban et al. ( 2022a)). Note that as long as r i,t ∈ [0, 1], we do not impose any distribution assumption (e.g., sub-Gaussian distribution) on noise term ϵ i,t . 

Comparison with

)) only can formulate "coarse-grained" user correlations. In their settings, given a user group N ⊆ U, all the users in N are forced to share the same reward function given an arm x i,t , i.e., E[r i,t | u, x i,t ] = h N (x i,t ), ∀u ∈ N . In contrast, our definition of the reward function enables us to model the pair-wise fine-grained user correlations by introducing another two important factors u and G (1), * i,t . With our formulation, each user here is allowed to produce different rewards facing the same arm, i.e., E[r i,t | u, x i,t ] = h(x i,t , u, G (1), * i,t ), ∀u ∈ N . Here, with different users u, the corresponding expected reward h(x i,t , u, G (1), * i,t ) can be different. Therefore, our definition of the reward function is more generic, and it can also readily generalize to existing user clustering algorithms (with "coarse-grained" user correlations) by allowing each single user group to form an isolated sub-graph in G (1), * i,t with no connections across different sub-graphs.



Existing Problem Definitions. The problem definition of existing user clustering works (e.g., Gentile et al. (2014); Li et al. (2019); Gentile et al. (2017); Ban & He (2021); Ban et al. (

