GRAPH NEURAL BANDITS

Abstract

Contextual bandits aim to choose the optimal arm with the highest reward out of a set of candidates based on their contextual information, and various bandit algorithms have been applied to personalized recommendation due to their ability of solving the exploitation-exploration dilemma. Motivated by online recommendation scenarios, in this paper, we propose a framework named Graph Neural Bandits (GNB) to leverage the collaborative nature among users empowered by graph neural networks (GNNs). Instead of estimating rigid user clusters, we model the "fine-grained" collaborative effects through estimated user graphs in terms of exploitation and exploration individually. Then, to refine the recommendation strategy, we utilize separate GNN-based models on estimated user graphs for exploitation and adaptive exploration. Theoretical analysis and experimental results on multiple real data sets in comparison with state-of-the-art baselines are provided to demonstrate the effectiveness of our proposed framework.

1. INTRODUCTION

Contextual bandits are a specific type of multi-armed bandit problem where the additional contextual information (contexts) related to arms are available at each round, and the learner intends to refine its selection strategy based on the received arm contexts and rewards. Various contextual bandit algorithms have been applied in real-world recommendation tasks, such as online content recommendation and advertising (Li et al., 2010; Wu et al., 2016) , and clinical trials (Durand et al., 2018; Villar et al., 2015) . Meanwhile, collaborative effects among users provide us the opportunity to design better recommender strategies, since the target user's preference can be inferred based on other similar users. Such effects have been studied by many bandit works (Gentile et al., 2014; Li et al., 2019; Gentile et al., 2017; Li et al., 2016; Ban & He, 2021) . Different from the conventional collaborative filtering methods (He et al., 2017; Wang et al., 2019) , bandit-based approaches focus on more dynamic environments (such as news, short-video platform) and the exploitation-exploration dilemma inherently existed in the decisions of recommendation. Existing works for clustering of bandits (Gentile et al., 2014; Li et al., 2019; Gentile et al., 2017; Li et al., 2016; Ban & He, 2021; Ban et al., 2022a) have been proposed to model the user correlations (collaborative effects) by clustering users into rigid groups, and assigning each formed group with an estimator to learn the assumed reward functions combined with an Upper Confidence Bound (UCB) strategy for exploration. However, these works only consider the "coarse-grained" user correlations. To be specific, they assume that users from the same group would share identical preferences, i.e., the users from the same group are compelled to make equal contributions to the final decision (arm selection) with regard to the target user. Such formulation of user correlations ("coarsegrained" collaborative effects), evidently fails to comply with real-world application scenarios, since users within the same group tend to have similar but subtly different preferences instead of sharing completely identical tastes. Therefore, given a target user, it is more practical to assume that the rest of the users would impose different levels of (collaborative) effects on this user. Motivated by aforementioned limitations of existing works, in this paper, we propose a novel framework, named Graph Neural Bandits (GNB), to formulate the "fine-grained" collaborative effects, where the correlation of each user pair is preserved by user graphs. Given a target user, other users are allowed to make different contributions to the final decision based on the strength of their correlation to the target user, which therefore corresponds to the "fine-grained" collaborative effects. In particular, in GNB, we propose a novel approach to construct two kinds of user graphs with distinct purposes, called "user exploitation graphs" and "user exploration graphs". Then, we apply two separate graph neural network (GNN) models on these two kinds of user graphs, to incorporate the collaborative effects for both exploitation and exploration purposes in the final decision-making process. Our main contributions can be summarized as follows: 1. Different from existing works that only formulate the "coarse-grained" collaborative effects by neglecting the divergence within user groups, we introduce a new problem setting to model the "fine-grained" user collaborative effects via user graphs. In our setting, the pair-wise user correlations are preserved to contribute differently to the decision-making. 2. We propose a framework named GNB, which has the novel ways to build two kinds of user graphs with two different purposes, i.e., exploitation and adaptive exploration, respectively. Then, GNB utilizes GNN-based models for a refined arm selection strategy by leveraging the user correlations encoded in these two kinds of user graphs. 3. With standard assumptions, we provide the theoretical analysis showing that GNB can achieve the regret upper bound of complexity O( T log(T n)), where T is the number of rounds and n is the number of users. This bound is sharper than the existing related works. 4. Extensive experiments comparing GNB with nine state-of-the-art algorithms are conducted on various real data sets, which demonstrate the effectiveness of our proposed method. After introducing the problem definition in Section 2, we provide the details of our proposed framework in Section 3. Then, we present the theoretical analysis in Section 4, and the experiments in Section 5. Finally, we conclude the paper in Section 6. Due to page limit, we will leave the review of related works to the Section A in the Appendix.

2. GRAPH NEURAL BANDITS: PROBLEM DEFINITION AND NOTATION

Suppose there are a total of n users with the user set U = {1, • • • , n}. At each time step t ∈ [T ], the learner will receive a user u t ∈ U to serve, along with a set of candidate arms X t = {x i,t } i∈ [a] for selection. The cardinality of this arm set is |X t | = a, and each arm is described by a d-dimensional context vector x i,t ∈ R d with ∥x i,t ∥ 2 = 1. Meanwhile, each arm x i,t ∈ X t is associated with a reward r i,t . As the user correlation is one important factor in determining the reward, we define the following reward function: r i,t = h(x i,t , u t , G (1), * i,t ) + ϵ i,t where h(•) is the unknown reward mapping function, and ϵ i,t stands for some zero-mean noise such that E[r i,t ] = h(x i,t , u t , G (1), * i,t ). Here, we have G (1), * i,t = (U, E, W (1), * i,t ) being the unknown user graph induced by arm x i,t , which encodes the "fine-grained" user correlations in terms of expected rewards. In graph G (1), * i,t , each user u ∈ U corresponds to a graph node; meanwhile, E = {e(u, u ′ )} ∀u,u ′ ∈U refers to the set of edges, and the set W (1), * i,t = {w (1), * i,t (u, u ′ )} ∀u,u ′ ∈U stores the weights for each edge from E. Under real-world application scenarios, users sharing the same preference for certain arms (e.g., sports news) may have distinct tastes over other arms (e.g., political news). Thus, we allow each arm x i,t ∈ X t to induce different user collaborations G (1), * i,t . Then, motivated by various real applications (e.g., online recommendation with normalized ratings), we consider r i,t to be bounded r i,t ∈ [0, 1] in this paper, which is standard in existing works (e.g., Gentile et al. (2014; 2017) ; Ban & He (2021) ; Ban et al. (2022a) ). Note that as long as r i,t ∈ [0, 1], we do not impose any distribution assumption (e.g., sub-Gaussian distribution) on noise term ϵ i,t . Comparison with Existing Problem Definitions. The problem definition of existing user clustering works (e.g., Gentile et al. (2014) ; Li et al. (2019) ; Gentile et al. (2017) ; Ban & He (2021) ; Ban et al. (2022a) ) only can formulate "coarse-grained" user correlations. In their settings, given a user group N ⊆ U, all the users in N are forced to share the same reward function given an arm x i,t , i.e., E[r i,t | u, x i,t ] = h N (x i,t ), ∀u ∈ N . In contrast, our definition of the reward function enables us to model the pair-wise fine-grained user correlations by introducing another two important factors u and G (1), * i,t . With our formulation, each user here is allowed to produce different rewards facing the same arm, i.e., E[r i,t | u, x i,t ] = h(x i,t , u, G (1), * i,t ), ∀u ∈ N . Here, with different users u, the corresponding expected reward h(x i,t , u, G (1), * i,t ) can be different. Therefore, our definition of the reward function is more generic, and it can also readily generalize to existing user clustering algorithms (with "coarse-grained" user correlations) by allowing each single user group to form an isolated sub-graph in G (1), * i,t with no connections across different sub-graphs. To bridge user collaborative effects with user preferences (rewards), we consider the following constrain for the reward function in Eq. 1. The intuition is that for any two users with comparable user correlations, they would share similar tastes over the items with a high probability. For arm x i,t , we consider the difference of expected rewards between any two users u, u ′ ∈ U to be governed by |h(x i,t , u, G (1), * i,t ) -h(x i,t , u ′ , G (1), * i,t )| ≤ Ψ(G (1), * i,t [u, :], G (1), * i,t [u ′ , :]) where G (1), * i,t [u, :] stands for the adjacency matrix row of G (1), * i,t that corresponds to user (node) u, and Ψ : R n × R n → R denotes an unknown mapping function. The reward function definition (Eq. 1) and the constraint (Eq. 2) motivate us to design the GNB framework, to be introduced in Section 3.

Since the true user graph G

(1), * i,t in Eq. 1 reflects user correlations in terms of the expected reward, it is exactly referring to the user exploitation correlation, where users with high correlations tend to have similar expected rewards (Eq. 2). Then, we proceed to give the formulation of G (1), * i,t below. Definition 1 (User Correlation for Exploitation). In round t, for any two users u, u ′ ∈ U, their exploitation correlation score w (1), * i,t (u, u ′ ) w.r.t. a candidate arm x i,t ∈ X t is defined as w (1), * i,t (u, u ′ ) = Ψ (1) E[r i,t |u, x i,t ], E[r i,t |u ′ , x i,t ] where E[r i,t |u, x i,t ], i ∈ [a] is the expected reward in terms of the user-arm pair (u, x i,t ). Given two users u, u ′ ∈ U, the function Ψ (1) : R × R → R maps from their expected rewards E[r i,t |u, x i,t ] to their user exploitation score w (1), * i,t (u, u ′ ). Given an arm x i,t ∈ X t , the user correlation for exploitation measures the user preference (i.e., expected reward) correlation between two users u, u ′ ∈ U, and the corresponding exploitation score w (1), * i,t (u, u ′ ) refers to the edge weight between these two users (nodes) u, u ′ in exploitation graph G (1), * i,t . In this paper, we consider the mapping functions Ψ (1) as the prior knowledge, and it can be functions such as the radial basis function (RBF) kernel or normalized absolute difference in practice.

Modeling User Exploration Correlations with the User Exploration Graph G

(2), * i,t . In order to deal with the exploration-exploitation dilemma, for each candidate arm x i,t ∈ X t , we propose to formulate the user exploration graph G (2), * i,t = (U, E, W (2), * i,t ) in order to model the user correlations in terms of the uncertainty of the reward estimation model, which are formulated by the set of edge weights Ban et al. (2022b) , before defining the second kind of user correlation (i.e., user exploration correlation), we first introduce the definition of expected potential gain for reward estimation, which measures the prediction uncertainty of reward estimators. Note that our formulation is distinct from Ban et al. (2022b) , since they only focus on the single-bandit setting with no user collaborations, and all the users will be treated identically. Definition 2 (Expected Potential Gain). Given user u ∈ U at time step t, given a candidate arm x i,t ∈ X t , i ∈ [a] and a reward estimation function f u (•) corresponding to user u, the expected potential gain for the reward estimation W (2), * i,t = {w (2), * i,t (u, u ′ )} ∀u,u ′ ∈U . Inspired by f u (x i,t ) is defined as E[r i,t |u, x i,t ] -f u (x i,t ). Here, the potential gain for reward estimation essentially formulates the uncertainty of model f u (•) by measuring the difference between the expected reward E[r i,t |u, x i,t ] and the prediction f u (x i,t ). Next, we proceed to introduce the second kind of user correlation, i.e., user exploration correlation. Definition 3 (User Correlation for Exploration). In round t, given two users u, u ′ ∈ U and an arm x i,t ∈ X t , their underlying exploration correlation score w (2), * i,t (u, u ′ ) is w (2), * i,t (u, u ′ ) = Ψ (2) E[r i,t |u, x i,t ] -f u (x i,t ), E[r i,t |u ′ , x i,t ] -f u ′ (x i,t ) with E[r i,t |u, x i,t ] -f u (x i,t ), i ∈ [a] being the potential gain for the user-arm pair (u, x i,t ). Here, f u (•) is the reward estimation function specified to user u, and Ψ (2) : R × R → R is the mapping from user potential gains E[r i,t |u, x i,t ] -f u (x i,t ) to their exploration correlation score. For the arm x i,t and two users u, u ′ ∈ U, the user exploration correlation score w (2), * i,t (u, u ′ ) refers to the correlation of prediction uncertainty between two user-specific functions f u (•) and f u ′ (•). Then, (2), * i,t (u, u ′ ) will be considered as the edge weight between these two nodes (users) u, u ′ in the true user exploration graph G (2), * i,t . Analogous to previous Ψ (1) , we also consider the mapping functions Ψ (2) is given as the prior knowledge. Intuitively, when the exploration score w (2), * i,t (u, u ′ ) is high, we can apply similar exploration strategies for both users u, u ′ . For example, given arm x i,t , if the reward estimation error (i.e., prediction uncertainty) is large for both u and u ′ , we may want to explore these two user-arm pairs (u, x i,t ), (u ′ , x i,t ) more for additional knowledge. Learning Objective. For the received user u t in round t, the learner is expected to recommend an arm x t ∈ X t (with reward r t ) in order to minimize the cumulative pseudo-regret R(T ) = E[ T t=1 (r * t - r t )] where r * t is the reward for the optimal arm E[r * t |u t , X t ] = max xi,t∈Xt h(x i,t , u t , G , * i,t ). Notation. Denoting T u,t ⊆ [t] as the collection of time steps that user u ∈ U is served up to round t, we use P u,t = {(x τ , r τ )} τ ∈Tu,t to represent the collection of received arm-reward pairs associated with user u, and T u,t = |T u,t | refers to the number of rounds that user u has been served. Here, x τ ∈ A τ , r τ ∈ R separately refer to the chosen arm and actual received reward in round τ ∈ T u,t . Similarly, we use P t = {(x τ , r τ )} τ ∈ [t] to denote all the past records (i.e., arm-reward pairs), up to round t. For any graph G, we denote A ∈ R n×n as its adjacency matrix (with added self-loops), and D ∈ R n×n as its degree matrix. Then, we will introduce our proposed solution, the GNB framework.

3. GRAPH NEURAL BANDITS: PROPOSED FRAMEWORK

The workflow of our proposed GNB framework is illustrated by Figure 1 , and it consists of four major components: (1) estimating the user exploitation graph G (1), * , denoted by G (1) , and user exploration graph G (2), * , denoted by G (2) to model the user correlations in terms of exploitation and exploration respectively; (2) applying GNN models f (1) gnn (•), f (2) gnn (•) on the estimated user graphs G (1) and G (2) , to collaboratively derive the estimated reward for exploitation, and potential gain for exploration; (3) selecting the arm x t based on estimated reward and potential gain; and (4) training parameters for GNN models and user neural networks with gradient descent (GD).

3.1. USER GRAPH ESTIMATION WITH USER NETWORKS

Based on the definition of unknown true user graphs G (1), * i,t , G (2), * i,t w.r.t. arm x i,t ∈ X t (Definitions 1, 3), we proceed to derive their estimations G (1) i,t , G (2) i,t , i ∈ [a] with individual user networks f (1) u , f (2) u , u ∈ U. With these two kinds of estimated user graphs G (1) i,t and G (2) i,t , we can thus model the user behaviors under the exploitation setting and the exploration setting separately. Due to page limit, pseudo-code summarizing the workflow is presented in Alg. 2 in Section D of the Appendix.

User Exploitation Network f

(1) u . For each user u ∈ U, we propose to apply an exploitation network f (1) u (•) to learn user u's preference for arm x i,t , i.e., E[r i,t |u, x i,t ]. Here, f (1) u (•) will be trained on the past records (arm contexts and rewards) P u,t from user u, and the loss function will be the quadratic loss between the predicted reward and the actual reward. Following the Definition 1, we will then be able to construct the exploitation graph G i,t , we consider the edge weight between two user nodes u, u ′ to be w (1) i,t (u, u ′ ) = Ψ (1) f (1) u (x i,t ), f u ′ (x i,t ) , where Ψ (1) (•, •) is the mapping function mentioned in Definition 1 (line 11, Alg. 2).

User Exploration Network f

(2) u . To estimate the potential gain (i.e., the uncertainty for the reward estimation) E[r|u, x i,t ] -f (1) u (x i,t ), we adopt an additional user exploration network f (2) u (•) inspired by Ban et al. (2022b) . As the confidence interval of reward estimation can be expressed as a function of network gradients (Zhou et al., 2020; Qi et al., 2022) , we apply f  u , i.e., {∇f u (x τ )} τ ∈Tu,t as inputs; and the uncertainty of reward prediction {r τ -f (1) u (x τ )} τ ∈Tu,t will be the labels. Analogously, for the estimated user exploration graph G (2) i,t and given two user nodes u, u ′ , we let their edge weight be w (2) i,t (u, u ′ ) = Ψ (2) f (2) u ∇f (1) u (x i,t )), f (2) u ′ ∇f (1) u ′ (x i,t ) , where ∇f (1) u (x i,t ) stands for the gradient of f (1) u (•) given arm x i,t as the input (line 12, Alg. 2), and Ψ (2) (•, •) is the mapping function in Definition 3. Network Architecture. For the theoretical analysis and experiments, we apply separate L-layer (L ≥ 2) fully-connected (FC) networks for both kinds of user networks. Their weight matrix entries for the first L -1 layers are drawn from the Gaussian distribution N (0, 2/m). The entries of the last layer (L-th layer) are sampled from N (0, 1/m). Complementary details are in Appendix Section C.

3.2. EXPLOITATION AND EXPLORATION WITH USER GRAPHS

With two kinds of estimated user graphs encoding user correlations, we apply two GNN models to separately estimate arm rewards and potential gains for a refined arm selection strategy, which enables us to utilize the past records from all the users compared with single-bandit algorithms (i.e., methods with no user collaboration).

3.2.1. ARCHITECTURE OF GNN MODELS

In round t, with user exploitation graph G (1) i,t for each arm x i,t ∈ X t , we apply the exploitation GNN model f (1) gnn (x i,t , G (1) i,t ; Θ (1) gnn ) to collaboratively estimate the arm reward ri,t for the received user u t ∈ U. We start from learning the aggregated hidden representation over the user graph, denoted as H agg = σ (S (1) i,t ) k • (X i,t Θ (1) agg ) ∈ R n×m where S (1) i,t = (D (1) i,t ) -1 2 A i,t (D i,t ) -1 2 is the symmetrically normalized adjacency matrix of G (1) i,t , and σ represents the ReLU activation function. With m being the network width, we have Θ (1) agg ∈ R nd×m as the trainable weight matrix. After propagating the information for k hops over the user graph, each row of H agg corresponds to the aggregated m-dimensional hidden representation for one specific user-arm pair (u, x i,t ), u ∈ U. In this way, the propagation of multi-hop information can provide a global perspective over the users, since it also involves the neighborhood information of users' neighbors (Zhou et al., 2004) . Here in Eq. 3, the embedding matrix X i,t for the arm x i,t ∈ X t , i ∈ [a] is defined as X i,t =      x ⊺ i,t 0 • • • 0 0 x ⊺ i,t • • • 0 . . . . . . . . . 0 0 • • • x ⊺ i,t      ∈ R n×nd (4) to partition the weight matrix Θ (1) gnn for different users. In this way, it is designed to generate individual m-dimensional representations w.r.t. each user-arm pair (u, x i,t ), u ∈ U, which correspond to the rows of the matrix multiplication (X i,t Θ (1) agg ) ∈ R n×m . Then, with H 0 = H agg , we feed aggregated representations to the L-layer (L ≥ 2) FC network as H l = σ(H l-1 • Θ (1) l ) ∈ R n×m , l ∈ [L -1], r all (x i,t ) = H L-1 • Θ (1) L ∈ R n (5) where r all (x i,t ) ∈ R n represents the reward estimation for all the users in U, given the arm x i,t . Received user u t in round t, the reward estimation for the user-arm pair (u t , x i,t ) would be the corresponding element in r all (line 8, Alg. 1), represented by: ri,t = f (1) gnn (x i,t , G (1) i,t ; Θ (1) gnn ) = [ r all (x i,t )] ut . For the FC network, the weight matrices for the first L -1 layers are Θ l ∈ R m×m , l ∈ [1, • • • , L -1], and for the L-th layer, we have Θ L ∈ R m . Here, we use Θ (1) gnn = [vec(Θ (1) agg ) ⊺ , vec(Θ (1) 1 ) ⊺ , . . . , vec(Θ L ) ⊺ ] ⊺ ∈ R p to represent the trainable parameters of the GNN exploitation model. The exploitation GNN model f (1) gnn (•) will be trained with GD based on all the received records P t . Then we apply the quadratic loss function between the reward prediction {f (1) gnn (x τ , G (1) τ ; Θ (1) gnn )} τ ∈[t] of chosen arms x τ , and the actual received rewards {r τ } τ ∈[t] . Connection with the Reward Function Definition (Eq. 1) and Constraint (Eq. 2). It is known that when width m is large enough, the FC network is naturally Lipschitz continuous with respect to the input (Allen-Zhu et al., 2019) . In our case, with aggregated hidden representations H agg being the input to the FC network (Eq. 5), we will have the difference of reward estimations ri,t bounded by the distance of rows in matrix H agg (i.e., aggregated hidden representations). Therefore, given arm x i,t ∈ X t and two users u i , u j ∈ U, the difference of their estimated rewards |[ r all (x i,t )] ui -[ r all (x i,t )] uj | can be bounded by the distance of their estimated correlation vectors (i.e, the corresponding rows in S i,t ). This matches the reward function definition and the constraint presented in Eq. 1-2. Exploration GNN Model. To achieve adaptive exploration with user collaborations, we apply a second GNN model f (2) gnn (∇[f (1) gnn ] i,t , G (2) i,t ; Θ (2) gnn ) to evaluate the potential gain bi,t of the reward estimation f (1) gnn (x i,t , G (1) i,t ; Θ (1) gnn ) (line 8, Alg. 1). Here, the input is the user exploration graph G (2) i,t , and the corresponding input graph signal is the gradient of the exploitation GNN model ∇[f (1) gnn ] i,t = ∇ Θ (1) gnn f (1) gnn (x i,t , G (1) i,t ; Θ (1) gnn ). Analogous to f (1) gnn (•), the architecture of f (2) gnn (•) can also be represented by Eq. 3-Eq. 6. Note that while f (1) gnn (•), f (2) gnn (•) have the same network width and number of layers, the dimensionality of weight matrices Θ (1) agg ∈ R nd×m , Θ (2) agg ∈ R np×m is different. Similarly, the exploration GNN model will be trained with GD. With the quadratic loss function, we aim to minimize the difference between estimated potential gains {f (2) gnn (∇[f (1) gnn ] τ , G (2) τ ; Θ (2) gnn )} τ ∈[t] and the actual ones {r τ -f (1) gnn (x τ , G (1) τ ; Θ (1) gnn )} τ ∈[t] . Instead of calculating non-negative UCB intervals (upward exploration only) as in existing works (e.g., Gentile et al. (2014) ; Ban et al. (2022a) ), the exploration GNN model f  i,t ) to achieve adaptive exploration (downward and upward exploration). Remark 3.1 (Reducing Input Complexity). The input of f (2) gnn (•) is the gradient ∇ Θ f (1) gnn (x) given the arm x, and its dimensionality is naturally p = (nd × m) + (L -1) × m 2 + m, which can be large when increasing the network width m and depth L. Inspired by Convolutional Neural Networks (CNNs), e.g., Radenović et al. (2018) , we apply the average pooling to calculate the approximation for the original gradient vector in practice. In this way, we can save the running time for large matrix multiplications, and reduce the space complexity at the same time. Note this approach is also compatible with user networks in Subsection 3.1. To prove its effectiveness, we will apply this method on GNB for all the experiments in Section 5. Remark 3.2 (Working with Large Systems). When facing a large number of users, to deal with potentially high computational cost, we can apply approximated user neighborhoods to reduce the running time of GNB. Given user graphs G (1) i,t , G (2) i,t in terms of arm x i,t , we derive approximated user neighborhoods Ñ (1) (u t ), Ñ (2) (u t ) ⊂ U for target user u t , with the size | Ñ (1) (u t )| = | Ñ (2) (u t )| = ñ, where ñ << n. For instance, we can choose a subset of ñ representative users (e.g., users who always post high quality reviews in e-commerce platforms) to form Ñ (1) (u t ), Ñ (2) (u t ) for the downstream GNN models, which can significantly reduce the computation cost. Related experiments are provided in Subsection 5.3 and Appendix Section B.

ALGORITHM 1: Graph Neural Bandits (GNB)

Input: Number of rounds T , network width m, information propagation hops k. Functions for edge weight estimation Ψ (1) (•, •), Ψ (2) (•, •) : R × R → R. Output: Arm recommendation x t for each time step t. Initialization: Initialize parameter Θ 0 for all models. for t = 1, 2, ..., T do Receive a user u t and a set of arm contexts X t = {x i,t } i∈[a] . Construct two kinds of user graphs {G (1) i,t } i∈[a] , {G (2) i,t } i∈[a] for arm set X t with Algorithm 2. for each arm x i,t ∈ X t do Compute reward estimation ri,t = f (1) gnn (x i,t , G (1) i,t ; [Θ (1) gnn ] t-1 ), and the potential gain bi,t = f (2) gnn (∇ Θ (1) gnn f (1) gnn (x i,t , G (1) i,t ; [Θ (1) gnn ] t-1 ), G i,t ; [Θ (2) gnn ] t-1 ). end Play arm x t = arg max xi,t∈Xt ri,t + bi,t , and observe its true reward r t . Train the user networks f (1) u (•; Θ (1) u ), f u (•; Θ (2) u ) and GNN models f (1) gnn (•; Θ (1) gnn ), f (2) gnn (•; Θ (2) gnn ) with gradient descent, according to Algorithm 3. end Weight Matrices Initialization. For both GNN models Θ (1) gnn and Θ (2) gnn , the matrix entries of the aggregation weight matrix Θ agg and the first L -1 FC layers {Θ 1 , . . . Θ L-1 } are drawn from the Gaussian distribution N (0, 2/m). Then, for the last layer Θ L , we draw its entries from N (0, 1/m).

3.2.2. ARM SELECTION MECHANISM AND MODEL TRAINING

In round t, with the current parameters [Θ (1) gnn ] t-1 , [Θ (2) gnn ] t-1 for GNN models before training, the selected arm is chosen as x t = arg max xi,t∈Xt f (1) gnn (x i,t , G (1) i,t ; [Θ (1) gnn ] t-1 ) + f (2) gnn (∇ Θ (1) gnn f (1) gnn (x i,t , G (1) i,t ; [Θ (1) gnn ] t-1 ), G (2) i,t ; [Θ (2) gnn ] t-1 ) based on the estimated reward and potential gain (line 10, Alg. 1). After receiving the true reward r t , we proceed to update the user networks and GNN models based on GD and quadratic loss function (line 11, Alg. 1). Pseudo-code summarizing the training procedure is shown in Alg. 3 (Appendix Section D).

4. THEORETICAL ANALYSIS

In this section, we present the theoretical analysis for the proposed GNB. Here, we consider each user u ∈ U to be evenly served T /n rounds up to time step T , i.e., |T u,t | = T u,t = T /n, which is standard in closely related works (e.g., Gentile et al. (2014) ; Ban & He (2021) ). To ensure the neural models are able to efficiently learn the underlying reward mapping, we have the following mild assumption on arm separateness. Assumption 4.1 (ρ-Separateness of Arms). After a total of T rounds, for every pair x i,t , x i ′ ,t ′ with t, t ′ ∈ [T ] and i, i ′ ∈ [a], if (t, i) ̸ = (t ′ , i ′ ), we have ∥x i,t -x i ′ ,t ′ ∥ 2 ≥ ρ where 0 < ρ ≤ O( 1 L ). Note that the above assumption is mild, and it has been repeatedly applied in previous works on neural bandits (Ban et al., 2022b) and over-parameterized neural networks (Allen-Zhu et al., 2019) . Meanwhile, Assumption 4.2 in Zhou et al. (2020) and Assumption 3.4 from Zhang et al. (2021) also imply that no two arms are the same, and they measure the arm separateness in terms of the minimum eigenvalue λ 0 (with λ 0 > 0) of the Neural Tangent Kernel (NTK) (Jacot et al., 2018) matrix, which is comparable with our Euclidean separateness ρ. Note that since L can be manually set (e.g., L = 2), we can easily satisfy the condition 0 < ρ ≤ O( 1 L ) as long as no two arms are identical. Based on Definition 1 and Definition 3, given an arm x i,t ∈ X t , we have the adjacency matrices A (1), * i,t and A (2), * i,t for the true arm graphs G (1), * i,t , G (2), * i,t . For the sake of analysis, given any adjacency matrix A, we derive the normalized adjacency matrix S by scaling the elements of A with 1/n. We also set the neighborhood parameter k = 1, and define the mapping functions Ψ (1) (a, b), Ψ (2) (a, b) := exp(-|a -b|) given the inputs a, b ∈ R. Note that our results can be readily generalized to other mapping functions with the Lipschitz-continuity properties. We proceed to derive the regret bound for T time steps, denoted as R(T ). Here, the following Theorem 4.2 offers the cumulative regret bound covering both types of error: (1) the estimation error of user graphs; and (2) the approximation error of neural models. Let η 1 , J 1 be the learning rate and iterations for user networks, and η 2 , J 2 denote the learning rate and iterations for GNN models. Theorem 4.2. Define δ ∈ (0, 1), 0 < ξ 1 , ξ 2 ≤ O(1/T ) and 0 < ρ ≤ O(1/L ). With the user networks defined in Eq. 7 and the GNN models defined in Eq. 3-5 with L FC-layers, let their width m ≥ Ω Poly(T, L, a, 1 ρ ) • log(1/δ) . With training process in Algorithm 3, set parameters η 1 = Θ ρ m • Poly(T, n, a, L) , η 2 = Θ ρ m • Poly(T, a, L) , J 1 = Θ Poly(T, n, a, L) ρ • δ 2 • log( 1 ξ 1 ) , J 2 = Θ Poly(T, a, L) ρ • δ 2 • log( 1 ξ 2 ) . Then, following Algorithm 1, Algorithm 2 for arm pulling and user group update, with probability at least 1 -δ, the T -round pseudo-regret R(T ) of GNB could be bounded by R(T ) ≤ √ T • O(L) + √ T • O(L 3 ) + √ T • O(L 2 ) • log( T n • a δ ) + O(L 2 ) + O(1). The proof of Theorem 4.2 and the full regret bound are presented in the Appendix. (Ban et al., 2022b) , we avoid making the i.i.d. assumption for the arms by applying the martingale-based analysis. For real-world applications, their i.i.d. assumption can be strong since the candidate arm pool is always conditioned on the received records, and candidate arms for a specific round can also come from different distributions.

5. EXPERIMENTS

In this section, we evaluate the proposed GNB framework on multiple real data sets against nine state-of-the-art algorithms, including: CLUB (Gentile et al., 2014) , SCLUB (Li et al., 2019) , LOCB (Ban & He, 2021) , DynUCB (Nguyen & Lauw, 2014), COFIBA (Li et al., 2016) , Neural-UCB-Pool (Neural-Pool) (Zhou et al., 2020) , Neural-UCB-Ind (Neural-Ind) (Zhou et al., 2020) , EE-Net (Ban et al., 2022b) , and Meta-Ban (Ban et al., 2022a) . We will include the descriptions for the baselines in the Appendix Section B.

5.1. REAL DATA SETS

Recommendation Data Sets. "MovieLens rating dataset" (https://www.grouplens.org/ datasets/movielens/20m/) includes reviews from 1.6 × 10 5 users towards 6 × 10 4 movies. Since the genome-scores of user-specified tags are provided for each movie, we select 10 tags with the highest score variance to generate the movie features v i ∈ R d , d = 10. Here, the user features v u ∈ R d , u ∈ U are obtained through singular value decomposition (SVD) of the rating matrix. Here, we use K-means to divide users into 50 groups based on v u , and the group information is unknown to models. In each round t, a user u t is drawn from a randomly selected group. For the arm pool X t of 10 arms, we randomly choose one bad movie (with two stars or less, out of five) rated by u t with reward 1, and randomly pick the other 9 good movies with reward 0. For "Yelp" data set (https://www.yelp.com/dataset), we extract ratings and build the rating matrix w.r.t. the top 2, 000 users and top 10, 000 arms with the most reviews. Then, we use SVD to extract a normalized 10-dimensional feature vector for each user and restaurant. Given the rating for a specific user-item pair, if the user's rating is greater than three stars (out of five stars), the reward is set to 1; otherwise, the reward is 0. Similarly, we apply K-means clustering to divide users into 50 groups based on user features. In each round t, a target u t , is sampled from a randomly selected group. For the arm pool X t , we randomly choose one good restaurant rated by u t with reward 1 and randomly pick the other 9 bad restaurants with reward 0. Classification Data Sets. In addition to the two recommendation data sets above, we also perform experiments on two real classification data sets under the recommendation settings, which are "MNIST" (http://yann.lecun.com/exdb/mnist/) and "Shuttle" (https://archive. ics.uci.edu/ml/datasets/Statlog+(Shuttle)) data sets. Similar to previous works (Zhou et al., 2020; Ban et al., 2022a) , given a sample x ∈ R d , we transform it into C different arms, as x 1 = (x, 0, . . . , 0), x 2 = (0, x, . . . , 0), . . . , x |C| = (0, 0, . . . , x) ∈ R d+C-1 where we add C -1 zero digits as the padding. The received reward r t will be 1 if we select the arm of the correct class, else the reward will be 0.

5.2. EXPERIMENT RESULTS

Figure 2 illustrates the cumulative regret results on the four data sets, our proposed GNB manages to achieve the best performance against all these strong benchmarks. First, since the MovieLens data set involves real arm features unlike the Yelp data set that includes high inherent noise, the performance of different algorithms on the MovieLens data set tends to have larger divergence. Among those regret results, the algorithms with neural architectures (Neural-Pool, EE-Net, Meta-Ban) generally perform better than linear algorithms due to the approximation power of neural networks. However, as Neural-Ind considers no collaboration among users, it performs the worst among all baselines on these two data sets. EE-Net outperforms Neural-Pool thanks to its adaptive exploration strategy. For classification data sets, Meta-Ban performs better than the other baselines by modeling user correlations under the non-linear setting. Different from recommendation data sets, the classification data sets involve more complicated reward mapping functions, and this might lead to the poor performances of linear algorithms. Our proposed GNB consistently outperforms all baselines by modeling fine-grained user correlations and utilizing the adaptive exploration strategy simultaneously. In addition, GNB only takes at most 75% of Meta-Ban's running time to finish the experiments (Table 1 ), since Meta-Ban trains the framework individually for each arm before making predictions.

5.3. SUPPLEMENTARY EXPERIMENTS

Due to the page limit, we present additional supplementary experiments in the Appendix Section B, including: (1) experiments on additional data sets; (2) with increasing number of users, experiments demonstrating the effectiveness of applying approximated user neighborhoods (Remark 3.2); (3) experiments showing the potential performance impact on GNB when there exist underlying user clusters; (4) the parameter sensitivity study showing that our adaptive exploration strategy can indeed improve the performance of GNB, and the effects of different hops k for information propagation.

6. CONCLUSION

In this paper, we propose a novel framework named GNB to model the fine-grained user collaborative effects. Instead of modeling user correlations through the estimation of rigid user groups, we estimate the user graphs to preserve the pair-wise user correlations for exploitation and exploration separately, and utilize individual GNN-based models to achieve the adaptive exploration. Moreover, under standard assumptions, we also demonstrate the improvement of regret bounds over existing methods from a new perspective of "fine-grained" user collaborative effects and GNNs. Extensive experiments are conducted to show the effectiveness of our proposed framework against strong baselines.

A RELATED WORKS

In this section, we briefly review the existing works related to our proposed GNB framework. Assuming the reward mapping to be linear, linear upper confidence bound (UCB) algorithms (Chu et al., 2011; Li et al., 2010; Auer et al., 2002; Abbasi-Yadkori et al., 2011) were first proposed to solve the exploitation-exploration dilemma. After kernel-based methods (Valko et al., 2013; Deshmukh et al., 2017) were used to address the non-linear setting where the reward mapping is the kernel-based function, neural algorithms (Zhou et al., 2020; Zhang et al., 2021; Ban et al., 2021) have been proposed to utilize neural networks to estimate the reward function and confidence bound. Meanwhile, AGG-UCB (Qi et al., 2022) adopts GNN to model the arm group correlations. GCN-UCB (Upadhyay et al., 2020) manages to apply the GNN model to embed arm contexts for downstream linear regression, and GNN-PE (Kassraie et al., 2022) utilizes the UCB based on information gains to achieve exploration for classification tasks on graphs. Note that the above neural algorithms with UCB-based exploration strategy all suffer from the space complexity O(p 2 ) to store their gigantic gradient matrix, where p is the number of model parameters. This space cost is especially enormous when you increase the quantity of model parameters by adding more network width m and depth L. Instead of UCB, EE-Net (Ban et al., 2022b) achieves adaptive exploration by using neural models for estimating prediction uncertainty. Assuming a finite number of arms, (Maillard & Mannor, 2014; Hong et al., 2020) discuss the latent bandits where there exist latent states that affect the reward generation. Nonetheless, all of these works fail to consider the collaboration effects among users under the real-world application scenarios. In order to model the user correlations, (Wu et al., 2016; Cesa-Bianchi et al., 2013) assume the user social graph is known, and apply an ensemble of linear estimators. Without the prior knowledge of user correlations, CLUB (Gentile et al., 2014) introduces the user clustering problem in contextual bandits with the graph connected components, SCLUB (Li et al., 2019) adopts dynamic user sets and applies set operations to update user clusters, and DynUCB (Nguyen & Lauw, 2014) assigns users to their nearest estimated clusters. Then, CAB (Gentile et al., 2017) studies the arm-specific user clustering, and LOCB (Ban & He, 2021) estimates soft-margin user groups through a random-seed based approach. COFIBA (Li et al., 2016) utilizes user and arm clustering for collaborative filtering. Apart from these linear algorithms, we note a concurrent work Meta-Ban (Ban et al., 2022a) , which applies a neural meta-model to adapt to different user groups. However, all algorithms mentioned in this paragraph consider rigid user groups, where users from the same group are treated equally with no internal differentiation. GNNs (Welling & Kipf, 2017; Chen et al., 2018; Wu et al., 2019; Gasteiger et al., 2019; He et al., 2020; Satorras & Estrach, 2018) are a kind of neural models operating on the graph data, and have been proved effective for various tasks, e.g., community detection (You et al., 2019) and recommender systems (Ying et al., 2018) . In this work, we leverage GNNs to learn from user correlations and arm contexts simultaneously.

B EXPERIMENT SETTINGS AND SUPPLEMENTARY EXPERIMENTS B.1 BASELINES AND EXPERIMENT SETTINGS

The descriptions for our nine baseline methods are: • CLUB (Gentile et al., 2014) regards connected components as user groups out of the estimated user graph, and adopts a UCB-type exploration strategy; • SCLUB (Li et al., 2019) estimates dynamic user sets as user groups, and allows set operations for group updates; • LOCB (Ban & He, 2021) applies soft-clustering among users with random seeds and choose the best user group for reward and confidence bound estimations; • DynUCB (Nguyen & Lauw, 2014) dynamically assigns users to its nearest estimated cluster. • COFIBA (Li et al., 2016) estimates user clustering and arm clustering simultaneously, and ensembles linear estimators for reward and confidence bound estimations; • Neural-Pool adopts one single Neural-UCB (Zhou et al., 2020) model for all the users with UCB-type exploration strategy; • Neural-Ind assigns each user with their own separate Neural-UCB (Zhou et al., 2020) model; • EE-Net (Ban et al., 2022b) achieves adaptive exploration by applying additional neural models for the exploration and decision making; • Meta-Ban (Ban et al., 2022a) utilizes individual neural models for each user's behavior, and applies a meta-model to adapt to estimated user groups. Baseline Settings. For all the UCB-based baselines, we choose theirs exploration parameter with grid search in the range {0.01, 0.1, 1} individually. And we set the L = 2 for all the deep learning models, and set the network width m = 100. The learning rate of all neural algorithms are selected by grid search in range {0.0001, 0.001, 0.01}. For EE-Net, we follow the default setting in their paper by using a hybrid decision maker, where the estimation is f 1 + f 2 for the first 500 time steps, and then we apply an additional neural network for decision making afterwards. For Meta-Ban, we follow the settings in their paper by turning the clustering parameter γ through grid search {0.1, 0.2, 0.3, 0.4}. For GNB, we choose the k-hop user neighborhood k ∈ {1, 2, 3} with grid search. Reported results are the average of 5 runs.

B.2 EXPERIMENTS ON ADDITIONAL DATA SETS

Due to the page limit in the main body and to better compare our GNB with the benchmarks, here, we include the experiments on two additional classification data sets in this subsection. They are: (1) the "Letter" data set with C = 26 different classes (https://archive. ics.uci.edu/ml/datasets/letter+recognition), and (2) the "Pendigits" data set with C = 10 classes (https://archive.ics.uci.edu/ml/datasets/Pen-Based+ Recognition+of+Handwritten+Digits), under the recommendation settings. Analogous to settings of the "MNIST" and the "Shuttle" data set, we consider each class to be a user. Given a sample x ∈ R d , we transform it into C arms for different classes similar to previous works (Zhou et al., 2020; Ban et al., 2022a) , namely x 1 = (x, 0, . . . , 0), x 2 = (0, x, . . . , 0), . . . , x C = (0, 0, . . . , x) ∈ R d+C-1 where additional C -1 zero digits are added as the padding. The reward will be r = 1 if we choose the correct arm that represents the sample's true class; otherwise, the reward will be 0. The experiment results for these two additional data sets are presented in Figure 3 . It is worthwhile to note that EE-Net continues to outperform the two Neural-UCB baselines, which is also another evidence of the effectiveness of the adaptive exploration strategy. On the other hand, our exploration strategy inspired by EE-Net further incorporates the user exploration graphs to exploit the encoded "fine-grained" user collaborative effects. Therefore, analogous to the experiment results in the main body (Figure 2 ), our proposed GNB framework consistently outperforms the other benchmarks by exploiting and adaptively exploring the "fine-grained" correlations among different classes at the same time.

B.3 EFFECTS OF THE ADAPTIVE EXPLORATION AND EFFECTS OF INFORMATION PROPAGATION HOPS

In order to demonstrate the necessity of the adaptive exploration strategy, we consider an alternative arm selection approach (different from line 10, Alg. 1) at each time step t, with the following form:  x t = arg max xi,t∈Xt f (1) gnn x i,t , G i,t ; [Θ (1) gnn ] t-1 + α • f (2) gnn ∇ Θ (1) gnn f (1) gnn (x i,t , G i,t ; [Θ (1) gnn ] t-1 ), G gnn ] t-1 , [Θ gnn ] t-1 . Here, we introduce an additional parameter α ∈ [0, 1] as the exploration coefficient to control the levels of exploration (larger the α values will lead to higher levels of exploration). And we will show the experiment results with α ∈ {0, 0.1, 0.3, 0.7, 1.0} on the "MNIST" and the "Yelp" data sets. In Figure 4 , we illustrate the effects of different exploration coefficients. Regarding the results in the left figure ("Yelp" data set), the adaptive exploration indeed helps to improve the performance GNB, but the performances of GNB do not differ dramatically with different α values. As the "Yelp" data set contains inherent noise, the curve of cumulative regrets (including cumulative regrets of the other benchmark algorithms) tends to follow a near-linear growing rate. However, our carefully designed adaptive exploration strategy based on user exploration graphs is still helpful to improve the overall performance, and this is validated by the fact that setting α = 1 will lead to better performance compared with the situation when no exploration strategy is involved (α = 0). On the other hand, based on the figure on the right hand side ("MNIST" data set), different α values tend to have relatively divergent results. The reason can be that in the "MNIST" data set, the mapping from arm contexts to the rewards is more complicated compared with that of the "Yelp" data set. Thus, the adaptive exploration strategy is able to prominently improve the performance of GNB by flexibly estimating potential gains of different classes with the estimated "fine-grained" user (class) correlations. Recall that there exists a parameter k for the GNB framework in Eq. 3, which controls the user neighborhood hops that the two GNN models learn from. In this subsection, we will present the experiment results with k ∈ {1, 2, 3} on the "MNIST" data set and the "Yelp" data set, which are presented in Figure 5 . Based on the results on the two data sets, we can observe that setting k = 1, namely making the GNB learn directly from the 1-hop neighborhood, tends to yield the best result. This might be due to the fact that since our user graphs are staying as connected graphs while the user correlations are encoded by the edge weights, learning directly from the neighbor would be good enough. And the pair-wise user correlations between the target user and every other user have already been taken into consideration. Meantime, with larger k values (k = 2, 3), raising the matrix to the power of k would lead to more even entry values across the adjacency matrix, which can be related to the over-smoothing problem (Xu et al., 2018; 2021) . The figure on the right hand side ("MNIST" data set) may support this claim. Since it has already been shown in the Figure 2 that applying one single estimator across all users (classes), i.e., Neural-Pool and EE-Net, will lead to poor performances, the "MNIST" data set tend to have complex correlations among different classes. In this case, when we increase k, different user pairs tend to have similar correlations because entries of the adjacency matrix become more close to each other, which may lead to extra estimation error. Moreover, we also conduct the experiments on the MNIST data set with different sets of α and k parameters jointly, as shown in Figure 6 . Following our conclusion above, setting k = 1 generally leads to better results and the adaptive exploration strategy offers considerable help to improve the GNB's performance. One phenomenon to note is that when we increase the value of parameter k, the performance difference of GNB with different α values will shrink. One reason for this situation is that when we increase the k value, the propagated adjacency matrix of the user graph will become more "smooth", which makes the users closer to each other in terms of similarity. In this case, the effect of the adaptive exploration strategy can be affected as the user correlations estimated are less divergent. To better understand the influence of potential underlying user clusters, we conduct the experiments on the MovieLens and the Yelp data sets, with controlled number of underlying user groups. The underlying user groups are derived by using hierarchical clustering on the user features, and we maintain approximately a total of 50 users. Here, we apply four representative baselines with relatively good performances, which are DynUCB (Nguyen & Lauw, 2014) [fixed number of user clusters], LOCB (Ban & He, 2021) [fixed number of user clusters], CLUB (Gentile et al., 2014) [distance-based user clustering], Neural-UCB-Pool (Zhou et al., 2020) [neural single-bandit algorithm], and Meta-Ban (Ban et al., 2022a) [neural user clustering bandits]. In particular, DynUCB and LOCB are provided with the true cluster number as the prior knowledge to determine the quantity of initial user clusters / random seeds. The experiment results are shown in Fig. 7 and Fig. 8 .

B.4 EXPERIMENTS WITH DIFFERENT NUMBER OF UNDERLYING USER GROUPS

As we can see from the results, our proposed GNB consistently outperforms other baselines across different data sets and number of user groups. In particular, with more underlying user groups, the performance improvement of GNB over the baselines will slightly increase, due to the increasingly complicated user correlations. The modeling of fine-grained user correlations and the representation power of our GNN-based architecture can help explain GNB's good performance, and the ability of utilizing user correlations.

B.5 EXPERIMENTS WITH APPROXIMATED USER NEIGHBORHOOD

In this subsection, we conduct experiments to support our claim that applying approximated user neighborhoods is a feasible solution for increasing number of users (Remark 3.2). Then, we consider three scenarios where the number of users n ∈ {200, 300, 500}. Meanwhile, we let the size of the approximated user neighborhood The experiment results are shown in Figure 9 . Here, we see that the proposed GNB still outperforms the baselines with increasing number of users. In particular, given a total of 500 users, the approximated neighborhood is only 1/10 (50 users) of the overall user pool. These results can serve as a clear support that applying approximated user neighborhoods (Remark 3.2) is a practical way to scale-up GNB in real-world application scenarios. Although the other baselines, especially the linear baselines tend to run much faster compared with our proposed GNB, their experiment performances (Section 5) are also not comparable with our proposed GNB as their linear assumption is too strong for most application scenarios. In particular, for the data set with large arm context dimension d, the mapping from the arm context to the reward will be much more complicated. In this case, as shown by the experiments on the MNIST data set (d = 784) in Figure 2 , the neural algorithms manage to achieve an undoubtedly huge improvement over the linear algorithms, and have the reasonable running time. Ñ (1) (u t ), Ñ (2) (u t ) fix to ñ = | Ñ (1) (u t )| = | Ñ (2) (u t )| = Here, the numbers in the brackets "[]" are the time consumption for the actual recommendation process. We have the following remarks: (1) Based on the running time in the brackets, we see that for the two recommendation tasks, GNB takes approximately ∼ 0.4 second / per round to make the arm recommendation for the received user, which is reasonable in real-world cases; (2) In all the experiments, we train the GNB framework per 100 rounds after T > 1000 and still manage to achieve good performance. Thus, the running time of GNB in a long run could be further significantly improved by reducing the training frequency since we have already have enough data and an accurate framework; (3) Moreover, since we are actually predicting the rewards and potential gain for all the nodes within the user graph (or the "approximated" user graph), GNB is able to handle multiple users in each round simultaneously without running the recommendation procedure multiple times, which is efficient in real-world cases.

ALGORITHM 3: Model Training

Input: Initial parameter Θ 0 , step size η 1 , η 2 , training steps J 1 , J 2 , network width m. Updated user graphs G (1) t , G (2) t . Served user u t . Output: Updated model parameters [Θ (1) ut ] t , [Θ (2) ut ] t , [Θ (1) gnn ] t and [Θ (2) gnn ] t . [Θ (1) ut ] t , [Θ (2) ut ] t = User-Model-Training u t , [Θ (1) ut ] 0 , [Θ (2) ut ] 0 . for ∀u ′ ∈ U, u ′ ̸ = u t do [Θ (1) u ′ ] t ← [Θ (1) u ′ ] t-1 , [Θ (2) u ′ ] t ← [Θ (2) u ′ ] t-1 end [Θ (1) gnn ] t , [Θ (2) gnn ] t = GNN-Model-Training [Θ (1) gnn ] 0 , [Θ (2) gnn ] 0 . Return [Θ (1) ut ] t , [Θ (2) ut ] t , [Θ (1) gnn ] t , [Θ (2) gnn ] t . Procedure User-Model-Training u t , [Θ (1) ut ] 0 , [Θ (2) ut ] 0 [Θ (1) ut ] 0 ← -[Θ (1) ut ] 0 , [Θ (2) ut ] 0 ← -[Θ (2) ut ] 0 . # Training of f (1) u (•) Let L(Θ (1) ut ) := τ ∈Tu t ,t |f (1) u (x τ ; Θ (1) ut ) -r τ | 2 for j = 1, 2, . . . , J 1 do [Θ (1) ut ] j = [Θ (1) ut ] j-1 -η 1 • ∇ Θ L([Θ (1) ut ] j-1 ) end # Training of f (2) u (•) Let L(Θ (2) ut ) := τ ∈Tu t ,t |f u (∇ Θ (1) u t f (1) u (x τ ; [Θ (1) ut ] τ -1 ); Θ (2) ut ) -r τ -f (1) u (x τ ; [Θ (1) ut ] τ -1 ) | 2 for j = 1, 2, . . . , J 1 do [Θ (2) ut ] j = [Θ (2) ut ] j-1 -η 1 • ∇ Θ L([Θ (2) ut ] j-1 ) end Let [ Θ (1) ut ] t ← [Θ (1) ut ] J1 , [ Θ (2) ut ] t ← [Θ (2) ut ] J1 Sample and return new parameters ([Θ (1) ut ] t , [Θ (2) ut ] t ) ∼ {([ Θ (1) ut ] τ , [ Θ (2) ut ] τ )} τ ∈[t] . end Procedure GNN-Model-Training [Θ (1) gnn ] 0 , [Θ (2) gnn ] 0 [Θ (1) gnn ] 0 ← -[Θ (1) gnn ] 0 , [Θ (2) gnn ] 0 ← -[Θ (2) gnn ] 0 . # Training of f (1) gnn (•) Let L(Θ ) := τ ∈[t] |f (1) gnn (x τ , G τ ; Θ (1) gnn ) -r τ | 2 for j = 1, 2, . . . , J 2 do [Θ (1) gnn ] j = [Θ (1) gnn ] j-1 -η 2 • ∇ Θ L([Θ (1) gnn ] j-1 ) end # Training of f (2) gnn (•) Apply f (1) gnn (x τ ) to denote f (1) gnn (x τ , G τ ; [Θ (1) gnn ] τ -1 ).

Let L(Θ

(2) gnn ) := τ ∈[t] |f (2) gnn (∇ Θ (1) gnn f (1) gnn (x τ ), G τ ; Θ (2) gnn ) -r τ -f (1) gnn (x τ , G τ ) | 2 for j = 1, 2, . . . , J 2 do [Θ (2) gnn ] j = [Θ (2) gnn ] j-1 -η 2 • ∇ Θ L([Θ (2) gnn ] j-1 ) end Let [ Θ (1) gnn ] t ← [Θ (1) gnn ] J2 , [ Θ (2) gnn ] t ← [Θ (2) gnn ] J2 Sample and return new parameters ([Θ (1) gnn ] t , [Θ (2) gnn ] t ) ∼ {([ Θ (1) gnn ] τ , [ Θ (2) gnn ] τ )} τ ∈[t] . end E PROOF OF THEOREM 4.2 Before presenting the regret bound after T rounds, we proceed to bound the regret at a single time step t ∈ [T ]. Recall that there are two kinds of user graphs {G (1) i,t } i∈[a] , {G i,t } i∈[a] at each time step t, while we can also build true user exploitation graph {G (1), * i,t . With r t , r * t separately being rewards for actual selected arm x t ∈ X t and the optimal arm x * t ∈ X t , we formulate the pseudo-regret for a single round t as R t = E[r * t |u t , X t ] -E[r t |u t , X t ] based on the candidate arms X t and received user u t for the current round t. Here, regarding our arm pulling mechanism in Algorithm 1, we have f gnn (x t ) = f (1) gnn (x t , G (1) t ; [Θ (1) gnn ] t-1 ) + f (2) gnn (∇f (1) t (x t ), G t ; [Θ (2) gnn ] t-1 ) given the selected arm x t with the input gradient ∇f (1) t (x t ) = ∇ Θ (1) gnn f (1) gnn (xt, G t ;[Θ (1) gnn ]t-1) cgL (c g > 0 as the normalization factor, such that ∥∇f (1) t (x t )∥ 2 ≤ 1), and the estimated user graphs G (1) t , G (2) t related to x t . Analogously, we also have estimated user graphs G (1) t, * , G t, * for the optimal arm x * t . Then, in round t ∈ [T ], the single-round regret R t can be bounded as R t = E[r * t |u t , X t ] -E[r t |u t , X t ] = E[r * t |u t , X t ] -f gnn (x t ) + f gnn (x t ) -E[r t |u t , X t ] ≤ (i) E[r * t |u t , X t ] -f gnn (x * t ) + f gnn (x t ) -E[r t |u t , X t ] ≤ E |r * t -f gnn (x * t )| u t , X t + E |r t -f gnn (x t )| u t , X t = E |f (2) gnn (∇f t (x * t ), G t, * ; [Θ (2) gnn ] t-1 ) -(r * t -f (1) gnn (x * t , G t, * ; [Θ (1) gnn ] t-1 ))| u t , X t + E |f (2) gnn (∇f (1) t (x t ), G t ; [Θ (2) gnn ] t-1 ) -(r t -f (1) gnn (x t , G t ; [Θ (1) gnn ] t-1 ))| u t , X t = CB t (x t ) + CB t (x * t ) where inequality (i) is due to the arm pulling mechanism, i.e., f gnn (x t ) ≥ f gnn (x * t ), and CB t (•) is the regret bound function at round t formulated by the last equation. Then, given an arbitrary candidate arm x ∈ X t with reward r, and its estimated user graphs G (1) , G (2) , we have CB t (x) = E |f (2) gnn (∇f (1) t (x), G (2) ; [Θ (2) gnn ] t-1 ) -(r -f (1) gnn (x, G (1) ; [Θ (1) gnn ] t-1 ))| u t , X t ≤ E |f (2) gnn (∇f (1), * t (x), G (2), * ; [Θ (2) gnn ] t-1 ) -(r -f (1) gnn (x, G (1), * ; [Θ (1) gnn ] t-1 ))| u t , X t I1 + E |f (1) gnn (x, G (1), * ; [Θ (1) gnn ] t-1 ) -f (1) gnn (x, G (1) ; [Θ (1) gnn ] t-1 )| u t , X t I2 + E |f (2) gnn (∇f (1), * t (x), G (2), * ; [Θ (2) gnn ] t-1 ) -f (2) gnn (∇f (1), * t (x), G (2) ; [Θ (2) gnn ] t-1 )| u t , X t I3 + E |f (2) gnn (∇f (1), * t (x), G (2) ; [Θ (2) gnn ] t-1 ) -f (2) gnn (∇f (1) t (x), G (2) ; [Θ (2) gnn ] t-1 )| u t , X t I4 . Here, we have the term I 1 representing the estimation error induced by the GNN model parameters {[Θ (1) gnn ] t-1 , [Θ (2) gnn ] t-1 }, the term I 2 denoting the error caused by the estimation of user exploitation graph. Then, error term I 3 is caused by the estimation of user exploitation graph, and term I 4 is the output difference given input gradients ∇f (1), * t (x) and ∇f (1) t (x), which are individually generated by true user exploitation graph G (1), * and the estimated exploitation graph G (1) . These four terms I 1 , I 2 , I 3 , I 4 are respectively bounded by Lemma G.2 (Corollary G.3 and the bounds in Subsection G.1), Lemma G.4, Lemma G.5, and Lemma G.7 in the appendix. Then, with the notation from Theorem 4.2, the pseudo regret after T rounds, namely R(T ), can be bounded by R(t) = t∈[T ] R t ≤ 2 • √ t • 2ξ 2 + 3L √ 2 + (1 + γ 2 ) 2 log( T n • a δ ) + 1 + O( tL 3 log 5/6 (m) ρ 1/3 m 1/6 ) • O( t 3 L ρ √ m log(m)) + O t 4 L 2 log 11/6 (m) ρ 4/3 m 1/6 + 2 • O(L) • √ 8t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( T n • a δ ) + O( tL 5 log 5/6 (m) ρ 1/3 m 1/6 ) + O(L 2 ) • √ 8t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( T n • a δ ) + 4Γ t =⇒ R(T ) ≤ 2 • √ T 2ξ 2 + 3L √ 2 + (1 + γ 2 ) 2 log( T n • a δ ) + √ T • O(L 2 ) • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( T n • a δ ) + O(1) ≤ √ T • ( 8ξ 2 + O(L 2 ) 2ξ 1 ) + O(L 3 ) + O(L 2 ) • 2 log( T n • a δ ) + √ T • O(L) + O(1) where the second inequality is because we have sufficient large network width m ≥ Ω Poly(T, L, a, 1 ρ ) • log(1/δ) as indicated in Theorem 4.2. Here, since m ≥ Ω(Poly(T )), terms γ 1 , γ 2 can also be bounded by O(1). Therefore, R(T ) ≤ √ T • ( 8ξ 2 + O(L 2 ) 2ξ 1 ) + O(L 3 ) + O(L 2 ) • 2 log( T n • a δ ) + √ T • O(L) + O(1) ≤ √ T • O(L 3 ) + O(L 2 ) • 2 log( T n • a δ ) + √ T • O(L) + O(L 2 ) + O(1) = √ T • O(L) + √ T • O(L 3 ) + √ T • O(L 2 ) • log( T n • a δ ) + O(L 2 ) + O(1) when we have ξ 1 , ξ 2 ≤ O( 1 T ). The proof is then completed. Apart from the two remarks in the main body (Remark 4.4, 4. 3), we also want to mention another improvement over existing works with the Remark E.1 below. Remark E.1 (Removing d, d Terms). Existing neural single-bandit (i.e., with no user collaboration) algorithms (Zhou et al., 2020; Zhang et al., 2021) T log(T )) with the term of arm dimension d, which can be large given arm contexts in the high-dimensional space. Here, we improve their bounds by a multiplicative factor of log(T ) and remove the dimension terms d, d. We apply the generalization bound for overparameterized neural networks (Allen-Zhu et al., 2019; Cao & Gu, 2019) instead of regression-based analysis to remove the log(T ) term, and the generalization error is also unrelated to d or d for over-parameterized neural networks.

G PROOF OF THE REGRET BOUND

In this section, we present the generalization results of GNN models f (1) gnn (•; Θ (1) gnn ), f gnn (•; Θ (2) gnn ). Recall that up to round t, we have all the past arm-reward pairs P t = {(x τ , r τ )} τ ∈[t-1] for the previous t -1 time steps. Analogous to the generalization analysis of user models in Section F, we adopt the the operation in Eq. 9 on the gradients ∇ Θ (1) gnn f (1) gnn (•; Θ (1) gnn ) to comply with the assumptions of unit-length and separateness, and the transformed gradient input is denoted as ∇f (1) (x) given the arm x.

G.1 BOUNDING THE PARAMETER ESTIMATION ERROR

Regarding Eq.8, given an arbitrary candidate arm x ∈ X t with its reward r, and its user graphs G (1) , G (2) , we have the bound for the estimation error as CB t (x) = E |f (2) gnn (∇f (1) t (x), G (2) ; [Θ (2) gnn ] t-1 ) -(r t -f (1) gnn (x, G (1) ; [Θ (1) gnn ] t-1 ))| u t , X t ≤ E |f (2) gnn (∇f (1), * t (x), G (2), * ; [Θ (2) gnn ] t-1 ) -(r t -f (1) gnn (x, G (1), * ; [Θ (1) gnn ] t-1 ))| u t , X t I1 + I 2 + I 3 + I 4 where we have the term I 1 representing the estimation error induced by the GNN model parameters {[Θ (1) gnn ] t-1 , [Θ (2) gnn ] t-1 }. Based on our arm selected strategy given in Algorithm 1, we have the selected arms and their rewards {x τ , r τ } τ ∈[t-1] up to round t. And we first proceed to bound term I 1 w.r.t. the selected arm x t , i.e., CB t (x t ). Analogous to the user-specific models, we also have bounded outputs for the GNN models shown in the following lemma. Lemma G.1. For the constants ρ ∈ (0, O( 1 L )) and ξ 2 ∈ (0, 1), the past records P t up to time step t, we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2. Then, with probability at least 1 -δ and given an arm-reward pair (x, r), we have |f (1) gnn (x; [ Θ (1) gnn ] t )| ≤ γ 2 where γ 2 = 2 + O t 3 L ρ √ m log m + O L 2 t 4 ρ 4/3 m 1/6 log 11/6 (m) . Proof. The proof of this lemma follows an analogous approach as in Lemma F.1 where we have proved the conclusion for the FC networks. Given an arm x, we denote the adjacency matrix of its estimated user graph G (1) as A (1) , and we have the normalized adjacency matrices as S (1) = A (1) /n. For the received user u t ∈ U, we could deem the corresponding row of the matrix multiplication S • X, represented by h ut = [S • X] i: , as the aggregated input for the network for the user-arm pair (x, u t ). Note that in this way, the rest of the network could be regarded as a L+1-layer FC network (one layer GNN + L-layer FC network), where the weight matrix of the first layer is Θ (1) agg . Then, to make sure each aggregated input has the norm of 1, we apply an additional transformation mentioned in Eq. 9 as hut = ϕ(h ut , x) = ( hu t √ 2 , x 2 , c ut ) where c ut = 3 4 -1 2 ∥h ut ∥ 2 2 . This transformation ensures ∥ hut ∥ 2 = 1 while preserving the original information w.r.t. the user-arm pair (x, u t ), as it does not change the original aggregated hidden representation. Meantime, this transformation also ensures the separateness of the transformed contexts to be at least ρ 2 , which would fit the original data separateness assumption (Assumption 4.1). Finally, following a similar approach as in the FC networks (Lemma F.1), on the transformed aggregated hidden representations would complete the proof. Regarding the definition for the true reward mapping function in Section 2, we have the following lemma for term I 1 given the arm-reward pair (x t , r t ). Lemma G.2. For the constants ρ ∈ (0, O( 1 L )) and ξ 2 ∈ (0, 1), given user u ∈ U and its past records P u,t , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameters [Θ (1) gnn ] t ∼ {[ Θ (1) gnn ] τ } τ ∈[t] , [Θ (2) gnn ] t ∼ {[ Θ (2) gnn ] τ } τ ∈[t] . Then, with probability at least 1 -δ given a sampled arm-reward pair (x, r), we have τ ∈[t] E f (2) gnn ∇f (1), * t (x t ), G (2), * t ; [Θ (2) gnn ] t-1 -r t -f (1) gnn (x t , G (1), * t ; [Θ (1) gnn ] t-1 ) |u t , X t ≤ √ t • 2ξ 2 + 3L √ 2 + (1 + γ 2 ) 2 log( tn • a δ ) where γ 2 = 2 + O t 3 L ρ √ m log m + O L 2 t 4 ρ 4/3 m 1/6 log 11/6 (m) . Proof. Based on the conclusion of Lemma G.1, we have the upper bound as f (2) gnn ∇f (1), * t (x t ), G (2), * t ; [Θ (2) gnn ] t-1 -r t -f (1) gnn (x t , G , * t ; [Θ (1) gnn ] t-1 ) ≤ 1 + 2γ 2 by simply using the triangular inequality. Then we proceed to define the sequence V τ , τ ∈ [t] as V τ =E Xτ f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) -f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) . And since the candidate arms and the corresponding rewards are associated with the same reward mapping function h(•), the sequence V τ is a martingale difference sequence with the expectation E[V τ F τ ] = E Xτ f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) -E Xτ f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) = 0. where F τ denotes the filtration of all the past records P τ up to time step τ . Then, we will have the mean value for this sequence as 1 t τ ∈[t] V τ = 1 t τ ∈[t] E Xτ f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) - 1 t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) . As it has shown that the sequence is a martingale difference sequence, by directly applying the Azuma-Hoeffding inequality, we could bound the difference between the mean and its expectation as P 1 t τ ∈[t] V τ - 1 t τ ∈[t] E[V τ ] ≥ (1 + 2γ 2 ) 2 log(1/δ) t ≤ δ with the probability at least 1 -2δ. Since it has shown that the V τ is of zero expectation, we have the second term on the LHS of the inequality to be zero. Then, the inequality above is equivalent to 1 t τ ∈[t] V τ ≤ (1 + 2γ 2 ) 2 log(1/δ) t =⇒ 1 t τ ∈[t] E Xτ f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) ≤ 1 t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) + (1 + 2γ 2 ) 2 log(1/δ) t with the probability at least 1 -2δ. Then, for the RHS of the above inequality, by further applying Lemma G.8 and Lemma G.12, we have 1 t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; [Θ (2) gnn ] τ -1 ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) ≤ 1 t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; Θ (2) gnn ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) + 3L √ 2t 2 with regard to the parameter Θ (2) gnn s.t. ∥ Θ (2) gnn -[Θ (2) gnn ] 0 ∥ 2 ≤ O t 3 ρ √ m log m . Therefore, by applying the conclusion from Lemma G.8, we could bound the empirical loss w.r.t. Θ (2) gnn as 1 t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; Θ (2) gnn ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) ≤ 1 √ t τ ∈[t] f (2) gnn (∇f (1), * τ (x τ ), G (2), * τ ; Θ (2) gnn ) -(r τ -f (1) gnn (x τ , G (1), * τ ; [Θ (1) gnn ] τ -1 )) 2 ≤ 2ξ 2 t . Finally, assembling all the components and applying the union bound would complete the proof. Analogous to the Lemma G.1, we could also have the following corollary of the generalization results for the optimal arms and their rewards {x * τ , r  gnn (•) trained on corresponding gradients and residuals. Corollary G.3. For the constants ρ ∈ (0, O( 1 L )) and ξ 2 ∈ (0, 1), given user u ∈ U and its past records P u,t , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameter [Θ (1), * gnn ] t ∼ {[ Θ (1), * gnn ] τ } τ ∈[t] , [Θ (2), * gnn ] t ∼ {[ Θ (2), * gnn ] τ } τ ∈[t] . Then, with probability at least 1 -δ given a sampled arm-reward pair (x, r), we have τ ∈[t] E f (2) gnn ∇f (1), * t (x * t ), G (2), * t, * ; [Θ (2) gnn ] t-1 -r * t -f (1) gnn (x t , G (1), * t, * ; [Θ (1) gnn ] t-1 ) |u t , X t ≤ √ t • 2ξ 2 + 3L √ 2 + (1 + γ 2 ) 2 log( tn • a δ ) + Γ t where γ 2 = 2 + O t 3 L ρ √ m log m + O L 2 t 4 ρ 4/3 m 1/6 log 11/6 (m) . Proof. The proof of this corollary is comparable to the proof of Lemma G.2. At each time step t, regarding the definition of the optimal arm, we have x * t = max xi,t∈Xt E[r i,t |u t , x i,t ]. Then, analogously, we could define the difference sequence as V * τ =E Xτ f (2) gnn (∇f (1), * τ (x * τ ), G (2), * τ ; [Θ (2), * gnn ] τ -1 ) -(r τ -f (1) gnn (x * τ , G (1), * τ ; [Θ (1), * gnn ] τ -1 )) -f (2) gnn (∇f (1), * τ (x * τ ), G (2), * τ ; [Θ (2), * gnn ] τ -1 ) -(r τ -f (1) gnn (x * τ , G (1), * τ ; [Θ (1), * gnn ] τ -1 )) where by reusing the notation, we denote G (1), * τ , G (2), * τ to be the true user graphs w.r.t. the optimal arm x * τ here. Then, similar to the proof of Lemma G.2, we have the sequence to be the martingale difference sequence as E[V * τ F * τ ] = E Xτ f (2) gnn (∇f (1), * τ (x * τ ), G (2), * τ ; [Θ (2), * gnn ] τ -1 ) -(r τ -f (1) gnn (x * τ , G (1), * τ ; [Θ (1), * gnn ] τ -1 )) -E Xτ f (2), * gnn (∇f (1), * τ (x * τ ), G (2), * τ ; [Θ (2), * gnn ] τ -1 ) -(r τ -f (1) gnn (x * τ , G (1), * τ ; [Θ (1), * gnn ] τ -1 )) = 0 with F * τ being the filtration of past optimal arms up to round τ . Then, we could also applying the Azuma-Hoeffding inequality to bound the difference between the mean 1 t τ ∈[t] V * τ and its expectation 1 t τ ∈[t] E[V * τ ]. Finally, like in the proof of Lemma G.2, applying the conclusion from Lemma G.8 and Lemma G.12 would complete the proof. Then, recall the definition of of the confidence bound function CB t (x * t ) w.r.t. the optimal arm x * t , we the corresponding term I 1 as I 1 = E |f (2) gnn (∇f (1), * t (x * t ), G (2), * t ; [Θ (2) gnn ] t-1 ) -(r t -f (1) gnn (x * t , G (1), * t ; [Θ (1) gnn ] t-1 ))| u t , X t . And it can be further decomposed as |f (2) gnn (∇f (1), * t (x * t ), G (2), * t ; [Θ (2) gnn ] t-1 ) -(r t -f (1) gnn (x * t , G (1), * t ; [Θ (1) gnn ] t-1 ))| ≤ |f (2) gnn (∇f (1), * t (x * t ), G (2), * t ; [Θ (2), * gnn ] t-1 ) -(r t -f (1) gnn (x * t , G (1), * t ; [Θ (1), * gnn ] t-1 ))|+ + |f (1) gnn (x * t , G (1), * t ; [Θ (1), * gnn ] t-1 ) -f (1) gnn (x * t , G (1), * t ; [Θ (1) gnn ] t-1 )| + |f (2) gnn (∇f (1), * t (x * t ), G (2), * t ; [Θ (2), * gnn ] t-1 ) -f (2) gnn (∇f (1), * t (x * t ), G (2), * t ; [Θ (2) gnn ] t-1 )| where the first term on the RHS could be bounded by Corollary G.3. Then, for the second term, we first denote h * i ∈ R m to be the aggregated hidden representation w.r.t. the user-arm pair (u i , x * t ) where u i is the i-th user. Here, h * t is essentially the row in the aggregated representation matrix H agg corresponding to the user arm pair (u t , x * t ). Therefore, for the received user u t ∈ U, the reward estimation based on two samples regarding the two sets of parameters would have the same the input h * t . Then, for the second term, since the outputs w.r.t. two sets of parameters have the same input h * t , we could apply the conclusion from Lemma G.14, which will lead to |f (1) gnn (x * t , G (1), * t ; [Θ (1), * gnn ] t-1 ) -f (1) gnn (x * t , G (1), * t ; [Θ (1) gnn ] t-1 )| ≤ 1 + O( tL 3 log 5/6 (m) ρ 1/3 m 1/6 ) • O( t 3 L ρ √ m log(m)) + O t 4 L 2 log 11/6 (m) ρ 4/3 m 1/6 . Analogously, we could also have the same bound for the third term on the RHS. Summing up the bounds for three terms on the RHS would finish deriving the upper bound for term I 1 . with the probability at least 1 -δ. Finally, applying the union bound for all the (n 2 -n)/2 user pairs and re-scaling the δ would give us the estimation error bound for the reward difference for each pair of users. To achieve the upper bound, we apply the Corollary F.3 by considering the trajectory Pu,t consists of the past arm-reward pairs {x iτ ,τ , r iτ ,τ } τ ∈[t] , where arm x iτ ,τ leads to the largest estimation error of the estimation model f (1) uτ (•) in each round τ ∈ [t] . Thus, we have the bound for the edge weight difference, where the difference of an arbitrary i-th row could be bounded by τ ∈[t] ∥[A (1) τ ] i: -[A (1), * τ ] i: ∥ 2 ≤ 2n √ t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( tn • a δ ) + Γ t , which implies τ ∈[t] ∥[S (1) τ ] i: -[S (1), * τ ] i: ∥ 2 ≤ 2 √ t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( tn • a δ ) + Γ t . Therefore, applying the conclusions from Lemma F.2, it leads to τ ∈[t] ||r i,τ -r j,τ | -|f (1) u (x i,τ ;[Θ (1) ui ] t ) -f (1) u (x j,τ ; [Θ (1) uj ] t )|| ≤ 2 t n • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( tn • a δ ) + Γ t Afterwards, recalling the transformation at the beginning of this subsection, and given an user-arm pair (u i , x) for the i-th user, we denote h = [S (1) • X] i: and h * = [S (1), * • X] i: . Based the aforementioned transformation in Eq. 9, their transformed form could naturally be h = ( Analogous to the procedure for the user exploitation graph, we have the following lemma to bound the error induced by user exploitation graph estimation. Lemma G.5. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given past records P t-1 , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameter Proof. The proof of this lemma could be derived based on a similar approach as in Lemma G.4. √ 2 2 h, x 2 , c) and h * = ( √ 2 2 h * , x 2 , c * ) with ∥x∥ 2 = 1. [Θ (2) gnn ] t ∼ {[ Θ Recall that for the exploration GNN model f  u,t (x) = ∇ [Θ (1) u ] t f (1) u (x;[Θ (1) u ]t) c ′ g L as the input given an arm x and user u ∈ U, whose norm ∥∇f (1) u,t (x)∥ 2 ≤ 1. Given two users u i , u j ∈ U and an arbitrary arm x ∈ X t , we denote their individual reward as r i , r j separately. Then, we could bound the absolute difference between the potential gain estimations as ||(r i -f (1) u (x; [Θ (1) ui ] t )) -(r j -f (1) u (x; [Θ Following a similar approach as in the proof of Lemma G.4, we proceed to consider the aggregated hidden representations for the input gradients. Since the entries between A (2) -A (2), * (and also the distance between S (2) -S (2), * ) are bounded, by adopting the aforementioned transformation in Eq. 9 on the aggregated hidden representations for the input gradients and the initial arm contexts x, we would end up with the bound for the difference between transformed representations for input gradients. Finally, combining the conclusion from Lemma G.13 would give the proof.

G.4 BOUNDING THE GRADIENT INPUT ESTIMATION ERROR

For the last term I 4 in the confidence bound function CB t (x), we have Proof. Following the aggregation procedure and transformation procedure shown in section G.2, we have the transformed representations for given an user-arm pair (u i , x) with the i-th user, which are h = [S (1) • X] i: and h * = [S (1), * • X] i: . And their transformed form could naturally be h = (h, c) and h * = (h * , c * ). From the conclusion of Lemma G.4, we have I 4 = E |f (2) gnn (∇f τ ∈[t] ∥ hτ - h * τ ∥ 2 ≤ √ 8t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( tn • a δ ) + Γ t . Then, applying the conclusion from Lemma G.13 would complete the proof. Then, we have te following lemma to bound the term I 4 . Lemma G.7. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given past records P t-1 , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameter [Θ (1) gnn ] t ∼ {[ Θ ≤ O( t 2 L 5 log 5/6 (m) ρ 1/3 m 1/6 ) + O(L 2 ) • √ 8t • 2ξ 1 + 3L √ 2 + (1 + γ 1 ) 2 log( tn • a δ ) + Γ t . Proof. We again follow the aggregation procedure and transformation procedure presented in section G.2. Then, the aggregated and transformed input gradient could be denoted as we have the transformed representations for given an user-arm pair (u i , x) with the i-th user, which are g = [S (2) • G] i: and g * = [S (2) • G * ] i: , where G denotes the gradient matrix embedded w.r.t. Eq. 4. And their transformed form could be g = ( |f (2) gnn (∇f 



Figure 1: Workflow of the proposed Graph Neural Bandits (GNB) framework. the exploration score w

by estimating the user exploitation correlation based on user preferences. In G (1)

to directly learn the confidence bound with the gradient of f

. Here, the input of f

given arm x i,t , denoted as ∇f

i,t ). For the training process, f

leverages both gradient information from the exploitation GNN model f (1) gnn (•) and the user exploration correlations (i.e., G

Remark 4.3 (Reducing √ n to log(n)). While our O( T log(T )) bound matches theoretical bound of state-of-the-art EE-Net(Ban et al., 2022b), EE-Net only considers the single-bandit setting with no collaboration among users. Compared with Meta-Ban(Ban et al., 2022a), we provide the theoretical analysis from a new perspective regarding the fine-grained user collaborative effect and GNNs. In particular, compared with existing user clustering works (e.g.,Ban et  al. (2022a); Gentile et al. (2014); Li et al. (2019); Ban & He (2021)) imposing the additional √ n (where n is the number of users) factor to incorporate user collaborative effects, our GNB only end up with the log(n) term by adopting GNN models for user collaboration, which is sharper than existing works. Remark 4.4 (Removing i.i.d. Assumption). Compared with existing clustering of bandits algorithms (e.g., Gentile et al. (2014); Li et al. (2019); Gentile et al. (2017); Ban et al. (2022a)) and the singlebandit algorithm EE-Net

Figure 2: Cumulative regrets on the recommendation and classification data sets.

Figure 3: Cumulative regrets on the two additional classification data sets.

Figure 4: Cumulative regrets for different exploration coefficients α.

Figure 5: Cumulative regrets for different neighborhood hops k.

Figure 6: Cumulative regrets for different neighborhood hops k and exploration parameter α for the MNIST data set.

Figure 7: Cumulative regrets for different number of underlying user groups (MovieLens data set).

Figure 8: Cumulative regrets for different number of underlying user groups (Yelp data set).

Figure 9: Cumulative regrets for different number of users with approximated user neighborhood (MovieLens data set).

} i∈[a] and true user exploration graph {G (2), * i,t } i∈[a] based on the Definition 1 and Definition 3 respectively. Comparably, the true normalized adjacency matrices of G (1), * i,t , i ∈ [a] are represented as S

derive the bound O( d√ T log(T )) based on neural gradient mappings and ridge regression, and they involve the effective dimension term d of the NTK matrix, which can grow along with the scale of network parameters and number of rounds T . The linear user clustering algorithms (e.g., Li et al. (2019); Ban & He (2021); Gentile et al. (2017)) have the bound O(d √

we have the gradients of the GNN exploitation model ∇f

gnn ] τ } τ ∈[t]  . Then, with probability at least 1 -δ, given an arm x ∈ R d , we haveτ ∈[t] |f (2) gnn (∇f (1), * τ (x), G (2) ; [Θ (2) gnn ] τ -1 ) -f (2) gnn (∇f (1) τ (x), G (2) ; [Θ (2) gnn ] τ -1 )|

, c * ). Then, according to the definition of Eq. 4, we could naturally have∥g -g * ∥ 2 ≤ ∥[S (2) ] i: ∥ 2 • ∥∇f

of the adjacency matrix ensures its arbitrary row has the norm smaller than 1. Finally, applying the conclusions from Lemma G.6 and Lemma G.13, it will leads to τ ∈[t]

given the candidate arm set X t = {x i,t } i∈[a] and the model parameters [Θ

Average running time results (seconds) on real data sets. The running time in the brackets "[]" is the actual time consumption for recommendation w/o the time consumption for training.From Table1, we see that compared with the most closely related work, Meta-Ban, our proposed GNB is generally faster, since GNB does not required to re-train the model for each candidate arm.

Without the loss of generality, we let c > c * . Then, we could haveHere, (i) is because c, c * ≥ 1 2 . (ii) is because of c 2 + , and (iii) is due to ∥h∥ 2 , ∥h * ∥ 2 ≤ 1.Then, we proceed to bound ∥h -h * ∥ 2 . Recall the definition from Eq. 4. Extending the above conclusion across different rounds τ ∈ [t], we will have .3 BOUNDING THE EXPLORATION GRAPH ESTIMATION ERROR Again, recall the definition of the confidence bound function CB t (x) which isCB t (x) = E |f (2) gnn (∇f (x), G (2) ; [Θ (2) gnn ] t-1 ) -(r -f (1)gnn (x, G (1) ; [Θ (1) gnn ] t-1 ))| u t , X t

2) gnn ] τ } τ ∈[t]. Then, with probability at least 1 -δ, given an arm x ∈ R d , we have

Afterwards, applying the conclusion from Lemma F.4 would lead to the result that

which represents the estimation error induced by the difference of input gradients. And we first bound the gradient difference with the following lemma Lemma G.6. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given past records P t-1 , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameter gnn ] τ } τ ∈[t] . Then, with probability at least 1 -δ, given an arm x ∈ R d , we have

annex

C USER NETWORKS ARCHITECTURE.Here, we can choose different architectures for f(1) u (•), f(2) u (•) to deal with various application scenarios (e.g., Convolutional Neural Networks [CNNs] for recommendation tasks of visual contents). In this paper, for the theoretical analysis and experiments, we apply separate L-layer fully-connected (FC) networks for user exploitation models and exploration models, aswith Θ u = [vec(Θ 1 ) ⊺ , . . . , vec(Θ L ) ⊺ ] ⊺ being the trainable parameters, and σ being the ReLU activation. Here, since f(1)u (•) are both L-layer networks shown in Eq.7, the input χ can be either the arm x or the network gradient ∇ Θ (1)u ). Initialization. Then, the weight matrix of the input layer is different for two user networks where Θ(1) 1 ∈ R m×d and Θ(2) 1 ∈ R m×p . The rest of the layers will be the same comparing the two kinds of user networks, which are

D PSEUDO-CODE FOR ESTIMATING USER GRAPHS AND TRAINING THE GNB FRAMEWORK

ALGORITHM 2: Estimating Arm-Specific User GraphsInput: Model parameters Θ t-1 . Functions for edge weight estimationOutput: Updated user graphs {G(1)i,t } i∈ [a] . for each user u ∈ U do for each arm x i,t ∈ X t , i ∈ [a] do Compute ru,i = f(1) u (x i,t ; [Θ (1) ut ] t-1 ), and bu,i = f(2)u (x i,t ; [Θ (1) ut ] t-1 ); [Θ (2) ut ] t-1 ).end end for each arm x i,t ∈ X t do for each user pair (u, u ′ ) ∈ U × U do For edge weight w(1) i,t (u, u ′ ) ∈ W(1) i,t , update w(1) i,t (u, u ′ ) = Ψ (1) (r u,i , ru ′ ,i ).

For edge weight w

(2)

F GENERALIZATION OF USER NETWORKS AFTER GD

In this section, we present the generalization results of user networks f(1)u (•; Θ (2) u ), u ∈ U. Up to a certain time step t and for a given user u ∈ U, we have all its past arm-reward pairs P u,t-1 = {(x τ , r τ )} τ ∈Tu,t . Before presenting the bounds, with two vectors x, x as the input such that ∥x∥ 2 ≤ 1, ∥x∥ 2 = 1, inspired by (Allen-Zhu et al., 2019) , we first define the the following operatoras the concatenation of the two vectors x √ 2 , x 2 and one constant c, where c = 3 4 -( ∥x∥2 √ 2 ) 2 ≥ 1 2 . And this operator makes the transformed vector ∥ϕ(x, x)∥ 2 = 1. The idea of this operator is to make the gradients ∇ Θ (1)u ) of the user exploitation model, which is the input of the user exploration model f(2) u (•), comply with the normalization requirement and the separateness assumption (Assumption 4.1). For the sake of analysis, we will adopt this operation in the following proof. Note that this operator is just one possible solution, and our results could be easily generalized to other forms of input gradients under the unit-length and separateness assumption. Similar ideas are also applied in previous works (Ban et al., 2022b) .

F.1 USER EXPLOITATION MODEL

With the convergence result presented in Lemma F.6, we could bound the output of the user exploitation model f(1) u (•) after GD with the following lemma. Lemma F.1. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given user u ∈ U and its past records P u,t-1 up to time step t, we suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2. Then, with probability at least 1 -δ, given an arm-reward pair (x, r), we haveProof. For brevity, we use Θ(1)u ] t . The LHS of the inequality could be written asu -[ Θu -[ Θu ] 0 ⟩|. Here, we could bound the first term on the RHS with Lemma F.7. Applying Lemma F.8 on the second term, and recalling ∥ Θ. Then, with T u,t = t n , applying the conclusion of Lemma F.6 would lead toThen, under the assumption of arm separateness (Assumption 4.1), we proceed to bound the reward estimation error of the user exploitation network f(1) u (•; [Θ (1) u ] t ) in the current round t.Lemma F.2. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given user u ∈ U and its past records P u,t-1 , we suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2, and randomly draw theu ] τ } τ ∈Tu,t . Consider the past records P u,t up to round t are generated by a fixed policy when witness the candidate arms {X τ } τ ∈Tu,t . Then, with probability at least 1 -δ given an arm-reward pair (x t , r t ), we havewhere r τ is the corresponding reward generated by the reward mapping function given an arm x τ .Proof. We proof this Lemma following a similar approach as in Lemma C.1 from (Ban et al., 2022b) and Lemma D.1 from (Ban et al., 2022a) . First, for the LHS and with τ ∈ T u,t ∪ {t}, we haveu ] τ )| + |r t | ≤ 1 + γ 1 based on the conclusion from Lemma F.1. Then, for user u, we define the following martingale difference sequence with regard to the previous records P u,τ up to round τ asu ] τ ) -r τ |. Since the records in set P u,τ are sharing the same reward mapping function, we have the expectationwhere F u,τ denotes the filtration given the past records P u,τ . And we have the mean value of V(1) τ across different time steps aswith the expectation of zero. Then, we proceed to bound the expected estimation error of the exploitation model with the estimation error from existing samples following the Proposition 1 from (Cesa-Bianchi et al., 2004) . Applying the Azuma-Hoeffding inequality, with a constant δ ∈ (0, 1), it leads toAs we have the parameteru ] τ } τ ∈Tu,t , with the probability at least 1 -δ, the expected loss on [Θ (1) u ] t could be bounded aswhere for the second term on the RHS, we havewhere the first inequality is the application of Lemma F.10, and the last inequality is due to Lemma F.6. Summing up all the components and applying the union bound for all a arms, all n users and t time steps would complete the proof.Then, we also have the following Corollary for the rest of the candidate arms x i,t ∈ X t \ {x t } .Corollary F.3. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given user u ∈ U and its past records P u,t-1 , we suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2, and randomly draw the parameter [Θu ] τ } τ ∈Tu,t . For an arm x i,t ∈ X t , consider its union set with the the collection of arms P u,t ∪ {x i,t , r i,t } are generated by a fixed policy when witness the candidate arms {X τ } τ ∈Tu,t , with P u,t = {x iτ ,τ , r iτ ,τ } τ being the collection of arms chosen by this policy. Then, with probability at least 1 -δ, we havewhere r i,τ is the corresponding reward generated by the mapping function given an arm x i,τ , andProof. The proof of this Corollary follows an analogous approach as in Lemma F.2. First, suppose a shadow model f(1), which is trained on the alternative trajectory P u,t . Analogous to the proof of Lemma F.2, for user u, we can define the following martingale difference sequence with regard to the previous records P u,τ up to round τ ∈ [t] asu ] τ ) -r iτ ,τ |. Since the records in set P u,τ are sharing the same reward mapping function, we have the expectationwhere Fu,τ denotes the filtration given the past records P u,τ . The mean value ofwith the expectation of zero. Afterwards, applying the Azuma-Hoeffding inequality, with a constant δ ∈ (0, 1), it leads toTo bound the output difference between the shadow model f(1)u ] t ) and the model we trained based on received records fu ] t ), we apply the conclusion from Lemma G.14, which leads to that given the same input x, we haveFinally, assembling all the components together will finish the proof.

F.2 USER EXPLORATION MODEL

To ensure the unit length of f(2) u (•)'s input, we normalize the gradientwith Lemma F.8, Lemma F.9 and a normalization constant c ′ g > 0. Then, to satisfy the separateness (Assumption 4.1) assumption, we adopt the operation mentioned in Eq. 9 to derive the transformation ϕ() to make sure the transformed input gradient is of the norm of 1, and the separateness of at least ρ √ 2 . Analogous to the user exploitation model, regarding the convergence result for FC networks in Lemma F.6, we proceed to present the generalization result of the user exploration model f(2) u (•) after GD with the following lemma. Lemma F.4. For the constants c ′ g > 0, ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given user u ∈ U and its past records P u,t-1 , we suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2, and randomly draw theu ] τ } τ ∈Tu,t . Consider the past records P u,t up to round t are generated by a fixed policy when witness the candidate arms {X τ } τ ∈Tu,t . Then, with probability at least 1 -δ given an arm-reward pair (x t , r t ), we haveProof. The proof of this lemma is inspired by Lemma C.1 from (Ban et al., 2022b) . Following the same procedure as in the proof of Lemma F.2, we bound≤ 1 + 2γ 1 by triangle inequality and applying the generalization result of FC networks (Lemma F.1) on f(1)For brevity, we use ∇ffor the following proof. Define the difference sequence asSince the reward mapping is fixed given the specific user u, which means that the past rewards and the received arm-reward pairs (x τ , r τ ) are generated by the same reward mapping function, we have the expectationwhere F u,τ denotes the filtration given the past records P u,τ , up to round τ ∈ [t]. This also gives the fact that V(2) τ is a martingale difference sequence. Then, after applying the martingale difference sequence over T u,t , we haveAnalogous to the proof of Lemma F.2, by applying the Azuma-Hoeffding inequality, it leads toSince the expectation of V(2) τ is zero, with the probability at least 1 -δ and an existing set of parameters Θ(2)Here, the upper bound (i) is derived by applying the conclusions of Lemma F.6 and Lemma F.10, and the inequality (ii) is derived by adopting Lemma F.6 while defining the empirical loss to be. Finally, applying the union bound would give the aforementioned results.Corollary F.5. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given user u ∈ U and its past records P u,t-1 , we suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2, and randomly draw the parameter [Θ (1) u ] t ∼ {[ Θu ] τ } τ ∈Tu,t . For an arm x i,t ∈ X t , consider its union set with the the collection of arms P u,t ∪ {x i,t , r i,t } are generated by a fixed policy when witness the candidate arms {X τ } τ ∈Tu,t , with P u,t = {x iτ ,τ , r iτ ,τ } τ being the collection of arms chosen by this policy. Then, with probability at least 1 -δ, we havewhere r iτ ,τ is the corresponding reward generated by the mapping function given an arm x iτ ,τ , andThis corollary is the direct application of Lemma F.4, and the proof is analogous to that of Corollary F.3.

F.3 LEMMAS FOR OVER-PARAMETERIZED USER NETWORKS

Applying P u,t-1 as the training data, we have the following convergence result for the user exploitation network f(1) u (•; Θ (1) u ) after GD. Lemma F.6 (Theorem 1 from (Allen-Zhu et al., 2019)). For any 0 < ξ 1 ≤ 1, 0 < ρ ≤ O( 1 L ). Given user u ∈ U and its past records P u,t-1 , suppose m, η 1 , J 1 satisfy the conditions in Theorem 4.2, then with probability at least 1 -δ, we could have 1. L(Θ (1) u ) ≤ ξ 1 after J 1 iterations of GD.

2.. For any

In particular, Lemma F.6 above provides the convergence guarantee for f(1) u (•; Θ (1) u ) after certain rounds of GD training on the past records P u,t-1 . Lemma F.7 (Lemma 4.1 in (Cao & Gu, 2019) ). Assume a constant ω such thatLemma F.8. Assume m, η 1 , J 1 satisfy the conditions in Theorem 4.2 and [Θ (1) u ] 0 being randomly initialized. Then, with probability at least 1 -δ and given an arm ∥x∥ 2 = 1, we haveProof. The conclusion (1) is a direct application of Lemma 7.1 in (Allen-Zhu et al., 2019) . For conclusion (2), applying Lemma 7.3 in (Allen-Zhu et al., 2019) , for each layer Θ l ∈ {Θ 1 , . . . , Θ L }, we haveThen, we could have the conclusion thatLemma F.9 (Theorem 5 in (Allen-Zhu et al., 2019) ). Assume m, η 1 , J 1 satisfy the conditions in Theorem 4.2 and [Θ (1) u ] 0 being randomly initialized. Then, with probability at least 1 -δ, and for all parameter Θ (1) u such that. Assume m, η 1 satisfy the condition in Theorem 4.2. With the probability at least 1 -δ, we haveProof. With the notation from Lemma 4.3 in (Cao & Gu, 2019) , set R =. Then, considering the loss function to be L(Θ (1) u ) := τ ∈Tu,t |f(1) u (x τ ; Θ (1) u ) -r τ | would complete the proof.

G.2 BOUNDING THE EXPLOITATION GRAPH ESTIMATION ERROR

Then, we proceed to bound the error induced by the estimation of user exploitation graph, i.e., the error term I 2 . Recall that the confidence bound function CB t (x) for the given arm x ∈ X t isgnn (∇fgiven an arbitrary arm x ∈ X t . For arm x, we use the following lemma to bound the error caused by the difference between the estimated exploitation graph G (1) and the true exploitation graph G (1), * associated with arm x.Denoting the adjacency matrix of the estimated graph G (1) as A (1) , and the adjacency matrix for the true user exploitation graph G (1), * as A (1), * , we have the normalized adjacency matrices as S (1) = A (1) /n and S (1), * = A (1), * /n. For the i-th user u i ∈ U, we could deem the i-th row of the matrix multiplication S • X, represented by h 0,i = [S • X] i: , as the aggregated input for the network for the user-arm pair (x, u i ). Note that in this way, the rest of the network could be regarded as a L + 1-layer FC network, where the weight matrix for the first layer is Θ (1) agg . Then, to make sure each aggregated input has the norm of 1, we apply an additional transformation mentioned in Eq. 9 as h0,i = ϕ(h 0,i , x) = (And this transformation ensures ∥ h0,i ∥ 2 = 1 and c 0,i ≥ 1 2 . Since this transformation does not alter the original aggregated representation h 0,i , it will not impair the original information w.r.t. the user-arm pair (x, u i ). Meantime, note that this transformation also ensures the separateness of the transformed contexts to be at least ρ 2 . Lemma G.4. For the constants ρ ∈ (0, O( 1 L )) and ξ 1 ∈ (0, 1), given past records P t-1 , we suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, and randomly draw the parameter. Then, with probability at least 1 -δ, given an arm x ∈ R d , we haveProof. By the conclusion of Lemma F.2, at time step t, the reward estimation error of the user exploitation model could be bounded aswith the probability at least 1 -δ. And given two users u i , u j ∈ U and an arbitrary arm x ∈ X t , we denote their individual reward as r i , r j separately. We omit the expectation notation below for simplicity. Then, we could bound the absolute difference between the reward estimations asBased on the definition of the mapping function Ψ 1 , it would naturally be Lipschitz continuous with the coefficient of 1, which is

G.5 LEMMAS FOR OVER-PARAMETERIZED NETWORKS

Applying P t-1 as the training data, we have the following convergence result for the exploitation GNN network f(1) u (•; Θ (1) gnn ) after GD. Lemma G.8 (Theorem 1 from (Allen-Zhu et al., 2019) ). For any 0 < ξ 2 ≤ 1, 0 < ρ ≤ O( 1 L ). Given past records P t-1 , suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2, then with probability at least 1 -δ, we could have 1. L(Θ (1) gnn ) ≤ ξ 2 after J 2 iterations of GD.

2.. For any

In particular, Lemma F.6 above provides the convergence guarantee for f(1) u (•; Θ (1) gnn ) after certain rounds of GD training on the past records P t-1 . Lemma G.9 (Lemma 4.1 in (Cao & Gu, 2019) ). Assume a constant ω such that O(m -3/2 L -3/2 [log(T nL 2 /δ)] 3/2 ) ≤ ω ≤ O(L -6 [log m] -3/2 ) and n training samples. With randomly initialized [Θ (1) gnn ] 0 , for parameterswith the probability at least 1 -δ.Lemma G.10. Assume m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2 and [Θ (1) gnn ] 0 being randomly initialized. Then, with probability at least 1 -δ and given an arm ∥x∥ 2 = 1, we haveProof. The conclusion (1) is a direct application of Lemma 7.1 in (Allen-Zhu et al., 2019) . For conclusion (2), for each weight matrix Θ l ∈ {Θ(1) 0 , Θ1 , . . . , ΘL } where Θ(1) 0 = Θ (1) agg , we haveby applying Lemma 7.3 in (Allen-Zhu et al., 2019) , and h denotes the aggregated hidden representation for each user-pair, namely the corresponding row in H agg . Therefore, by combining the bounds for all the weight matrices, we could havewhich finishes the proof.Lemma G.11 (Theorem 5 in (Allen-Zhu et al., 2019) ). Assume the training parameters m, η 2 , J 2 satisfy the conditions in Theorem 4.2 and [Θ (1) gnn ] 0 being randomly initialized. Then, with probability at least 1 -δ, and for all parameter Θ (1) gnn such that ∥ΘLemma G.12. Assume m, η 2 satisfy the condition in Theorem 4.2. With the probability at least 1 -δ, we haveProof. With the notation from Lemma 4.3 in (Cao & Gu, 2019) , set R = t 3 log(m) δ , ν = R 2 , and ϵ = LR √ 2νt . Then, considering the loss function to be L(Θ (1) gnn ) := τ ∈[t] |f (x τ ; Θ (1) gnn ) -r τ | would complete the proof. Lemma G.13. Consider a L-layer fully-connected network f (•; Θ t ) initialized w.r.t. Subsection 3.2.1. For any 0 < ξ 2 ≤ 1, 0 < ρ ≤ O( 1 L ). Given the training data set with t samples satisfying the unit-length and the ρ-separateness assumption, suppose the training parameters m, η 2 , J 2 satisfy the conditions in Theorem 4.2. Then, with probability at least 1 -δ, we havewhen given two new samples x, x ′ .Proof. Denoting D l to be the diagonal sign matrix of the l-th layer such thatBased on Lemma 7.3 from (Allen-Zhu et al., 2019) and Lemma C.4 from (Ban et al., 2022b) , we have we have still holds, which proves this statement.Then, for the bound on the gradients, we haveFirstly, we havebased on Lemma 7.3 from (Allen-Zhu et al., 2019) , and this leads to ∥∇ Θ0 f (x; Θ 0 )∥ 2 ≤ O(L). Analogously, we also deriveThen, according to Theorem 5 from (Allen-Zhu et al., 2019) and with ∥Θ 0 -Θ t ∥ 2 ≤ ω, we could have ∥∇ Θ0 f (x; Θ 0 ) -∇ Θt f (x; Θ t )∥ 2 ≤ O(ω 1/3 L 2 log(m)) • ∥∇ Θ0 f (x; Θ 0 )∥ 2 . Substituting the ω value with the conclusion from Lemma G.8, we could have= O( tL 4 log 5/6 (m) ρ 1/3 m 1/6 ).Finally, assembling all parts together will lead to the conclusion.Lemma G.14. Consider a L-layer fully-connected network f (•; Θ t ) initialized w.r.t. Section 3.2.1. For any 0 < ξ 2 ≤ 1, 0 < ρ ≤ O( 1 L ). Let there be two sets of training samples P t , P ′ t with the unit-length and the ρ-separateness assumption, and let Θ t be the trained parameter on P t while Θ ′ t is the trained parameter on P ′ t . Suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Theorem 4.2. Then, with probability at least 1 -δ, we have when given a new sample x ∈ R d .Proof. First, based on the conclusion from Theorem 1 from (Allen-Zhu et al., 2019) and regarding the t samples, the trained the parameters satisfy ∥Θ t -Θ 0 ∥ 2 , ∥Θ ′ t -Θ 0 ∥ 2 ≤ O( t 3 ρ √ m log(m)) = ω where Θ 0 is the randomly initialized parameter. Then, we could havew.r.t. the conclusion from Theorem 1 and Theorem 5 of (Allen-Zhu et al., 2019) . Then, regarding the Lemma 4.1 from (Cao & Gu, 2019) , we would haveTherefore, the our target could be reformed asSubstituting the ω with its value would complete the proof.

H COMPUTATIONAL RESOURCES

All the experiments are conducted on a Windows machine with an Intel Core i7 CPU, 64GB RAM, and two RTX 5000 GPUs.

