NEURAL COLLABORATIVE FILTERING BANDITS VIA META LEARNING

Abstract

Contextual multi-armed bandits provide powerful tools to solve the exploitationexploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring 'Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, we propose a meta-learning based bandit algorithm, Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with an informative UCBbased exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of O( √ nT log T ), which is sharper over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban outperforms six strong baselines.

1. INTRODUCTION

The contextual multi-armed bandit has been extensively studied in machine learning to resolve the exploitation-exploration dilemma in sequential decision making, with wide applications in personalized recommendation (Li et al., 2010) , online advertising (Wu et al., 2016) , etc. Recommender systems play an indispensable role in many online businesses, such as e-commerce platforms and online streaming services. It is well-known that user collaborative effects are strongly associated with the user preference. Thus, discovering and leveraging collaborative information in recommender systems has been studied for decades. In the relatively static environment, e.g., in a movie recommendation platform where catalogs are known and accumulated ratings for items are provided, the classic collaborative filtering methods can be easily deployed (e.g., matrix/tensor factorization (Su and Khoshgoftaar, 2009) ). However, such methods can hardly adapt to more dynamic settings, such as news or short-video recommendation, due to: (1) the lack of cumulative interactions for new users or items; (2) the difficulty of balancing the exploitation of current user-item preference knowledge and the exploration of the new potential matches (e.g., presenting new items to the users). To address this problem, a line of works, clustering of bandits (collaborative filtering bandits) (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) , have been proposed to incorporate collaborative effects among users which are largely neglected by conventional bandit algorithms (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Valko et al., 2013; Ban and He, 2020) . These works use the graph-based method to adaptively cluster users and explicitly or implicitly utilize the collaborative effects on user sides while selecting an arm. However, this line of works have a significant limitation that they all build on the linear bandit framework (Abbasi-Yadkori et al., 2011) and the user groups are represented by the simple linear combinations of individual user parameters. The linear reward assumptions and linear representation of groups may not be true in real-world applications (Valko et al., 2013) . To learn non-linear reward functions, neural bandits (Collier and Llorens, 2018; Zhou et al., 2020; Zhang et al., 2021; Kassraie and Krause, 2022) have attracted much attention, where a neural network is assigned to learn the reward function along with an exploration strategy (e.g., Upper Confidence Bound (UCB) or Thompson Sampling (TS)). However, this class of works do not incorporate any collaborative effects among users, overlooking the crucial potential in improving recommendation. In this paper, to overcome the above challenges, we introduce the problem, Neural Collaborative Filtering Bandits (NCFB), built on either linear or non-linear reward assumptions while introducing relative groups. Groups are formed by users sharing similar interests/preferences/behaviors. However, such groups are usually not static over specific contents (Li et al., 2016) . For example, two users may both like "country music" but may have different opinions on "rock music". "Relative groups" are introduced in NCFB to formulate groups given a specific content, which is more practical in real problems. To solve NCFB, we propose a meta-learning based bandit algorithm, Meta-Ban (Meta-Bandits), distinct from existing related works (i.e., graph-based clustering of linear bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) ). Inspired by recent advances in meta-learning (Finn et al., 2017; Yao et al., 2019) , in Meta-Ban, a meta-learner is assigned to represent and rapidly adapt to dynamic groups, which allows the non-linear representation of collaborative effects. And a user-learner is assigned to each user to discover the underlying relative groups. Here, we use neural networks to formulate both meta-learner and user learners, in order to learn linear or non-linear reward functions. To solve the exploitation-exploration dilemma in bandits, Meta-Ban has an informative UCB-type exploration. In the end, we provide rigorous regret analysis and empirical evaluation for Meta-Ban. To the best of our knowledge, this is the first work incorporating collaborative effects in neural bandits. The contributions of this paper can be summarized as follows: (1) Problem. We introduce the problem, Neural Collaborative Filtering Bandits (NCFB), to incorporate collaborative effects among users with either linear or non-linear reward assumptions. (2)Algorithm. We propose a meta-learning based bandit algorithm working in NCFB, Meta-Ban, where the meta-learner is introduced to represent and rapidly adapt to dynamic groups, along with a new informative UCB-type exploration that utilizes both meta-side and user-side information. Meta-Ban allows the non-linear representation of relative groups based on user learners. (3) Theoretical analysis. Under the standard assumptions of over-parameterized neural networks, we prove that Meta-Ban can achieve the regret upper bound of complexity O( √ nT log T ), where n is the number of users and T is the number of rounds. Our bound is sharper than existing related works. Moreover, we provide a correctness guarantee of groups detected by Meta-Ban. (4) Empirical performance. We evaluate Meta-Ban on 10 real-world datasets and show that Meta-Ban significantly outperforms 6 strong baselines. Next, after introducing the problem definition in Section 2, we present the proposed Meta-Ban in Section 3 together with theoretical analysis in Section 4. In the end, we show the experiments in Section 5 and conclude the paper in Section 6. More discussion regarding related work is placed in Appendix Section A.1.

2. NEURAL COLLABORATIVE FILTERING BANDITS

In this section, we introduce the problem of Neural Collaborative Filtering Bandits, motivated by generic recommendation scenarios. Suppose there are n users, N = {1, . . . , n}, to serve on a platform. In the t th round, the platform receives a user u t ∈ N and prepares the corresponding k arms (items) X t = {x t,1 , x t,2 , . . . , x t,k } in which each arm is represented by its d-dimensional feature vector x t,i ∈ R d , ∀i ∈ {1, . . . , k}. Then, like the conventional bandit problem, the platform will select an arm x t,i ∈ X t and recommend it to the user u t . In response to this action, u t will produce a corresponding reward (feedback) r t,i . We use r t,i |u t to represent the reward produced by u t given x t,i , because different users may generate different rewards towards the same arm. Group behavior (collaborative effects) exists among users and has been exploited in recommender systems. In fact, the group behavior is item-varying, i.e., the users who have the same preference on a certain item may have different opinions on another item (Gentile et al., 2017; Li et al., 2016) . Therefore, we define a relative group as a set of users with the same opinions on a certain item. Definition 2.1 (Relative Group). In round t, given an arm x t,i ∈ X t , a relative group N (x t,i ) ⊆ N with respect to x t,i satisfies 1) ∀u, u ′ ∈ N (x t,i ), E[r t,i |u] = E[r t,i |u ′ ] 2) ∄ N ′ ⊆ N, s.t. N ′ satisfies 1) and N (x t,i ) ⊂ N ′ . Such flexible group definition allows users to agree on certain items while disagree on others, which is consistent with the real-world scenario. Therefore, given an arm x t,i , the user pool N can be divided into q t,i non-overlapping groups: N 1 (x t,i ), N 2 (x t,i ), . . . , N qt,i (x t,i ), where q t,i ≤ n. Note that the group information is unknown to the platform. We expect that the users from different groups have distinct behavior with respect to x t,i . Thus, we provide the following constraint among groups. Definition 2.2 (γ-gap). Given two different groups N (x t,i ), N ′ (x t,i ), there exists a constant γ > 0, such that ∀u ∈ N (x t,i ), u ′ ∈ N ′ (x t,i ), |E[r t,i |u] -E[r t,i |u ′ ]| ≥ γ. For any two groups in N , we assume that they satisfy the γ-gap constraint. Note that such an assumption is standard in the literature of online clustering of bandit to differentiate groups (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) . Reward function. The reward r t,i is assumed to be governed by an unknown function with respect to x t,i given u t : r t,i |u t = h ut (x t,i ) + ζ t,i , where h ut is an either linear or non-linear but unknown reward function associated with u t , and ζ t,i is a noise term with zero expectation E[ζ t,i ] = 0. We assume the reward r t,i ∈ [0, 1] is bounded, as in many existing works (Gentile et al., 2014; 2017; Ban and He, 2021) . Note that online clustering of bandits assume h ut is a linear function with respect to x t,i (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) . Regret analysis. In this problem, the goal is to minimize the pseudo regret of T rounds: R T = T t=1 E[r * t -r t | u t ], where r t is the reward received in round t and E[r * t |u t , X t ] = max xt,i∈Xt h ut (x t,i ). The introduced problem definition above can naturally formulate many recommendation scenarios. For example, for a music streaming service provider, when recommending a song to a user, the platform can exploit the knowledge of other users who have the same opinions on this song, i.e., all 'like' or 'dislike' this song. Unfortunately, the potential group information is usually not available to the platform before the user's feedback. To solve this problem, we will introduce an approach that can infer and exploit such group information to improve the recommendation, in the next section. Notation. Denote by [k] the sequential list {1, . . . , k}. Let x t be the arm selected in round t and r t be the reward received in round t. We use ∥x t ∥ 2 and ∥x t ∥ 1 to represent the Euclidean norm and Taxicab norm. For each user u ∈ N , let µ u t be the number of rounds that user u has been served up to round t, i.e., µ u t = t τ =1 1{u τ = u}, and T u t be all of u's historical data up to round t, i.e., T u t = {(x τ , r τ ) : u τ = u ∧ τ ∈ [t]}. m is the width of neural network and L is depth of neural network in the proposed approach. Given a group N , all it's data up to t can be denoted by {T u t } u∈N = {T u t |u ∈ N }. We use standard O, Θ, and Ω to hide constants.

3. PROPOSED ALGORITHM

In this section, we propose a meta-learning based bandit algorithm, Meta-Ban, to tackle the challenges in the NCFB problem as follows: (1) Challenge 1 (C1): Given an arm, how to infer a user's relative group, and whether the returned group is the true relative group? (2) Challenge 2 (C2): Given a relative group, how to represent the group's behavior in a parametric way? (3) Challenge 3 (C3): How to generate a model to efficiently adapt to the rapidly-changing relative groups? (4) Challenge 4 (C4): How to balance between exploitation and exploration in bandits with relative groups? Algorithm 1: Meta-Ban Input: T (number of rounds), ν, γ(group exploration parameter), α(exploration parameter), λ(regularization parameter), δ(confidence level) , J 1 (number of iterations for user), J 2 (number of iterations for meta), η 1 (user step size), η 2 (meta step size), L(depth of neural network). Initialize Θ 0 ; θ u 0 = Θ 0 , µ u 0 = 0, ∀u ∈ N Observe one data for each u ∈ N for t = 1, 2, . . . , T do Receive a user u t ∈ N and observe k arms X t = {x t,1 , . . . , x t,k } for i ∈ [k] do Determine u t 's relative groups: N ut (x t,i ) = {u ∈ N | |f (x t,i ; θ u t-1 ) -f (x t,i ; θ ut t-1 )| ≤ ν-1 ν γ}. Θ t,i = GradientDecent_Meta N ut (x t,i ), Θ t-1 U t,i = f (x t,i ; Θ t,i ) + α • ∥g(xt,i;Θt,i)-g(xt,i;θ u t 0 )∥2 √ t + L+1 √ 2µ u t t + log(t/δ) µ u t t i ′ = arg i∈[k] max U t,i Play x t,i ′ and observe reward r t,i ′ x t = x t,i ′ ; r t = r t,i ′ ; Θ t = Θ t,i ′ µ ut t = µ ut t-1 + 1 θ ut t = GradientDecent_User(u t , Θ t ) for u ∈ N and u ̸ = u t do θ u t = θ u t-1 ; µ u t = µ u t-1 Meta-Ban has one meta-learner Θ to represent the group behavior and n user-learners for each user respectively, {θ u } u∈N , sharing the same neural network f . Given an arm x t,i , we use g(x t,i ; θ) = ▽ θ f (x t,i ; θ) to denote the gradient of f for the brevity. The workflow of Meta-Ban is divided into three parts as follows. Group inference (to C1). As defined in Section 2, each user u ∈ N is governed by an unknown function h u . It is natural to use the universal approximator (Hornik et al., 1989) , a neural network f (defined in Section 4), to learn h u . In round t ∈ [T ], let u t be the user to serve. Given u t 's past data up to round t -1, T ut t-1 , we can train parameters θ ut by minimizing the following loss: L T ut t-1 ; θ ut = 1 2 (x,r)∈T u t t-1 (f (x; θ ut ) -r) 2 . Let θ ut t-1 represent θ ut trained with T ut t-1 in round t -1. The training of θ ut can be conducted by (stochastic) gradient descent, e.g., as described in Algorithm 3. Therefore, for each u ∈ N , we can obtain the trained parameters θ u t-1 . Then, given u t and an arm x t,i , we return u t 's estimated group with respect to arm x t,i by N ut (x t,i ) = {u ∈ N | |f (x t,i ; θ u t-1 ) -f (x t,i ; θ ut t-1 )| ≤ ν -1 ν γ}. where γ ∈ (0, 1) represents the assumed γ-gap and ν > 1 is a tuning parameter to trade off between the exploration of group members and the cost of playing rounds. Meta learning (to C2 and C3). In this paper, we propose to use one meta-learner Θ to represent and adapt to the behavior of dynamic groups. In meta-learning, the meta-learner is trained based on a number of different tasks and can quickly adapt to new tasks with a small amount of new data (Finn et al., 2017) . Here, we consider each user u ∈ N as a task and its collected data T u t as the task distribution. Therefore, Meta-Ban has two phases: User adaptation and Meta adaptation. User adaptation. In the t th round, given u t , after receiving the reward r t , we have available data T ut t . Then, the user parameter θ ut is updated in round t based on meta-learner Θ, denoted by θ ut t , described in Algorithm 3. Meta adaptation. In the t th round, given a group N ut (x t,i ), we have the available collected data {T u t-1 } u∈ Nu t (xt,i) . The goal of meta-learner is to fast adapt to these users (tasks). Thus, given an arm x t,i , we update Θ in round t, denoted by Θ t,i , by minimizing the following meta loss: Algorithm 2: GradientDecent_Meta (N , Θ t-1 ) Θ (0) = Θ t-1 (or Θ 0 ) for j = 1, 2, . . . , J 2 do for u ∈ N do Collect T u t-1 Randomly choose T u ⊆ T u t-1 L θ u µ u t-1 = 1 2 (x,r)∈ T u (f (x; θ u µ u t-1 ) -r) 2 L N = u∈N L θ u µ u t-1 + λ √ m u∈N ∥θ u µ u t-1 ∥ 1 . Θ (j) = Θ (j-1) -η 2 ▽ {θ u µ u t-1 } u∈N L N Return: Θ t = Θ (J2) Algorithm 3: GradientDecent_User (u, Θ t ) Collect T u t # Historical data of u up to round t θ u (0) = Θ t ( or Θ 0 ) for j = 1, 2, . . . , J 1 do Randomly choose T u ⊆ T u t L T u ; θ u = 1 2 (x,r)∈ T u (f (x; θ u ) -r) 2 θ u (j) = θ u (j-1) -η 1 ▽ θ u (j-1) L T u ; θ u Return: θ u t = θ u (J1) L Nu t (xt,i) = u∈ Nu t (xt,i) L θ u µ u t-1 + λ √ m u∈ Nu t (xt,i) ∥θ u µ u t-1 ∥ 1 . where θ u µ u t-1 are the stored user parameters in Algorithm 3 at round t -1. Here, we add L1-regularization on meta-learner to prevent overfitting in practice and neutralize vanishing gradient in convergence analysis. Then, the meta-learner is updated by: Θ = Θ -η 2 ▽ {θ u µ u t-1 } u∈ Nu t (x t,i ) L Nu t (xt,i) , where η 2 is the meta learning rate and ▽ {θ u µ u t-1 xt,i) is the sum of gradients of L Nu t (xt,i) with respect to all the user learners in the group N ut . Algorithm 2 shows meta update with stochastic gradient descent (SGD) . } u∈ Nu t (x t,i ) L Nu t ( Note that in linear clustering of bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) , they represent the group behavior Θ by the linear combination of user-learners, e.g., Θ = 1 | Nu t (xt,i)| u∈ Nu t (xt,i) θ u µ u t-1 . This may not be true in real world. Instead, we use the meta adaptation to update the meta-learner Θ according to N ut (x t,i ), which can represent non-linear combinations of user learners (Finn et al., 2017; Wang et al., 2020b) . UCB Exploration (to C4). To balance the trade-off between the exploitation of the current group information and the exploration of new matches, we introduce the following UCB-based selection criterion. Based on Lemma C.2, with probability at least 1 -δ, after T rounds, the cumulative error induced by meta-learner is upper bounded by T t=1 E rt|xt [|f (x t ; Θ t ) -r t | | u t ] ≤ T t=1 O (∥g(x t ; Θ t ) -g(x t ; θ ut 0 )∥ 2 ) √ t Meta-side info + u∈N µ u t O L + 1 √ 2µ u t + 2 log(t/δ) µ u t User-side info , where g(x t ; Θ t ) incorporates the discriminative information of meta-learner acquired from the collaborative effects within the relative group N ut (x t ) and O( 1 √ µ u t ) shows the shrinking confidence interval of user-learner to a specific user u t . This bound provides necessary information we should include in the selection criterion (U t,i in Algorithm 1), which paves the way for the regret analysis (Theorem 4.2). Therefore, we say that the bound U t,i leverages both the collaborative effects existed in N ut (x t,i ) and u t 's personal behavior for exploitation and exploration. Then, we select an arm according to: x t = arg xt,i∈Xt max(U t,i ). To sum up, Algorithm 1 depicts the workflow of Meta-Ban. In each round, given a served user and a set of arms, we compute the meta-learner and its bound for each relative group (Line 5-9). Then, we choose the arm according to the UCB-type strategy (Line 10). After receiving the reward, we update the meta-learner for next round (Line 12) and update the user-learner θ ut (Line 13-14) because only u t 's collected data is updated. In the end, we update all the other parameters (Lines 15-16). Remark 

4. REGRET ANALYSIS

In this section, we provide the regret analysis of Meta-Ban and the comparison with close related works. The analysis is built in the framework of meta-learning under the the over-parameterized neural networks regimen (Jacot et al., 2018; Allen-Zhu et al., 2019; Zhou et al., 2020) . Given an arm x t,i ∈ R d with ∥x t,i ∥ 2 = 1, t ∈ [T ], i ∈ [k], without loss of generality, we define f as a fully-connected network with depth L ≥ 2 and width m: f (x t,i ; θ or Θ) = W L σ(W L-1 σ(W L-2 . . . σ(W 1 x))) where σ is the ReLU activation function, W 1 ∈ R m×d , W l ∈ R m×m , for 2 ≤ l ≤ L -1, W L ∈ R 1×m , and θ, Θ = [vec(W 1 ) ⊺ , vec(W 2 ) ⊺ , . . . , vec(W L ) ⊺ ] ⊺ ∈ R p . To conduct the analysis, we need the following initialization and mild assumptions. Initialization. For l ∈ [L -1], each entry of W l is drawn from the normal distribution N (0, 2/m); Each entry of W L is drawn from the normal distribution N (0, 1/m). Assumption 4.1 (Arm Separability). For any pair x t,i , x t ′ ,i ′ , t, t ′ ∈ [T ], i, i ′ ∈ [k], (t, i) ̸ = (t ′ , i ′ ), these exists a constant 0 < ρ ≤ O( 1 L ), such that ∥x t,i -x t ′ ,i ′ ∥ 2 ≥ ρ. Assumption 4.1 is satisfied as long as no two arms are identical. Assumption 4.1 is the standard input assumption in over-parameterized neural networks (Allen-Zhu et al., 2019) . Moreover, most of existing neural bandit works (e.g., Assumption 4.2 in (Zhou et al., 2020 ), 3.4 in (Zhang et al., 2021 ), 4.1 in (Kassraie and Krause, 2022) ) have the comparable assumptions with equivalent constraints. They require that the smallest eigenvalue λ 0 of neural tangent kernel (NTK) matrix formed by all arm contexts is positive, which implies that any two arms cannot be identical. As L can be set manually, the condition 0 < ρ ≤ O( 1 L ) can be easily satisfied (e.g., L = 2). Then, we provide the following regret upper bound for Meta-Ban with gradient descent. Theorem 4.2. Given the number of rounds T , assume that each user is uniformly served and set T u = T u t , ∀t ∈ [T ]. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), 0 < ϵ 1 ≤ ϵ 2 ≤ 1, λ > 0, suppose m, η 1 , η 2 , J 1 , J 2 satisfy m ≥ Ω max poly(T, L, ρ -1 ), e √ log(O(T k)/δ) , η 1 = Θ ρ poly(T, L) • m , η 2 = min Θ √ nρ T 4 L 2 m , Θ √ ρϵ 2 T 2 L 2 λn 2 , J 1 = Θ poly(T, L) ρ 2 log 1 ϵ 1 J 2 = max Θ T 5 (O(T log 2 m) -ϵ 2 )L 2 m √ nϵ 2 ρ , Θ T 3 L 2 λn 2 (O(T log 2 m -ϵ 2 )) ρϵ 2 . (5) Then, with probability at least 1 -δ over the random initialization, Algorithms 1-3 has the following regret upper bound: R T ≤ O( √ n) √ T + L √ T + 2T log(O(T )/δ) + O(1). Comparison with clustering of bandits. The existing works on clustering of bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021 ) are all based on the linear reward assumption and achieve the following regret bound complexity: R T ≤ O(d √ T n log T ). Comparison with neural bandits. The regret analysis in a single neural bandit (Zhou et al., 2020; Zhang et al., 2021) has been developed recently (n = 1 in this case), achieving R T ≤ O( d√ T log T ), d = log det(I + H/λ) log(1 + T n/λ) where H is the neural tangent kernel matrix (NTK) (Zhou et al., 2020; Jacot et al., 2018) and λ is a regularization parameter. d is the effective dimension first introduced by Valko et al. (2013) to measure the underlying non-linear dimensionality of the NTK kernel space. Remark 4.3 (Improve O( √ log T )). It is easy to observe that Meta-Ban achieves O( √ T log T ), improving by a multiplicative factor of O( √ log T ) over above existing works. Note that these works (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021; Zhou et al., 2020; Zhang et al., 2021) all explicitly apply the Confidence Ellipsoid Bound (Theorem 2 in (Abbasi-Yadkori et al., 2011)) to their analysis, which inevitably introduces the complexity term O(log(T )). In contrast, Meta-Ban builds generalization bound for the user-learner (Lemma E.1), inspired by recent advances in over-parameterized network (Cao and Gu, 2019) , which only brings in the complexity term O( √ log T ). Then, we show that the estimations of meta-learner and user-learner are close enough when θ and Θ are close enough, to bound the error incurred by the meta-learner (Lemma C.1). Thus, we have a different and novel UCB-type analysis from previous works. These different techniques lead to the non-trivial improvement of O( √ log T ). Remark 4.4 (Remove Input Dimension). The regret bound of Meta-Ban does not have d or d. When input dimension is large (e.g., d ≥ T ), it may cause a considerable amount of error for R T . The effective dimension d may also incur this predicament when the determinant of H is very large. As (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) build the confidence ellipsoid for θ * (optimal parameters) based on the linear function (Zhou et al., 2020; Zhang et al., 2021) construct the confidence ellipsoid for θ * according to the linear function E[r t,i | x t,i ] = ⟨g(x t,i ; θ 0 ), θ * -θ 0 ⟩ and thus their regret bounds are affected by d due to g(x t,i ; θ 0 ) ∈ R p ( d reaches to p in the worst case). On the contrary, the generalization bound derived in our analysis is only comprised of the convergence error (Lemma D.1) and the concentration bound (Lemma E.3). These two terms both are independent of d and d, which paves the way for Meta-Ban to remove the curse of d and d. E[r t,i | x t,i ] = ⟨x t,i , θ * ⟩, their regret bounds contain d because of x t,i ∈ R d . Similarly, Remark 4.5 (Remove i.i.d. Arm Assumption). We do not impose any assumption on the distribution of arms. However, the related clustering of bandit works (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017) assume that the arms are i.i.d. drawn from some distribution in each round, which may not be a mild assumption. In our proof, we build the martingale difference sequence only depending on the reward side (Lemma E.3), which is novel, to derive the generalization bound of user-learner and remove the i.i.d. arm assumption. Relative group guarantee. Compared to detected group N ut (x t,i ) (Eq.( 3)), we emphasize that N ut (x t,i ) (u t ∈ N ut (x t,i )) is the ground-truth relative group satisfying Definition 2.1. Suppose γ-gap holds among N , we prove that when t is larger than a constant, i.e., t ≥ T (as follows), with probability at least 1 -δ, it is expected over all selected arms that N ut (x t ) ⊆ N ut (x t ) and N ut (x t ) = N ut (x t ) if ν ≥ 2. Then, for ν, we have: (1) When ν ↑, we have more chances to explore collaboration with other users while costing more rounds ( T ↑); (2) When ν ↓, we limit the potential cooperation with other users while saving exploration rounds ( T ↓). More details and proof of Lemma 4.6 are in Appendix F. Lemma 4.6 (Relative group guarantee). Assume the groups in N satisfy γ-gap (Definition 2.2) and the conditions of Theorem 4.2 hold. For any ν > 1, with probability at least 1 -δ over the random initialization" there exist constants c 1 , c 2 , such that when t ≥ n64ν 2 (1 + ξ t ) 2 log 32ν 2 (1+ξt) 2 γ 2 + 9L 2 c 2 1 +4ϵ1+2ζ 2 t 4(1+ξt) 2 -log δ γ 2 (1 + 3n log(n/δ)) = T , given a user u ∈ N , it holds uniformly for Algorithms 1-3 that E xτ ∼T u t |x [ N u (x τ ) ⊆ N u (x τ )] and E xτ ∼T u t |x [ N u (x τ ) = N u (x τ )], if ν ≥ 2, where x τ is uniformly drawn from T u t |x and T u t |x = {x τ : u t = u ∧ τ ∈ [t] } is all the historical selected arms when serving u up to round t.

5. EXPERIMENTS

Figure 1 : Regret comparison on ML datasets (10 runs). Meta-Ban outperforms all baselines. Specifically, compared to the best baseline, Meta-Ban improves 26.2% on Mnist and Notmnist, 12.2% on Cifar10, 25.2% on EMNIST(Letter), and 28.8% on Shuttle. In this section, we evaluate Meta-Ban's empirical performance on 8 ML and 2 real-world recommendation datasets, compared to six strong state-of-the-art baselines. We first present the setup and then the results of experiments. More details are in Appendix A. ML datasets. We totally use 8 public classification datasets: Mnist (LeCun et al., 1998) , Notmnist (Bulatov, 2011) , Cifar10 (Krizhevsky et al., 2009) , Emnist (Letter) (Cohen et al., 2017) , Shuttle (Dua and Graff, 2017) , Fashion (Xiao et al., 2017) , Mushroom (Dua and Graff, 2017) , and Magictelescope (Dua and Graff, 2017) . Note that ML datasets are widely used for evaluating the performance of neural bandit algorithms (e.g., (Zhou et al., 2020; Zhang et al., 2021) ), which test the algorithm's ability in learning various non-linear functions between rewards and arm contexts. On ML datasets, we consider each class as a user to hold an exclusive reward function. As some classes have correlations, the goal of ML datasets also is to find the classes with strong correlations and leverage this information to improve the qualify of classification. Recommendation datasets. We alse use two recommendation datasets for evaluation: Movielens (Harper and Konstan, 2015) and Yelpfoot_0 . The descriptions are in Appendix A.2. Baselines. We compare Meta-Ban to six State-Of-The-Art (SOTA) baselines as follows: (1) CLUB (Gentile et al., 2014) ; ( 2 

6. CONCLUSION

In this paper, we introduce the problem, Neural Collaborative Filtering Bandits, to incorporate collaborative effects in bandits with generic reward assumptions. Then, we propose, Meta-Ban, to solve this problem, where a meta-learner is assigned to represent and rapidly adapt to dynamic groups, along with a new informative UCB-type exploration. Moreover, we provide the regret analysis of Meta-Ban and shows that Meta-Ban can achieve a sharper regret upper bound than the close related works. In the end, we conduct extensive experiments to evaluate its empirical performance compared to SOTA baselines. Figure 3 : Regret comparison on recommendation datasets (10 runs). Meta-Ban outperforms all baselines. Specifically, compared to the best baseline, Meta-Ban improves 7.02% on Movielens and 2.6% on Yelp.

A SUPPLEMENTARY

In this section, we first introduce the related works and present the experiments setup coming with extensive ablation studies.

A.1 RELATED WORK

In this section, we briefly review the related works, including clustering of bandits and neural bandits. However, these approaches do not provide regret analysis. Zhou et al. (2020) and Zhang et al. (2021) first provide the regret analysis of UCB-based and TS-based neural bandits, where they apply ridge regression on the space of gradients. Ban et al. (2021a) study a combinatorial problem in multiple neural bandits with a UCB-based exploration. Jia et al. (2021) perturb the training samples for incorporating both exploitation and exploration portions. EE-Net (Ban et al., 2021b) proposes to use another neural network for exploration. Xu et al. (2020) combine the last-layer neural network embedding with linear UCB to improve the computation efficiency. Unfortunately, all these methods neglect the collaborative effects among users in contextual bandits. Dutta et al. (2019) use an off-theshelf meta-learning approach to solve the contextual bandit problem in which the expected reward is formulated as Q-function. Santana et al. (2020) propose a Hierarchical Reinforcement Learning framework for recommendation in dynamic experiment, where a meta-bandit is used for the select independent recommender system. Kassraie and Krause (2022) revisit Neural-UCB type algorithm and show the O(T ) regret bound without the restrictive assumptions on the context. Maillard and Mannor (2014) ; Hong et al. (2020) study the latent bandit problem where the reward distribution of arms are conditioned on some unknown discrete latent state and prove the O(T ) regret bound for their algorithm as well. Key Differences from Related Work. We emphasize that we made important improvements compared to each aspect. (1) Compared to (Gentile et al., 2017) , the only similarity is that we adopt the idea of leveraging relative groups. (2) Compared to NeuUCB (Zhou et al., 2020) , in addition to the fact that they do not incorporate collaborative filtering effects, we have provided important technical improvements. The UCB in NeuUCB has to maintain a gradient outer product matrix (Z t in NeuUCB) which occupies space R p×p (θ ∈ R p ), and only incorporates user-side information. The new UCB introduced in our paper does not need to keep the gradient matrix and contains both group-side and user-side information. (3) Compared to (Wang et al., 2020a) , we achieved the convergence of meta-learner in the online learning setting with bandit feedback, where we need to tackle the challenge that the training data of each round may come from different user distributions.

A.2 EXPERIMENTS SETUP AND ADDITIONAL RESULTS

ML Datasets. In all ML datasets, following the evaluation setting of existing works (Zhou et al., 2020; Valko et al., 2013; Deshmukh et al., 2017) , we transform the classification problem into a bandit problem. Take Mnist as an example. Given an image x ∈ R d , it will be transformed into 10 arms, x 1 = (x ⊤ , 0, . . . , 0) ⊤ , x 2 = (0, x ⊤ , . . . , 0) ⊤ , . . . , x 10 = (0, 0, . . . , x ⊤ ) ⊤ , matching 10 class in sequence. The reward is defined as 1 if the index of selected arm equals x' ground-truth class; Otherwise, the reward is 0. In the experiments of Cifar10, Emnist, and Shuttle, we consider each class as a user and randomly draw a class first and then randomly draw a sample from the class. Note that some classes have strong correlations and thus these datasets evaluate one approach's ability to detect and leverage these correlated classes. In the experiments of Mnist and Notmnist (in Figure 1 ), we add these datasets together as these two both are 10-class classification datasets, to increase the difficulty of this problem. Thus, we consider these two datasets as two groups, where each class can be thought of as a user. In each round, we randomly select a group (i.e., Mnist or Notmnist), and then we randomly choose an image from a class (user). Note that we run all approaches on the Mnist as well (in Figure 2 ) only this time instead of on Mnist and Notmnist together (in Figure 1 ). Movielens (Harper and Konstan, 2015) and Yelpfoot_1 datasets. MovieLens is a recommendation dataset consisting of 25 million reviews between 1.6 × 10 5 users and 6 × 10 4 movies. Yelp is a dataset released in the Yelp dataset challenge, composed of 4.7 million review entries made by 1.18 million users to 1.57 × 10 5 restaurants. For both these two datasets, we extract ratings in the reviews and build the rating matrix by selecting the top 2000 users and top 10000 restaurants(movies). Then, we use the singular-value decomposition (SVD) to extract a normalized 10-dimensional feature vector for each user and restaurant(movie). The goal of this problem is to select the restaurants (movies) with bad ratings (due to the imbalance of these two datasets, i.e., most of the entries have good ratings). Given an entry with a specific user, we generate the reward by using the user's rating stars for the restaurant(movie). If the user's rating is less than 2 stars (5 stars totally), its reward is 1; Otherwise, its reward is 0. From these two datasets, as a single user may not have enough entries to run the experiments, we use K-means to divide users into 50 clusters, where each cluster forms a new user. Therefore, the user pool totally consists of 50 users for these two datasets. Then, in each round, a user to serve u t is randomly drawn from the user pool. For the arm pool, we randomly choose one restaurant (movie) rated from u t with reward 1 and randomly pick the other 9 restaurants (movies) rated by u t with 0 reward. Therefore, there are totally 10 arms in each round. We conduct experiments on these two datasets, respectively. Baselines. We compare Meta-Ban to six State-Of-The-Art (SOTA) baselines as follows: (1) CLUB (Gentile et al., 2014) clusters users based on the connected components in the user graph and refine the groups incrementally. When selecting arm, it uses the newly formed group parameter instead of user parameter with UCB-based exploration; (2) COFIBA (Li et al., 2016) clusters on both user and arm sides based on evolving graph, and chooses arms using a UCB-based exploration strategy; (3) SCLUB (Li et al., 2019) improves the algorithm CLUB by allowing groups to merge and split to enhance the group representation; (4) LOCB (Ban and He, 2021) uses the seed-based clustering and allow groups to be overlapped, and chooses the best group candidates when selecting arms; (5) NeuUCB-ONE (Zhou et al., 2020) uses one neural network to formulate all users and select arms via a UCB-based recommendation; (6) NeuUCB-IND (Zhou et al., 2020) uses one neural network to formulate one user separately (totally N networks) and apply the same strategy to choose arms. Since LinUCB (Li et al., 2010) and KernalUCB (Valko et al., 2013) are outperformed by the above baselines, we do not include them in comparison. Configurations. For all the methods, they all have two parameters: λ that is to tune regularization at initialization and α which is to adjust the UCB value. To find their best performance, we conduct the grid search for λ and α over (0.01, 0.1, 1) and (0.0001, 0.001, 0.01, 0.1) respectively. For LOCB, the number of random seeds is set as 20 following their default setting. For Meta-Ban, we set ν as 5 and γ as 0.4 to tune the group set. To compare fairly, for NeuUCB and Meta-Ban, we use the same simple neural network with 2 fully-connected layers and the width m is set as 100. To save the running time, we train the neural networks every 10 rounds in first 1000 rounds and train the neural networks every 100 rounds afterwards. In our implementation, the gradient descent (Algorithm 2 and 3) stops when the training error is smaller than 0.001, but the J 1 and J 2 are restricted by 1000. In the end, we choose the best results for the comparison and report the mean and standard deviation (shadows in figures) of 10 runs for all methods.

A.3 SENSITIVITY STUDY FOR ν AND γ

In this section, we conduct the ablation study for the group parameter ν. Here, we set γ as a fixed value 0.4 and change the value of ν to find the effects on Meta-Ban's performance. Figure 4 shows the varying of performance of Meta-Ban with respect to ν. When setting ν = 1.1, the exploration range of groups is very narrow. This means, in each round, the inferred group size | N ut (x t,i )| tends to be small. Although the members in the inferred group N ut (x t,i ) is more likely to be the true member of u t 's relative group, we may lose many other potential group members in the beginning phase. When setting ν = 5, the exploration range of groups is wider. This indicates we have more chances to include more members in the inferred group, while this group may contain some false positives. With a larger size of group, the meta-learner Θ can exploit more information. Therefore, Meta-Ban with ν = 5 outperforms ν = 1.1. But, keep increasing ν does not always mean improve the performance, since the inferred group may consist of some non-collaborative users, bringing into noise. Therefore, in practice, we usually set ν as a relatively large number. Even we can set ν as the monotonically decreasing function with respect to t. Figure 5 depicts the sensitivity of Meta-Ban with regard to α. Meta-Ban shows the robust performance as α is varying, which stems from the strong discriminability of meta learner and the derived upper bound. Despite that the magnitude of α is changing, the order of arms ranked by Meta-Ban is slightly influenced. Thus, the Meta-Ban can obtain the robust performance, alleviating the hyperparameter tuning.

A.5 ABLATION STUDY FOR INPUT DIMENSION

We run the experiments on MovieLens dataset with different input dimensions and report the results as follows. Table 3 summarizes the final regret of each method with different input dimensions (5 runs). Meta-Ban keeps the similar regret with the small fluctuation. This fluctuation is acceptable given that different input dimensions may contain different amount of information. NeuUCB-ONE and NeuUCB-IND also use the neural network to learn the reward function, so they have the similar property. In contrast, the regret of linear bandits (CLUB, COFIBA, SCLUB, LOCB) is affected more drastically by the input dimensions, which complies with their regret analysis.  u = T u t , ∀t ∈ [T ]. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), 0 < ϵ 1 ≤ ϵ 2 ≤ 1, λ > 0, suppose m, η 1 , η 2 , J 1 , J 2 satisfy m ≥ Ω max poly(T, L, ρ -1 ), e √ log(O(T k)/δ) , η 1 = Θ ρ poly(T, L) • m , η 2 = min Θ √ nρ T 4 L 2 m , Θ √ ρϵ 2 T 2 L 2 λn 2 , J 1 = Θ poly(T, L) ρ 2 log 1 ϵ 1 J 2 = max Θ T 5 (O(T log 2 m) -ϵ 2 )L 2 m √ nϵ 2 ρ , Θ T 3 L 2 λn 2 (O(T log 2 m -ϵ 2 )) ρϵ 2 . ( ) Then, with probability at least 1 -δ, Algorithms 1-3 has the following regret upper bound: R T ≤ 2 √ n ϵ 1 T + O L √ T + (1 + ξ 1 ) 2T log(T /δ) + O T log mβ 4/3 T L 4 + Z T where ξ T =2 + O T 4 nL log m ρ √ m + O T 5 nL 2 log 11/6 m ρm 1/6 , β T = O(n 2 T 3 ϵ 2 log 2 m) + O(T 2 log 2 m -tϵ 2 )ρ 1/2 λn O(ρ √ mϵ 2 ) , Z T =O T 5 L 2 log 11/6 m ρm 1/6 + T (L + 1) 2 m log mβ 4/3 t + O LT 4 ρ √ m log m + O L 4 T 5 ρ 4/3 m 2/3 log 4/3 m . With the proper choice of m, we have R T ≤ O( √ n) √ T + L √ T + 2T log(T /δ) + O(1). ( ) Proof Overview. Different from existing works (Zhou et al., 2020; Zhang et al., 2021) that bound the regret of one round by kernel regression in Neural Tangent Kernel , we directly upper bound the mean of regret of overall T rounds by building martingale difference sequence with respect to h u . First, we decompose the regret of T rounds into three key terms (Eq. ( 9)), where the first term is the error induced by user learner θ u , the second term is the distance between user learner and meta learner, and the third term is the error induced by the meta learner Θ. Then, Lemma E.2 provides an upper bound for the first term. Lemma E.2 is an extension of Lemma E.3, which is key to remove input dimension. Lemma E.3 has three terms with the complexity O( √ T ), where the first term is the training error induced by a class of functions around initialization, the second term is the price of choose the function class, and the third term is confidence interval induced by concentration inequality for f (•; θ u ). Lemma C.1 bounds the distance between user learner and meta learner. As this bound has the term O(1/ √ m), this bound can be reduce to √ T with proper choice of m. Lemma C.2 bounds the error induced by the meta learner using triangle inequality bridged by the user learner. Bounding the three terms in Eq. ( 9) completes the proof. Proof. Let x * t = arg max xt,i∈Xt h ut (x t,i ) given X t , u t , and let Θ * t be corresponding parameters trained by Algorithm 2 based on N ut t (x * t ). Then, for the regret of one round t ∈ [T ], we have R t |u t = E rt,i|xt,i,i∈[k] [r * t -r t | u t ] = E rt,i|xt,i,i∈[k] r * t -f (x * t ; θ ut, * t-1 ) + f (x * t ; θ ut, * t-1 ) -r t = E rt,i|xt,i,i∈[k] r * t -f (x * t ; θ ut, * t-1 ) + f (x * t ; θ ut, * t-1 ) -f (x * t ; Θ * t ) | N ut t (x * t ) + f (x * t ; Θ * t ) | N ut t (x * t ) -r t ≤ E r * t |x * t r * t -f (x * t ; θ ut, * t-1 ) + |f (x * t ; θ ut, * t-1 ) -f (x * t ; Θ * t ) | N ut t (x * t )| + E rt|xt f (x * t ; Θ * t ) | N ut t (x * t ) -r t (8) where the expectation is taken over r t,i conditioned on x t,i for each i ∈ [k], θ ut, * t-1 are intermediate user parameters introduced in Lemma E.4 trained on Bayes-optimal pairs by Algorithm 3, e.g., (x * t-1 , r * t-1 ), and Θ * t are meta parameters trained on the group N ut t (x * t ) using Algorithm 2. Then, the cumulative regret of T rounds can be upper bounded by R T = T t=1 R t |u t ≤ T t=1 E r * t |x * t |r * t -f (x * t ; θ ut, * t-1 )| + T t=1 |f (x * t ; θ ut, * t-1 ) -f (x * t ; Θ * t )| + T t=1 E rt|xt f (x * t ; Θ * t ) | N ut t (x * t ) -r t (a) ≤ u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ 1 ) 2µ u t log(T /δ) + T t=1 β t • ∥g(x * t ; Θ * t ) -g(x * t ; θ ut, * 0 )∥ 2 + Z t + T t=1 E rt|xt f (x * t ; Θ * t ) | N ut t (x * t ) -r t (b) ≤ u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ 1 ) 2µ u t log(T /δ) + T t=1 [β t • ∥g(x t ; Θ t ) -g(x t ; θ ut 0 )∥ 2 + Z t ] + T t=1 E rt|xt f (x t ; Θ t ) | N ut t (x t ) -r t (9) where (a) is the applications of Lemma E.2 and Lemma C.1, and (b) is due to the selection criterion of Algorithm 1 where θ ut 0 = θ ut, * 0 according to our initialization. Thus, we have R T = T t=1 R t |u t ≤ u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ 1 ) 2µ u t log(T /δ) + T t=1 [β t • ∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 + Z t ] + T t=1 E rt|xt f (x t ; Θ t ) | N ut t (x t ) -f (x t ; θ ut t-1 ) + f (x t ; θ ut t-1 ) -r t ≤ u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ 1 ) 2µ u t log(T /δ) + T t=1 [β t • ∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 + Z t ] + T t=1 |f (x t ; Θ t ) | N ut t (x t ) -f (x t ; θ ut t-1 )| + T t=1 E rt|xt f (x t ; θ ut t-1 ) -r t (c) ≤ 2 u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ 1 ) 2µ u t log(T /δ) + 2 T t=1 [β t • ∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 + Z t ] (d) ≤ 2 √ n   ϵ 1 T + O L √ T + (1 + ξ 1 ) I3 2T log(T /δ)   + 2 T t=1 β t • ∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 I1 + 2 T t=1 Z t I2 where (c) is an application of Lemma E.1 and Lemma C.1 and (d) is an application of Lemma E.1 with Hoeffding-Azuma inequality. For I 1 , recall that β t = O(n 2 t 3 √ ϵ2 log 2 m)+O(t 2 log 2 m-tϵ2)ρ 1/2 λn O(ρ √ mϵ2) . Then, using Theorem 5 in (Allen-Zhu et al., 2019), we have I 1 ≤ T t=1 β t • O log mβ 1/3 t L 3 ∥g(x t ; Θ 0 )∥ 2 ≤ E2 O T log mβ 4/3 T L 4 ≤ E3 O(1) (10) where , E 2 is as the Lemma E.10 and E 3 is because of the choice of m (β t has the complexity of O 1 m 1/2 and m ≥ Ω(T 30 )). For I 2 , recall that Z t = O (t-1) 4 L 2 log 11/6 m ρm 1/6 + (L + 1) 2 √ m log mβ 4/3 t + O L (t-1) 3 ρ √ m log m + O(Lβ t ) + O L 4 (t-1) 3 ρ √ m log m 4/3 . Then, we have I 2 ≤O T 5 L 2 log 11/6 m ρm 1/6 + T (L + 1) 2 m log mβ 4/3 t + O LT 4 ρ √ m log m + O L 4 T 5 ρ 4/3 m 2/3 log 4/3 m =Z T . I 2 has the complexity of O 1 m 1/6 . Therefore, I 2 ≤ O(1) when m ≥ Ω(T 30 ). For I 3 , as the choice of m, we have (1 + ξ 1 ) ≤ O(1). The proof is complete. C BRIDGE META-LEARNER AND USER-LEARNER Lemma C.1. For any δ ∈ (0, 1), ρ ∈ (0, O( 1 L )], 0 < ϵ 1 ≤ ϵ 2 ≤ 1, λ > 0, suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Eq.( 6). Then, with probability at least 1 -δ, for any t ∈ [T ] and x t satisfying ∥x t ∥ 2 = 1, given the serving user u ∈ N and Θ t returned by Algorithm 2 based on N u t (x t ), it holds uniformly for Algorithms 1-3 that |f (x t ; θ u t-1 ) -f (x t ; Θ t )| ≤ β t • ∥g(x t ; Θ t ) -g(x t ; θ u 0 )∥ 2 + Z t , where β t = O(n 2 t 3 ϵ 2 log 2 m) + O(t 2 log 2 m -tϵ 2 )ρ 1/2 λn O(ρ √ mϵ 2 ) , Z t =O (t -1) 4 L 2 log 11/6 m ρm 1/6 + (L + 1) 2 m log mβ 4/3 t + O L (t -1) 3 ρ √ m log m . Proof. First, we have |f (x t ; θ u t-1 ) -f (x t ; Θ t )| ≤ |f ut (x t ; θ u t-1 ) -⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ -f (x t ; θ u 0 )| I1 + |⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ + f (x t ; θ u 0 ) -f (x t ; Θ t )| I2 where the inequality is using Triangle inequality. For I 1 , based on Lemma E.9, we have I 1 ≤ O(w 1/3 L 2 m log(m))∥θ u t-1 -θ u 0 ∥ 2 ≤ O t 4 L 2 log 11/6 m ρm 1/6 , where the second equality is based on the Lemma E.8 (4): ∥θ u t-1 -θ u 0 ∥ 2 ≤ O (µ u t-1 ) 3 ρ √ m log m ≤ O (t-1) 3 ρ √ m log m = w. For I 2 , we have |⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ + f (x t ; θ u 0 ) -f (x t ; Θ t )| ≤ E1 |⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ -⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩| + |⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩ + f (x t ; θ u 0 ) -f (x t ; Θ t )| ≤ E2 |⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ -⟨g(x t ; θ u 0 ), Θ t -Θ 0 ⟩| I3 + |⟨g(x t ; θ u 0 ), Θ t -Θ 0 ⟩ -⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩| I4 + |⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩ + f (x t ; θ u 0 ) -f (x t ; Θ t )| I5 where E 1 , E 2 use Triangle inequality. For I 3 , we have |⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ -⟨g(x t ; θ u 0 ), Θ t -Θ 0 ⟩| ≤|⟨g(x t ; θ u t-1 ), θ u t-1 -θ u 0 ⟩ -⟨g(x t ; θ u 0 ), θ u t-1 -θ u 0 ⟩| + |⟨g(x t ; θ u 0 ), θ u t-1 -θ u 0 ⟩ -⟨g(x t ; θ u 0 ), Θ t -Θ 0 ⟩| ≤ ∥g(x t ; θ u t-1 ) -g(x t ; θ u 0 )∥ 2 • ∥θ u t-1 -θ u 0 ∥ 2 M1 + ∥g(x t ; θ u 0 )∥ 2 • ∥θ u t-1 -θ u 0 -(Θ t -Θ 0 )∥ 2 M2 (15) For M 1 , we have M 1 ≤ E3 O (t -1) 3 ρ √ m log m • ∥g(x t ; θ u t-1 ) -g(x t ; θ u 0 )∥ 2 ≤ E4 O L 4 (t -1) 3 ρ √ m log m 4/3 where E 3 is the application of Lemma E.8 and E 4 utilizes Theorem 5 in Allen-Zhu et al. ( 2019) with Lemma E.8. For M 2 , we have ∥g(x t ; Θ 0 )∥ 2 ∥θ u t-1 -θ u 0 -(Θ t -Θ 0 )∥ 2 ≤∥g(x t ; Θ 0 )∥ 2 ∥θ u t-1 -θ u 0 ∥ 2 + ∥Θ t -Θ 0 ∥ 2 ≤ E5 O(L) • O (t -1) 3 ρ √ m log m + β t where E 5 use Lemma E.10, E.8, and D.1. Combining Eq.( 16) and Eq.(C), we have I 3 ≤ O L 4 (t -1) 3 ρ √ m log m 4/3 + O L (t -1) 3 ρ √ m log m + O(Lβ t ). . For I 4 , we have I 4 =|⟨g(x t ; Θ 0 ), Θ t -Θ 0 ⟩ -⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩| ≤∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 ∥Θ t -Θ 0 ∥ 2 ≤β t • ∥g(x t ; Θ t ) -g(x t ; Θ 0 )∥ 2 (19) where the first inequality is because of Cauchy-Schwarz inequality and the last inequality is by Lemma D.1. For I 5 , we have I 5 = |⟨g(x t ; Θ t ), Θ t -Θ 0 ⟩ + f (x t ; Θ 0 ) -f (x t ; Θ t )| ≤ (L + 1) 2 m log mβ 4/3 t where this inequality uses Lemma D.2 with Lemma D.1. Combing Eq.( 13), ( 18), ( 19), and (20) completes the proof. Lemma C.2. For any δ ∈ (0, 1), ρ ∈ (0, O( 1 L )], 0 < ϵ 1 ≤ ϵ 2 ≤ 1, λ > 0, suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Eq.( 6). Then, with probability at least 1 -δ over the random initialization, after t rounds, the error induced by meta-learner is upper bounded by: T t=1 E rt|xt [|f (x t ; Θ t ) -r t | | u t ] ≤ T t=1 O (∥g(x t ; Θ t ) -g(x t ; θ ut 0 )∥ 2 ) √ t + u∈N µ u t O L + 1 √ 2µ u t + 2 log(t/δ) µ u t . where the expectation is taken over r t conditioned on x t . Proof. T t=1 E rt|xt [|f (x t ; Θ t ) -r t ||u t ] = T t=1 E rt|xt [|f (x t ; Θ t ) -f (x t ; θ ut t-1 ) + f (x t ; θ ut t-1 ) -r t | | u t ] ≤ T t=1 |f (x t ; Θ t ) -f (x t ; θ ut t-1 )| I1 + T t=1 E rt|xt [|f (x t ; θ ut t-1 ) -r t | | u t ] I2 . For I 1 , applying Lemma C.1, with probability at least 1 -δ, for any ∥x t,j ∥ 2 = 1, we have I 1 ≤ T t=1 (β t • ∥g(x t ; Θ t ) -g(x t ; θ u 0 )∥ 2 + Z t ) E1 ≤ T t=1 O (∥g(x t ; Θ t ) -g(x t ; θ ut 0 )∥ 2 ) √ t where E 1 is the result of choice of m (m ≥ Ω(T 27 )) for β t and Z t . For I 2 , based on the Lemma E.1, with probability at least 1 -δ, for any ϵ 1 ∈ (0, 1], we have I 2 ≤ u∈N ϵ 1 µ u t + O L µ u t + (1 + ξ t ) 2µ u t log(t/δ) ≤ u∈N µ u t O L + 1 √ 2µ u t + 2 log(t/δ) µ u t . The proof is complete. D ANALYSIS FOR META-LEARNER Lemma D.1. Given any δ ∈ (0, 1), 0 < ϵ 1 ≤ ϵ 2 ≤ 1, λ > 0, ρ ∈ (0, O( 1 L )] , suppose m, η 1 , η 2 , J 1 , J 2 satisfy the conditions in Eq.( 6) and Θ 0 , θ u 0 are randomly initialized ,∀u ∈ N . Then, with probability at least 1 -δ, these hold for Algorithms 1-3: 1. Given any N ⊆ N , define L N (Θ t,i ) = 1 2 u∈N (x,r)∈T u t-1 (f (x; Θ t,i ) -r) 2 , where Θ t,i is returned by Algorithm 2 given N . Then, we have L N (Θ t,i ) ≤ ϵ 2 in J 2 rounds.

2.. For any

j ∈ [J 2 ], ∥Θ (j) -Θ (0) ∥ 2 ≤ O(n 2 t 3 √ ϵ2 log 2 m)+O(t 2 log 2 m-tϵ2)ρ 1/2 λn O(ρ √ mϵ2) = β 1 . Proof. Define the sign matrix sign(θ [i] ) = 1 if θ [i] ≥ 0; -1 if θ [i] < 0 where θ [i] is the i-th element in θ. For the brevity, we use θ u t to denote θ u µ u t , For each u ∈ N , we have T u t-1 . Given a group N , then recall that L N = u∈N L θ u t + λ √ m u∈N ∥ θ u t ∥ 1 . Then, in round t + 1, for any j ∈ [J 2 ] we have Θ (j) -Θ (j-1) = η 2 • ▽ { θ u t } u∈N L N = η 2 • n∈N ▽ θ u t L + λ √ m u∈N sign( θ u t ) According to Theorem 4 in (Allen-Zhu et al., 2019) , given Θ (j) , Θ (j-1) , we have L N (Θ (j) ) ≤L N (Θ (j-1) ) -⟨▽ Θ (j-1) L N , Θ (j) -Θ (j-1) ⟩ + tL N (Θ (j-1) ) • w 1/3 L 2 m log m • O(∥Θ (j) -Θ (j-1) ∥ 2 ) + O(tL 2 m)∥Θ (j) -Θ (j-1) ∥ 2 2 ≤ E1 L N (Θ (j-1) ) -η 2 ∥ n∈N ▽ θ u t L + λ √ m u∈N sign( θ u t )∥ 2 ∥▽ Θ (j-1) L N ∥ 2 + + η 2 w 1/3 L 2 tm log m∥ n∈N ▽ θ u t L + λ √ m u∈N sign( θ u t )∥ 2 L N (Θ (j-1) ) + η 2 2 O(tL 2 m)∥ n∈N ▽ θ u t L + λ √ m u∈N sign( θ u t )∥ 2 2 (27) ⇒ L N (Θ (j) ) ≤ L N (Θ (j-1) ) -η 2 √ n u∈N ∥▽ θ u t L∥ 2 ∥▽ Θ (j-1) L N ∥ 2 + + η 2 w 1/3 L 2 tnm log m n∈N ∥▽ θ u t L∥ 2 L N (Θ (j-1) ) + η 2 2 O(tL 2 m)n n∈N ∥▽ θ u t L∥ 2 2 - η 2 λ √ m ∥▽ Θ (j-1) L N ∥ 2 + η 2 w 1/3 nL 2 t log mλ L N (Θ (j-1) ) + O(2η 2 2 tL 2 )λ 2 n 2 (28) ⇒ L N (Θ (j) ) ≤ E2 L N (Θ (j-1) ) -η 2 √ n u∈N ρm tµ u t L( θ u t )L N (Θ (j-1) )+ I1 +η 2 w 1/3 L 2 m tρn log m n∈N L( θ u t )L N (Θ (j-1) ) + η 2 2 t 2 L 2 m 2 n n∈N L( θ u t ) I1 - η 2 λ √ ρ t L N (Θ (j-1) ) + η 2 w 1/3 nL 2 t log mλ L N (Θ (j-1) ) + O(2η 2 2 tL 2 )λ 2 n 2

I2

(29) where E 1 is because of Cauchy-Schwarz inequality inequality, E 2 is due to Theorem 3 in (Allen-Zhu et al., 2019) , i.e., the gradient lower bound. Recall that η 2 = min Θ √ nρ t 4 L 2 m , Θ √ ρϵ 2 t 2 L 2 λn 2 , L N (Θ 0 ) ≤ O(t log 2 m) J 2 = max Θ t 5 (O(t log 2 m) -ϵ 2 )L 2 m √ nϵ 2 ρ , Θ t 3 L 2 λn 2 (O(t log 2 m -ϵ 2 )) ρϵ 2 . ( ) Under review as a conference paper at ICLR 2023 Before achieving L N (Θ (j) ) ≤ ϵ 2 , we have, for each u ∈ N , L( θ u t ) ≤ L N (Θ (j-1) ), for I 1 , we have I 1 ≤ -η 2 √ n u∈N ρm tµ u t L( θ u t )L N (Θ (j-1) )+ + η 2 w 1/3 L 2 m tρn log m n∈N L( θ u t )L N (Θ (j-1) ) + η 2 2 t 2 L 2 m 2 n n∈N L( θ u t )L N (Θ (j-1) ) ≤ - η 2 n √ nρm t 2 n∈N L( θ u t )L N (Θ (j-1) ) + η 2 w 1/3 L 2 m tρn log m + η 2 2 t 2 L 2 m 2 n n∈N L( θ u t )L N (Θ (j-1) ) ≤ E3 -Θ η 2 n √ nρm t 2 n∈N L( θ u t )L N (Θ (j-1) ) ≤ E4 -Θ η 2 n √ nρm t 2 n∈N L( θ u t ) (31) where E 3 is because of the choice of η 2 . As L N (Θ 0 ) ≤ O(t log 2 m), we have L N (Θ (j) ) ≤ ϵ 2 in J Θ rounds. For I 2 , we have I 2 ≤ E5 - η 2 λ √ ρ t √ ϵ 2 + η 2 w 1/3 nL 2 t log mλ L N (Θ (0) ) + O(2η 2 2 tL 2 )λ 2 n 2 ≤ E6 - η 2 λ √ ρ t √ ϵ 2 + η 2 w 1/3 nL 2 t log mλ O(t log 2 m) + O(2η 2 2 tL 2 )λ 2 n 2 ≤ - η 2 √ ρ t √ ϵ 2 + η 2 w 1/3 nL 2 t log m O(t log 2 m) + O(2η 2 2 tL 2 )λn 2 λ ≤ E7 -Θ( η 2 √ ρϵ 2 t )λ where E 5 is by L N (Θ (j-1) ) ≥ ϵ 2 and L N (Θ (j-1) ) ≤ L N (Θ (0) ), E 6 is according to Eq.( 30), and E 7 is because of the choice of η 2 . Combining above inequalities together, we have L N (Θ (j) ) ≤L N (Θ (j-1) ) -Θ η 2 n √ nρm t 2 n∈N L( θ u t ) -Θ( η 2 √ ρϵ 2 t )λ ≤L N (Θ (j-1) ) -Θ( η 2 √ ρϵ 2 t )λ Thus, because of the choice of J 2 , η 2 , we have L N (Θ (J2) ) ≤ L N (Θ (0) ) -J 2 • Θ( η 2 √ ρϵ 2 t )λ ≤ O(t log 2 m) -J 2 • Θ( η 2 √ ρϵ 2 t ) ≤ ϵ 2 . ( ) The proof of (1) is completed. According to Lemma E.8, For any j ∈ [J 1 ], L(θ u (j) ) ≤ (1 -Ω( ηρm dµ u t 2 ))L(θ u (j-1) ). Therefore, for any u ∈ [n], we have L( θ u t ) ≤ J1 j=0 L(θ u (j) ) ≤ O (µ u t ) 2 η 1 ρm • L(θ u (0) ) ≤ O (µ u t ) 2 η 1 ρm • O( µ u t log 2 m), where the last inequality is because of Lemma E.8 (3). Second, we have ∥Θ (J2) -Θ 0 ∥ 2 ≤ J2 j=1 ∥Θ (j) -Θ (j-1) ∥ 2 ≤ J2 j=1 η 2 ∥ n∈N ▽ θ u t L + λ √ m u∈N sign( θ u t )∥ 2 ≤ J2 j=1 η 2 ∥ u∈N ▽ θ u t L∥ F I3 + J 2 η 2 λn √ m For I 3 , we have J2 j=1 η 2 ∥ u∈N ▽ θ u t L∥ 2 ≤ J2 j=1 η 2 |N | u∈N ∥▽ θ u t L∥ 2 ≤ E8 J2 j=1 η 2 √ n u∈N ∥▽ θ u t L∥ 2 ≤ E9 O J2 j=1 (η 2 ) √ ntm u∈N L( θ u t ) ⇒ J2 j=1 η 2 ∥ u∈N ▽ θ u t L∥ 2 ≤ E10 O(η 2 ) √ ntm u∈N J2 j=1 L( θ u t ) ≤ E11 O(η 2 ) √ ntm • n • O (µ u t ) 2 η 1 ρm • O( µ u t log 2 m) ≤ O η 2 n 3/2 t 5/2 t log 2 m η 1 ρ √ m (38) where E 1 is because of |N | ≤ n, E 2 is due to Theorem 3 in (Allen-Zhu et al., 2019) , and E 3 is as the result of Eq.( 35). Combining Eq.( 36) and Eq.( 38), we have ∥Θ (J2) -Θ 0 ∥ 2 ≤ O η 2 n 3/2 t 3 log 2 m + J 2 η 2 η 1 ρλn η 1 ρ √ m ≤O η 2 n 3/2 t 3 log 2 m + O(t 2 log 2 m -tϵ 2 ))η 1 √ ρλn η 1 ρ √ mϵ 2 ≤ O(n 2 t 3 ϵ 2 log 2 m) + O(t 2 log 2 m -tϵ 2 )ρ 1/2 λn O(ρ √ mϵ 2 ) =β t . ( ) The proof is completed.

D.1 ANCILLARY LEMMAS

Lemma D.2 ( (Wang et al., 2020a) ). Suppose m satisfies the condition2 in Eq.( 6), if Ω(m -3/2 L -3/2 [log(T kL 2 /δ)] 3/2 ) ≤ ν ≤ O((L + 1) -6 √ m). then with probability at least 1 -δ, for all Θ, Θ ′ satisfying ∥Θ - Θ 0 ∥ 2 ≤ ν and ∥Θ ′ -Θ 0 ∥ 2 ≤ ν, x ∈ R d , ∥x∥ 2 = 1, we have |f (x; Θ) -f (x; Θ ′ ) -⟨▽ Θ f (x; Θ), Θ ′ -Θ⟩| ≤ O(ν 4/3 (L + 1) 2 m log m). Lemma D.3. With probability at least 1 -δ, set η 2 = Θ( ν √ 2tm ), for any Θ ′ ∈ R p satisfying ∥Θ ′ -Θ 0 ∥ 2 ≤ β 1 , such that t τ =1 |f (x τ ; Θ (j) -r τ | ≤ t τ =1 |f (x τ ; Θ ′ ) -r τ | + O 3L √ t √ Proof. Then, the proof is a direct application of Lemma 4.3 in (Cao and Gu, 2019 ) by setting the loss as L τ ( Θ τ ) = |f (x τ ; Θ τ ) -r τ |, R = β 1 √ m, ϵ = LR √ 2νt , and ν = R 2 . E ANALYSIS FOR USER-LEARNER Lemma E.1. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), suppose 0 < ϵ 1 ≤ 1 and m, η 1 , J 1 satisfy the conditions in Eq.( 6). After T rounds, with probability 1 -δ over the random initialization, the cumulative error induced by the user-learners is upper bounded by 1 T T t=1 E rt|xt [|f (x t ; θ ut t-1 ) -r t | | T ut t-1 , u t ] ≤ √ n ϵ 1 T + O LR √ T + O(1 + ξ t ) 2 log(T /δ) T , where the expectation is taken over r u t conditioned on x u t and T u t is the historical data of u up to round t. Proof. Applying Lemma E.3 over all users, we have 1 T T t=1 E rt|xt [|f (x t ; θ ut t-1 ) -r t | | T ut t-1 , u t ] = 1 T u∈N (xτ ,rτ )∈T u t E rt|xt [|f (x τ ; θ u t-1 ) -r τ | | T u t-1 , u] ≤ 1 T u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ t ) 2µ u t log(T /δ) where we applied the union bound to δ over all n users and so we get log(T /δ) because of u∈N µ u T = T . Then, given a user u, then, µ u T = T t=1 1{u t = u} where 1{u t = u} is the indicator function. Then, applying Hoeffding-Azuma inequality on the sequence µ u T , ∀u ∈ N , we have u∈N µ u T ≤ u∈N E[ µ u T ] + 2n log(1/δ) = √ nT + 2n log(1/δ). Then, by simplification, we have 1 T T t=1 E rt|xt [|f (x t ; θ ut t-1 ) -r t | | T ut t-1 , u t ] ≤ √ n ϵ 1 T + O L √ T + O(1 + ξ t ) 2 log(T /δ) T . The proof is complete. Lemma E.2. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), suppose 0 < ϵ 1 ≤ 1 and m, η 1 , J 1 satisfy the conditions in Eq.( 6). In round t ∈ [T ], given u ∈ N , let x * t = arg max xt,i,i∈[k] h u (x t,i ) the Bayes-optimal arm for u and r * t is the corresponding reward. Then, with probability at least 1 -δ over the random initialization, after T rounds, with probability 1 -δ over the random initialization, the cumulative error induced by the user-learners is upper bounded by: 1 T T t=1 E r * t |x * t [|f (x * t ; θ ut, * t-1 ) -r * t | | T ut, * t-1 , u t ] ≤ √ n ϵ 1 T + O L √ T + O(1 + ξ t ) 2 log(T /δ) T . where the expectation is taken over r * τ conditioned on x * τ , T u, * t = {(x * τ , r * τ ) : u τ = u, τ ∈ [t] } are stored Bayes-optimal pairs up to round t for u, and θ ut, * t-1 are the parameters trained on T ut, * t-1 according to Algorithm 3 in round t -1. Proof. Based on Lemma E.4, we have 1 T T t=1 E r * t |x * t [|f (x * t ; θ ut, * t-1 ) -r * t | | T ut, * t-1 , u t ] = 1 T u∈N (x * τ ,r * τ )∈T u, * t E r * t |x * t [|f (x * τ ; θ u, * t-1 ) -r * τ | | T u, * t-1 , u] ≤ 1 T u∈N ϵ 1 µ u T + O L µ u t + (1 + ξ t ) 2µ u t log(T /δ) where we applied the union bound to δ over all n users and so we get log(T /δ) because of u∈N µ u T = T . Then, given a user u, then, µ u T = T t=1 1{u t = u} where 1{u t = u} is the indicator function. Then, applying Hoeffding-Azuma inequality on the sequence µ u T , ∀u ∈ N , we have u∈N µ u T ≤ u∈N E[ µ u T ] + 2n log(1/δ) = √ nT + 2n log(1/δ). Then, we have 1 T T t=1 E r * t |x * t [|f (x t ; θ ut, * t-1 ) -r * t | | T ut, * t-1 , u t ] ≤ √ n ϵ 1 T + O L √ T + O(1 + ξ t ) 2 log(T /δ) T . ( ) The proof is complete. Lemma E.3. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), suppose 0 < ϵ 1 ≤ 1 and m, η 1 , J 1 satisfy the conditions in Eq.( 6). In a round τ where u ∈ N is serving user, let x τ be the arm selected by some fixed policy π τ and r τ is the corresponding received reward. Then, with probability at least 1 -δ over the randomness of initialization, after t ∈ [T ] rounds, the cumulative regret induced by u is upper bounded by: 1 µ u t (xτ ,rτ )∈T u t E rτ |xτ [|f (x τ ; θ u τ -1 ) -r τ | | π τ , u] ≤ ϵ 1 µ u t + O 3L √ 2µ u t + (1 + ξ t ) 2 log(µ u t /δ) µ u t . where the expectation is taken over r τ conditioned on x τ and T u t = {(x τ , r τ ) : u τ = u, τ ∈ [t]} is the historical data of u up to round t. Lemma E.4. For any δ ∈ (0, 1), 0 < ρ ≤ O( 1 L ), suppose 0 < ϵ 1 ≤ 1 and m, η 1 , J 1 satisfy the conditions in Eq.( 6). In a round τ where u ∈ N is the serving user, let x * τ be the arm selected according to Bayes-optimal policy π * : x * τ = arg max xτ,i,i∈[k] h u (x τ,i ), and r * τ is the corresponding reward. Then, with probability at least 1 -δ over the randomness of initialization, after t ∈ [T ] rounds, the cumulative regret induced by u with policy π * is upper bounded by: 1 µ u t (x * τ ,r * τ )∈T u, * t E r * τ |x * τ [|f (x * τ ; θ u, * τ -1 ) -r * τ | | π * , u] ≤ 2ϵ 1 µ u t + O 3L √ 2µ u t + (1 + ξ t ) 2 log(µ u t /δ) µ u t . where the expectation is taken over r * τ conditioned on x * τ , T u, * t = {(x * τ , r * τ ) : u τ = u, τ ∈ [t] } are stored Bayes-optimal pairs up to round t for u, and θ u, * τ -1 are the parameters trained on T u, * τ -1 according to Algorithm 3 in round τ -1. Proof. This proof is analogous to Lemma E.3. In a round τ where u is the serving user, we define V τ = E r * τ |x * τ [|f (x * τ ; θ u, * τ -1 ) -r * τ |] -|f (x * τ ; θ u, * τ -1 ) -r * τ |. where the expectation is taken over r * τ conditioned on x * τ . Then, we have E[V τ |F τ ] = E r * τ |x * τ [|f (x * τ ; θ u, * τ -1 ) -r τ, * |] -E[|f (x * τ ; θ u, * τ -1 ) -r * τ | | F τ ] = 0 Therefore, V 1 , . . . , V µ u t is the martingale difference sequence. Then, following the same procedure of Lemma E.3, we can derive 1 µ u t (x * τ ,r * τ )∈T u, * t E r * τ |x * τ [|f (x * τ ; θ u, * τ -1 ) -r * τ | | u] ≤ 2ϵ 1 µ u t + O 3L √ 2µ u t + (1 + ξ t ) 2 log(1/δ) µ u t . Based on Lemma E.8 (4), for any θ u, * τ , τ ∈ [t], we have ∥ θ u, * τ -θ u 0 ∥ 2 ≤ O (µ u t ) 3 ρ √ m log m . Thus, it holds that ∥θ u, * τ -θ u 0 ∥ 2 ≤ O (µ u t ) 3 ρ √ m log m . E.1 ANCILLARY LEMMAS Lemma E.5. Suppose m, η 1 , η 1 satisfy the conditions in Eq. ( 6). With probability at least 1 -δ, for any x with ∥x∥ 2 = 1 and t ∈ [T ], u ∈ N , it holds that |f (x; θ u t )| ≤ 2 + O t 4 nL log m ρ √ m + O t 5 nL 2 log 11/6 m ρm 1/6 = ξ t . Proof. This is an application of Lemma C.3 in (Ban et al., 2021b) . Let θ 0 be randomly initialized. Then applying Lemma E.9, for any ∥x∥ 2 = 1 and ∥ θ u t -θ 0 ∥ ≤ w, we have |f (x; θ u t )| ≤ |f (x; θ 0 )| I1 +|⟨▽ θ0 f (x i ; θ 0 ), θ u t -θ 0 ⟩| + O(L 2 m log(m))∥ θ u t -θ 0 ∥ 2 w 1/3 ≤ 2∥x∥ 2 I1 + ∥▽ θ0 f (x i ; θ 0 )∥ 2 ∥ θ u t -θ 0 ∥ 2 I2 +O(L 2 m log(m)) ∥ θ u t -θ 0 ∥ 2 w 1/3 I3 ≤ 2 + O(L) • O t 3 ρ √ m log m I2 + O L 2 m log(m) • O t 3 ρ √ m log m 4/3 I3 = 2 + O t 3 L log m ρ √ m + O t 4 L 2 log 11/6 m ρm 1/6 (50) where I 1 is an application of Lemma 7.3 in (Allen-Zhu et al., 2019), I 2 is by Lemma E.10 (1) and Lemma E.8 (4), and I 3 is due to Lemma E.8 (4). Lemma E.6. For any δ ∈ (0, 1), suppose m satisfy the conditions in Eq.( 6) and ν = Θ((µ u t ) 6 /ρ 2 ). Then, with probability at least 1 -δ, set η 1 = Θ( ν √ 2µ u t m ) for algorithm 1-3, for any θ satisfying ∥ θ -θ u 0 ∥ 2 ≤ O (µ u t ) 3 ρ √ m log m such that µ u t τ =1 |f (x τ ; θ u τ -1 ) -r τ | ≤ µ u t τ =1 |f (x τ ; θ) -r τ | + O 3L √ µ u t √ 2 Proof. This is a direct application of Lemma 4.3 in (Cao and Gu, 2019) L τ ( θ) + 3µ u t ϵ. Then, replacing ϵ completes the proof. Lemma E.7 (Lemma C.2 (Ban et al., 2021b) ). For any δ ∈ (0, 1), ρ ∈ (0, O( 1 L )), suppose the conditions in Theorem 4.2 are satisfied. Then, with probability at least 1 -δ, in each round t ∈ [T ], for any ∥x∥ 2 = 1, θ u, * t-1 , θ (2)∥▽ θ u t-1 f 1 (x; θ u t-1 )∥ 2 ≤ 1 + O tL 3 log 5/6 m ρ 1/3 m 1/6 O(L) . (54) Lemma E.8 (Theorem 1 in (Allen-Zhu et al., 2019) ). For any 0 < ϵ 1 ≤ 1, 0 < ρ ≤ O(1/L). Given a user u, the collected data {x τ , r u τ } µ u t τ =1 , suppose m, η 1 , J 1 satisfy the conditions in Eq.( 6). Define L (θ u ) = 1 2 (x,r)∈T u t (f (x; θ u ) -r) 2 . Then with probability at least 1 -δ, these hold that: 1. For any j ∈ [J], L(θ u (j) ) ≤ (1 -Ω( η1ρm µ u t 2 ))L(θ u (j-1) ) 2. L( θ u µ u t ) ≤ ϵ 1 in J 1 = poly(µ u t ,L) ρ 2 log(1/ϵ 1 ) rounds. 3. L(θ u 0 ) ≤ O(µ u t log 2 m). 4. For any j ∈ [J], ∥θ u (j) -θ u (0) ∥ 2 ≤ O (µ u t ) 3 ρ √ m log m . Lemma E.9 (Lemma 4.1, (Cao and Gu, 2019) ). Suppose O(m -3/2 L -3/2 [log(T nL 2 /δ)] 3/2 ) ≤ w ≤ O(L -6 [log m] -3/2 ). Then, with probability at least 1 -δ over randomness of θ 0 , for any t ∈ [T ], ∥x∥ 2 = 1, and θ, θ ′ satisfying ∥θ -θ 0 ∥ ≤ w and ∥θ ′ -θ 0 ∥ ≤ w , it holds uniformly that |f (x; θ) -f (x; θ ′ ) -⟨▽ θ ′ f (x; θ ′ ), θ -θ ′ ⟩| ≤ O(w 1/3 L 2 m log(m))∥θ -θ ′ ∥ 2 . Lemma E.10. For any δ ∈ (0, 1), suppose m, η 1 , J 1 satisfy the conditions in Eq.( 6) and θ 0 are randomly initialized. Then, with probability at least 1 -δ, for any ∥x∥ 2 = 1, these hold that 1. ∥▽ θ0 f (x; θ 0 )∥ 2 ≤ O(L),  ∥▽ W l f (x; θ 0 )∥ F ≤ ∥W L DW L-1 • • • DW l+1 ∥ F • ∥DW l+1 • • • x∥ F ≤ O( √ L) where the inequality is according to Lemma 7.2 in (Allen-Zhu et al., 2019) . Therefore, we have ∥▽ θ0 f (x; θ 0 )∥ 2 ≤ O(L).

F RELATIVE GROUP GUARANTEE

In this section, we provide a relative group guarantee with the expectation taken over all past selected arms. For u, u ′ ∈ N ∧ u ̸ = u ′ , we define This indicates for any u, u ′ ∈ N and satisfying |f (x τ ; θ u t-1 ) -f (x τ ; θ u ′ t-1 )| ≤ ν-1 ν γ, i.e., u, u ′ ∈ N u (x τ ), we have E E xτ ∼T u t |x E rτ ,r ′ τ [|r τ -r ′ τ | | x τ ] ≤ γ. This implies E xτ ∼T u t |x [ N u (x τ ) ⊆ N u (x τ )]. For any u, u ′ ∈ N u (x τ ), we have [ N u (x τ ) = N u (x τ )] when ν ≥ 2 and t ≥ T . E The proof is completed. Corollary F.2. For any δ ∈ (0, 1), ρ ∈ (0, O( 1 L )], suppose 0 < ϵ 1 ≤ 1 and m, η 1 , J 1 satisfy the conditions in Eq.( 6). In each round t ∈ [T ], given u ∈ N , let (x t,j , r t,j ) be pair produced by some



https://www.yelp.com/dataset https://www.yelp.com/dataset



) COFIBA(Li et al., 2016); (3) SCLUB(Li et al., 2019); (4) LOCB(Ban  and He, 2021); (5) NeuUCB-ONE(Zhou et al., 2020); (6) NeuUCB-IND(Zhou et al., 2020). See detailed descriptions in Appendix A.2. Since LinUCBLi et al. (2010) andKernalUCB Valko et al.  (2013)  are outperformed by the above baselines, we do not include them in comparison.

Figure 2: Regret comparison on ML datasets. Meta-Ban outperforms all baselines. Specifically, compared to the best baseline, Meta-Ban improves 41.6% on Fashion-Mnist, 28.4% on Mnist, 47.1% on Mushroom, and 61.5% on Magictelescope. Results. Figure 1 -2 shows the regret comparison on ML datasets in which Meta-Ban outperforms all baselines. Each class can be thought of as a user in these datasets. As the rewards are non-linear to the arms on these datasets, conventional linear clustering of bandits (CLUB, COFIBA, SCLUB, LOCB) perform poorly. Thanks to the representation power of neural networks, NeuUCB-ONE obtains better performance. However, it treats all the users as one group, neglecting the disparity among groups. In contrast, NeuUCB-IND deals with the user individually, not taking collaborative knowledge among users into account. Meta-Ban significantly outperforms all the baselines, because Meta-Ban can exploit the common knowledge of the correct group of classes where the samples from these classes have non-trivial correlations, and train the parameters on the previous group to fast adapt to new tasks, which existing works do not possess. Figure 3 reports the regret comparison on recommendation datasets where Meta-Ban still outperforms all baselines. Since these two datasets contain considerably inherent noise, all algorithms show the linear growth of regret. As rewards are almost linear to the arms on these two datasets, conventional clustering of bandits (CLUB, COFIBA, SCLUB, LOCB) achieve comparable performance. But they still are outperformed by Meta-Ban because a simple vector cannot accurately represent a user's behavior. Similarly, because Meta-Ban can discover and leverage the group information automatically, it obtains the best performance surpassing NeuUCB-ONE and NeuUCB-IND. Furthermore, hyper-parameter sensitivity study is in Appendix A.3.

Clustering of bandits.CLUB Gentile et al. (2014)  first studies exploring collaborative effects among users in contextual bandits where each user hosts an unknown vector to represent the behavior based on the linear reward function. CLUB formulates user similarity on an evolving graph and selects an arm leveraging the clustered groups. Then,Li et al. (2016);Gentile et al. (2017) propose to cluster users based on specific contents and select arms leveraging the aggregated information of conditioned groups.Li et al. (2019) improve the clustering procedure by allowing groups to split and merge. Ban and He (2021) use seed-based local clustering to find overlapping groups, different from globally clustering on graphs.Korda et al. (2016);Yang et al. (2020);Wu et al. (2021) also study clustering of bandits with various settings in recommendation system. However, all the series of works are based on the linear reward assumption, which may fail in many real-world applications.Neural bandits.Allesiardo et al. (2014) use a neural network to learn each action and then selects an arm by the committee of networks with ϵ-greedy strategy.Lipton et al. (2018);Riquelme et al. (2018) adapt the Thompson Sampling to the last layer of deep neural networks to select an action.

Figure 4: Sensitivity study for ν on MovieLens Dataset.

Figure 5: Sensitivity study for α on Mnist Dataset.

by setting the loss asL τ (θ u τ -1 ) = |f (x τ ; θ u τ -1 ) -r τ |, and R = ν = ν ′ R 2, where ν ′ is some small enough absolute constant. Then, for any θ satisfying ∥ θ -θ u 0 ∥ 2 ≤ O (µ u t ) 3 ρ √m log m , there exist a small enough absolute constant ν ′ , such that

2. |f (x; θ 0 )| ≤ 2. Proof. For (2), based on Lemma 7.1 in (Allen-Zhu et al., 2019), we have |f (x; θ 0 )| ≤ 2. Denote by D the ReLU function. For any l ∈ [L],

3.1 (Time Complexity). Recall that n is the number of users. It takes O(n) to find the group. Given the detected group N u , let b be the batch size of SGD and J 2 be the number of iterations for the updates of Meta-learner. Thus, it takes O(| N u |bJ) to update the meta-learner. Based on the fast adaptation ability of meta-learner, J 2 is a typically small number. b is controlled by the practitioner, and | Nu | is upper bound by n. Therefore, the test time complexity is O(n) + O(| Nu |bJ). In the large recommender system, despite the large number of users, given a serving user u, the computational cost of Meta-Ban is mostly related to the inferred relative group N u , i.e., O(| Nu |bJ). Inferring N u is efficient because it takes O(n) and only needs to calculate the output of neural networks. Therefore, as long as we can control the size of N u , Meta-Ban can work properly. The first solution is to set the hyperparameter γ to a small value, so | N u | is usually small. Second, we confine the size of | N

The cumulative regret of 10000 rounds on MovieLens with different input dimensions.A.6 ABLATION STUDY FOR NETWORK LAYERSWe run the experiments on MovieLens and Yelp datasets with the different number of layers of neural networks and report the results as follows. Meta-Ban achieves the best performance in the most of cases. In this paper, we try to propose a generic framework to combine meta-learning and bandits with the neural network approximation. Since the UCB in Meta-Ban only depends on the gradient, the neural network can be easily replaced by other different structures.

The cumulative regret of 10000 rounds on MovieLens with the different number of layers.

The cumulative regret of 10000 rounds on Yelp with the different number of layers.Theorem B.1 (Theorem 4.2 restated). Given the number of rounds T , assume that each user is uniformly served and set T

(x τ ) is the detected group and N u (x τ ) is the ground-truth group. Then, we provide the following lemma. Lemma F.1 (Lemma 4.6 Restated). Assume the groups in N satisfy γ-gap (Definition 2.2) and the conditions of Theorem 4.2 are satisfied. For any δ ∈ (0, 1), ν > 1, with probability at least 1 -δ over the random initialization, there exist constants c 1 , c 2 , such that whent ≥ n64ν 2 (1 + ξ t ) 2 log 32ν 2 (1+ξt) 2 N u (x τ ) = N u (x τ )], if ν ≥ 2,where x τ is uniformly drawn from T u t |x andT u t |x = {x τ : u t = u ∧ τ ∈ [t]} is all the historical selected arms when serving u up to round t. Recall that Given the binomially distributed random variables, x 1 , x 2 , . . . , x t , where for τ ∈ [t], x τ = 1 with probability 1/n and x τ = 0 with probability 1 -1/n. Then, we have Then, apply Chernoff Bounds on the µ u t with probability at least 1 -δ, for each u ∈ N , we have

annex

Proof. According to Lemma E.5, with probability at least 1 -δ, given any ∥x∥ 2 = 1, r ≤ 1, for any round τ in which u is the serving user, we have |f (x; θ u τ -1 ) -r| ≤ ξ t + 1. Here, we will apply the union bound of δ over all µ u T rounds, to make this bound hold for every round of u. Then, in a round τ where u is the serving user, let x τ be the arm selected by some fixed policy π τ and r τ is the corresponding reward. Then, we definewhere the expectation is taken over r τ conditioned on x τ . Then, we havewhere F u τ denotes the σ-algebra generated by T u τ -1 . Thus, we have the following form:For I 1 , for any θ satisfying ∥ θ -where I 2 is because of Lemma E.6 and I 3 is the direct application of Lemma E.8 (2): there existsCombing Eq.( 46) and Eq.( 47), we have(48) Then, applying the union bound over δ, for any i ∈Based on Lemma E.8 (4), for anyThen, apply the union bound of δ over all µ u T rounds. The proof is completed.Proof. Given two user u, u ′ ∈ N and an arm x τ , let r τ be the reward u generated on x τ and r ′ τ be the reward u ′ generated on x τ . Then, in round t ∈ [T ], we have(57) where the expectation is taken over r τ , r ′ τ conditioned on x τ . According to Lemma E.3 and Corollary F.2 respectively, for each u ∈ N , we haveDue to the setting of Algorithm 1, |f (x τ ; θ u t-1 ) -f (x τ ; θ u ′ t-1 )| ≤ ν-1 ν γ for any u, u ′ ∈ N ut (x τ ), given x τ ∈ T u t |x. Therefore, we haveNext, we need to lower bound t as the following:By simple calculations, we haveThen, based on Lemme 8.1 in (Ban and He, 2021) , we havepolicy π j . Then, with probability at least 1 -δ over the random initialization, for the user-learner θ u t-1 , we havewhere T u t |π j = {(r τ,j , x τ,j ) : u t = u, τ ∈ [t]} is the historical data according to π j in the rounds where u is the serving user.Proof. By the application of Lemma E.4, there exists θ u,j t-1 satisfying ∥θ u,j t-1 - (where I 1 is an application of Lemma E.4 and I 2 is because of E.7. The proof is complete.

