NEURAL COLLABORATIVE FILTERING BANDITS VIA META LEARNING

Abstract

Contextual multi-armed bandits provide powerful tools to solve the exploitationexploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring 'Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, we propose a meta-learning based bandit algorithm, Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with an informative UCBbased exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of O( √ nT log T ), which is sharper over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban outperforms six strong baselines.

1. INTRODUCTION

The contextual multi-armed bandit has been extensively studied in machine learning to resolve the exploitation-exploration dilemma in sequential decision making, with wide applications in personalized recommendation (Li et al., 2010) , online advertising (Wu et al., 2016) , etc. Recommender systems play an indispensable role in many online businesses, such as e-commerce platforms and online streaming services. It is well-known that user collaborative effects are strongly associated with the user preference. Thus, discovering and leveraging collaborative information in recommender systems has been studied for decades. In the relatively static environment, e.g., in a movie recommendation platform where catalogs are known and accumulated ratings for items are provided, the classic collaborative filtering methods can be easily deployed (e.g., matrix/tensor factorization (Su and Khoshgoftaar, 2009) ). However, such methods can hardly adapt to more dynamic settings, such as news or short-video recommendation, due to: (1) the lack of cumulative interactions for new users or items; (2) the difficulty of balancing the exploitation of current user-item preference knowledge and the exploration of the new potential matches (e.g., presenting new items to the users). To address this problem, a line of works, clustering of bandits (collaborative filtering bandits) (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) , have been proposed to incorporate collaborative effects among users which are largely neglected by conventional bandit algorithms (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Valko et al., 2013; Ban and He, 2020) . These works use the graph-based method to adaptively cluster users and explicitly or implicitly utilize the collaborative effects on user sides while selecting an arm. However, this line of works have a significant limitation that they all build on the linear bandit framework (Abbasi-Yadkori et al., 2011) and the user groups are represented by the simple linear combinations of individual user parameters. The linear reward assumptions and linear representation of groups may not be true in real-world applications (Valko et al., 2013) . To learn non-linear reward functions, neural bandits (Collier and Llorens, 2018; Zhou et al., 2020; Zhang et al., 2021; Kassraie and Krause, 2022) have attracted much attention, where a neural network is assigned to learn the reward function along with an exploration strategy (e.g., Upper Confidence Bound (UCB) or Thompson Sampling (TS)). However, this class of works do not incorporate any collaborative effects among users, overlooking the crucial potential in improving recommendation.

annex

In this paper, to overcome the above challenges, we introduce the problem, Neural Collaborative Filtering Bandits (NCFB), built on either linear or non-linear reward assumptions while introducing relative groups. Groups are formed by users sharing similar interests/preferences/behaviors. However, such groups are usually not static over specific contents (Li et al., 2016) . For example, two users may both like "country music" but may have different opinions on "rock music". "Relative groups" are introduced in NCFB to formulate groups given a specific content, which is more practical in real problems.To solve NCFB, we propose a meta-learning based bandit algorithm, Meta-Ban (Meta-Bandits), distinct from existing related works (i.e., graph-based clustering of linear bandits (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) ). Inspired by recent advances in meta-learning (Finn et al., 2017; Yao et al., 2019) , in Meta-Ban, a meta-learner is assigned to represent and rapidly adapt to dynamic groups, which allows the non-linear representation of collaborative effects. And a user-learner is assigned to each user to discover the underlying relative groups. Here, we use neural networks to formulate both meta-learner and user learners, in order to learn linear or non-linear reward functions. To solve the exploitation-exploration dilemma in bandits, Meta-Ban has an informative UCB-type exploration. In the end, we provide rigorous regret analysis and empirical evaluation for Meta-Ban. To the best of our knowledge, this is the first work incorporating collaborative effects in neural bandits. The contributions of this paper can be summarized as follows:(1) Problem. We introduce the problem, Neural Collaborative Filtering Bandits (NCFB), to incorporate collaborative effects among users with either linear or non-linear reward assumptions.(2)Algorithm. We propose a meta-learning based bandit algorithm working in NCFB, Meta-Ban, where the meta-learner is introduced to represent and rapidly adapt to dynamic groups, along with a new informative UCB-type exploration that utilizes both meta-side and user-side information. Meta-Ban allows the non-linear representation of relative groups based on user learners.(3) Theoretical analysis. Under the standard assumptions of over-parameterized neural networks, we prove that Meta-Ban can achieve the regret upper bound of complexity O( √ nT log T ), where n is the number of users and T is the number of rounds. Our bound is sharper than existing related works. Moreover, we provide a correctness guarantee of groups detected by Meta-Ban.(4) Empirical performance. We evaluate Meta-Ban on 10 real-world datasets and show that Meta-Ban significantly outperforms 6 strong baselines.Next, after introducing the problem definition in Section 2, we present the proposed Meta-Ban in Section 3 together with theoretical analysis in Section 4. In the end, we show the experiments in Section 5 and conclude the paper in Section 6. More discussion regarding related work is placed in Appendix Section A.1.

2. NEURAL COLLABORATIVE FILTERING BANDITS

In this section, we introduce the problem of Neural Collaborative Filtering Bandits, motivated by generic recommendation scenarios.Suppose there are n users, N = {1, . . . , n}, to serve on a platform. In the t th round, the platform receives a user u t ∈ N and prepares the corresponding k arms (items) X t = {x t,1 , x t,2 , . . . , x t,k } in which each arm is represented by its d-dimensional feature vector x t,i ∈ R d , ∀i ∈ {1, . . . , k}. Then, like the conventional bandit problem, the platform will select an arm x t,i ∈ X t and recommend it to the user u t . In response to this action, u t will produce a corresponding reward (feedback) r t,i . We use r t,i |u t to represent the reward produced by u t given x t,i , because different users may generate different rewards towards the same arm.Group behavior (collaborative effects) exists among users and has been exploited in recommender systems. In fact, the group behavior is item-varying, i.e., the users who have the same preference on a certain item may have different opinions on another item (Gentile et al., 2017; Li et al., 2016) . Therefore, we define a relative group as a set of users with the same opinions on a certain item.

