NEURAL COLLABORATIVE FILTERING BANDITS VIA META LEARNING

Abstract

Contextual multi-armed bandits provide powerful tools to solve the exploitationexploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring 'Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, we propose a meta-learning based bandit algorithm, Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with an informative UCBbased exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of O( √ nT log T ), which is sharper over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban outperforms six strong baselines.

1. INTRODUCTION

The contextual multi-armed bandit has been extensively studied in machine learning to resolve the exploitation-exploration dilemma in sequential decision making, with wide applications in personalized recommendation (Li et al., 2010) , online advertising (Wu et al., 2016) , etc. Recommender systems play an indispensable role in many online businesses, such as e-commerce platforms and online streaming services. It is well-known that user collaborative effects are strongly associated with the user preference. Thus, discovering and leveraging collaborative information in recommender systems has been studied for decades. In the relatively static environment, e.g., in a movie recommendation platform where catalogs are known and accumulated ratings for items are provided, the classic collaborative filtering methods can be easily deployed (e.g., matrix/tensor factorization (Su and Khoshgoftaar, 2009) ). However, such methods can hardly adapt to more dynamic settings, such as news or short-video recommendation, due to: (1) the lack of cumulative interactions for new users or items; (2) the difficulty of balancing the exploitation of current user-item preference knowledge and the exploration of the new potential matches (e.g., presenting new items to the users). To address this problem, a line of works, clustering of bandits (collaborative filtering bandits) (Gentile et al., 2014; Li et al., 2016; Gentile et al., 2017; Li et al., 2019; Ban and He, 2021) , have been proposed to incorporate collaborative effects among users which are largely neglected by conventional bandit algorithms (Dani et al., 2008; Abbasi-Yadkori et al., 2011; Valko et al., 2013; Ban and He, 2020) . These works use the graph-based method to adaptively cluster users and explicitly or implicitly utilize the collaborative effects on user sides while selecting an arm. However, this line of works have a significant limitation that they all build on the linear bandit framework (Abbasi-Yadkori et al., 2011) and the user groups are represented by the simple linear combinations of individual user parameters. The linear reward assumptions and linear representation of groups may not be true in real-world applications (Valko et al., 2013) . To learn non-linear reward functions, neural bandits (Collier and Llorens, 2018; Zhou et al., 2020; Zhang et al., 2021; Kassraie and Krause, 2022) have attracted much attention, where a neural network is assigned to learn the reward function along with an exploration strategy (e.g., Upper Confidence Bound (UCB) or Thompson Sampling (TS)). However, this class of works do not incorporate any collaborative effects among users, overlooking the crucial potential in improving recommendation.

