BANDIT LEARNING IN MANY-TO-ONE MATCHING MARKETS WITH UNIQUENESS CONDITIONS

Abstract

An emerging line of research is dedicated to the problem of one-to-one matching markets with bandits, where the preference of one side is unknown and thus we need to match while learning the preference through multiple rounds of interaction. However, in many real-world applications such as online recruitment platform for short-term workers, one side of the market can select more than one participant from the other side, which motivates the study of the many-to-one matching problem. Moreover, the existence of a unique stable matching is crucial to the competitive equilibrium of the market. In this paper, we first introduce a more general new αcondition to guarantee the uniqueness of stable matching in many-to-one matching problems, which generalizes some established uniqueness conditions such as SPC and Serial Dictatorship, and recovers the known α-condition if the problem is reduced to one-to-one matching. Under this new condition, we design an MO-UCB-D4 algorithm with O N K log(T ) ∆ 2 regret bound, where T is the time horizon, N is the number of agents, K is the number of arms, and ∆ is the minimum reward gap. Extensive experiments show that our algorithm achieves uniform good performances under different uniqueness conditions.

1. INTRODUCTION

The data-driven matching market is faced with the problems of learning customer preference and matching the demand side with the supply side of the market to maximize the benefits of both sides. Online platforms, like Lyft, Thumbtack and Taskrabbit, make decisions for customers and service providers to match, on the basis of their diversified needs, which is abstracted as a matching market with an agent side and an arm side, and each side has a preference profile over the opposite side. They choose from the other side according to preference and perform a matching. Specific examples like pool riding in ride-share system that matches a driver to multiple riders, Slate ranking in recommender systems that a user is matched to various content at a single request Ie et al. (2019) . The stability of the matching result is a key property of the market Roth & Sotomayor (1992); Abizada ( 2016 2021) consider a more general decentralized setting without a central platform to arrange matchings, and our work is also based on this setting. However, it is not enough to just study the one-to-one setting. In online short-term worker employment problem, employers have numerous similar short-term tasks to be recruited and workers can only



); Park (2017). This work takes online short-term recruitment as the main example, combine the traditional matching problem Bade (2020); Bogomolnaia & Moulin (2001); Roth & Sotomayor (1992) with the online system Gunn et al. (2022); Malgonde et al. (2020); Johari et al. (2021). Companies with short-term needs accommodate workers who are voluntarily looking for flexible probation periods. The worker preferences may be unknown in advance, thus matching while learning the preferences is necessary. The multi-armed bandit (MAB) Thompson (1933); Garivier et al. (2016); Auer et al. (2002) is an important tool for N independent agents in matching market simultaneously selecting arms adaptively from received rewards at each round. And the upper confidence bound algorithm (UCB) Auer et al. (2002) is a typical MAB algorithm, which sets a confidence interval to represent uncertainty. The idea of applying MAB to one-to-one matching problems, introduced by Liu et al. (2020a), assumes that there is a central platform to make decisions for all agents. Following this, other works Liu et al. (2020b); Sankararaman et al. (2021); Basu et al. (

