BANDIT LEARNING IN MANY-TO-ONE MATCHING MARKETS WITH UNIQUENESS CONDITIONS

Abstract

An emerging line of research is dedicated to the problem of one-to-one matching markets with bandits, where the preference of one side is unknown and thus we need to match while learning the preference through multiple rounds of interaction. However, in many real-world applications such as online recruitment platform for short-term workers, one side of the market can select more than one participant from the other side, which motivates the study of the many-to-one matching problem. Moreover, the existence of a unique stable matching is crucial to the competitive equilibrium of the market. In this paper, we first introduce a more general new αcondition to guarantee the uniqueness of stable matching in many-to-one matching problems, which generalizes some established uniqueness conditions such as SPC and Serial Dictatorship, and recovers the known α-condition if the problem is reduced to one-to-one matching. Under this new condition, we design an MO-UCB-D4 algorithm with O N K log(T ) ∆ 2 regret bound, where T is the time horizon, N is the number of agents, K is the number of arms, and ∆ is the minimum reward gap. Extensive experiments show that our algorithm achieves uniform good performances under different uniqueness conditions.

1. INTRODUCTION

The data-driven matching market is faced with the problems of learning customer preference and matching the demand side with the supply side of the market to maximize the benefits of both sides. Online platforms, like Lyft, Thumbtack and Taskrabbit, make decisions for customers and service providers to match, on the basis of their diversified needs, which is abstracted as a matching market with an agent side and an arm side, and each side has a preference profile over the opposite side. They choose from the other side according to preference and perform a matching. Specific examples like pool riding in ride-share system that matches a driver to multiple riders, Slate ranking in recommender systems that a user is matched to various content at a single request Ie et al. (2019) . The stability of the matching result is a key property of the market Roth & Sotomayor (1992); Abizada ( 2016 2021) consider a more general decentralized setting without a central platform to arrange matchings, and our work is also based on this setting. However, it is not enough to just study the one-to-one setting. In online short-term worker employment problem, employers have numerous similar short-term tasks to be recruited and workers can only choose one task according to the company's needs at a time while one company can accept more than one employee. Each company makes a fixed ranking for candidates according to its own requirements but workers have no knowledge of companies' preferences. The reward for workers is a comprehensive consideration of salary and job environment. The online matching is in an iterative way that tasks are short-term, or if an agent do not get an ideal job, he will leave the platform or start a new competition to select another company. We abstract companies as arms and workers as agents. Each arm has a capacity q which is the maximum number of agents this arm can accommodate. When an arm faces multiple choices, it accepts its most q preferred agents. Agents thus competing for arms and may receive zero reward if losing the conflict. It is worth mentioning that arms with capacity q in the many-to-one matching can not just be replaced by q independent replicates with the same preference since there would be implicit competition. In addition, when multiple agents select one arm at a time, collision is unavoidable, which hinder the communication among different agents under the decentralized assumption. They cannot distinguish who is more preferred by this arm in one round as it can accept more than one agent while this can be done in one-to-one case. Communication here lets each agent learn more about preferences of arms and other agents, so as to formulate better policies to reduce collisions and learn faster about their stable results. This work focuses on a many-to-one market under uniqueness conditions. Previous work Clark (2006); Gutin et al. ( 2021) emphasize the importance of constructing a unique stable matching for the equilibrium of matching problems and some existing uniqueness conditions are studied in many-toone matching, such as Sequential Preference Condition (SPC) and Acyclicity Niederle & Yariv (2009); Akahoshi (2014). Our work is motivated by Basu et al. (2021) , but the unique one-to-one mapping between arms and agents in their study which gives a surrogate threshold for arm elimination does not work in the many-to-one setting. And the uniqueness conditions in many-to-one matching are not well-studied, which also brings a challenge to identify and leverage the relationship between the resulting stable matching and preferences of two sides in the design of bandit algorithms. We propose an α-condition that can guarantee a unique stable matching and recover α-condition Karpov (2019) if reduced to the one-to-one setting. We establish the relationships between our new α-condition and existing uniqueness conditions in many-to-one setting. For clarity, in this paper, we study the bandit algorithm for a decentralized many-to-one matching market with uniqueness conditions. Under our newly proposed uniqueness condition, α-condition, we design an MO-UCB-D4 algorithm with arm elimination to construct a stable matching result. The regret of our algorithm can be upper bounded by O N K log(T ) ∆ 2 , where N is the number of agents, K is the number of arms, and ∆ is the minimum reward gap, and the regret reaches the lower bound in terms of T and ∆. Finally, we conduct a series of experiments to simulate our algorithm under various conditions of Serial dictatorship, SPC and α-condition to study the stability and regret of the algorithm.

2. SETTING

This paper considers a many-to-one matching market M = (K, J , P), where K = [K], is a finite arm set and J = [N ] is a finite agent set. Each arm k has a fixed capacity q k ≥ 1. To guarantee that no agents will be unmatched, we focus on the market with N ≤ K i=1 q i . P is the fixed preference order of agents and arms, which is ranked by the mean reward. We assume that arm preference is over individuals Roth & Sotomayor (1992) ; Sethuraman et al. (2006) ; Altinok (2019), and arm preferences for agents are unknown and needed to be learned. If agent j prefers arm k than k ′ , i.e., µ j,k > µ j,k ′ , we denote by k ≻ j k ′ . And the preference is strict that µ j,k ̸ = µ j,k ′ if k ̸ = k ′ . Similarly, each arm k has preferences ≻ k over all agents, and specially, j ≻ k j ′ means that arm k prefers agent j over j ′ . Throughout, we focus on the market where all agent-arm pairs are mutually acceptable, that is, j ≻ k ∅ and k ≻ j ∅ for all k ∈ [K] and j ∈ [N ]. Let a mapping m be the matching result. m t (j) is the matched arm for agent j at time t, and γ t (k) is the agent set matched with arm kfoot_0 . At each time agent j selects an arm I t (j), and we use M t (j) to denote whether j is successfully matched with its selected arm. M t (j) = 1 if agent j is matched with I t (j), and M t (j) = 0, otherwise. If multiple agents select arm k at the same time, only top q k agents can successfully match. The agent j matched with arm k can observe the reward X j,mt(j) (t), where



The mapping m is not reversible as it is not a injective, thus we do not use m -1 t (k).



online short-term recruitment as the main example, combine the traditional matching problem Bade (2020); Bogomolnaia & Moulin (2001); Roth & Sotomayor (1992) with the online system Gunn et al. (2022); Malgonde et al. (2020); Johari et al. (2021). Companies with short-term needs accommodate workers who are voluntarily looking for flexible probation periods. The worker preferences may be unknown in advance, thus matching while learning the preferences is necessary. The multi-armed bandit (MAB) Thompson (1933); Garivier et al. (2016); Auer et al. (2002) is an important tool for N independent agents in matching market simultaneously selecting arms adaptively from received rewards at each round. And the upper confidence bound algorithm (UCB) Auer et al. (2002) is a typical MAB algorithm, which sets a confidence interval to represent uncertainty. The idea of applying MAB to one-to-one matching problems, introduced by Liu et al. (2020a), assumes that there is a central platform to make decisions for all agents. Following this, other works Liu et al. (2020b); Sankararaman et al. (2021); Basu et al. (

