BANDIT LEARNING IN MANY-TO-ONE MATCHING MARKETS WITH UNIQUENESS CONDITIONS

Abstract

An emerging line of research is dedicated to the problem of one-to-one matching markets with bandits, where the preference of one side is unknown and thus we need to match while learning the preference through multiple rounds of interaction. However, in many real-world applications such as online recruitment platform for short-term workers, one side of the market can select more than one participant from the other side, which motivates the study of the many-to-one matching problem. Moreover, the existence of a unique stable matching is crucial to the competitive equilibrium of the market. In this paper, we first introduce a more general new αcondition to guarantee the uniqueness of stable matching in many-to-one matching problems, which generalizes some established uniqueness conditions such as SPC and Serial Dictatorship, and recovers the known α-condition if the problem is reduced to one-to-one matching. Under this new condition, we design an MO-UCB-D4 algorithm with O N K log(T ) ∆ 2 regret bound, where T is the time horizon, N is the number of agents, K is the number of arms, and ∆ is the minimum reward gap. Extensive experiments show that our algorithm achieves uniform good performances under different uniqueness conditions. Under review as a conference paper at ICLR 2023 choose one task according to the company's needs at a time while one company can accept more than one employee. Each company makes a fixed ranking for candidates according to its own requirements but workers have no knowledge of companies' preferences. The reward for workers is a comprehensive consideration of salary and job environment. The online matching is in an iterative way that tasks are short-term, or if an agent do not get an ideal job, he will leave the platform or start a new competition to select another company. We abstract companies as arms and workers as agents. Each arm has a capacity q which is the maximum number of agents this arm can accommodate. When an arm faces multiple choices, it accepts its most q preferred agents. Agents thus competing for arms and may receive zero reward if losing the conflict. It is worth mentioning that arms with capacity q in the many-to-one matching can not just be replaced by q independent replicates with the same preference since there would be implicit competition. In addition, when multiple agents select one arm at a time, collision is unavoidable, which hinder the communication among different agents under the decentralized assumption. They cannot distinguish who is more preferred by this arm in one round as it can accept more than one agent while this can be done in one-to-one case. Communication here lets each agent learn more about preferences of arms and other agents, so as to formulate better policies to reduce collisions and learn faster about their stable results. This work focuses on a many-to-one market under uniqueness conditions. Previous work Clark (2006); Gutin et al. (2021) emphasize the importance of constructing a unique stable matching for the equilibrium of matching problems and some existing uniqueness conditions are studied in many-toone matching, such as Sequential Preference Condition (SPC) and Acyclicity Niederle & Yariv (2009); Akahoshi (2014). Our work is motivated by Basu et al. (2021) , but the unique one-to-one mapping between arms and agents in their study which gives a surrogate threshold for arm elimination does not work in the many-to-one setting. And the uniqueness conditions in many-to-one matching are not well-studied, which also brings a challenge to identify and leverage the relationship between the resulting stable matching and preferences of two sides in the design of bandit algorithms. We propose an α-condition that can guarantee a unique stable matching and recover α-condition Karpov (2019) if reduced to the one-to-one setting. We establish the relationships between our new α-condition and existing uniqueness conditions in many-to-one setting. For clarity, in this paper, we study the bandit algorithm for a decentralized many-to-one matching market with uniqueness conditions. Under our newly proposed uniqueness condition, α-condition, we design an MO-UCB-D4 algorithm with arm elimination to construct a stable matching result. The regret of our algorithm can be upper bounded by , where N is the number of agents, K is the number of arms, and ∆ is the minimum reward gap, and the regret reaches the lower bound in terms of T and ∆. Finally, we conduct a series of experiments to simulate our algorithm under various conditions of Serial dictatorship, SPC and α-condition to study the stability and regret of the algorithm. This paper considers a many-to-one matching market M = (K, J , P), where K = [K], is a finite arm set and J = [N ] is a finite agent set. Each arm k has a fixed capacity q k ≥ 1. To guarantee that no agents will be unmatched, we focus on the market with N ≤

1. INTRODUCTION

The data-driven matching market is faced with the problems of learning customer preference and matching the demand side with the supply side of the market to maximize the benefits of both sides. Online platforms, like Lyft, Thumbtack and Taskrabbit, make decisions for customers and service providers to match, on the basis of their diversified needs, which is abstracted as a matching market with an agent side and an arm side, and each side has a preference profile over the opposite side. They choose from the other side according to preference and perform a matching. Specific examples like pool riding in ride-share system that matches a driver to multiple riders, Slate ranking in recommender systems that a user is matched to various content at a single request Ie et al. (2019) . The stability of the matching result is a key property of the market Roth & Sotomayor (1992) ; Abizada (2016) ; Park (2017) . This work takes online short-term recruitment as the main example, combine the traditional matching problem Bade (2020) ; Bogomolnaia & Moulin (2001) ; Roth & Sotomayor (1992) with the online system Gunn et al. (2022) ; Malgonde et al. (2020) ; Johari et al. (2021) . Companies with short-term needs accommodate workers who are voluntarily looking for flexible probation periods. The worker preferences may be unknown in advance, thus matching while learning the preferences is necessary. The multi-armed bandit (MAB) Thompson (1933) ; Garivier et al. (2016) ; Auer et al. (2002) is an important tool for N independent agents in matching market simultaneously selecting arms adaptively from received rewards at each round. And the upper confidence bound algorithm (UCB) Auer et al. (2002) is a typical MAB algorithm, which sets a confidence interval to represent uncertainty. The idea of applying MAB to one-to-one matching problems, introduced by Liu et al. (2020a) , assumes that there is a central platform to make decisions for all agents. Following this, other works Liu et al. (2020b) ; Sankararaman et al. (2021) ; Basu et al. (2021) consider a more general decentralized setting without a central platform to arrange matchings, and our work is also based on this setting. However, it is not enough to just study the one-to-one setting. In online short-term worker employment problem, employers have numerous similar short-term tasks to be recruited and workers can only K i=1 q i . P is the fixed preference order of agents and arms, which is ranked by the mean reward. We assume that arm preference is over individuals Roth & Sotomayor (1992) ; Sethuraman et al. (2006) ; Altinok (2019) , and arm preferences for agents are unknown and needed to be learned. If agent j prefers arm k than k ′ , i.e., µ j,k > µ j,k ′ , we denote by k ≻ j k ′ . And the preference is strict that µ j,k ̸ = µ j,k ′ if k ̸ = k ′ . Similarly, each arm k has preferences ≻ k over all agents, and specially, j ≻ k j ′ means that arm k prefers agent j over j ′ . Throughout, we focus on the market where all agent-arm pairs are mutually acceptable, that is, j ≻ k ∅ and k ≻ j ∅ for all k ∈ [K] and j ∈ [N ] . Let a mapping m be the matching result. m t (j) is the matched arm for agent j at time t, and γ t (k) is the agent set matched with arm k 1 . At each time agent j selects an arm I t (j), and we use M t (j) to denote whether j is successfully matched with its selected arm. M t (j) = 1 if agent j is matched with I t (j), and M t (j) = 0, otherwise. If multiple agents select arm k at the same time, only top q k agents can successfully match. The agent j matched with arm k can observe the reward X j,mt(j) (t), where the random reward X j,k (t) ∈ [0, 1] is independently drawn from a fixed distribution with mean µ j,k . While the unmatched one has collisions and receives zero reward. Generally, the reward obtained by agent j is X j,It(j) (t) M t (j). We say an agent j and an arm k form a blocking pair for a matching m if they prefer each other over their current assignments, i.e. k ≻ j m(j) and ∃j ′ ∈ γ(k), j ≻ k j ′ . We say a matching satisfies individually rational (IR), if a j ≻ pi ∅ and p i ≻ aj ∅ for all i ∈ [N ] and j ∈ [K], that is, every worker prefers to find a job rather than do nothing, and every company also wants to recruit workers rather than not recruit anyone. Under the IR condition, a matching in the many-to-one setting is stable if there does not exist a blocking pair Salonen & Salonen (2018) ; Sethuraman et al. (2006) . This paper considers the matching markets under the uniqueness condition. Thus the overall goal is to find the unique stable matching between the agent side and arm side through iterations. Let m * (j) be the stable matched arm for agent j under the stable matching m * . The reward obtained by agent j is compared against the reward received by matching with m * (j) at each time. We aim to minimize the expected stable regret for agent j over time horizon T , which is defined as R j (T ) = T µ j,m * (j) -E T t=1 M t (j)X j,It(j) (t) .

3. ALGORITHM

In this section, we introduce our MO-UCB-D4 Algorithm (Many-to-one UCB with Decentralized Dominated arms Deletion and Local Deletion Algorithm) (Algorithm 1) for the decentralized manyto-one market, where there is no platform to arrange actions for agents. The MO-UCB-D4 algorithm sets multiple phases, and each phase i mainly includes regret minimization block (line 6 -12) and communication block (line 13 -16) with duration 2 i-1 , i = 1, 2, • • • . Algorithm 1 MO-UCB-D4 algorithm (for agent j) Input: θ ∈ (0, 1/K), α > 1. 1: Set global dominated set G j [0] = ϕ 2: for phase i = 1, 2, ... do

3:

Reset the collision set C j,k [i] = 0, ∀k ∈ [K];

4:

Reset active arms set Ch j [i] = [K]\G j [i -1]; 5: if t < 2 i + N K(i -1) then 6: Local deletion L j [i] = {k : C jk [i] ≥ ⌈θ2 i ⌉}; 7: Play arm I t (j) ∈ arg max k∈Chj [i]\Lj [i] μj,k (t -1) + 2α log(t) N j,k (t-1) ; 8: if k = I t (j) is successfully matched with agent j, i.e. m t (j) = k then 9: Update estimate μj,k (t) and matching count N j,k (t);  else if t = 2 i + N K(i -1) then 14: O j [i] ← most matched arm in phase i; 15: G j [i] ← COMMUNICATION(i, O j [i]); 16: end if 17: end for For each agent j in phase i, the algorithm adds arm deletion process to reduce potential conflicts, which contains global deletion and local deletion. The former eliminates the arms most preferred by agents who rank higher than agent j and obtains active set Ch j [i] (line 4), and the latter deletes the arms that still have many conflicts with agent j after global deletion (line 6). We set a collision counter C j,k [i] to record the number of collisions for agent j pulling arm k in phase i. In the regret minimization block of phase i, we use L j [i] = {k : C j,k [i] ≥ ⌈θ2 i ⌉} to represent the arms that collide more times than a threshold ⌈θ2 i ⌉ when matching with agent j. Arms in L j [i] are first locally deleted to reduce potential collisions for agent j (line 6). After that, agent j selects an optimal action I t (j) from remaining arms in Ch j [i]\L j [i] in phase i according to UCB index, which is computed by μj,k (t -1) + 2α log(t) N j,k (t-1) (line 7), where N j,k (t -1) is the number that agent j and arm k have been matched at time t -1. If the selected arm is successfully matched with agent j, then the algorithm updates estimated reward μj,k (t) = 1 N j,k (t) t s=1 1{I s (j) = k and M s (j) = 1} X j,k (t) and N j,k (t) (line 9). Otherwise, the collision happens (line 11) and agent j receives zero reward. The regret minimization block identifies the most played arm O j [i] for agent j in each phase i, which is estimated as the best arm for agent j, thus making optimal policy to minimize expected regret.

Algorithm 2 COMMUNICATION Input:

Phase number i, and most played arms O j [i] for agent j, ∀j ∈ [N ] . 1: Set C = ∅; 2: for t = 1, 2, • • • , N K -1 do 3: if K(j -1) ≤ t ≤ Kj -1 then 4: Agent j plays arm I t (j) = (t mod K) + 1; In the communication block (Algorithm 2), there are N sub-blocks, each with duration K. In the ℓ -th sub-block, only agent ℓ pulls arm 1, arm 2, • • • , arm K in round-robin while other agents select their most preferred arms estimated as the most played ones (line 4). This block aims to detect globally dominated arms for each agent j: G j [i] ⊂ {O j ′ [i] : j ′ ≻ O j ′ [i] j}. Under the stable matching m * , the globally dominated arm set for agent j is denoted as G * j . After the communication block in phase i, each agent j updates its active arm set Ch j [i + 1] for phase i + 1, by globally deleting arm set G j [i], and enters into the next phase (line 4 in Algorithm 1). Hence, multi-phases setting can guarantee that the active sets in different phases have no inclusion relationship so that if an agent deletes an arm in a certain phase, this arm can still be selected in the later rounds. This ensures that each agent will not permanently eliminate its stable matched arm, and if agent j mistakenly deletes an arm, it will not lead to linear regret.

4.1.1. α-CONDITION

When the preferences of agents and arms are given by some utility functions instead of random preferences, like payments for workers in the labor markets, the stable matching is usually unique. Thus the assumption of the unique stable matching is quite common in real applications. And some uniqueness conditions have important properties like consistency, which states that any stable pair leaving the market does not affect the remaining to form a stable matching. In dynamic markets where agents and arms come and go, the consistency property is desirable to keep the matching majority static Basu et al. (2021) . And in this way, the market is divided into pairs with priority, which is divided into hierarchical structures, so that the design of the algorithm is inductive, and the regret is constrained to the number of sub-optimal matchings (Appendix 3). Besides, when the stable matching is unique, there would be no dispute about adopting stable matching preferred by which side, thus is fairer to both sides Cen & Shah (2022) . Note that the outcome of the GS algorithm would prefer the proposal side and would be unfair to the other side Clark (2006) . In this section, we propose a new uniqueness condition, α-condition. First, we introduce uniqueness consistency (Unqc) Karpov (2019) , which guarantees robustness and uniqueness of markets. Definition 1. A preference profile satisfies uniqueness consistency if and only if (i) there exists a unique stable matching m * ; (ii) for any subset of arms or agents, the preference profile on this subset with their stable-matched pair can induce a unique stable matching. It guarantees that even if an arbitrary subset of stable pairs are deleted out of the system, there still exists a unique stable matching among the remaining agents and arms. This condition allows the algorithm to find the unique stable matching by detecting the stable matching pairs iteratively. To obtain the unique stable matching in the many-to-one market, we propose a new α-condition, which is a sufficient and necessary condition for Unqc (proved in Appendix C). Considering arm capacity, we denote γ * (c k ) (right order) as the stable matched agent set for arm c k . Definition 2. A many-to-one matching market satisfies the α-condition if, (i) The left order of agents and arms satisfies

We considers a finite set of arms

[K] = {1, 2, • • • , K} and a finite set of agents [N ] = {1, 2, • • • , N } with preference profile P. Assume that [N ] r ={A 1 , A 2 , • • • , A N } is a permutation of {1, 2, • • • , N } and [K] r ={c 1 , c 2 , • • • , c K } is a permutation of {1, 2, • • • , K}. Denote [N ], [K] ∀j ∈ [N ], ∀k > j, k ∈ [K], µ j,m * (j) > µ j,k , where m * (j) is agent j's stable matched arm; (ii) The right order of agents and arms satisfies ∀k < k ′ ≤ K, c k ∈ [K] r , A k ′ ⊂ [N ] r , γ * (c k ) ≻ c k A k ′ -1 i=1 qc i +1 , where the set γ * (c k ) is more preferred than A k ′ -1 i=1 qc i +1 means that the least preferred agent in γ * (c k ) for c k is better than A k ′ -1 i=1 qc i +1 for c k . Under our α-condition, the left order and the right order satisfy the following rule. The left order gives rankings according to agents' preferences. The first agent in the left order set [N ] prefers arm 1 in [K] most and has it as the stable matched arm. Similar properties for the agent 2 to q 1 since arm 1 has q 1 capacity. Then the (q 1 + 1)-th agent in the left order set [N ] has arm 2 in [K] as her stable matched arm and prefers arm 2 most except arm 1. The remaining agents follow similarly. Similarly, the right order gives rankings according to arms' preferences. The first arm 1 in the right order set [K] r most prefers the first q c1 agents in the right order set [N ] r and takes them as its stable matched agents. The remaining arms follow similarly. This condition is more general than existing SPC condition Reny (2021) and can recover the known α-condition in one-to-one matching market Karpov (2019) . The relationship between existing uniqueness conditions and our proposed conditions will be analyzed in detail later in Section 4.1.2. The main idea from one-to-one to many-to-one analysis is to replace individuals with sets. In general, under α-condition, the left order satisfies that when arm 1 to arm k -1 are removed, agents k-1 i=1 q i + 1 to k i=1 q i prefer k most, and the right order means that when A 1 to agents A k-1 i=1 qi are removed, arm k prefers agents A k = {A k-1 i=1 qc i +1 , A k-1 i=1 qc i +2 , • • • , A k i=1 qc i }, where A k is the agent set that are most q k preferred by arm k among those who have not been matched by arm 1, 2, • • • , k -1. The α-condition can be detected as follows: After running GS algorithm and finding a stable matching, we can find two orders of arms and agents by sequential elimination higher ranked agents or arms with their matching pairs. And the α-condition satisfied if the two orders are identical. The next theorem gives a summary. Theorem 1. If a market M = (K, J , P) satisfies α-condition, then m * ( j-1 i=1 q i + 1) = m * ( j-1 i=1 q i + 2) = • • • = m * ( j i=1 q i ) = j (the left order), γ * (c k ) = A k and m * (A j ) = c j (the right order) under stable matching. Under α-condition, the stable matched arm may not be the most preferred one for each agent j, j ∈ [N ], thus (i) we do not have m * (j) to be dominated only by the agent 1 to agent j -1, i.e. there may exist j ′ > j, s.t. j ′ ≻ m * (j) j; (ii) the left order may not be identical to the right order, we define a mapping lr to match the index of an agent in the left order with the index in the right order, i.e. A lr(j) = j. From Theorem 1, the stable matched set for arm k is its first q k preferred agents γ * (c k ) = A k . We define lr as lr(i) = max{j : A j ∈ γ * (m * (i)), j ∈ [N ]}, that is, in the right order, the mapping for arm k ∈ [K] is the least preferred one among its most q k preferred agents. Note that this mapping is not an injective, i.e. ∃j, j ′ , s.t. agent j = A lr(j) = A lr(j ′ ) . An intuitive representation can be seen in Figure 4 in Appendix B.1.

4.1.2. UNIQUE STABLE CONDITIONS IN MANY-TO-ONE MATCHING

Uniqueness consistency (Unqc) leads the stable matching to a robust one which is a desirable property in large dynamic markets with constant individual departure Basu et al. (2021) . A precondition of Unqc is to ensure global unique stability, hence finding uniqueness conditions is essential. The existing unique stable conditions are well established in one-to-one setting (analysis can be found in Appendix C), and in this section, we focus on the uniqueness conditions in many-to-one market, such as SPC, Reny (2021), Aligned Preference, Serial Dictatorship Top-top match and Acyclicity Niederle & Yariv (2009) ; Akahoshi (2014) ; Reny (2021) (Definition 9, 7, 8, 10 in Appendix C.2). Akahoshi (2014) proposes a necessary and sufficient condition for unique stable matching in many-to-one matching where unacceptable agents and arms may exist on both sides. We denote this condition as Acyclicity * . Under our setting, both two sides are acceptable, and we first give the proof that Acyclicity * is a necessary and sufficient condition for uniqueness in this setting (Section C.2.4). We then give relationships between our newly α-condition and other existing uniqueness conditions, intuitively expressed in Figure 1 , and we give proof for this section in Appendix C.2. Lemma 1. In a many-to-one matching market M = (K, J , P), both Serial Dictatorship and Aligned Preference can produce a unique stable matching and they are equivalent. Theorem 2. In a many-to-one matching market M = (K, J , P), our α-condition satisfies: (i) SPC is a sufficient condition to α-condition; (ii) α-condition is a necessary and sufficient condition to Unqc; (iii) α-condition is a sufficient but not necessary condition to Acyclicity * . 

4.2. THEORETICAL RESULTS OF REGRET

We then provide theoretical results of MO-UCB-D4 algorithm under our α-condition. Recall that G * j is the globally dominated arms for agent j under stable matching m * . For each arm k / ∈ G * j , we give the definition of the blocking agents for arm k and agent j: B jk = {j ′ : j ′ ≻ k j, k / ∈ G * j }, which contains agents more preferred by arm k than j. The hidden arms for agent j is H j = {k : k / ∈ G * j } ∩ {k : B jk ̸ = ∅}. The reward gap for agent j and arm k is defined as ∆ jk = |µ j,m * (j) -µ j,k | and the minimum reward gap across all arms and agents is ∆ = min j∈[N ] {min k∈[K] ∆ j,k }. We assume that the reward is different for each agent, thus ∆ j,k > 0 for every agent j and arm k. Theorem 3. (Regret upper bound) Let J max (j) = max {j + 1, {j ′ : ∃k ∈ H j , j ′ ∈ B jk }} be the max blocking agent for agent j and f α(j) = j + lr max (j) is a fixed factor depends on both the left order and the right order for agent j. Following MO-UCB-D4 algorithm with horizon T , the expected regret of a stable matching under α-condition (Definition 2) for agent j ∈ [N ] is upper bounded by E [R j (T )] ≤ k / ∈G * j ∪m * (j) 8α ∆ jk log(T ) + π α log(T ) + k / ∈G * j j ′ ∈B jk :k / ∈G * j ′ 8αµ j,m * (j) ∆ 2 j ′ k log(T ) + π α log(T ) + c j log 2 (T ) + O N 2 K 2 ∆ 2 + min(1, θ|H j |)f α(J max (j)) + f α(j) -1 2 i * + N 2 Ki * , where i * = max{8, i 1 , i 2 } (i 1 , i 2 are defined in equation 3), and lr max (j) = max{lr(j ′ ) : 1 ≤ j ′ ≤ j}, is the maximum right order mapping for agent j ′ who ranks higher than j. From Theorem 3, the scale of the regret upper bound under α-condition is O N K log(T ) ∆ 2 . Proof Sketch of Theorem 3. The main proof idea is how agents settle down to their stable matched arms inductively. Agent 1 will find its stable matched arm 1 at first since arm 1 is the most preferred arm for agent 1. The same is true for agent 2 to agent q 1 . When they all settle down with arm 1, then agent q 1 + 1 will find its stable arm 2 since agent q 1 + 1 has deleted arm 1 in the communication block and thus arm 2 becomes its most preferred arm. We can show by induction that agent j will find its stable matched arm after agent 1 to j -1 has settled down. The regret of agent j can be decomposed into four parts: sub-optimal play, collision, communication, and local deletion. Both collisions between agent j and other agents in the blocking agent set and sub-optimal play are due to the wrong estimation of UCB index (Lemma 6). Communication regret can be bounded by the length of the communication block. Local deletion regret can be controlled by the threshold we set (line 6 in Algorithm 1). The regret bound is decomposed as follows, and the complete proof can be seen in Section 3. Lemma 2. (Regret Decomposition) For a stable matching under α-condition, the upper bound of regret for the agent j ∈ [N ] under our algorithm can be decomposed by: E [R j (T )] ≤ E S Fαj (Regret before phase Fαj ) + min(θ|H j |, 1)E S Vαj (Local deletion) + (K -1 + |B j,m * (j) |) log 2 (T ) + N KE [V αj ] (Communication) + k / ∈G * j j ′ ∈B j,k :k / ∈G * j ′ 8αµ j,m * (j) ∆ 2 j ′ ,k log(T ) + π α log(T ) (Collision) + k / ∈G * j ∪m * (j) 8α ∆ j,k (log(T ) + π α log(T )) (Sub-optimal play) +N K 1 + (ϕ(α) + 1) 8α ∆ 2 , where F αj , V αj are the time points when agent j enters into α-Good phase and α-Low Collision phase respectively, are defined in Appendix B.2.

5. DIFFICULTIES AND SOLUTIONS

From one-to-one setting to many-to-one setting First, arm preference is difficult to learn in a decentralized many-to-one setting. Influenced by capacity, in communication block, when two agents select one arm at a time, as an arm can accept more than one agent, these two cannot distinguish who is more preferred by this arm, while it can be done in one-to-one markets. Thus identifying arm preference for each agent encounters more challenges, and then influences total regret. In order to solve this, we introduce the dominated arm set G * j into communication block to identify arms who are preferred by higher ranked agents than agent j. The arm set G * j is one of the main sources that prevent agent j from forming a stable matching, and it will be deleted before each phase to reduce collisions. Second, the idea from one-to-one to many-to-one is a transition from individual to set. It is natural to split sets into individuals or correspond sets to individuals. Although we assume that arm preference is over individuals Roth & Sotomayor (1992) ; Sethuraman et al. (2006) ; Altinok (2019) , the agents matched by one arm are not independent. Specially, arms with capacity q can not just be replaced by q independent individuals with the same preference. Since there would be implicit competition among different replicates of one arm, and he can reject previously accepted agents when he faces a more preferred agent. In addition, considering capacity, the matching result for each arm k is a set rather than an individual. In order to give a description of a uniqueness condition, we need to give a threshold for the range of stable matched agents set. The lr in Basu et al. ( 2021) is a one-to-one mapping that corresponds the agent index in the left order and the agent index in the right order, which is related to regret bound (Theorem 3 in Basu et al. (2021) and Theorem 3 in our work). While it does not hold in our setting. We construct a new mapping lr (Figure 4 in Appendix B) which connects the index of agents in two orders in many-to-one setting. lr maps each arm k to the least preferred one of its stable matched agents in the right order, thus giving a mapping between individuals and individuals. From α-condition to α-condition In general markets, preferences are difficult to learn when one arm can accommodate multiple agents. We consider the market with uniqueness condition. For one thing, equilibrium plays an important role in the fairness and stability of matching problems. For another, to reduce the conflicts among agents, we adopt an arm deletion idea and Unqc (Definition 1) can ensure that the deletion does not affect the stable matching. Our work extends α-condition to many-to-one setting, which needs to define preferences among sets. However, there might be an exponential number of sets due to the combinatorial structure and simply constraining preferences over all possible sets will lead to high complexity. Motivated by α-condition which characterizes properties of matched pairs in one-to-one setting, we come up with a possible constraint by regarding the arm and the least preferred agent in its matched set as the matched pair and define preferences according to this grouping. It turns out that we only need to define arm preferences over disjoint agent sets to complete this extension as α-condition is defined under the stable matching, which can also fit the regret analysis well. Under this α-condition, it induces a hierarchy in the matching market, which reduces the regret bound from collision block to the number of matchings with sub-optimal arms by induction, thus making the regret reach the lower bound related to time horizon T and reward gap ∆ (Appendix D) in matching problem with bandit algorithm Sankararaman et al. (2021) . In a summary, there might be other possible ways to extend the α-condition but we present a successful trial to not only give a good extension with similar inclusion relationships but also guarantee a good regret bound.

6. EXPERIMENTS

In this section, we verify the experimental results of our MO-UCB-D4 algorithm (Algorithm 1) for decentralized many-to-one matching markets. For all experiments, the rankings of all agents and arms are sampled uniformly. We set the reward value towards the least preferred arm to be 1/N and the most preferred one as 1 for each agent, then the reward gap between any adjacently ranked arms is ∆ = 1/N . The reward for agent j matches with arm k at time t X j,k (t) is sampled from Ber(µ j,k ). The capacity is equally set as q = N/K. We investigate how the cumulative regret and cumulative market unstability depend on the size of the market and the number of arms under three different unique stability conditions: Serial Dictatorship, SPC, α-condition. The former cumulative regret is the total mean reward gap between the stable matching result and the simulated result, and the latter cumulative unstability is defined as the number of unstable matchings in round t. In our experiments, all results are averaged over 10 independent runs, hence the error bars are calculated as standard deviations divided by √ 10. Varying the market size. To test effects on cumulative regret and cumulative unstability, we first vary N with fixed K with market size of N ∈ {10, 20, 30, 40} agents and K = 5 arms. The number of rounds is set to be 100, 000. The cumulative regrets in Figure 2 (a)(c)(e) show an increasing trend with convergence as the number of agents increases under these three conditions. When the number of agents increases, there is a high probability of collisions among agents, resulting in an increase of Varying arm capacity. The number of arms K is chosen by K ∈ {2, 5, 10, 20}, with N = 20 and q = N/K. The number of rounds we set is 400, 000. With the increase of K, both the cumulative regret in Figure 3 (a)(c)(e) and the cumulative unstability in Figure 3 (b)(d)(f) increase monotonously. When K increases, the capacity q k for each arm k decreases, and then the number of collisions will increase, which leads to an increase of cumulative regret. And it also leads to more unstable pairs, which needs more communication blocks to converge to a stable matching. Under these three conditions, the performances of the algorithm are similar.

7. CONCLUSIONS

We are the first to study the bandit algorithm for the many-to-one matching market under the unique stable matching. This work focuses on a decentralized market. A new α-condition is proposed to guarantee a unique stable outcome in many-to-one market, which is more general than existing uniqueness conditions like SPC, Serial Dictatorship and could recover the usual α-condition in one-to-one setting. We propose a phase-based algorithm of MO-UCB-D4 with arm-elimination, which obtains O N K log(T ) ∆ 2 stable regret under α-condition. By carefully defining a mapping from arms to the least preferred agent in its stable matched set, we could effectively correspond arms and agents by individual-to-individual. A series of experiments under two environments of varying the market size and varying arm capacity are conducted. The results show that our algorithm performs well under Serial Dictatorship, SPC and α-condition respectively.

A RELATED WORKS

The study of matching markets has a long history in economics and operation research Bogomolnaia & Moulin (2001) ; Bade (2020); Roth & Sotomayor (1992) with real applications like school enrollment, labor employment, hospital resource allocation, and so on Abizada (2016) ; Ma (2010) ; Roth (1986) ; Hatfield et al. (2014) . A salient feature of market matching is making decisions for competing players on both sides Thompson (1933) ; Gale & Shapley (1962) . MAB is an important tool to study matching problems under uncertainty to obtain a maximum reward, and upper confidence bound algorithm (UCB) Auer et al. (2002) is a typical algorithm, which sets a confidence interval to represent uncertainty. This paper contributes mainly to intersection of MABs and two-sided matching markets literature.We analyze recent works in this direction. After Das & Kamenica (2005) proposed to apply MAB in learning preference, the learning uncertain matching system provided inspiration for the design of online platform, and then there was a series of algorithm design Liu et ). Following this, a more general market, decentralized one, was studied by traditional UCB algorithm, and obtain a O( exp (N 4 )N 5 K 2 log 2 (T )

∆

) regret by setting a delay parameter to reduce collisions among agents. By limiting preferences, we can get algorithms that have better convergence or can learn information about unknown preferences. Under Serial Dictatorship condition, Sankararaman et al. (2021) proposed an phased UCB algorithm with global communication to solve decentralized market with nonlocal information. As Serial Dictatorship condition is too strong, a weaker Uniqueness Consistency condition is applied in this online data-driven market Basu et al. (2021) . Under the conditions on preferences, the regret bound in decentralized matching is reduced to O( N K log(T ) ). However, these valuable articles focused on one-to-one matching that one arm can accept only one agent as his stable pair. Motivated by these, we extend works not only to a many-to-one setting, but also under a weaker uniqueness condition which is first introduced by this work. In terms of uniqueness conditions, a flurry of works proposed some descriptive conditions in oneto-one setting, like the Serial Dictatorship Sankararaman et al. (2021) , the No Crossing Condition (NCC) Clark (2006) , the Sequential Preference Condition (SPC) Eeckhout (2000) , the α-Condition Karpov (2019) . However, a few of works concentrated on the unique stable property in many-to-one market. Some exiting conditions are SPC, Reny (2021), Aligned Preference, Serial Dictatorship Top-top match and Acyclicity Niederle & Yariv (2009) ; Akahoshi (2014) ; Reny (2021), which are strong that are not universal in algorithm design. The research on many-to-one market is a relatively meaningful work recently. Leaning preferences and form a stable matching are also key features in this setting Jagadeesan et al. (2021) . Altinok (2019) ; Özkan & Ward (2020) ; Johari et al. (2021) studied dynamic many-to-one matching. For one thing, their concerns provide motivation for our work, for another, they also provide more latent future directions for the application of MAB in matching.

B.1 MAPPING UNDER α-CONDITION

To connect two sides of the market, we define a mapping lr as lr(i) = max{j : A j ∈ γ * (m * (i)), j ∈ [N ]}, from agent index in the left order to agent index in the right order under α-condition since arms in the right order can select more than one agents. From Theorem 1, the stable matching for arm k is its first q k preferred agents γ * (c k ) = A k . Recall that the preference is strict. Denote that the first q k agents are ranked as A (1) k ≻ A (2) k ≻ • • • A (q k ) k . Then the rule of the mapping lr in the right order we set is as follows: the mapping for arm k ∈ [K] is the least preferred one among its most preferred q k agents, that is, A lr(k) = A (q k ) k . And the intuitive representation can be seen in Figure 4 . If we assume that c i2 = c 1 , then the right order can be seen form the figure and lr(q 1 + 1) = • • • = lr(q 1 + q c1 ) = q c1 holds. 1 2 q1 q1 + 1 N 1 2 K ci 1 ci 2 ci K A1 = A (1) 1 Aq c 1 = A (qc 1 ) 1 Aq c 1 +1 = A (1) 2 AN = A (qc K ) K Left Order Right Order

Agents Arms Agents

Figure 4 : The mapping from the left order to the right order (assume that c i2 = c 1 ) B.2 PROOF FOR REGRET ANALYSIS UNDER α -CONDITION The proof idea is mainly as follows. We construct phases with good properties and denote that the time point of agent j reaching its good phase by F αj . From phase F αj on-wards, agent j + 1 will find the globally dominated arm set G * j+1 and will eliminate arm m * (j) according to Algorithm 1. Then the process of each agent is divided into two stages: before F αj and after F αj . After F αj , according to the causes of regret, it is divided into four blocks: collision, local deletion, communication, and sub-optimal play. Phases before F αj can be bounded by induction. We first give some notations and definitions: Rank for Each Agent Recall that if arm k prefers agent j over j ′ , we denote j ≻ k j ′ . And under α-condition, the stable matched arm m * (j) for agent j is agent j's most preferred arm among remaining arms who still have vacant seats within its capacity. Denote the agents that match with the stable matched arm of agent j by γ * (m * (j)).

Classification of arm sets

The dominated arms set D j = {m * (j ′ ) : j ′ ≻ m * (j ′ ) j} means the stable matched arms of agents who are more preferred by these arms than agent j, and the globally dominated arms set under stable matching m * is G * j , a subset of D j . Global deletion here follows the left order. Recall that O j [i] is the best arm for agent j in phase i. In Algorithm 1, the estimated dominated arms set in phase i is D j [i] = {O j ′ [i] : j ′ > O j ′ [i] j} and the globally dominated arms in each phase i G j [i] ⊂ D j [i] 2 . For each arm k / ∈ G * j , we give the definition of the blocking agents for arm k and agent j: B j,k = {j ′ : j ′ ≻ k j, k / ∈ G * j } , which contains agents more preferred by arm k than j. The hidden arms for agent j is H j = {k : k / ∈ G * j } ∩ {k : B j,k ̸ = ∅}. Under SPC condition, the stable matched pair is also the best arm for each agent, and agents that arm k matches with are its q k most preferred agents. It can be easily understood by the definition of Top-top match. While under our α-condition, the stable results may not be the best choices for the two sides. We then define a set N T T (j), in which each arm is a stable matched arm for some other agents A j ′ , is a sub-optimal arm for j, and j is preferred by that arm than its stable matched pairs γ * (k). The N T T (j) set can be understood as "not Top-top match" stable results, and it can be mathematically expressed as N T T (j) = k : k ∈ [K], µ j,k < µ j,m * (j) , ∃j ′ / ∈ γ * (m * (j)), s.t.k = m * (A j ′ ) and j ≻ k γ * (k) , where j ≻ k γ * (k) means that k prefers j than any agents in γ * (k). Phases with Good Properties In the decentralized market with limited information, estimating preferences of other agents is challenging, thus we set a communication block. This block for agent j is mainly to judge the dominated arms of agents that rank higher than j, where the dominated arm is measured as the arm with the most number of times matched with each agent. Under our α-condition, the most preferred arm is not necessarily the stable matched result, hence if arms in N T T (j) match too many times with j, agents cannot distinguish the preference of agent j. During the time period with limitation of arms in the N T T (j), other agents can identify the preferences of j, which helps to reduce conflicts. Definition 3. We say phase i is a Warm-up Phase for some j ∈ [N ] under α-condition if the following conditions hold for each arm k ∈ N T T (j): (i) arm k is matched with agent j at most 10αi ∆ 2 j,k in phase i, where α is a parameter of UCB index (line 7 in Algorithm 1); (ii) arm k is not agent j's most matched arm in phase i. According to it, we introduce the Unlocked phase (U j ) that all phases on and after it, agents A 1 to A j are all into warm-up phase. Let i 1 = min i : (N -1) 10αi ∆ 2 < θ2 (i-1) , where ∆ is the minimum reward gap, and 1 W [i, j] = 1, phase i is a warm-up phase for agent j; 0, otherwise. U j = max   i 1 , min      i : lr(j)-1 j ′ =1 i ′ ≥i 1 W [i ′ , A ′ j ] = 1    ∪ {∞}     . Definition 4. We say phase i is a α-Good Phase for some j ∈ [N ] under α-condition if the following are all satisfied: (i) The globally dominated arms for agent j are globally deleted in phase i. Then, G j [i] = G * j holds. (ii) The phase i is a warm-up phase for all agents in L j = {j ′ : m * (j) ∈ N T T (j ′ )}. (iii) For each arm k / ∈ G * j ∪ m * (j) (neither be globally deleted nor stable matched arm of agent j), arm k is successfully matched with agent j in phase i at most 10αi ∆ 2 j,k times. (iv) The stable matched arm m * (j) is selected the most number of times in phase i. The definition of α-Good Phase is naturally to be brought up that during this phase, agent j has collisions with low probability. When agent j selects an arm competing with a more preferred agent by this arm, it receives zero reward with high probability (w.h.p.), thus condition (i) in Definition 4 is necessary for a lower regret. Recall that the stable matched pair may not be the best pair for j, (ii) aims to limit arms in other agents' N T T sets to avoid too many conflicts. And (iii), (iv) are beneficial for other agents to estimate the stable matching of agent j. Similarly, we define α-Low Collision Phase as Basu et al. ( 2021): Definition 5. We say phase i is a α-Low Collision Phase for agent j under α-condition if: (i) Phase i is a α-Good Phase for agent 1 to agent j; (ii) Phase i is a α-Good Phase for agent j ′ ∈ ∪ k∈Hj B j,k .

Define that

F αj = max   i 1 , min({i : i ′ ≥i   j-1 j ′ =1 1 Gα [i ′ , j ′ ]     j ′′ ∈Lj 1 W [i ′ , j ′′ ]   = 1) ∪ {∞}   , and V αj = max   i 1 , min({i : i ′ ≥i 1 LCα [i ′ , j] = 1} ∪ {∞})   , where the definitions of 1 LCα [i, j] and 1 Gα [i, j] is similar to 1 W [i, j]. Hence, all phases on and after phase F αj are α-Good Phase and all phases after phase V αj are α-Low Collision Phase for agent j. Hence, 1 W [i, j], 1 LCα [i, j] and 1 Gα [i, j] are the indicator to represent whether phase i is a warm-up phase, α-low deletion phase and α-good phase respectively. Before we give the complete proof of the regret bound in Theorem 3, we propose some propositions. Proposition 1. The stable matched arm m * (j) for agent j can be blocked by agents in L j , where L j = j ′ : m * (j) ∈ N T T (j ′ ) . Proof. Assume that we have stable matching m * . By contradiction, if j ≻ m * (j ′ ) j ′ but µ j,m * (j) < µ j,m * (j ′ ) , then (j, m * (j ′ )) forms a blocking pair since they prefer each other than matched one but they are unmatched, this leads to the instability of m * . So, if j ≻ m * (j ′ ) j ′ , then µ j,m * (j) > µ j,m * (j ′ ) under the stable matching. Thus, if j ′ ≻ m * (j) j, then µ j ′ ,m * (j ′ ) > µ j,m * (j) , then m * (j) ∈ N T T (j ′ ). Proposition 1 tells us that m * (j) can be blocked only by agents in L j , and the next proposition gives the range of L j . Proposition 2. For each agent j ∈ [N ], L j ⊆ lr(j)-1 j ′ =1 A j ′ Proof. Under α-condition, for ∀k < j ≤ K, c k ∈ [K] r , A j ∈ [N ] r , γ * (c k ) ≻ c k A j . And by Theorem 1, γ * (c k ) = A k . Therefore, for ∀j, j ′ ∈ [N ], and j < j ′ , A j ≻ m * (Aj ) A j ′ . In particular, for any j ′ > lr(j), we have j = A lr ( j) ≻ m * (j) A j ′ . This implies that for ∀j ′ ≥ lr(j), we can not obtain j ′ ≻ m * (j) j, hence m * (j) / ∈ N T T (j ′ ), that is, for ∀j ′ ≥ lr(j), j ′ / ∈ L j . Then L j ⊆ ∪ lr(j)-1 j ′ =1 A j ′ . Proposition 3. For each agent j ∈ [N ], F αj ≤ max U (lr(j)-1) , max(F α j ′ : 1 ≤ j ′ ≤ j -1) happens with probability 1. Proof. By the definition of U j , we know that on and after phase U (lr(j)-1) , all agents {A j ′ : j ′ = 1, 2, • • • , lr(j) -1} are in warm-up phase. By proposition 2, the set of deadlock agents as L j ⊆ ∪ lr(j)-1 j ′ =1 A j ′ . Hence, all agents in L j are also in warm-up phase on and after U lr(j)-1 . Further, the agents 1 to (j -1) are in α-good phase from phase max{F αj ′ : 1 ≤ j ′ ≤ j -1} onwards. Then the proposition holds w.p.1. As the events decomposition for regret minimization block in Lemma 6 requires that m * (j) always exit and will not be deleted, it is important to find conditions or a certain phase with good properties to guarantee that m * (j) will not be globally deleted or locally deleted. The next lemma give us theoretical guarantee. Lemma 3. Let i 1 = min i : (N -1) 10αi ∆ 2 < θ2 i-1 , for any phase i (i ≥ i 1 ) and any agent j ∈ [N ], the following properties holds. (a) If phase i and (i -1) are warm-up phases for all j ′ ∈ L j , then m * (j) will not be globally deleted or locally deleted almost surely, i.e. m * (j ) / ∈ L j [i] ∪ G j [i]. (b) If phase i ≥ min U (lr(j)-1) , F αj + 1, then m * (j) / ∈ L j [i] ∪ G j [i] a.s. (c) If phase i ≥ V αj + 1 is a low collision phase for agent j then L j [i] = ∅ a.s. Proof. (i) All agents j ′ can block arm m * (j) are in L j by Proposition 1. And m * (j) ∈ N T T (j ′ ) for any agent j ′ ∈ L j due to the definition of L j . Therefore, if all agents in L j are in warm-up phase in phase (i -1), then m * (j) / ∈ G j [i] because by the definition of warm-up phase for agent j ′ and m * (j) ∈ N T T (j ′ ), so m * (j) is not agent j ′ 's most matched arm. Hence, m * (j) / ∈ G j [i]. furthermore, the total number of times the arm m * (j) can be deleted is at most lr(j)-1 i=1 q i -1 10αi ∆ 2 j,k for any i ≥ i 1 , which is less than the local deletion threshold. So m * (j) / ∈ L j [i] ∪ G j [i] after phase i 1 . (ii) (a) L j ⊆ ∪ lr(j)-1 j ′ =1 A j ′ holds by Proposition 3, this implies that for phase i ≥ U lr(j)-1 + 1 (i.e. i -1 ≥ U lr(j)-1 + 1) is a warm-up phase for all agents in L j = {j ′ : m * (j) ∈ N T T (j)}. (b) By the definition of F αj , all agents in L j = {j ′ : m * (j) ∈ N T T (j)} are in warm-up phase for phase i ≥ F αj+1 . By (a), (b) and (i) we know that (ii) holds. (iii) It can easily check by the definition of V αj .

B.3 PROOF FOR THEOREM 3

After defining F αj and V αjfoot_2 , we divide the whole process into two main modules: the process before phase F αj and after F αj . We denote S i by the beginning time point of phase i. The regret during time period [S Fαj , T] can be decomposed by four blocks: Local Deletion Block, Communication Block, Collision Block and Sub-optimal Block. The regret during time period [0, S Fαj ] can be bounded by induction with j (Lemma 7). Local Deletion Block. Lemma 3 implies that there is no collision after phase V αj , so we only need to consider the regret from F αj + 1 to V αj . Following our algorithm, there is at most θ2 i-1 collisions when pulling an arm from the set H j in each round. This amounts to Vαj i=(Fαj +1) k∈Hj θ • 2 i-1 ≤ Vαj i=(Fαj +1) θ|H j | • 2 i-1 < 1 -2 Vαj -1 1 -2 θ|H j | = (2 Vαj -1 -1)θ|H j | = S Vαj • θ|H j | ≤ min(S Vαj , 1) • θ|H j | . Communication Block. In the communication block, there are N sub-blocks, and the duration of each sub-block is K. Agent j pulls arm 1, arm 2, • • • , arm K in order in the j-th block and pulls O j [i] in other blocks, where O j [i] is the arm that it matched the most times in the regret minimization block in phase i. The best arm for agent j is not played in all but (K -1) number of steps for each communication phase after phase F αj + 1, and other agents j ′ collide at most once after phase V αj (since each of them enters good phase). Hence, the regret comes from communication block is (K -1 + |B j,m * (j) |) log 2 (T ) + N KE [V αj ] . Collision Block. The regret caused by collision from phase F αj + 1 to V αj has been included in the previous communication block (the regret of the period during F αj + 1 and V αj is relatively loose), so we only consider the regret after phase V αj . After phase V αj + 1, regret comes from the collision between agent j and the agents in the set B j,k . And by the definition of V αj , agent j and agent j ′ ∈ B jk have deleted dominated arms for themselves, this leads to k / ∈G * j j ′ ∈B j,k :k / ∈G * j ′ µ j,m * (j) N j ′ ,k (T ) -N j ′ ,k (S Vαj ) . And by lemma 6, the number of the matchings with suboptimal arms can be bounded, and the main resource of regret is bounded as a scale of O( N K log(T ) ∆ 2 )foot_3 . Proof. Due to Lemma 3, m * (j) will not be globally deleted or locally deleted after phase i ≥ (F αj + 1). Denote I j (t) as the arm that agent j pulls at time t. After phase F αj , the reason for agent j pulling arm k rather than m * (j) are as follows: (1) the UCB index of the optimal arm m * (j) is less than µ j,m * (j) -ϵ; (2) I j (t) = k and its UCB index is larger than µ j,m * (j) -ϵ. For any k / ∈ G * j ∪ m * (j) and ϵ > 0, N j,k (T ) -N j,k (S Fαj ) = T t=S F αj +1 1{I t (j) = k} ≤ T t=S F αj +1   1{(uj,k(t) ≥ µ j,m * (j) -ϵ) ∧ (I t (j) = k)} (a) + 1{u j,m * (j) ≤ µ j,m * (j) -ϵ} (b)    . First, we bound (a). E   T t=S F αj +1 1 (u j,k (t) ≥ µ j,m * (j) -ϵ) ∧ (I t (j) = k)   ≤E   T t=S F αj +1 1 (μ j,k (t -1) + 2α log(t) N j,k (t -1) ≥ µ j,m * (j) -ϵ) ∧ (I t (j) = k)   ≤E T t=1 1 (μ j,k (t -1) 2α log(T ) N j,k (t -1) ≥ µ j,m * (j) -ϵ) ∧ (I t (j) = k) ≤E T s=1 1 (μ j,k (s) + 2α log(T ) s ≥ µ j,k + ∆ j,k -ϵ) ≤1 + 2 (∆ j,k -ϵ) 2 α log(T ) + απ log(T ) + 1 . Then we turn to bound (b) E   T t=S F αj +1 u j,m * (j) ≤ µ j,m * (j) -ϵ   ≤E T t=1 u j,m * (j) ≤ µ j,m * (j) -ϵ ≤E T t=1 T s=1 P μj,k (t -1) + 2α log(t) N j,k (t -1) ≤ µ j,m * (j) -ϵ ≤ T t=1 T s=1 exp - s 2 ( 2α log(t) s + ϵ) 2 ≤ T t=1 t -α T s=1 exp(- sϵ 2 2 ) ≤ψ(α) 2 ϵ 2 . By choosing ϵ = ∆ j,k 2 , we have E N j,k (T ) -N j,k (S Fαj ) ≤ ψ(α) 8 ∆ 2 j,k + 1 + 8 ∆ 2 j,k α log(T ) + απ log(T ) + 1 . We define lr max (j) = max{lr(j ′ ) : 1 ≤ j ′ ≤ j}, and Fj = max U lrmax(j)-1 , max( Fj ′ : 1 ≤ j ′ ≤ (j -1)) , and Fj > F αj . Then we introduce a lemma to bound the probability that a phase i is not an α-Good phase when i ≥ F αj + 1. Lemma 7. For any j ∈ [N ] and m ≥ 1, the following hold with i * (i * = max{8, i 1 , i 2 }) E F m j ≤ 2i 1 + (lr max (j) + j -2) (i * ) m + K(1 + 64 ∆ 2 ) 2 -(α-1)(i * -2) (2 (α-1) -1) 2 , E 2 Fj ≤ 2i 1 + (lr max (j) + j -2) 2 i * + K(1 + 64 ∆ 2 ) 2 -(α-1)(i * -2) (2 (α-1) -1) 2 . The proof is the same as Basu et al. (2021) . Hence, the upper bound of E S Fαj is E S Fαj = E C(F αj -1) + 2 Fαj ≤ E C( Fj -1) + 2 Fj ≤ C(2i 1 -1) + C lr max (j) + j -2 i * + lr max (j) + j -2 2 i * + C + 1 lr max (j) + j -2 K 1 + 64 ∆ 2 -(α-1)(i * -2) (2 (α-1) -1) 2 , where C is a constant term. Then for formula with term E S Vαj , we can transform its upper bound to another term related to E S FJ max(j) since V αj = max F α(j+1) , ∪ k∈Hj ∪ j ′ ∈B jk Fαj ≤ max F(j+1) , ∪ k∈Hj ∪ j ′ ∈B jk F(j+1) = FJmax(j) . Hence, E S Vαj ≤ E S FJmax(j) . Lastly, the regret can be bounded by the decomposition of E S Fαj and phases after S Fαj with properties above, where phases on and after S Fαj contain local deletion, collision, communication, sub-optimal play blocks. E [R j (T )] ≤ E S Fαj + min(θ|H j |, 1)E S Vαj + (K -1 + |B j,m * (j) |) log 2 (T ) + N KE [V αj ] + k / ∈G * j j ′ ∈B j,k :k / ∈G * j ′ 8αµ kj * ∆ 2 j ′ k log(T ) + π α log(T ) + k / ∈G * j ∪m * (j) 8α ∆ j,k (log(T ) + π α log(T )) + N K 1 + (ϕ(α) + 1) 8α ∆ 2 ≤ k / ∈G * j j ′ ∈B j,k :k / ∈G * j ′ 8αµ kj * ∆ 2 j ′ ,k log(T ) + π α log(T ) + k / ∈G * j ∪m * (j) 8α ∆ j,k log(T ) + π α log(T ) + c j log 2 (T ) + O N 2 K 2 ∆ 2 min + min(1, θ|H j |)f α(J max (j)) + f α(j) -1 2 i * + N 2 Ki * . C PROOF FOR UNIQUE STABLE CONDITIONS C.1 UNIQUENESS CONDITIONS IN ONE-TO-ONE MATCHING. There are many existing conditions that guarantee the unique stable matching in one-to-one setting, like the Serial Dictatorship Sankararaman et al. (2021) , the No Crossing Condition (NCC) Clark ( 2006), the Sequential Preference Condition (SPC) Eeckhout (2000) , the α-Condition Karpov (2019). Previous works tell us that top-top match and SPC condition can lead to a unique stable matching in both one-to-one Niederle & Yariv (2009) ; Clark (2006) and many-to-one setting Reny (2021). Niederle & Yariv (2009) use the Top-top match property instead of α-reducibilityfoot_4 for the same meaning in the one-to-one setting. Serial Dictatorship in one-to-one setting means that for each agent, the arms are ranked heterogeneously, in an increasing order of arm-means which is different for each agent-arm pair while the agents are ranked homogeneously across all arms, and vice versa. Followed by Romero-Medina & Triossi (2013); Niederle & Yariv (2009) , we know that Aligned preference is equal to Serial dictatorship in marriage problem as they are both equivalent to no cycle property. And NCC and Serial Dictatorship are not mutually inclusive, which can be seen in Clark (2006) . Hence, the relationship can be represented intuitively in figure 5 : In this section, we focus on conditions that guarantee the unique stable matching in the many-toone setting, such as SPC, Reny (2021), Aligned Preference, Serial Dictatorship Top-top match and Acyclicity Niederle & Yariv (2009); Akahoshi (2014); Reny (2021) and give the proof of the relationships among uniqueness conditionsfoot_5 . Definition 6. (Aligned Preference.) In a many-to-one market M = (K, J , P), K = (k) k∈[K] , J = (j) j∈[N ] , if the preference profile P satisfies ∀k ∈ K, j ≻ k j ′ , ∀j < j ′ (1.a) ∀j ∈ N , k ≻ j k ′ , ∀k < k ′ (1.b) then the market has aligned preference. The one-to-one setting has the same definition. Definition 7. (Serial Dictatorship) We say that if all arms (school) have the same preference for agents (students), while agents' preferences are heterogeneous (vice versa), then the system satisfies serial dictatorship. Definition 8. (Top-top Match) A stable pair (k, j) is a Top-top match for sub-market M ′ ∈ M if, for arm k, agent j is the favorite candidate in M ′ , and vice versa. Definition 9. (SPC) SPC condition in the many-to-one setting Reny (2021) is to require the existence of a sequence of agents 1, 2, • • • , N in which each agent appears once, and a sequence of arms 1, 2, • • • , K in which each arm appears once for each seat in its capacity, such that k ≻ j k ′ for every k ′ > k and j ∈ [N ]; in addition, such that j ≻ k j ′ for every j ′ > j and k ∈ [K]. C.2.1 PROOF FOR LEMMA 1. Proof. ⇒): Table 1: Preference Profiles (a) Exm1: Companies c 1 : s 1 > s 2 > s 3 > s 4 > s 5 c 2 : s 2 > s 3 > s 4 > s 5 > s 1 c 3 : s 3 > s 4 > s 5 > s 1 > s 2 (b) Exm1: Workers s 1 : c 1 > c 2 > c 3 s 2 : c 2 > c 3 > c 1 s 3 : c 3 > c 2 > c 1 s 4 : c 3 > c 1 > c 2 s 5 : c 2 > c 1 > c 3 (c) Exm2: Companies c 1 : s 1 > s 2 > s 3 > s 4 > s 5 c 2 : s 3 > s 2 > s 1 > s 4 > s 5 c 3 : s 1 > s 5 > s 2 > s 4 > s 3 (d) Exm2: Workers s 1 : c 1 > c 3 > c 2 s 2 : c 1 > c 2 > c 3 s 3 : c 2 > c 1 > c 3 s 4 : c 1 > c 2 > c 3 s 5 : c 3 > c 2 > c 1 Serial Dictatorship ⇒ Aligned Preference. In order to distinguish the symbols of agents and arms, we consider arms set {c k , k = 1, 2, • • • , K} and agents set {s j : j = 1, 2, • • • , N }. If arms have the same preference for individual agent, then there is no cycle in the preference of the arm, i.e. there is no case that ∃T, s 0 ≻ c0 s T ≻ c T s T -1 • • • s 1 ≻ c1 s 0 for s 0 , s 1 , • • • , s T and c 0 , c 1 , • • • , c T . Otherwise, assume that there exists the cycle above, then by the same preference of arms, we know that ≻ c0 =≻ c1 . And then s 0 ≻ c0 s 1 and s 1 ≻ c1 s 0 , hence s 0 ≻ c0 s 1 and s 1 ≻ c0 s 0 , which yields a contradiction. Now we prove that no cycle property implies Aligned preference. By contradiction, if there exists a c l such that s k ≻ c l s j , for k > j, then we can construct a cycle: s k ≻ c l s j ≻ cj s j-1 • • • s k-2 ≻ c k-1 s k-1 ≻ c k s k .

⇐):

Aligned Preference ⇒ Serial Dictatorship. We first illustrate that aligned preference leads to no cycle property. By contradiction, if there is a cycle s 1 ≻ c1 s T ≻ c T s T -1 • • • s 2 ≻ c2 s 1 for some s 1 , s 2 , • • • , s T , c 1 , c 2 , • • • , c T and T . It is obvious that it yields s 1 ≻ c1 s T , T > 1, which contradicts the aligned principle. Then, if there is no cycle of length two, which implies that all college have the same preferences because all students are acceptable to every college, which induces the group serial dictatorship property. C.2.2 PROOF FOR THEOREM 2. (i) Proof for the relationship between SPC and α-condition SPC states that after eliminating all Top-top match, there is at least one new Top-top match in the remaining system under the restricted preference profile. Then it satisfies α-condition naturally. However, examples below tell us that SPC can not imply α-condition. We give two examples to illustrate this relationship where the order that an agent successfully matches with its stable pair corresponds to the left order and right order. Example Consider a market with three companies and five workers. Assume that the preference profile of companies c 1 , c 2 , c 3 and workers s 1 , s 2 , s 3 , s 4 , s 5 is as follows and the capacities are 2, 1, 2 respectively for c 1 , c 2 , c 3 . The preference in Table 1 (1(a))(1(b)) satisfies both SPC and α-condition with valid order {(c 2 , s 2 ), (c 3 , s 3 , s 4 ), (c 1 , s 1 , s 5 )}. While preference in Table 1 (1(c))(1(d)) only satisfies α-condition with valid left order {(c 1 , s 1 , s 2 ), (c 2 , s 3 ), (c 3 , s 4 , s 5 )} and right order {(c 2 , s 3 ), (c 1 , s 1 , s 2 ), (c 3 , s 4 , s 5 )}, and SPC does not hold. (ii) Proof for the relationship between Unqc and α-condition ⇐) : Sufficiency: If α-condition holds, then the agent-proposing Gale-Shapley algorithm and the arm-proposing Gale-Shapley algorithm leads to matching m in all consistent restrictions. ⇒) : Necessity: We first prove for K = 2, N = q 1 + q 2 case. Assume that there are two arms c 1 , c 2 , each has capacity q k (k = 1, 2) and the agents set S = s 1 , s 2 , • • • , s q1+q2 . By contradiction, assume that Unqc is satisfied while α-condition is not. Then we know that not all matching pairs are Top-top match, so there exists an agent s k , c 1 ≻ s k c 2 , but s k is not in the agents set that first q 1 preferred by c 1 . The matched result may have two cases: (• • • • • • q1 , c 1 ) and (s k , • • • • • • q2-1 , c 2 ) (i) , (s k , • • • • • • q1-1 , c 1 ) and (• • • • • • q2 , c 2 ) (ii) . We first consider matching (ii). If s k matches c 1 , then there must be an agent in A 1 matches with c 2 . Let's assume that there is an agent s ℓ ∈ A 1 that matches with c 2 . There are two situations to discuss at this time. If c 1 ≻ s ℓ c 2 , then (ii) is an unstable matching, which is recorded as case (A); If s ℓ prefers c 2 more than c 1 , then (ii) is a stable matching and is recorded as event (B). Apply the above two cases (A), (B) to matching (i). In (A), c 1 and s ℓ prefer each other, so there is a Top-top match and then α-condition is satisfied, and a conclusion contradictory to the hypothesis is derived. In (B), this case will produce two stable matchings, which contradicts Unqc. We use induction to prove it. Suppose, that for all ( N , K), N ≤ N, K ≤ K, N ≥ q 1 + q 2 + • • • + q K the α-condition is a necessary condition for the uniqueness consistency. Then we prove for (N + 1, q 1 + q 2 + • • • + q K ) (similarly, we would have for (N, q 1 + q 2 + • • • + q K + 1) and q 1 + q 2 + • • • + q K ≥ N ). Assume that the newly added agent is X, select an agent from the original N agents and record it as Y . Let k * X and k * Y be the arms rank first for X and Y respectively. By the K = 2, N = q 1 + q 2 case proved above, we know that X and Y satisfy α-condition, hence either X or Y matches with its first ranked arm. The agent matches with its first ranked arm is denoted by s 1 , and the remaining N agents are s 2 , • • • , s N . Except k * s1 and stable matched agents for k * s1 , there are N agents and K -1 arms, and N ≥ q 1 + q 2 + • • • + q K -q k * s 1 . From the inductive hypothesis, we can know that α-condition is satisfied. The relationship between α-condition and Acyclicity * is illustrated in Section C.2.4.

C.2.3 DIFFICULTIES FROM SPC TO α-CONDITION IN REGRET ANALYSIS

When we use the events decomposition for regret minimization block to prove the bound inequality of the number of times agent j is pulled (Lemma 6), it requires that m * (j) always exit and will not be deleted. Under SPC condition, m * (j) always exits as the stable matched partner is the most preferred one among the remaining market for the certain agent while α-condition cannot guarantee this property. Hence, it is important to find conditions or a certain phase with good properties to guarantee that m * (j) will not be globally deleted or locally deleted. And we consider F αj and V αj in Lemma 3 (in Appendix B.2) to solve this problem. And since the stable matched pair is not top-top match in the remaining system under α-condition while the answer is true under SPC, we introduce a new mapping (Figure 4 ) to describe the corresponding relationships of stable pairs. In addition, as shown in Figure 1 , Acyclicity * is the weakest condition to ensure uniqueness up to now, and Bettina Klaus and Flip Klijn Klaus & Klijn (2013) point that acyclicity has a tight connection with consistency. Hence, whether we can further weaken α-condition and propose a new algorithm remains to study.  c 1 : s 1 > s 2 > s 5 > s 3 > s 4 c 2 : s 2 > s 1 > s 4 > s 3 > s 5 c 3 : s 1 > s 3 > s 2 > s 4 > s 5 (b) Exm3: Agents s 1 : c 2 > c 3 > c 1 s 2 : c 1 > c 2 > c 3 s 3 : c 3 > c 1 > c 2 s 4 : c 1 > c 2 > c 3 s 5 : c 1 > c 2 > c 3 (c) Exm3: Arms c 1 : s 1 > s 2 > s 5 c 2 : s 2 > s 1 > s 5 (d) Exm3: Agents s 1 : c 2 > c 1 s 2 : c 1 > c 2 s 5 : c 1 > c 2 Theorem 5. Suppose that (K, J , P) are arbitrarily fixed. P c and P s are the preference profiles of arms and agents respectively. Then, P c satisfies Acyclicity * if and only if there is a unique stable matching in many-to-one setting for each P s . Proof. In order to prove this theorem, we first introduce a lemma. Lemma 8. For a given P, suppose that there are two stable matchings under P: µ, µ ′ , then Akahoshi (2014) • |µ(s)| = |µ ′ (s)| for each s ∈ J and |µ(c)| = |µ ′ (c)| for each c ∈ K. Moreover, for each c ∈ K with µ ′ (c) ̸ = µ(c), • |µ(c)| = |µ ′ (c)| = q c ; • µ(c)\µ ′ (c) ̸ = ∅ and µ ′ (c)\µ(c) ̸ = ∅; • if µ ′ (c) ≻ c µ(c), then for each s ′ ∈ µ ′ (c) and s ∈ µ(c)\µ ′ (c), {s ′ } ≻ c {s}. ⇒) : Necessity: We complete this proof by contradiction. Suppose there are at least two distinct stable matchings under P. From GS algorithm Gale & Shapley (1962) , there exists optimal matchings µ s and µ c , s.t. µ c ≻ c µ s and µ s ≻ s µ c . Under the multi-stability assumption, µ s ̸ = µ c . Then, ∃c 0 ∈ K, s.t. µ s (c 0 ) ̸ = µ c (c 0 ), and by the optimality of µ c , µ c (c 0 ) ≻ c0 µ s (c 0 ). Consider the following algorithm: • Step 1: Choose c 1 ∈ K, such that µ s (c 1 ) ̸ = µ c (c 1 ) and choose s 2 ∈ J , such that s 2 ∈ µ c (c 1 )\µ s (c 1 ). Choose c 2 ∈ K\{c 1 }, {c 2 } = µ s (s 2 ). Go to step 2; • Step k (k ≥ 2): Choose s k+1 ∈ J , such that s k+1 ∈ µ c (c k )\µ s (c k ) and c k+1 ∈ K\{c k }, s.t. {c k+1 } = µ s (s k+1 ). If c k+1 ∈ {c 1 , c 2 , • • • , c k }, then the algorithm terminates. If not, go to the next step. • Result: If the algorithm terminates at Step ℓ (ℓ ≥ 2) with c ℓ+1 = c j (j ≥ 1), then the result is: Given the students {s j+1 , s j+2 , • • • , s ℓ+1 } and the college {c j , c j+1 , • • • , c ℓ }, there is a cycle: Hence, there is a cycle (Definition 10), which induces a contradiction. s ℓ+1 ≻ c ℓ s ℓ • • • • • • s j+2 ≻ cj+1 s j+1 ≻ cj s j , then condition (P ) is satisfied. Let T k = µ c (c k )\{s k }, k ∈ {j, j + 1, • • • , ℓ}, ⇐) : Sufficiency: Assume that there exists a cycle Table Table 5 : s ℓ+1 ≻ c ℓ s ℓ • • • s 3 ≻ c2 s 2 ≻ c1 s 1 , s ℓ+1 ≡ s 1 , and |T i | = q ci -1, T ci ⊆ U ci (s i ), 3: Preference Profile of K. note c 1 c 2 • • • • • • c ℓ-1 c ℓ c ℓ+1 • • • • • • c k 1 s 2 s 3 • • • • • • s ℓ s 1 * • • • • • • * 2 s ℓ+2 s ℓ+2 • • • • • • s ℓ+2 s ℓ+2 * • • • • • • * . . . . . . . . . • • • • • • . . . . . . . . . • • • • • • . . . s ℓ+1+q1 . . . • • • • • • s ℓ+1+q ℓ-1 . . . . . . • • • • • • . . . q i s ℓ+1+q2 • • • • • • . . . s ℓ+1+q ℓ . . . • • • • • • . . . s ℓ+2+q1 s ℓ+2+q2 • • • • • • s ℓ+2+q ℓ-1 1s ℓ+2+q ℓ • • • • • • s ℓ+3+q1 s ℓ+3+q2 • • • • • • s ℓ+3+q ℓ-1 1s ℓ+3+q ℓ • • • • • • . . . . . . • • • • • • . . . . . . • • • • • • s N s N • • • • • • s N s N • • • • • • The remaining s 1 s 1 • • • • • • s 1 s 2 of {s ℓ } s 3 s 2 • • • • • • s 2 s 3 are ranked . . . s 4 • • • • • • . . . . . . at last . . . . . . • • • • • • . . . . . . s ℓ s ℓ • • • • • • s ℓ-1 s ℓ Table 4: Preference Profile of J . s 1 s 2 • • • • • • s ℓ-1 s ℓ s ℓ+1 • • • • • • s N c 1 c 2 • • • • • • s ℓ-1 s 1 * • • • • • • * c ℓ c 1 • • • • • • c ℓ-2 c ℓ-1 * • • • • • • * . . . . . . • • • • • • . . . . . . . . . • • • • • • . . . [K]\{c 1 , c ℓ } [K]\{c 2 , c 1 } • • • • • • [K]\{c ℓ-1 , c ℓ-2 } [K]\{c ℓ , c ℓ-1 } * • • • • • • * µ c . c 1 c 2 • • • • • • c ℓ-1 c ℓ c ℓ+1 • • • • • • c K s 2 s 3 • • • • • • s ℓ s 1 * • • • • • • * * * • • • • • • * * * • • • • • • * Table 6: µ s . c 1 c 2 • • • • • • c ℓ-1 c ℓ c ℓ+1 • • • • • • c K s 1 s 2 • • • • • • s ℓ-1 s ℓ * • • • • • • * * * • • • • • • * * * • • • • • • * D MORE DISCUSSIONS ABOUT OUR WORK D.1 STABILITY IN MANY-TO-ONE SETTING Stable matchings are always exist in one-to-one market Gale & Shapley (1962 ) while the answer is not necessarily correct under many-to-one setting Roth & Sotomayor (1992) . Roth & Sotomayor (1992) 2021) applies a phased UCB algorithm with arm elimination in the one-to-one setting. Our MO-UCB-D4 algorithm in many-to-one matching is also carried out in multi-phases for conflict management. The multi-phases is to guarantee that the active set in different phases has no inclusion relationship so that if an agent deletes an arm in a phase, this arm can still be selected in the later phases. This ensures when the agent wrongly deletes an arm, it will not lead to linear regret. Parameter Selection and Scale The parameter θ ∈ (0, 1/K) in our MO-UCB-D4 algorithm is chosen for the local deletion threshold. Increasing the threshold leads to higher regret until local deletion vanishes. This happens as more collisions are allowed until an arm is deleted. But a higher threshold allows for quick detection of the stable matched arms. However, decreasing the threshold results in a more aggressive deletion and then lower regret from collision each phase, at a cost of longer detection time for the stable matched arms. Therefore, there is a trade-off when choosing θ and we can design an algorithm to iteratively update θ based on the previous information. Baseline experimental design Although our work mainly focuses on theory and therefore we did not put much emphasis on the experimental evaluation, we still carefully design our experiments to test the robustness of our algorithm across different environments. Since our work is the first one to study the many-to-one setting with uniqueness conditions, there are indeed no comparable baselines. It is possible to design some sub-optimal algorithms in which each agent runs a MAB algorithm independently and there is no communication block among agents. However, such algorithm may not find the stable matching and thus suffers a linear regret. Optimality of our bound and the lower bound Recall that our bound is O(N K log(T ) ∆ 2 ). There exists a lower bound of O( log(T ) ∆ 2 ) under the setting where arms have the same and known preferences Sankararaman et al. (2021) , which is a special case of our setting. Our bound is optimal in terms of T and ∆. For N , since each agent j needs to face collisions from non-dominated arms and other agents, regret is bounded over the summation of agents and thus leads to the term O(N ). Usually in a multi-player decentralized setting Avner & Mannor (2014) ; Rosenski et al. (2016) , each agent will suffer regret of term N since it will be collided with other agents. Thus we conjecture such N is unavoidable. For K, since in the decentralized setting, agents have no knowledge of arm preference, each agent needs to try each O(log(T )/∆ 2 ) times to identify the stable matched arm. And it may get collided when pulling the other agent's stable matched arm, thus leading to the term K. K might be removed for those agents who may never get collisions due to the special market structure.

D.3 STRICT PREFERENCE AND "INDIFFERENT AGENTS"

Our work focuses on strict preference rather than the more general case that considering indifferent agents. As far as we know, a lot of works studying the traditional (offline) matching markets would assume preferences to be strict Gale & Shapley (1962) ; Karpov (2019); Gutin et al. (2021) ; Nguyen et al. (2021) ; Akahoshi (2014) , perhaps due to the reason of simplicity. Our work mainly follows these existing settings of the offline matching markets Gale & Shapley (1962) ; Karpov (2019); Gutin et al. (2021) ; Nguyen et al. (2021) ; Akahoshi (2014) and the bandit learning on the one-to-one matching markets Basu et al. (2021) ; Liu et al. (2020a) ; Sankararaman et al. (2021) ; Liu et al. (2020b) that assume strict preferences. Note that if the agents are indifferent (or nearly indifferent) over the arms that are far down the ranking lists and do not affect the stable matching, our algorithm and analysis can actually go through. The gap appeared in the regret bound actually depends only on the those "(nearly) optimal" arms that appear in the stable matching or are the best among those not appeared in the stable matching. Recall that our setting is to learn a particular stable matching, like previous works Basu et al. (2021) ; Liu et al. (2020a) ; Sankararaman et al. (2021) ; Liu et al. (2020b) learning the unique, or agentpessimal/optimal stable matching on the one-to-one setting. Under this objective, if the agents are nearly indifferent, not exactly indifferent, over "(nearly) optimal" arms, no matter how small the gap is, the agents will need to figure out the which arm is better and the gap appears as the learning hardness. This phenomenon is common in multi-armed bandits where differentiating the optimal arm and the second optimal arm is the most difficult part of the learning. Then one might be curious about the objective to learn a "nearly stable matching". This would be more general and would prefer to leave it as interesting future work. For the case when agents are exactly indifferent on "(nearly) optimal" arms, the stable matchings would not be unique. In this case, the communication block and the global deletion set of our algorithm need to be revised to allow each agent to keep more than one stable matched arm. Note that after this revision, the selected matching will not become fixed during interactions and will switch between all optimal stable matchings since the learning algorithm needs to continue exploring these arms to take precautions against the case of small gap. This will result in a phenomenon of fast-changing matching-selections, compared with our setting and most previous works Basu et al. (2021) ; Liu et al. (2020a) ; Sankararaman et al. (2021) ; Liu et al. (2020b) where the learning algorithm tends to stick on a specific matching in the latter learning period.

D.4 FUTURE DIRECTIONS FOR MANY-TO-ONE SETTING

First, we propose some interesting directions about the setting. This paper considers preference over individuals rather than agent sets. For example, when the first and fourth employees have cooperation experience and the second and third employees have no cooperation experience before, the company may prefer to recruit 1-st and 4-th together rather than 1-st, 2-nd or 2-nd, 3-rd. That is, 1, 4 ≻ k 2, 3 may occur for arm k and 1, 2, 3, 4 ∈ [N ]. Further research can also take this combination effect as the starting point. We assume that the preferences over agents for arms are known in our settingfoot_8 . When multiple agents are accepted by one arm simultaneously, the ranking of these agents cannot be judged if under the assumption of unknown preference ranking. Therefore, the algorithm for rank estimation still needs further design. And our work is based on fixed finite agents set and arms set, thus how to generalize this setting to a dynamic one?



The mapping m is not reversible as it is not a injective, thus we do not use m -1 t (k). We can obtain Dj[i] = Gj[i] in the one-to-one setting Under α-condition it is no longer the case as agent 1 is not the most preferred agent for arm 1. For agent A1 and its stable match arm c1, c1 may not be the best arm for agent A1 but for arm c1 we have A1 as its best agent. Therefore, agent A1 will not delete it's stable match pair arm a1, but unless global deletion eliminates better arms it will not converge to this arm. It is α-condition that induces a hierarchy in the matching market, which reduces the regret bound from collision block to the number of matchings with sub-optimal arms by induction. Park (2017);Clark (2006) introduce that a matching problem is α-reducible if there is a top trading single or pair for every sub-problem. The remark inNiederle & Yariv (2009) tells us that Aligned Preference is stronger than Top-top match and SPC condition. The responsive preference here means that if only one student in the two matchings is different, the college prefers the matching containing the preferred student. This assumptionRoth & Sotomayor (1992);Akahoshi (2014);Altinok (2019) in our setting states that the addition of another agent p i ′′ will not influence the preference ranking for an arm to agent pi and p i ′ , i.e. p i ′′ ∪ pi ≻a j p i ′′ ∪ p i ′ is equivalent to p i ′ ≻a j pi The preference profile over arms for agents is unknown in our setting, and needed to be learned.



as the left order and [N ] r , [K] r as the right order. The k-th arm in the right order set [K] r has the index c k in the left order set [K] and the j-th agent in the right order set [N ] r has the index A j in the left order set [N ].

Figure 1: Relations of Uniqueness Conditions in Many-to-one Market.

Figure 2: Cumulative regret and cumulative unstability of MO-UCB-D4 of size with N ∈ {10, 20, 30, 40} and the number of arms K = 5 under Serial Dictatorship, SPC, α-condition.

al. (2020a;b); Sankararaman et al. (2021); Basu et al. (2021); Gunn et al. (2022); Malgonde et al. (2020); Johari et al. (2021).In a general centralized market without conflicts,Liu et al. (2020a)  applied the common ETC and UCB algorithms to the matching market, and obtained the regret order of O( N K log(T ) ∆

Figure 5: Relations of Unique Stable Conditions in One-to-one (left) and Many-to-one (right) Setting.

since each agent ultimately matches only one arm, µ c (c j ), µ c (c j+1 ), • • • , µ c (c ℓ ) are mutually disjoint, then T j , T j+1 , • • • , T ℓ are disjoint. And by the definition of T k , k ∈ {j, j + 1, • • • , ℓ}, T k does not contain any agent in {s j+1 , s j+2 , • • • , s ℓ+1 }. By the second property in Lemma 8, |T k | = q c k -1 and by the last property, T k ⊂ U c k (s k ).

then we construct preference profiles for both arms (Figure C.2.4) and agents (Figure C.2.4):

we can find two distinct matchings µ c and µ s (Figure C.2.4 and Figure C.2.4), which induce a contradiction.

Preference Profiles

points out that responsive preference (RP) that can refrain from this unexpectation. Our work assume that arm preference profiles are over individuals rather than agents sets, which naturally satisfies RPSethuraman et al. (2006) 8 .D.2 SOME DETAILS ABOUT ALGORITHMMulti-phases to Reduce Collisions In previous work, the CA-UCB algorithmLiu et al. (2020b)   was proposed to manage conflicts in the decentralized market combined with the bandit algorithm, but it has limitations for more general preference structures. In CA-UCB, if we set the delay probability for all agents as zero, then agents may fall into infinite loops and cause high regret. To avoid linear regret, the paper ofSankararaman et al. (

annex

Sub-optimal Play Block. From phase F αj + 1 on-wards, regret happens for agent j when agent j selects arm k / ∈ G * j ∪ m * (j) and successfully be matched. This amounts to k / ∈G * j ∪m * (j) ∆ jk (N jk (T ) -N jk (S Fαj )) regret, and it can be upper bounded by Lemma 6. Then we illustrate the relationship among those phases with good properties and indicators. We first show that for phases i ≥ U αj-1 + 1, the probability that phase i is not a Warm-up phase for agent A j is low. Letthen we have the following lemma.Lemma 4. For phase i ≥ i * = max(8, i 1 , i 2 ), and for ∀j ∈ [N ], α > 1, then the following holds:Similarly, we give the relationship between F αj and α-Good phase.Lemma 5. For any agent j and phase i ≥ i * , and for α > 1, thenWe only give the proof of Lemma 4, and another one can similarly be verified.Proof.The inequality (i) is because that if phase i is not a Warm-up phase for agent A j , there exists an arm k ∈ N T T (A j ), which is played more than 10αitimes in phase i. Next, (ii) holds since the probability of union is less than or equal to the sum of probability. By Lemma 3, mHence, the inequality (iii) holds since I t (A j ) = k is equivalent to that the UCB index (line 7 in Algorithm 1) of arm m * (j) = a j can not be less than arm k.We now give the upper bound of E N jk (T ) -N jk (S Fαj ) , which is helpful to bound the regret resulting from collision block and sub-optimal block. Lemma 6. For ∀j ∈α log(T ) + πα log(T ) + 1 . (P) {s i+1 } ≻ ci {s i } ≻ ci ϕ, where s l+1 ≡ s 1 , and

C.2.4 Acyclicity

If P c has no cycle, it satisfies Acyclicity * .Akahoshi (2014) pointed that Acyclicity * is a necessary and sufficient condition for a unique stable matching in many-to-one matching. They study the problem with responsive preference 7 and unacceptable agents and arms may exist on both sides of the market. Under our setting, both two sides are acceptable, and we will prove that Acyclicity * is also a necessary and sufficient condition for uniqueness in our problem. Theorem 4. In our setting, our new α-condition is a sufficient condition to Acyclicity * (Theorem 2 (iii)).We first see the example above to explain hoe to check whether the Acyclicity * is satisfied. As mentioned above, the preference profile in Table 1 (1(a))(1(b)) satisfies both SPC and α-condition with valid order {(c 2 , s 2 ), (c 3 , s 3 , s 4 ), (c 1 , s 1 , s 5 )}. We now check that it also satisfies Acyclicity * .From preference profile (1(a)), we can find four cycle:Condition (P ) in Definition 10 is satisfied, and we then illustrate that condition (Q) is not satisfied, thus Acyclicity * holds. For cycle (i), T 1 , T 2 ⊂ S\{s 1 , s 2 }, |T 1 | = q c1 -1 = 1. However, it violates T 1 ⊂ U c1 (s 1 ) = ∅. Similarly, (ii), (iii), (iv) all imply that Acyclicity * is satisfied. For cycle (iv),Then, this example also satisfies Acyclicity * .In fact, we can see from the definitions of these two conditions that Acyclicity * only limits the preferences of the arm side, while α-condition limits the preferences of both sides of the market. Intuitively, Acyclicity * is a more general condition. We now give the theoretical proof.If α-condition holds, then Acyclicity * also holds. By contradiction, if Acyclicity * is violated, then there is a cycle (Definition 10). For preference sequences that can produce stable matchings, as long as there is a cycle or a ring structure, we can always construct at least two stable matchings Romero-Medina & Triossi (2013) . For example, for fixed agents set S = {s 1 , s 2 , • • • , s N } and arms set C = {c 1 , c 2 , • • • , c K } with preference profile P and this matching market has stable matching m * . If there is a cycle s 1 ≻ c1 s 2 ≻ c2 s 1 , for this stable matching m * containing (s 1 , c 1 ), (s 2 , c 2 ), when other matching pairs remain unchanged, (s 2 , c 1 ), (s 1 , c 2 ) with other pairs can lead to a new stable matching. Thus the uniqueness is violated, and then α-condition is also violated.Conversely, we consider a counterexample that Acyclicity * holds while α-condition may not hold.From Table 2 , we now explain that a market with arms c 1 , c 2 , c 3 , agents s 1 , s 2 , s 3 , s 4 , s 5 , and capacity q = (2, 1, 2) with preference (2(a)) and (2(b)) satisfies Acyclicity * and can lead to a unique stable matching but does not satisfy α-condition. We run GS Algorithm in many-to-one market and obtain stable matching {(c1; s 2 , s 5 ), (c 2 ; s 1 ), (c 3 ; s 3 , s 4 )}. And Acyclicity * is easily verified. After eliminating (c 3 ; s 3 , s 4 ), only s 1 , s 2 , s 5 , c 1 , c 2 remain in the system, and then the preference profile is represented as (2(c)) and (2(d)) in Table 2 . Apparently, this preference can produce two stable matching. Thus, α-condition is violated.

