LEARNING FOR EDGE-WEIGHTED ONLINE BIPARTITE MATCHING WITH ROBUSTNESS GUARANTEES

Abstract

Many real-world problems, such as online ad display, can be formulated as online bipartite matching. The crucial challenge lies in the nature of sequentially-revealed online item information, based on which we make irreversible matching decisions at each step. While numerous expert online algorithms have been proposed with bounded worst-case competitive ratios, they may not offer satisfactory performance in average cases. On the other hand, reinforcement learning (RL) has been applied to improve the average performance, but they lack robustness and can perform arbitrarily badly. In this paper, we propose a novel RL-based approach to edgeweighted online bipartite matching with robustness guarantees (LOMAR), achieving both good average-case and good worst-case performance. The key novelty of LOMAR is a new online switching operation which, based on a judiciously-designed condition to hedge against future uncertainties, decides whether to follow the expert's decision or the RL decision for each online item arrival. We prove that for any ρ ∈ [0, 1], LOMAR is ρ-competitive against any given expert online algorithm. To improve the average performance, we train the RL policy by explicitly considering the online switching operation. Finally, we run empirical experiments to demonstrate the advantages of LOMAR compared to existing baselines.

1. INTRODUCTION

Online bipartite matching is a classic online problem of practical importance (Mehta, 2013; Kim & Moon, 2020; Fahrbach et al., 2020; Antoniadis et al., 2020b; Huang & Shu, 2021; Gupta & Roughgarden, 2020) . In a nutshell, online bipartite matching assigns online items to offline items in two separate sets: when an online item arrives, we need to match it to an offline item given applicable constraints (e.g., capacity constraint), with the goal of maximizing the total rewards collected (Mehta, 2013) . For example, numerous applications, including scheduling tasks to servers, displaying advertisements to online users, recommending articles/movies/products, among many others, can all be modeled as online bipartite matching or its variants. The practical importance, along with substantial algorithmic challenges, of online bipartite matching has received extensive attention in the last few decades (Karp et al., 1990; Fahrbach et al., 2020) . Concretely, many algorithms have been proposed and studied for various settings of online bipartite matching, ranging from simple yet effective greedy algorithms to sophisticated ranking-based algorithms (Karp et al., 1990; Kim & Moon, 2020; Fahrbach et al., 2020; Aggarwal et al., 2011; Devanur et al., 2013) . These expert algorithms typically have robustness guarantees in terms of the competitive ratio -the ratio of the total reward obtained by an online algorithm to the reward of another baseline algorithm (commonly the optimal offline algorithm) -even under adversarial settings given arbitrarily bad problem inputs (Karp et al., 1990; Huang & Shu, 2021) . In some settings, even the optimal competitive ratio for adversarial inputs has been derived (readers are referred to (Mehta, 2013) for an excellent tutorial). The abundance of competitive online algorithms has clearly demonstrated the importance of performance robustness in terms of the competitive ratio, especially in safety-sensitive applications such as matching mission-critical items or under contractual obligations (Fahrbach et al., 2020) . Nonetheless, as commonly known in the literature, the necessity of conservativeness to address the worst-case adversarial input means that the average performance is typically not optimal (see, e.g., (Christianson et al., 2022; Zeynali et al., 2021) for discussions in other general online problems). More recently, online optimizers based on reinforcement learning (RL) (Chen et al., 2022; Georgiev & Lió, 2020; Wang et al., 2019; Alomrani et al., 2021; Du et al., 2019; Zuzic et al., 2020) have been proposed in the context of online bipartite matching as well as other online problems. Specifically, by exploiting statistical information of problem inputs, RL models are trained offline and then applied online to produce decisions given unseen problem inputs. These RL-based optimizers can often achieve high average rewards in many typical cases. Nonetheless, they may not have any performance robustness guarantees in terms of the competitive ratio. In fact, a crucial pain point is that the worst-case performance of many RL-based optimizers can be arbitrarily bad, due to, e.g., testing distribution shifts, inevitable model generalization errors, finite samples, and/or even adversarial inputs. Consequently, the lack of robustness guarantees has become a key roadblock for wide deployment of RL-based optimizers in real-world applications. In this paper, we focus on an important and novel objective -achieving both good average performance and guaranteed worst-case robustness -for edge-weighted online bipartite matching (Fahrbach et al., 2020; Kim & Moon, 2020) . More specifically, our algorithm, called LOMAR (Learning-based approach to edge-weighted Online bipartite MAtching with Robustness guarantees), integrates an expert algorithm with RL. The key novelty of LOMAR lies in a carefully-designed online switching step that dynamically switches between the RL decision and the expert decision online, as well as a switching-aware training algorithm. For both no-free-disposal and free-disposal settings, we design novel switching conditions as to when the RL decisions can be safely followed while still guaranteeing robustness of being ρ-competitive against any given expert online algorithms for any ρ ∈ [0, 1]. Furthermore, if the expert itself has a competitive ratio of λ ≤ 1 against the optimal offline algorithm (OPT), then it will naturally translate into LOMAR being ρλ-competitive against OPT. To improve the average performance of LOMAR, we train the RL policy in LOMAR by explicitly taking into account the introduced switching operation. Importantly, to avoid the "no supervision" trap during the initial RL policy training, we propose to approximate the switching operation probabilistically. Finally, we offer empirical experiments to demonstrate that LOMAR can improve the average cost (compared to existing expert algorithms) as well as lower the competitive ratio (compared to pure RL-based optimizers).

2. RELATED WORKS

Online bipartite matching has been traditionally approached by expert algorithms (Mehta, 2013; Karande et al., 2011; Huang et al., 2019; Devanur et al., 2013) . A simple but widely-used algorithm is the (deterministic) greedy algorithm (Mehta, 2013), achieving reasonably-good competitive ratios and empirical performance (Alomrani et al., 2021) . Randomized algorithms have also been proposed to improve the competitive ratio (Ting & Xiang, 2014; Aggarwal et al., 2011) . In addition, competitive algorithms based on the primal-dual framework have also been proposed (Mehta, 2013; Buchbinder et al., 2009) . More recently, multi-phase information and predictions have been leveraged to exploit stochasticity within each problem instance and improve the algorithm performance (Kesselheim et al., 2013) . For example, (Korula & Pál, 2009) designs a secretary matching algorithm based on a threshold obtained using the information of phase one, and exploits the threshold for matching in phase two. Note that stochastic settingsconsidered by expert algorithms (Mehta, 2013; Karande et al., 2011) mean that the arrival orders and/or rewards of different online items within each problem instance are stochastic. By contrast, as shown in equation 2, we focus on an unknown distribution of problem instances whereas the inputs within each instance can still be arbitrary. Another line of algorithms utilize RL to improve the average performance (Wang et al., 2019; Georgiev & Lió, 2020; Chen et al., 2022; Alomrani et al., 2021) . Even though heuristic methods (such as using adversarial training samples (Zuzic et al., 2020; Du et al., 2022) ) are used to empirically improve the robustness, they do not provide any theoretically-proved robustness guarantees. ML-augmented algorithms have been recently considered for various problems (Rutten et al., 2022; Christianson et al., 2022; Chłędowski et al., 2021; Lykouris & Vassilvitskii, 2021; Gupta & Roughgarden, 2017) . By viewing the ML prediction as blackbox advice, these algorithms strive to provide good competitive ratios when the ML predictions are nearly perfect, and also bounded competitive ratios when ML predictions are bad. But, they still focus on the worst case without addressing the average performance or how the ML model is trained. By contrast, the RL model in LOMAR is trained by taking into account the switching operation and performs inference based on the actual state

