LEARNING FOR EDGE-WEIGHTED ONLINE BIPARTITE MATCHING WITH ROBUSTNESS GUARANTEES

Abstract

Many real-world problems, such as online ad display, can be formulated as online bipartite matching. The crucial challenge lies in the nature of sequentially-revealed online item information, based on which we make irreversible matching decisions at each step. While numerous expert online algorithms have been proposed with bounded worst-case competitive ratios, they may not offer satisfactory performance in average cases. On the other hand, reinforcement learning (RL) has been applied to improve the average performance, but they lack robustness and can perform arbitrarily badly. In this paper, we propose a novel RL-based approach to edgeweighted online bipartite matching with robustness guarantees (LOMAR), achieving both good average-case and good worst-case performance. The key novelty of LOMAR is a new online switching operation which, based on a judiciously-designed condition to hedge against future uncertainties, decides whether to follow the expert's decision or the RL decision for each online item arrival. We prove that for any ρ ∈ [0, 1], LOMAR is ρ-competitive against any given expert online algorithm. To improve the average performance, we train the RL policy by explicitly considering the online switching operation. Finally, we run empirical experiments to demonstrate the advantages of LOMAR compared to existing baselines.

1. INTRODUCTION

Online bipartite matching is a classic online problem of practical importance (Mehta, 2013; Kim & Moon, 2020; Fahrbach et al., 2020; Antoniadis et al., 2020b; Huang & Shu, 2021; Gupta & Roughgarden, 2020) . In a nutshell, online bipartite matching assigns online items to offline items in two separate sets: when an online item arrives, we need to match it to an offline item given applicable constraints (e.g., capacity constraint), with the goal of maximizing the total rewards collected (Mehta, 2013) . For example, numerous applications, including scheduling tasks to servers, displaying advertisements to online users, recommending articles/movies/products, among many others, can all be modeled as online bipartite matching or its variants. The practical importance, along with substantial algorithmic challenges, of online bipartite matching has received extensive attention in the last few decades (Karp et al., 1990; Fahrbach et al., 2020) . Concretely, many algorithms have been proposed and studied for various settings of online bipartite matching, ranging from simple yet effective greedy algorithms to sophisticated ranking-based algorithms (Karp et al., 1990; Kim & Moon, 2020; Fahrbach et al., 2020; Aggarwal et al., 2011; Devanur et al., 2013) . These expert algorithms typically have robustness guarantees in terms of the competitive ratio -the ratio of the total reward obtained by an online algorithm to the reward of another baseline algorithm (commonly the optimal offline algorithm) -even under adversarial settings given arbitrarily bad problem inputs (Karp et al., 1990; Huang & Shu, 2021) . In some settings, even the optimal competitive ratio for adversarial inputs has been derived (readers are referred to (Mehta, 2013) for an excellent tutorial). The abundance of competitive online algorithms has clearly demonstrated the importance of performance robustness in terms of the competitive ratio, especially in safety-sensitive applications such as matching mission-critical items or under contractual obligations (Fahrbach et al., 2020) . Nonetheless, as commonly known in the literature, the necessity of conservativeness to address the worst-case adversarial input means that the average performance is typically not optimal (see, e.g., (Christianson et al., 2022; Zeynali et al., 2021) for discussions in other general online problems). More recently, online optimizers based on reinforcement learning (RL) (Chen et al., 2022; Georgiev & Lió, 2020; Wang et al., 2019; Alomrani et al., 2021; Du et al., 2019; Zuzic et al., 2020) have been proposed in the context of online bipartite matching as well as other online problems. Specifically, by exploiting statistical information of problem inputs, RL models are trained offline and then applied online to produce decisions given unseen problem inputs. These RL-based optimizers can often achieve high average rewards in many typical cases. Nonetheless, they may not have any performance robustness guarantees in terms of the competitive ratio. In fact, a crucial pain point is that the worst-case performance of many RL-based optimizers can be arbitrarily bad, due to, e.g., testing distribution shifts, inevitable model generalization errors, finite samples, and/or even adversarial inputs. Consequently, the lack of robustness guarantees has become a key roadblock for wide deployment of RL-based optimizers in real-world applications. In this paper, we focus on an important and novel objective -achieving both good average performance and guaranteed worst-case robustness -for edge-weighted online bipartite matching (Fahrbach et al., 2020; Kim & Moon, 2020) . More specifically, our algorithm, called LOMAR (Learning-based approach to edge-weighted Online bipartite MAtching with Robustness guarantees), integrates an expert algorithm with RL. The key novelty of LOMAR lies in a carefully-designed online switching step that dynamically switches between the RL decision and the expert decision online, as well as a switching-aware training algorithm. For both no-free-disposal and free-disposal settings, we design novel switching conditions as to when the RL decisions can be safely followed while still guaranteeing robustness of being ρ-competitive against any given expert online algorithms for any ρ ∈ [0, 1]. Furthermore, if the expert itself has a competitive ratio of λ ≤ 1 against the optimal offline algorithm (OPT), then it will naturally translate into LOMAR being ρλ-competitive against OPT. To improve the average performance of LOMAR, we train the RL policy in LOMAR by explicitly taking into account the introduced switching operation. Importantly, to avoid the "no supervision" trap during the initial RL policy training, we propose to approximate the switching operation probabilistically. Finally, we offer empirical experiments to demonstrate that LOMAR can improve the average cost (compared to existing expert algorithms) as well as lower the competitive ratio (compared to pure RL-based optimizers).

2. RELATED WORKS

Online bipartite matching has been traditionally approached by expert algorithms (Mehta, 2013; Karande et al., 2011; Huang et al., 2019; Devanur et al., 2013) . A simple but widely-used algorithm is the (deterministic) greedy algorithm (Mehta, 2013) , achieving reasonably-good competitive ratios and empirical performance (Alomrani et al., 2021) . Randomized algorithms have also been proposed to improve the competitive ratio (Ting & Xiang, 2014; Aggarwal et al., 2011) . In addition, competitive algorithms based on the primal-dual framework have also been proposed (Mehta, 2013; Buchbinder et al., 2009) . More recently, multi-phase information and predictions have been leveraged to exploit stochasticity within each problem instance and improve the algorithm performance (Kesselheim et al., 2013) . For example, (Korula & Pál, 2009) designs a secretary matching algorithm based on a threshold obtained using the information of phase one, and exploits the threshold for matching in phase two. Note that stochastic settingsconsidered by expert algorithms (Mehta, 2013; Karande et al., 2011) mean that the arrival orders and/or rewards of different online items within each problem instance are stochastic. By contrast, as shown in equation 2, we focus on an unknown distribution of problem instances whereas the inputs within each instance can still be arbitrary. Another line of algorithms utilize RL to improve the average performance (Wang et al., 2019; Georgiev & Lió, 2020; Chen et al., 2022; Alomrani et al., 2021) . Even though heuristic methods (such as using adversarial training samples (Zuzic et al., 2020; Du et al., 2022) ) are used to empirically improve the robustness, they do not provide any theoretically-proved robustness guarantees. ML-augmented algorithms have been recently considered for various problems (Rutten et al., 2022; Christianson et al., 2022; Chłędowski et al., 2021; Lykouris & Vassilvitskii, 2021; Gupta & Roughgarden, 2017) . By viewing the ML prediction as blackbox advice, these algorithms strive to provide good competitive ratios when the ML predictions are nearly perfect, and also bounded competitive ratios when ML predictions are bad. But, they still focus on the worst case without addressing the average performance or how the ML model is trained. By contrast, the RL model in LOMAR is trained by taking into account the switching operation and performs inference based on the actual state (rather than its own independently-maintained state as a blackbox). Assuming a given downstream algorithm, (Wang et al., 2021; Liu & Grigas, 2021; Wilder et al., 2019; Elmachtoub & Grigas, 2017; Du et al., 2021; Anand et al., 2021) focus on learning the ML model to better serve the end goal in completely different (sometimes, offline optimization) problems. LOMAR is relevant to conservative bandits/RL (Wu et al., 2016; Kazerouni et al., 2017; Yang et al., 2022; Garcelon et al., 2020) . With unknown reward functions (as well as transition models if applicable), conservative bandits/RL leverages an existing policy to safeguard the exploration process. But, they only consider the cumulative reward without addressing future uncertainties when deciding exploration vs. rolling back to an existing policy. Thus, as shown in Section 4, this cannot guarantee robustness in our problem. Also, constrained policy optimization (Yang et al., 2020; Kumar et al., 2020; Schulman et al., 2015; Achiam et al., 2017; Thomas et al., 2021; Berkenkamp et al., 2017) focuses on average (cost) constraints in the long run, whereas LOMAR achieves stronger robustness (relative to an expert algorithm) for each episode with even adversarial inputs.

3. PROBLEM FORMULATION

We focus on edge-weighted online bipartite matching, which includes un-weighted and vertexweighted matching as special cases (Fahrbach et al., 2020; Kim & Moon, 2020) . In the following, we also drop "edge-weighted" if applicable when referring to our problem. The goal of the agent is to match items (a.k.a. vertices) between two sets U and V to gain as high total rewards as possible. Suppose that U is fixed and contains offline items u ∈ U, and that the online items v ∈ V arrive sequentially: in each time slot, an online item v ∈ V arrives and the weight/reward information {w uv | w u,min ≤ w uv ≤ w u,max , u ∈ U} is revealed, where w uv represents the reward when the online item v is matched to each offline u ∈ U. We denote one problem instance by G = {U, V, W}, where W = {w uv | u ∈ U, v ∈ V}. We denote x uv ∈ {0, 1} as the matching decision indicating whether u is matched to v. Also, any offline item u ∈ U can be matched up to c u times, where c u is essentially the capacity for offline item u known to the agent. The objective is to maximize the total collected reward v∈V,u∈U x uv w uv . With a slight abuse of notations, we denote x v ∈ U as the index of item in U that is matched to item v ∈ V. The set of online items matched to u ∈ U is denoted as V u = {v ∈ V | x uv = 1}. The edge-weighted online bipartite matching problem has been mostly studied under two different settings: no free disposal and with free disposal (Mehta, 2013) . In the no-free-disposal case, each offline item u ∈ U can only be matched strictly up to c u times; in the free-disposal case, each offline item u ∈ U can be matched more than c u times, but only the top c u rewards are counted when more than c u online items are matched to u. Compared to the free-disposal case, the no-free-disposal case is significantly more challenging with the optimal competitive ratio being 0 in the strong adversarial setting unless additional assumptions are made (e.g., w u,min > 0 for each u ∈ U (Kim & Moon, 2020) and/or random-order of online arrivals) (Fahrbach et al., 2020; Mehta, 2013) . The free-disposal setting not only makes the problem more tractable, but also is practically motivated by the display ad application where the advertisers (i.e., offline items u ∈ U) will not be unhappy if they receive more impressions (i.e., online items v ∈ V) than their budgets c u , even though only the top c u items count. LOMAR can handle both no-free-disposal and free-disposal settings. For better presentation of our key novelty and page limits, we focus on the no-free-disposal setting in the body of the paper, while deferring the free-disposal setting to Appendix B. Specifically, the offline problem with no free disposal can be expressed as: max xuv∈{0,1},u∈U ,v∈V x uv w uv , s.t., v∈V x uv ≤ c u , and u∈U x uv ≤ 1, ∀u ∈ U, v ∈ V, (1) where the constraints specify the offline item capacity limit and each online item v ∈ V can only be matched up to one offline item u ∈ U. Given a problem instance G and an online algorithm α, we use f α u (G) to denote the total reward collected for offline item u ∈ U, and R α (G) = u∈U f α u (G) to denote the total collected reward. We will also drop the superscript α for notational convenience wherever applicable. Objective. Solving the problem in equation 1 is very challenging in the online case, where the agent has to make irreversible decisions without knowing the future online item arrivals. Next, we first define the following generalized competitiveness as a metric of robustness. Definition 1 (Competitiveness). An online bipartite matching algorithm α is said to be ρ-competitive with ρ ≥ 0 against the algorithm π if for any problem instance G, its total collected reward R α (G) satisfies R α (G) ≥ ρR π (G) -B, where B ≥ 0 is a constant independent of the problem input, and R π is the total reward of the algorithm π. Competitiveness against a given (online) algorithm π is common in the literature (Christianson et al., 2022) : the greater ρ ≥ 0, the better robustness of the online algorithm, although the average rewards can be worse. Additionally, the constant B ≥ 0 relaxes the strict competitive ratio by allowing an additive regret (Antoniadis et al., 2020a) . When B = 0, the competitive ratio becomes the strict one. In this paper, we focus on a setting where the problem instance G = {U, V, W} follows an unknown distribution, whereas both the rewards W and online arrival order within each instance G can be adversarial. We consider both average performance and worst-case robustness guarantees as formalized below: max E G [R α (G)] , s.t. R α (G) ≥ ρR π (G) -B, ∀G, where the expectation E G [R α (G)] is over the randomness G = {U, V, W}. Note carefully that some manually-designed algorithms focus on a stochastic setting where the arrival order is random and/or the rewards {w uv | w u,min ≤ w uv ≤ w u,max , u ∈ U} of each online item is independently and identically distributed (i.i.d.) within each problem instance G Mehta (2013) . By contrast, our settings are significantly different -we only assume an unknown distribution for the entire problem instance G = {U, V, W} while both the rewards W and online arrival order within each instance G can be adversarial in our problem.

4. ONLINE SWITCHING FOR ROBUSTNESS GUARANTEES

We present LOMAR, which includes an online switching operation to dynamically decide to follow the ML decision or the expert decision, to achieve robustness guarantees with respect to the expert.

4.1. ONLINE SWITCHING

While switching is common in online algorithms, "how to switch"is highly non-trivial and a key merit for algorithm designs Antoniadis et al. (2020a) ; Christianson et al. (2022) ; Rutten et al. (2022) . To guarantee robustness (i.e., ρ-competitive against a given expert for any ρ ∈ [0, 1]), we propose a novel online algorithm (Algorithm 1). In the algorithm, we independently run an expert online algorithm π -the cumulative reward and item matching decisions are all maintained virtually for the expert, but not used as the actual decisions. Meanwhile, instead of being independently executed to provide blackbox advice based on its own virtual state (like in the prior ML-augmented online algorithm (Christianson et al., 2022) ), the RL model in LOMAR makes online decisions based on the actual state at each step. The most crucial step for safeguarding RL decisions is online switching: Lines 13-19 in Algorithm 1. The key idea for this step is to switch between the expert decision x π v and the RL decision xv in order to ensure that the actual online decision x v meets the ρ-competitive requirement (against the expert π). Specifically, we follow the RL decision xv only if it can safely hedge against any future uncertainties (i.e., the expert's future reward increase); otherwise, we need to roll back to the expert's decision x π v to stay on track for robustness. Note that naive switching conditions, e.g., only ensuring that the actual cumulative reward is at least ρ times of the expert's cumulative reward at each step (Wu et al., 2016; Yang et al., 2022) , can fail to meet the competitive ratio requirement in the end. The reason is that, even though the competitive ratio requirement is met (i.e., R v ≥ ρR π v -B) at the current step v, the expert can possibly obtain much higher rewards from future online items v + 1, v + 2, • • • if it has additional offline item capacity that the actual algorithm LOMAR does not have. Thus, we must carefully design the switching conditions to hedge against future risks. The no-free-disposal and free-disposal settings Algorithm 1 Inference of Robust Learning-based Online Bipartite Matching (LOMAR) 1: Initialization: The actual set of items matched to u ∈ U is V u,v after sequentially-arriving item v's assignment with V u,0 = ∅, the actual remaining capacity is b u = c u for u ∈ U, and the actual cumulative reward is R 0 = u∈U f u (V u,0 ) = 0. The same notations apply to the expert algorithm π by adding the superscript π. Competitive ratio requirement ρ ∈ [0, 1] and slackness B ≥ 0 with respect to the expert algorithm π. 2: for v = 1 to |V| do 3: Run the algorithm π and match the item v to u ∈ U based on the expert's decision u = x π v .

4:

Update the expert's decision set and reward for offline item u = x π v : V π x π v ,v = V π x π v ,v-1 {v} and f x π v = f x π v (V π x π v ,v ). 5: Update the expert's cumulative reward R π v = u∈U f u 6: Get the set of available items U a = {u ∈ U | |V u,v-1 | < c u } 7: for u in U a do 8: Collect the available history information I u about item u 9: Run the RL model to get score: s u = w uv -h θ (I u , w uv ) where θ is the network weight 10: end for 11: Calculate the probability of choosing each available item or skip: {{s u } u∈Ua , sskip } = softmax {{s u } u∈Ua , 0}. 12: Obtain RL decision: xv = arg max u∈Ua {skip} {{s u } u∈Ua , sskip }. 13: if R v-1 + w xv,v ≥ ρ R π v + u∈U |V u,v-1 | -|V π u,v | + I u=xv + • w u,max -B then 14: Select x v = xv . //Follow the RL decision 15: else if x π v is available for matching (i.e., |V x π v ,v-1 | < c x π v ) then 16: Select x v = x π v . //Follow the expert 17: else 18: Select x v = skip. 19: end if 20: Update assignment and reward: V xv,v = V xv,v-1 {v} and R v = R v-1 + w xv,v : end for require different switching conditions. Due to the page limit, we focus on the no-free-disposal setting below, while referring readers to Appendix B for more details about the free-disposal setting.

4.2. ROBUSTNESS CONSTRAINT

In the no-free-disposal case, an offline item u ∈ U cannot receive any additional online items if it has been matched for c u times up to its capacity. By assigning more online items to u ∈ U than the expert algorithm at step v, LOMAR can possibly receive a higher cumulative reward than the expert's cumulative reward. But, such advantages are just temporary, because the expert may receive an even higher reward in the future by filling up the unused capacity of item u. Thus, to hedge against the future uncertainties, LOMAR chooses the RL decisions only when the following condition is satified: R v-1 + w xv,v ≥ ρ R π v + u∈U |V u,v-1 | -|V π u,v | + I u=xv + • w u,max -B, where has assigned to u than the expert if LOMAR follows the RL decision at step v. If LOMAR assigns fewer items than the expert for an offline item u ∈ U, there is no need for any hedging because LOMAR is guaranteed to receive more rewards by filling up the item u up to the expert's assignment level. I The term w u,max in equation 3 is the set as the maximum possible reward for each decision. Even when w u,max is unknown in advance, LOMAR still applies by simply setting w u,max = ∞. In this case, LOMAR will be less "greedy" than the expert and never use more resources than the expert at any step for any u ∈ U.

4.3. ROBUSTNESS ANALYSIS

We now formally show the competitive ratio of LOMAR. The proof is available in the appendix. Theorem 4.1. For any 0 ≤ ρ ≤ 1 and B ≥ 0 and any expert algorithm π, LOMAR achieves a competitive ratio of ρ against the algorithm π, i.e., R ≥ ρR π -B for any problem input. The hyperparameters 0 ≤ ρ ≤ 1 and B ≥ 0 govern the level of robustness we would like to achieve, at the potential expense of average reward performance. For example, by setting ρ = 1 and B = 0, we achieve the same robustness as the expert but leave little to none freedom for RL decisions. On the other hand, by setting a small ρ > 0 and/or large B, we provide higher flexibility to RL decisions for better average performance, while potentially decreasing the robustness. In fact, designing an algorithm that is guaranteed to simultaneously outperform RL and the expert is very challenging, if not impossible, and such tradeoff is necessary in the broad context of ML-augmented online algorithms (Rutten et al., 2022; Christianson et al., 2022) . Additionally, in case of multiple experts, we can first combine these experts into a single expert and then apply LOMAR as if it works with a single combined expert. While the competitive ratio of all online algorithms against the optimal offline algorithm is zero in the no-free-disposal and general adversarial setting, there exist provably competitive online expert algorithms under some mild assumptions and other settings (Mehta, 2013) . For example, the simple greedy algorithm achieves 1 + max u∈U wu,max wu,min -1 under bounded weights assumptions for the adversarial no-free-disposal setting (Kim & Moon, 2020) , and 1 2 for the free-disposal setting (Fahrbach et al., 2020) , and there also exist 1/e-competitive algorithms against the optimal offline algorithm for the random-order setting (Mehta, 2013) . Thus, an immediate result follows. Corollary 4.1.1. For any 0 ≤ ρ ≤ 1 and B ≥ 0, by using Algorithm 1 and an expert online algorithm π that is λ-competitive against the optimal offline algorithm OPT, then under the same assumptions for π to be λ-competitive, LOMAR is ρλ-competitive against OPT. Corollary 4.1.1 provides a general result that applies to any λ-competitive expert algorithm π under its respective required assumptions. For example, if the expert π assumes an adversarial or random-order setting, then Corollary 4.1.1 also holds true under the same adversarial or random-order setting. Finally, we comment on the randomized setting where randomization is over the algorithm choice and potentially increases the competitive ratio (Gupta & Roughgarden, 2020; Mehta, 2013) . For the randomized setting, the competitive ratio is modified as E(R) ≥ ρR OP T -B, where E is the expectation and R OP T is the optimal reward of the optimal offline algorithm (Mehta, 2013) . We make no assumptions on the expert π in Algorithm 1. Thus, if the expert π itself is randomized in Algorithm 1, then LOMAR will also be randomized. Also, for any ρ ∈ [0, 1] and B ≥ 0, by directly applying Theorem 4.1 and Corollary 4.1.1 to the randomized setting, the competitive ratio of LOMAR will be ρλ-competitive against OPT if the randomized expert π itself is λ-competitive against OPT.

5. RL POLICY TRAINING WITH ONLINE SWITCHING

The existing ML-augmented online algorithms typically assume a pre-trained standalone RL model (Christianson et al., 2022; Rutten et al., 2022) . While the standalone RL model may perform well on its own, some already good actions can be replaced by expert's action due to switching for robustness in inference. In other words, there will be a objective mismatch between training and testing. To rectify the mismatch, we propose a novel approach to train the RL model in LOMAR by explicitly considering the switching operation. RL architecture. For solving online bipartite matching, there exist various network architectures, e.g., fully-connected networks and scalable invariant network for arbitrary graph sizes. The prior study (Mehta, 2013) has shown using extensive empirical experiments that the invariant network architecture, where each offline-online item pair runs through a separate neural network with shared weights among all the item pairs, is empirically advantageous, due to its scalability to large graph sizes and high average performance. We denote the RL model as h θ (I u , w uv ) where θ is the network parameter. By feeding the item weight w uv and applicable history information I u for each offline-online item pair (u, v), we can use the RL model to output a threshold for possible item assignment, following threshold-based algorithms (Huang et al., 2019; Mehta, 2013) . The history information I u includes, but is not limited to, the average value and variance of weights assigned to u, average in-degree of u, and maximum weight for the already matched items. More details about the information can be found in the appendix. Then, with the RL output, we can obtain a score s u = w uv -h θ (I u , w uv ), for each possible assignment, and the RL uses the offline item u ∈ U (plus "skip" with s skip = 0 in the no-free-disposal setting) with the largest s u as its candidate action xv when checking the switching condition in Algorithm 1.

Policy training.

Training the RL model by considering switching in Algorithm 1 is highly non-trivial. Most critically, the initial RL decisions can perform arbitrarily badly upon policy initialization, which means that the initial RL decisions are almost always overridden by the expert's decisions for robustness. Due to following the expert's decisions, the RL agent almost always receive a good reward, which actually has nothing to do with the RL's own decisions and hence provides little to no supervision to improve the RL policy. Consequently, this creates a gridlock for RL policy training. While using an offline pre-trained standalone RL model without considering online switching (e.g., (Alomrani et al., 2021) ) as an initial policy may partially address this gridlock, this is certainly inefficient as we have to spend resources for training another RL model, let alone the likelihood of being trapped into the standalone RL model's suboptimal policies (e.g. local minimums). To address these issues, we introduce another softmax probability with temperature t to approximate the hard switching process during training. The switching probability depends on the cumulative reward difference R dif f in the switching condition, which is R dif f = R v-1 + w xv,v -ρ R π v + u∈U |V u,v-1 | -|V π u,v | + I u=xv + • w u,max + B (4) Then the probability of following RL is p os = e R dif f /t 1+e R dif f /t , where t is the temperature of softmax function. This softmax probability is differentiable and hence allows backpropagation to supervise the training of the RL model weight θ. We train the RL agent by applying REINFORCE (Williams, 1992) to optimize the policy parameter θ. Denote τ = {x 1 , • • • , x v } as an action trajectory sample and p θ (τ ) as the possibility of the trajectory given the RL policy, where p θ is calculated based on the selection probability of RL model and expert. Our goal is to maximize the expected total reward R θ = E τ ∼p θ [w xv,v ]. Thus, at each training step, given an RL policy with parameter θ, we sample n action trajectories {τ i = {x 1,i , • • • , x v,i }, i ∈ [n] } and record the corresponding rewards. We can get the approximated average reward as Rθ = 1 n n i=1 w i xi,v,v . Then, we calculate the gradient of the RL policy parameter as ∇ θ Rθ = n i=1 v∈V ∇ θ log p θ (x v,i | I u,i ) v∈V w i xv,i,v . Then, we update the parameter θ by θ = θ + α∇ θ Rθ , where α controls the training step size. By changing the temperature t for softmax, we are able to balance exploration and exploitation. Specifically, at the beginning of the policy training, we can set a high temperature to encourage the RL model to explore more aggressively, instead of sticking to the expert's decisions. As the RL model performance continuously improves, we can reduce the temperature in order to make the RL agent more aware of the downstream switching operation. The training process is performed offline as in the existing RL-based optimizers (Alomrani et al., 2021; Du et al., 2022) and described in Algorithm 3.

6. EXPERIMENT 6.1 SETUP

To validate the effectiveness of LOMAR, we conduct experiments based on the movie recommendation application. Specifically, when an user (i.e., online item v) arrives, we recommend a movie (i.e., offline item u) to this user and receive a reward based on the user-movie preference information. We choose the MovieLens dataset (Harper & Konstan, 2015) , which provides a total of 3952 movies, 6040 users and 100209 ratings. We preprocess the dataset to sample movies and users randomly from the dataset to generate subgraphs, following the same steps as used by (Dickerson et al., 2019) and (Alomrani et al., 2021) . In testing dataset, we empirically evaluate each algorithm using average reward (AVG) and competitive ratio (CR, against OPT), which represents the average performance and worst case performance, respectively. Thus, the value of CR is the empirically worst reward ratio in the testing dataset. For fair comparison, all the experimental settings like capacity c u follow those used in (Alomrani et al., 2021) . More details about the problem setup and training details are deferred to Appendix A. Baseline Algorithms. We consider the following baselines. All the RL policies are trained offline with the same architecture and applicable hyperparameters. • OPT: The offline optimal oracle has the complete information about the bipartite graph. We use the Gurobi optimizer to find the optimal offline solution. • Greedy: At each step, Greedy selects the available offline item with highest weight. • DRL: It uses the same architecture as in LOMAR, but does not consider online switching for training or inference. That is, the RL model is both trained and tested with ρ = 0. More specifically, our RL architecture has 3 fully connected layers, each with 100 hidden nodes. • DRL-OS (DRL-OnlineSwitching): We apply online switching to the same RL policy used by DRL during inference. That is, the RL model is trained with ρ = 0, but tested with a different ρ > 0. The choice of baselines include all those considered in (Alomrani et al., 2021). In the no-free-disposal setting, the best competitive ratio is 0 in general adversarial cases (Mehta, 2013) . Here, we use Greedy as the expert algorithm, because the recent study (Alomrani et al., 2021) has shown that Greedy performs better than other alternatives and is a strong baseline.

DRL-OS

LOMAR (ρ = 0.4) LOMAR (ρ = 0.6) LOMAR (ρ = 0.8) We show the comparison of LOMAR with baseline algorithms in Table 1 . First, we see that DRL has the highest average reward, but its empirical competitive ratio is the lowest. The expert algorithm Greedy is fairly robust, but has a lower average award than RL-based policies. Second, DRL-OS can improve the competitive ratio compared to DRL. But, its RL policy is trained alone without being aware of the online switching. Thus, by making the RL policy aware of online switching, LOMAR can improve the average reward compared to DRL-OS. Specifically, by training LOMAR using the same ρ as testing it, we can obtain both the highest average cost and the highest competitive ratio. One exception is the minor decrease of competitive ratio when ρ = 0.8 for testing. This is likely due to the dataset and a few hard instances can affect the empirical competitive ratio, which also explains why the empirical competitive ratio is not necessarily monotonically increasing in the ρ ∈ [0, 1]. Nonetheless, unlike DRL that may only work well empirically without guarantees, LOMAR offers provable robustness guarantees while exploiting the power of RL to improve the average performance. The boxplots in Fig. 1 visualizes the reward ratio distribution of LOMAR, which further validates the importance of switching-aware training. To show the effect of switching with different ρ, we calculate the bi-competitive reward ratios. Specifically, for each problem instance, the bi-competitive ratio compares the actual reward against those of Greedy and RL model, respectively. To highlight the effect of online switching, we focus on DRL-OS (i.e., training the RL with ρ = 0) whose training process of RL model is not affected by ρ, because the RL model trained with ρ > 0 in LOMAR does not necessarily perform well on its own and the reward ratio of LOMAR to its RL model is not meaningful. The histogram of the bi-competitive ratios are visualized in Fig. 2 . When ρ = 0, the ratio of DRL-OS/ DRL is always 1 unsurprisingly, since DRL-OS are essentially the same as DRL in this case (i.e., both trained and tested with ρ = 0). With a large ρ (e.g. 0.8) for testing, the reward ratios of DRL-OS/Greedy for most samples are around 1, which means the robustness is achieved, as proven by our theoretical analysis. But on the other hand, DRL-OS has limited flexibility and can less exploit the good average performance of DRL. Thus, the hyperparameter ρ ∈ [0, 1] governs the tradeoff between average performance and robustness relative to the expert and, like other hyperparameters, can be tuned to maximize the average performance subject to the robustness requirement. We also consider a crowdsourcing application, as provided by the gMission dataset (Chen et al., 2014) . Additional results for LOMAR and baselines in gMission are deferred to Appendix A.

7. CONCLUSION

In this paper, we propose LOMAR to achieve both good average-case and good worst-case performance for edge-weighted online bipartite matching. LOMAR includes a novel online switching operation to decide whether to follow the expert's decision or the RL decision for each online item arrival. We prove that for any ρ ∈ [0, 1], LOMAR is ρ-competitive against any expert online algorithms, which directly translates a bounded competitive ratio against OPT if the expert algorithm itself has one. We also train the RL policy by explicitly considering the online switching operation so as to improve the average performance. Finally, we run empirical experiments to validate LOMAR. The input to our considered RL model is the edge weights w uv revealed by the online items plus some historical information, which includes: Mean and variances of each offline node's weights; Average degree of each offline nodes; Normalized step size; Percentage of offline nodes connected to the current node; Statistical information of these already matched nodes' weights (maximum, minimum, mean and variance); Ratio of matched offline node; Ratio of skips up to now; Normalized reward with respect to the offline node number. For more details of the historical information, readers are referred to Table 1 in Alomrani et al. (2021) . Additionally, training the RL model in LOMAR usually takes less than 8 hours on a shared research cluster with one NVIDIA K80 GPU, which is almost the same as the training the model for DRL in a standalone manner (i.e., setting ρ = 0 without considering online switching).

A.1.2 ADDITIONAL RESULTS

In Table 1 , we have empirically demonstrated that LOMAR achieves the best tradeoff between the average reward and competitive ratio. In Fig 3 , we further demonstrate that LOMAR not only achieves a better worst-case competitive ratio (at 100.0%). The tail reward ratio of LOMAR is also good compared to the baseline algorithms. Specifically, we show the percentile of reward ratios (compared to the optimal offline algorithm) -the 100% means the worst-case empirical reward ratio (i.e., competitive ratio). We see that DRL has a bad high-percentile reward ratio and lacks performance robustness, although its lower-percentile cost ratio is better. This is consistent with the good average performance of LOMAR. Because of online switching, both DRL-OS and LOMAR achieve better robustness, and LOMAR is even better due to its awareness of the online switching operation during its training process. The expert Greedy has a fairly stable competitive ratio, showing its good robustness. But, it can be outperformed by other algorithms when we look at lower-percentile reward ratio.

A.1.3 RESULTS FOR ANOTHER EXPERT ALGORITHM

Optimally competitive expert algorithms have been developed under the assumptions of random oder and/or i.i.d. rewards of different online items. In particular, by considering the random order setting, OSM (online secretary matching) has the optimal competitive ratio of 1/e (Kesselheim et al., 2013) . Note that the competitive ratio for OSM is average over the random order of online items, while the rewards can be adversarially chosen. We show the empirical results in Fig. 4 . As OSM skips the first |V|/e online items, it actually does not perform (in terms of the empirical worst-case cost ratio) as well as the default expert Greedy in our experiments despite its guaranteed competitive ratio against OPT. That said, we still observe the same trend as using Greedy for the expert: by tuning ρ ∈ [0, 1], LOMAR achieves a good average performance while guaranteeing the competitiveness against the expert OSM (and against OPT as OSM itself is optimally competitive against OPT). Fig. 4 shows the empirical results in our testing dataset, which does not strictly satisfy the random order assumption required by OSM. Next, to satisfy the random order assumption, we select a typical problem instance and randomly vary the arrival orders of online items. We show the cost ratio averaged over the random arrival order in Table 2 . Specifically, we calculate each cost ratio by 100 different random orders, and repeat this process 100 times. We show the mean and stand deviation of the average cost ratios (each averaged over 100 different random orders). We see that LOMAR improves the average cost ratio compared to OSM under the random order assumption. While DRL has a better average cost for this particular instance, it does not provide any guaranteed worst-case robustness as LOMAR. 2014) considers a crowdsourcing application, where the goal is to assign the tasks (online items) to workers (offline items). The edge weight between a certain online task and each worker can be calculated by the product of the task reward and the worker's success probability, which is determined by the physical location of workers and the type of tasks. Pure ML LOMAR ρ = 0.2 LOMAR ρ = 0.4 LOMAR ρ = 0.6 LOMAR ρ = 0. Our goal is to maximize the total reward given the capacity of each worker, which perfectly fits into our formulation in Eqn. equation 1. We use the same data processing and RL architecture design as introduced in Section A.1.1. We train LOMAR with different ρ in the gMission dataset by setting u 0 = 10, v 0 = 60, w u,max = 1. Again, we use Greedy as the expert, which is an empirically strong baseline algorithm as shown in Alomrani et al. (2021) . Our results are all consistent with those presented in Alomrani et al. (2021) . A.2.1 TESTING ON 10 × 60 In our the first result, we generate a testing dataset with u 0 = 10 and v 0 = 60, which is the same setting as our training dataset. In other words, the training and testing datasets have similar distributions. Specifically, Greedy's average reward and competitive ratio are 4.508 and 0.432, while these two values for DRL are 5.819 and 0.604, respectively. Thus, DRL performs outperforms Greedy in both average performance and the worst-case performance. Next, we show the results for LOMAR and DRL-OS under different ρ ∈ [0, 1] in Table 3 . In general, by setting a larger ρ for inference, both LOMAR and DRL-OS are closer to the expert algorithm Greedy, because there is less freedom for the RL decisions. As a result, when ρ increases for inference, the average rewards of both DRL-OS and LOMAR decrease, although they have guaranteed robustness whereas DRL does not. Moreover, by training the RL model with explicit awareness of online switching, LOMAR can have a higher average cost than DRL-OS, which reconfirms the benefits of training the RL model by considering its downstream operation. Interestingly, by setting ρ identical for both training and testing, the average reward may not always be the highest for LOMAR. This is partially because of the empirical testing dataset. Another reason is that, in this test, DRL alone already performs the best (both on average and in the worst case). Hence, by setting a smaller ρ for inference, LOMAR works better empirically though it is trained under a different ρ. Nonetheless, this does not void the benefits of guaranteed robustness in LOMAR. The empirically better performance of DRL lacks guarantees, which we show as follows. A.2.2 TESTING ON 100 × 100 In our second test, we consider an opposite case compared to the first one. We generate a testing dataset with u 0 = 100 and v 0 = 100, which is different from the training dataset setting. As a result, the training and testing datasets have very different distributions, making DRL perform very badly. Specifically, Greedy's average reward and competitive ratio are 40.830 and 0.824, and these two values for DRL are 32.938 and 0.576, respectively. DRL has an even lower average reward than Greedy, showing its lack of performance robustness. We show the results for LOMAR and DRL-OS under different ρ ∈ [0, 1] in Table 4 . In general, by setting a larger ρ for inference, both LOMAR and DRL-OS are closer to the expert algorithm Greedy. As Greedy works empirically much better than DRL in terms of the average performance and the worst-case performance, both LOMAR and DRL-OS have better performances when we increase ρ to let Greedy safeguard the RL decisions more aggressively. Moreover, by training the RL model with explicit awareness of online switching, LOMAR can have a higher average cost than DRL-OS, which further demonstrates the benefits of training the RL model by considering its downstream operation. Also, interestingly, by setting ρ identical for both training and testing, the average reward may not be the highest for LOMAR, partially because of the empirical testing dataset. Another reason is that, in this test, DRL alone already performs very badly (both on average and in the worst case) due to the significant training-testing distributional discrepancy. Hence, by setting a higher ρ, LOMAR works better empirically though it is tested under a different ρ. An exception is when testing LOMAR with ρ = 0.9: setting ρ = 0.6/0.8 for training makes LOMAR perform slightly better in terms of the average performance and worst-case performance, respectively. But, setting ρ = 0.9 for training still brings benefits to LOMAR compared to DRL-OS that does not consider the downstream online switching operation. To sum up, our experimental results under different settings demonstrate: LOMAR's empirical improvement in terms of the average reward compared to DRL-OS; the improved competitive ratio of LOMAR and DRL-OS compared to DRL, especially when the training-testing distributions differ significantly; and the improved average reward of LOMAR compared to Greedy when RL is good. Therefore, LOMAR can exploit the power of RL while provably guaranteeing the performance robustness.

B FREE DISPOSAL

The offline version of bipartite matching with free disposal can be expressed as: With Free Disposal: max xuv∈{0,1},u∈U max S⊆Vu,|S|≤cu v∈S w uv s.t. V u = {v ∈ V | x uv = 1} ∀u ∈ U, u∈U x uv ≤ 1, ∀v ∈ V, where V u = {v ∈ V | x uv = 1} is the set of online items matched to u ∈ U and the objective max S∈Vu,|S|≤cu v∈S x uv w uv indicates that only up to top c u rewards are counted for u ∈ U. In the free-disposal setting, it is more challenging to design the switching conditions to guarantee the robustness. The reason is the additional flexibility allowed for matching decisions -each offline item u ∈ U is allowed to be matched more than c u times although only up to top c u rewards actually count Mehta (2013) ; Fahrbach et al. (2020) . For example, even though LOMAR and the expert assign the same number of online items to an offline item u ∈ U and LOMAR is better than the expert at a certain step, future high-reward online items can still be assigned to u ∈ U, increasing the expert's total reward or even equalizing the rewards of LOMAR and the expert (i.e., high-reward future online items become the top c u items for u ∈ U for both LOMAR and the expert). Thus, the temporarily "higher" rewards received by LOMAR must be hedged against such future uncertainties. Before designing our switching condition for the free-disposal setting, we first define the set containing the top c u online items for u ∈ U after assignment of v: E u,v (V u,v ) = arg max E⊆Vu,v,|E|=cu v∈E w uv , where V u,v is the set of all online items matched to u ∈ U so far after assignment of item v ∈ V. When there are fewer than c u items in V uv , we will simply add null items with reward 0 to E u,v such that |E u,v | = c u . We also sort the online items denoted as e u,i , for i = 1, • • • , c u , contained in E u,v according to their weights in an increasing order such that w u,eu,1 ≤ • • • ≤ w u,eu,c u . Similarly, we define the same top-c u item set for the expert algorithm π by adding the superscript π. Next, we define the following value which indicates the maximum possible additional reward for the expert algorithm π if LOMAR simply switches to the expert and follows it for all the future steps Algorithm 2 Inference of LOMAR (Free Disposal) 1: Initialization: The actual set of items matched to u ∈ U is V u,v after sequentially-arriving item v's assignment with V u,0 = ∅, the actual remaining capacity is b u = c u for u ∈ U, and the actual cumulative reward is R 0 = u∈U f u (V u,0 ) = 0. The same notations apply to the expert algorithm π by adding the superscript π. Competitive ratio requirement ρ ∈ [0, 1] and slackness B ≥ 0 with respect to the expert algorithm π. 2: for v = 1 to |V| do 3: Run the algorithm π and match the item v to u ∈ U based on the expert's decision u = x π v .

4:

Update the expert's decision set and reward for offline item u = x π v : V π x π v ,v = V π x π v ,v-1 {v} and f x π v = f x π v (V π x π v ,v ). 5: Update the expert's cumulative reward R π v = u∈U f u 6: for u in U do 7: Collect the available history information I u about item u 8: Run the RL model to get score: s u = w uv -h θ (I u , w uv ) where θ is the network weight 9: end for 10: Calculate the probability of choosing each item u: {{s u } u∈U } = softmax {{s u } u∈U }. 11: Obtain RL decision: xv = arg max u∈U {skip} {{s u } u∈U }. 12: Find ∆f xv in Eqn. equation 8 and Update assignment and reward: G xv , {V u,v-1 } u∈U , {V π u,v }u ∈ U in Eqn. equation 14 13: if R v-1 + ∆f xv ≥ ρ R π v + G xv , {V u,v-1 } u∈U , {V π u,v } u∈U -B then 14: Select x v = xv . // V xv,v = V xv,v-1 {v} and R v = u∈U f u (V u,v ) 19: end for v + 1, v + 2, • • • : G xv , {V u,v-1 } u∈U , {V π u,v } u∈U = u∈U   max i=1,••• ,cu i j=1 (w u,eu,j -w u,e π u,j )   + , where e π u,j ∈ E π u (V π u,v ), and e u,j ∈ E u ( Ṽu,v ) in which Ṽu,v = V u,v-1 if xv ̸ = u and Ṽu,v = V u,v-1 {v} if xv = u. The interpretation is as follows. Suppose that LOMAR follows the RL decision for online item v. If it has a higher cumulative reward for the j-th item in the top-c u item set E u,v than the expert algorithm π, then the expert can still possibly offset the reward difference w u,eu,j -w u,e π u,j by receiving a highreward future online item that replaces the j-th item for both LOMAR and the expert. Nonetheless, in the free-disposal model, the items in the top-c u set E u,v are removed sequentially -the lowest-reward item will be first removed from the sorted set E u,v , followed by the next lowest-reward item, and so on. Thus, in order for a high-reward item to replace the i-th item in the sorted set E u,v , the first (i -1) items have to be removed first by other high-reward online items. As a result, if LOMAR has a lower reward for the j-th item (for j ≤ i) in the top-c u item set E u,v than the expert algorithm π, then it will negatively impact the expert's additional reward gain in the future. Therefore, for item u ∈ U we only need to find the highest total reward difference, max i=1,••• ,cu i j=1 (w u,eu,j -w u,e π u,j ) + , that can be offset for the expert algorithm π by considering that i items are replaced by future high-reward online items for i = 1, • • • , c u . If max i=1,••• ,cu i j=1 (w u,eu,j -w u,e π u,j ) is negative (i.e., the expert algorithm cannot possibly gain higher rewards than LOMAR by receiving high-reward online items to replace its existing ones), then we use 0 as the hedging reward. Finally, by summing up the hedging rewards for all the offline items u ∈ U, we obtain the total hedging reward in Eqn. equation 7. Based on this hedging reward, we have the condition (Line 28 in Algorithm 1 for LOMAR to follow the RL decision: R v-1 + ∆f xv ≥ ρ R π v + G xv , {V u,v-1 } u∈U , {V π u,v } u∈U -B, where ∆f xv defined below is the additional reward that would be obtained by following the RL decision: ∆f xv = f xv (V xv,v {v}) -f xv (V xv,v-1 ), of the previous step, we have R v = R v-1 ≥ρ R π v-1 + u∈U |V u,v-1 | -|V π u,v-1 | + • w u,max -B ≥ρ R π v-1 + w x π v ,v -w x π v ,max + u∈U |V u,v-1 | -|V π u,v-1 | + • w u,max -B =ρ R π v + u∈U |V u,v | -|V π u,v | + • w u,max -B where the last equality holds because (|V u,v | -|V π u,v |) + -(|V u,v-1 | -|V π u,v-1 |) + = -1 if u = x π v , and (|V u,v | -|V π u,v |) + -(|V u,v-1 | -|V π u,v-1 |) + = 0 otherwise. We next prove by induction that the condition R v ≥ ρ R π v + u∈U |V u,v | -|V π u,v | + • w u,max -B holds for all steps by Algorithm 1. At the first step, if xv is not the same as x π v and w xv,v ≥ ρ w x π v ,v + w u,max -B, we select the RL decision x v = xv , and the robustness condition equation 10 is satisfied. Otherwise, we select the expert action x v = x π v and the condition still holds since A v = C v and w xv,v ≥ ρw x π v ,v -B holds when ρ ≤ 1 and B ≥ 0. Then, assuming that the robustness condition in equation 10 is satisfied after making the decision for v -1, we need to prove it is also satisfied after making the decision for v. If the condition in equation 3 in Algorithm 1 is satisfied, then x v = xv and so equation 10 holds naturally. Otherwise, if the expert action x π v is available for matching, then we select expert action x v = x π v . Then, we have w xv,v ≥ ρw x π v ,v -B and |V u,v | -|V π u,v | = |V u,v-1 | -|V π u,v-1 |, ∀u ∈ U, and hence the condition equation 10 still holds. Other than these two cases, we also have the option to "skip", i.e. x v = skip. By Lemma C.1, the condition equation 10 still holds. Therefore, we prove that the condition equation 10 holds for every step. After the last step v = |V|, we must have R v ≥ ρ R π v + u∈U |V u,v | -|V π u,v | + • w u,max -B ≥ ρR π v -B where R v and R π v are the total rewards of LOMAR and the expert algorithm π after the last step v = |V|, respectively. This completes the proof for the no-free-disposal case.

C.2 WITH FREE DISPOSAL

We now turn to the free-disposal setting which is more challenging than the no-free-disposal setting because of the possibility of using future high-reward items to replace existing low-reward ones. We first denote ∆f x π v as the actual additional reward obtained by following the expert's decision x π v , ∆f x π v = f x π v (V x π v ,v {v}) -f x π v (V x π v ,v-1 ), Additionally, we denote ∆f π x π v as the expert's additional reward of choosing x π v , where ∆f π x π v = f x π v (V π x π v ,v {v}) -f x π v (V π x π v ,v-1 ). ( ) For presentation convenience, we rewrite the hedging reward as G {V u,v } u∈U , {V π u,v } u∈U as G {V u,v } u∈U , {V π u,v } u∈U = u∈U   max i=1,••• ,cu i j=1 (w u,eu,j -w u,e π u,j )   + , where e π u,j ∈ E π u (V π u,v ), e u,j ∈ E u (V u,v ), and E u is defined in Eqn. equation 6.



u=xv = 1 if and only if u = xv and 0 otherwise, (•) + = max(•, 0), ρ ∈ [0, 1] and B ≥ 0 are the hyperparameters indicating the desired robustness with respect to the expert algorithm π. The interpretation of equation 3 is as follows. The left-hand side is the total reward of LOMAR after assigning the online item v based on the RL decision (i.e. xt ). The right-hand side is the expert's cumulative cost R π v , plus the term u∈U |V u,v-1 | -|V π u,v | + I u=xv + • w u,max which indicates the maximum reward that can be possibly received by the expert in the future. This reservation term is crucial, especially when the expert has more unused capacity than LOMAR. Specifically, |V u,v-1 | is the number of online items (after assigning v -1 items) already assigned to the offline item u ∈ U, and hence |V u,v-1 | -|V π u,v | + I u=xv + represents the number of more online items that LOMAR

Figure 1: Boxplot for reward ratio with different ρ within testing dataset. Greedy and DRL-OS are also shown here for comparison. The best average performance in each figure is achieved by choosing the same ρ during training and testing.

Figure 3: Tail reward ratio comparison. In this experiment, we set ρ = 0.4 for DRL-OS and LOMAR.For applicable algorithms (i.e., DRL, DRL-OS, and LOMAR), we train the RL model for 300 epochs in the training dataset with a batch size of 100. In LOMAR, the parameter B = 0 is used to follow the strict definition of competitive ratio. We test the algorithms on the testing dataset to obtain the average reward and the worst-case competitive ratio empirically. By setting ρ = 0 for training, LOMAR is equivalent to the vanilla inv-ff-hist RL model (i.e., DRL) used inAlomrani et al. (2021). Using the same problem setup, we can reproduce the same results shown inAlomrani et al. (2021), which reaffirms the correctness of our data generation and training process.

Figure 4: Cost ratio distribution (OSM as the expert)

Comparison under different ρ. In the top, LOMAR (ρ = x) means LOMAR is trained with the value of ρ = x. The average reward and competitive ratio are represented by AVG and CR, respectively -the higher, the better. The highest value in each testing setup is highlighted in bold. The AVG and CR for DRL are 12.909 and 0.544 respectively. The average reward for OPT is 13.209.

Performance comparison in gMission 10 × 60 for different ρ. LOMAR with ρ = y means LOMAR is trained with ρ = y.

Performance comparison on gMission 100 × 100 for different ρ. LOMAR with ρ = y means LOMAR is trained with ρ = y.

Follow the ML action

REPRODUCIBILITY STATEMENT

The details of proving Theorem 4.1 are included in Appendix C. The experimental codes and settings are based on the open-sourced resources in Alomrani et al. (2021) . The implementation of LOMAR mainly includes adding the switching condition for training and testing based on the standard DRL-based algorithm Alomrani et al. (2021) and will be released upon publication.

APPENDIX

In the appendix, we show the experimental setup and additional results (Appendix A), algorithm details for the free-disposal setting (Appendix B), and finally the proof of Theorem 4.1 (Appendix C).

A EXPERIMENTAL SETTINGS AND ADDITIONAL RESULTS

Our implementation of all the considered algorithms, including LOMAR, is based on the source codes provided by Alomrani et al. (2021) , which includes codes for training the RL model, data pre-proposing and performance evaluation. We conduct experiments on two real-world datasets: MovieLens Harper & Konstan (2015) and gMission Chen et al. (2014) .

A.1.1 SETUP AND TRAINING

We first sample u 0 movies from the original MovieLens dataset Harper & Konstan (2015) . We then sample v 0 users and make sure each user can get at least one movie; otherwise, we remove the users that have no matched movies, and resample new users. After getting the topology graph, we use Gurobi to find the optimal matching decision. In our experiment, we set u 0 = 10 and v 0 = 60 to generate the training and testing datasets. The number of graph instances in the training and testing datasets are 20000 and 1000, respectively. For the sake of reproducibility and fair comparision, our settings follows the same setup of Alomrani et al. (2021) . In particular, the general movie recommendation problem belongs to online submodular optimization, but it can actually be equivalently mapped to edge-weighted online bipartite matching with no free disposal under the setting considered in Alomrani et al. (2021) . So by default, the capacity c u for each offline node is set as 1 and w u,max = 5. While LOMAR can use any RL architecture, we follow the design of inv-ff-hist proposed by Alomrani et al. (2021) , which empirically demonstrates the best performance among all the considered architectures.Algorithm 3 Training LOMAR with Online Switching (No Free Disposal) Input: Available history information {I u | u ∈ U }; set of weights for the online item arrival {w uv , u ∈ U }; up to the online item v, the expert's cumulative rewards u = w uv -h θ (I u , w uv ) 3: end for 4: Calculate the probabilities of selecting an available item or skip:The probability of each action isThe probability of each action is {s π u } u∈Ua , sπ skip = {{0} u∈Ua , 1} 11: end if 12: With online switching, the probabilities of following RL or expert are {s OS , sπ OS } = softmax{R dif f /t, 0} 13: Calibrated probabilities of choosing an available offline item u or skip are Training with free disposal. The training process for the free-disposal setting is the same as that for the no-free-disposal setting, except for Line 6 of Algorithm 3 in which we need to modify R dif f based on the switching condition (i.e., Line 13 of Algorithm 2) for the free-disposal setting.C PROOF OF THEOREM 4.1The key idea of proving Theorem 4.1 is to show that there always exist feasible actions (either following the expert or skip) while being able to guarantee the robustness if we follow the switching condition. Next, we prove Theorem 4.1 for the no-free-disposal and free-disposal settings, respectively.C.1 NO FREE DISPOSAL Denote V u,v as the actual set of items matched to u ∈ U after making decision for v. Denote V π u,v as the expert's set of items matched to u ∈ U. We first prove a technical lemma.Lemma C.1. Assuming that the robustness condition is met after making the decision for v -If at the step when v arrives and the expert's decision x π v is not available for matching, thenProof. If the item x π v is not available for matching, it must have been consumed before v arrives, which meansThen, by the robustness assumption Lemma C.2. Assuming that the robustness condition is met after making the decision for v -. By definition, it can be written aswhere, and e u,j ∈ E u (V u,v-1 ). To prove the lemma, we consider four possible cases for w u,v to cover all the cases.Case 1: If the reward for v is small enough such that w u,v < w u,eu,1 and w u,v < w u,e π u,1 , thenx π v = 0, since both the expert and LOMAR cannot gain any reward from the online item v. From Eqn. equation 15, we can find that the right-hand side is also 0. Therefore, the conclusion in Lemma C.2 holds with the equality activated.Case 2: If the reward for v is large enough such that w u,v > w u,eu,1 and wIn other words, we will remove the smallest-The inequality holds because (w u,eu,1 -w u,e π u,1 ) + ≥ w u,eu,1 -w u,e π u,1 . In this case, ∆f x π v = w u,v -w u,eu,1 and ∆f π x π v = w u,v -w u,e π u,1 . Therefore, the conclusion in Lemma C.2 holds. Case 3: If the reward for v satisfies w u,v ≥ w u,eu,1 and w u,v ≤ w u,e π u,1 , then v ∈ E u (V u,v-1 {v}) and v / ∈ E u (V π u,v-1 {v}). In other words, even if v ∈ E u (V u,v-1 {v}) (i.e., the online item v produces additional rewards for LOMAR), the reward of v is still smaller than the smallest reward for the expert. Then, we haveTherefore, the conclusion in Lemma C.2 still holds.Case 4: If the reward for v satisfies w u,v ≤ w u,eu,1 and w u,v ≥ w u,e π u,1 , then in this case, only the current smallest-reward item is replaced with v for the expert, while the reward of LOMAR remains unchanged. Thus, we haveThen the conclusion in Lemma C.2 still holds with the equality activated.We next prove by induction that the conditionholds for all steps by Algorithm 1.At the first step, by usingand it is obvious that the condition in equation 16 is satisfied. Thus, there is at least one solution x v = x π v for our robustness condition in equation 16.Starting from the second step, assume that after the step v -1, we already haveIf the condition in Line 28 of Algorithm 1 is already satisfied, we can just use x v = xv , which directly satisfies equation 16. Otherwise, we need to follow the expert by setting x v = x π v . We prove x v = x π v satisfies the robustness condition at any step v. From Lemma C.2, since 0 ≤ ρ ≤ 1 and ∆f x π v ≥ 0 we haveThen, by substituting it back to Eqn. equation 17, we haveTherefore, after the last step v, LOMAR must satisfywhere R v and R π v are the total rewards of LOMAR and the expert algorithm π after the last step v = |V|, respectively. Thus, we complete the proof for the free-disposal setting.

