ROBUST LEARNING FOR CONGESTION-AWARE ROUT-ING

Abstract

We consider the problem of routing users through a network with unknown congestion functions over an infinite time horizon. On each time step t, the algorithm receives a routing request and must select a valid path. For each edge e in the selected path, the algorithm incurs a cost c t e = f e (x t e ) + η t e , where x t e is the flow on edge e at time t, f e is the congestion function, and η t e is a noise sample drawn from an unknown distribution. The algorithm observes c t e , and can use this observation in future routing decisions. The routing requests are supplied adversarially. We present an algorithm with cumulative regret Õ(|E|t 2/3 ), where the regret on each time step is defined as the difference between the total cost incurred by our chosen path and the minimum cost among all valid paths. Our algorithm has space complexity O(|E|t 1/3 ) and time complexity O(|E| log t). We also validate our algorithm empirically using graphs from New York City road networks.

1. INTRODUCTION

Modern navigation applications such as Google Maps and Apple Maps are critical tools in large scale mobility solutions that route billions of users from their source to their destination. In order to be effective, the application should be able to accurately estimate the time required to traverse each road segment (edge) along the user route in the road network (graph): we call this the cost of the edge. In general, the cost of an edge also depends on the current traffic level (the flow) on the edge. Furthermore, costs may scale differently on different edges: a highway can tolerate more traffic than a small residential street. We model this using congestion functions that map traffic flows to edge costs. Usually, cost information is not readily available to the routing engine and can only be inferred indirectly. For example, this can be done using location pings from vehicles, or with loop detectors that record when vehicles cross a particular marker. All realistic methods of measuring the cost of an edge (such as those above) require the presence of a vehicle that reports this information back to the routing platform. In this paper, we assume that whenever a vehicle traverses an edge, we observe the time spent on the edge. We can then use this observation in future routing decisions. This induces a natural exploration/exploitation trade-off: we may wish to send vehicles on underexplored routes, even if those routes currently seem suboptimal. In this paper, we propose a learning model for congestion-aware routing, and present an algorithm which seeks to minimize the total driving time across all vehicles. Our algorithm applies to arbitrary networks and arbitrary (Lipschitz-continuous) congestion functions, even when observations are noisy. The algorithm is also robust to changes in traffic conditions in a strong sense: we show that even when request endpoints are chosen adversarially and even when traffic congestion on the edges is updated adversarially between requests, our algorithm learns an optimal routing policy. 1.1 MODEL Consider a directed graph (V, E). Each edge e has a deterministic and fixed (but unknown) congestion function f e : R ≥0 → R ≥0 . We assume each f e is L-Lipschitz continuous and nondecreasing, with L known. For simplicity, we also assume that f e (0) = 0 for all e ∈ E (if not, we can simply translate and extend the function appropriately). We consider an infinite horizon of time steps, starting from t = 0. At each time t, a new car arrives. An adversary tells us the current amount of flow on each edge, and the source and destination of the new car. Let x t e be the flow on edge e at time t and let P t be the set of paths between the source and destination for the time t arrival. Let x t max = max e,t x t e . We must choose how to route the car, i.e., we must select a path p t ∈ P t . Based on our choice of p t , we incur a cost based on the flow and the congestion functions on the edges in our chosen path: c t = e∈pt f e (x t e ) For each e ∈ p t , we observe c t e = f e (x t e ) + η t e , where η t e is a random variable with expectation 0. The distribution of η t e is unknown, and can vary between edges and time steps. The distributions can be correlated across edges, as long as for a given edge, all the individual samples (i.e., ηfoot_0 e , η 2 e , . . . ) are independent. We assume that there exists β such that η t e ∈ [-β/2, β/2] for all edges e and times t, and that β (or an upper bound on β) is known. The optimal cost at time t is c * t = min p∈Pt e∈p f e (x t e ) so the regret of our algorithm over the first t time steps is R t = t r=1 E[c t -c * t ] where the expectation is over the randomness in the noise samples. Note also that we do not include the noise we observe in the objective function, since the noise has expectation 0. Any algorithm with sublinear regret can be said to learn an optimal routing policy: if R t = o(t), then E[c t -c * t ] must shrink as t goes to infinity, i.e., the difference between our algorithm's cost and the optimal cost must go to 0 on average. 1 In this paper, we give an algorithm with regret Õ(t 2/3 ).

1.2. OUR CONTRIBUTION

Our main result is the following theorem: Theorem 1.1. The expected regret of Algorithm 1 after t time steps is R t = O t 2/3 • |E| log x t max (β log t + x t max L) The space complexity and time complexity on time step t are O(|E|t Here SP(|E|, |V |) denotes the time complexity of computing the shortest path between two vertices in a graph with nonnegative weights. For example, this can be done by Dijkstra's algorithm in time O(|E| + V log V ). We also validate our algorithm's performance using graphs from New York City road networks.

1.3. RELATED WORK

Comparison with multi-armed bandits. Perhaps the simplest model of exploration vs exploitation is the classical multi-armed bandit (MAB) problem (Slivkins, 2019) . A MAB instance consists of n arms, each with an unknown but fixed distribution. At each time step, the algorithm selects an arm and observes a reward drawn randomly from that arm's distribution. Several algorithms obtaining regret O( √ t) are known, and this is also known to be the best possible, up to logarithmic factors. Our routing model generalizes the MAB problem. In particular, when the graph consists of n parallel edges, the congestion functions are constant, and each edge has a single fixed noise distribution, our model reduces to the MAB problem: each edge is an arm, and the reward distribution for each edge is (the negative of) the congestion constant plus the noise distribution. Our model can be thought of as extending the MAB problem to the case where (1) each reward distribution has a parameter (in our case, the flow), and (2) the arms are edges in an arbitrary graph and their usage is constrained according to paths in the graph. We believe that routing is the most natural application of this model, but there could be other applications as well. Since our problem is strictly harder than the MAB problem, we know that regret o( √ t) is impossible. We conjecture that even regret Θ( √ t) is impossible for our problem, but leave this as an open question. Comparison with other routing work. Our work is also related to the body of literature on shortest paths under uncertainty, in which it is typically assumed that edge lengths follow known independent distributions. A canonical problem is to find an s-t path whose length is below a threshold L with highest probability. When the edge length distributions are Gaussian, Nikolova et al. (2006) present a quasi-polynomial time algorithm for this problem via connections to quasi-convex maximization. A related probing problem is the Canadian Traveler Problem (Nikolova & Karger, 2008; Papadimitriou & Yannakakis, 1991) , where the length of an edge is revealed when we reach one of its end-points and the goal is to minimize the expected cost. There are no efficient algorithms known for this problem, except under special assumptions such as no backtracking (Bnaya et al., 2009) . Awerbuch & Kleinberg (2004) study a conceptually similar adaptive routing problem. In their version, there is no noise in the observations, but the algorithm observes only the total cost of the path, and not the cost of each individual edge. Their model also does not have the same notion of a parametrized congestion function. Although their model is technically distinct from ours, it is noteworthy that they also obtain a regret bound of Õ(t 2/3 ). A number of subsequent works have also sought to apply bandit algorithms to shortest path selection problems in the domain of network signal routing (Chen & Ji, 2005; György et al., 2007; Liu & Zhao, 2012; Zou et al., 2014; Talebi et al., 2018) . Moreover, there is a long line of other works broadly dealing with regret minimization in Internet routing and congestion control settings (Dong et al., 2015; 2018; Jiang et al., 2016; 2017; Talebi et al., 2018) . Similar works apply bandit algorithms to shortest path selection have also been applied to traffic route planning (Chorus, 2010; de Oliveira Ramos et al., 2017) . However, these works differ from ours in that they do not consider the more general congestion functions that we consider in our model. A more recent work (Zhou et al., 2019) considers a similar routing problem, labeled the Multi-Armed Bandit On-Time Arrival Problem. In this setting, regret is quantified in terms of ontime arrival reliability, and the arms are joint route and departure times. Kveton et al. (2014) also study (among other applications) a routing setting in the context of matroid bandits, but this work also assumes static edge costs and does not factor in congestion. Routing in the presence of congestion functions has been studied in the context of selfish routing Roughgarden & Tardos (2002) and in the atomic congestion games setting Awerbuch et al. (2005) ; Christodoulou & Koutsoupias (2005) . The difference from our setting is that the congestion games model and selfish routing models assume drivers have full information about the costs in the network and route themselves on an optimal path. In our setting costs are learned as time progresses and there is a centralized routing control as given by navigation applications.

2. DESCRIPTION OF THE ALGORITHM

In this section, we define and give intuition for our algorithm for learning an optimal routing policy.

2.1. INTUITION BEHIND THE ALGORITHM

On each step, the algorithm uses past observations to form an estimate for the cost of each edge at the current flow, i.e., an estimate for f e (x t e ). It then selects the shortest path according to these estimated costs. The heart of the algorithm is the cost estimation scheme. There are two sources of error in this estimation: we may not have observed this exact flow before (type 1 error), and noise (type 2 error). For type 1 error, suppose all the observations we use are based on flows y such that |y -x t e | ≤ ε. Then Lipschitz continuity implies that |f e (y) -f e (x t e )| ≤ εL, and so if ε is small, this aspect of our estimation should be accurate. However, we do not actually observe f e (y): instead, we observe f e (y) + η y , where η y is a random noise term with expectation 0. Even if we have observed the exact flow x t e before, this noise will prevent us from having a perfect estimate. However, the more observations we use to form our estimate, the less impact the noise has: we can use Hoeffding's Inequality to show that | y η y | goes to 0 quickly as the number of observations grows, and thus our type 2 error shrinks. But as we increase the number of observed flows y that we use (for a fixed time step), we also (in general) increase the maximum distance max y |y -x t e |. Concretely, if k(ε, x) is the number of observations based on flows within ε of x, then k(ε, x) is weakly increasing with ε. This is diametrically opposed to our plan for the type 1 error. In order for the average error to approach 0 as t increases, we will have to carefully manage this tradeoff. We will need to ensure that the number of observations we use for estimation tends to infinity, but also that the maximum distance tends to 0. Intuitively, this should be possible: if observations are somewhat uniformly distributed across edges and across the flow spectrum, then we should have Θ(t) observations per edge and Θ(tε/x t max ) observations on every interval of size ε (treating |E| as a constant here). Setting ε to be something like Θ(t -1/2 ) would lead to Θ( √ t/x t max ) observations being used for each estimation with a maximum distance of Θ(t -1/2 ), which fits our requirements. Observations are of course not guaranteed to be uniform, and can even be adversarial. For example, suppose we have observed many flows less than 1 on some edge e, but no flows above 1. When estimating the cost of a flow less than 1, we are in good shape (if those observations are uniformly distributed, say). But suppose we are asked to estimate the cost for flow 2: we may incur a large estimation error, since we have no information about that portion of the flow spectrum. But our lack of information also implies that so far we have not incurred any error in that part of the flow spectrum, since we have never used it! The more we use part of the flow spectrum on a given edge (i.e., our chosen path used that edge at a flow in that range), the more error we incur, but the more we learn. Thus we will need an analysis that is decomposable across edges and across the flow spectrum: for each interval I, we need to bound our cumulative error over all estimation queries that fall within that interval as a function of the length of the interval and the number of such queries. There is one more complicating factor: what if we consistently overestimate the cost of a particular flow, so that we never use it? Then we may repeatedly incur a large error from that flow, without ever learning about it. For this reason, we will want to ensure that with high likelihood, all of our estimates are underestimates: this will allow us to bound the regret on a given time step in terms of the estimation error only on edges in the path that we used.

2.2. OUR COST ESTIMATION SCHEME

The cost estimation works roughly as follows. The algorithm maintains a "bucketing" system for each edge, inspired by hashtables. The interval [0, x t e,max ] is partitioned into a set of intervals, with each interval being a "bucket". Whenever we observe a new flow, we identify which bucket that flow falls into, and use all the observations in that bucket to estimate the cost. (We also subtract a term proportional to √ log t in order to ensure that we get an underestimate with high probability; this is similar to the Upper Confidence Bound algorithm for multi-armed bandits.) We then observe the resulting cost for this flow (for this edge), and insert it into the bucket. If we observe a flow that falls outside of our bucketing system, we create a new depth 0 bucket so that the observed flow falls into our new bucket. If the number of elements in the bucket surpasses a certain threshold, we split it into two new buckets (whose associated intervals are equal length). The threshold depends on the depth (i.e., how many times one of its ancestors was split). We denote the "lifetime" of a bucket at depth m by and we denote it by h(m). Our analysis will be "decomposable" across buckets: we will bound the total error any single bucket of depth m can contribute over the course of its lifetime.

Let b t

e be the bucket we used for estimating the cost of edge e at time t. The smaller the interval associated with this b t e , the smaller the error of type 1: if our estimation only uses observed flows y within this interval, and the interval has length ε, then the type 1 error is at most εL, since our target flow x t e is also in the interval. This incentivizes us to make h(m) small, so that we can more quickly we get to buckets with small interval lengths. If we only cared about type 1 error (i.e., if there were no noise) choosing h(m) = 1 would be optimal. On the other hand, the more observations stored in b t e , the smaller type 2 error. This incentivizes us to make h(m) large, so that each bucket contains a large number of observations for a larger portion of its lifetime. If we only cared about type 2 error (i.e., if f e were constant), choosing h(m) = ∞ would be optimal. This is exactly the tradeoff we discussed in Section 2.1. Eventually, h(m) = 2 2m will give us the best regret bound.

2.3. DEFINING THE ALGORITHM

Algorithm 1 provides pseudocode for our algorithm. We use B e to denote the entire bucketing system for edge e. We treat B e as a set, whose elements are "buckets". Each bucket b supports the following operations: The first four operations will simply be numerical fields stored in b. We also assume that the create operation takes constant time and space, and that the aforementioned four fields are initialized appropriately (estimate(b) should be 0). However, we will need to implement INSERT ourselves in order to ensure that it is done in an efficient fashion (hence the differing font). In general, we will reserve the variables w, z for dom(b), y for arbitrary elements of b, and x = x t e for the proposed flow on the current time step. To limit space usage, we will not actually "store" each relevant observation in the bucket; instead, we will simply incorporate it into the estimate field that we store. However, in the analysis, it will be useful to use "y ∈ b" to denote that a particular observation (y, c y ) has been used in that bucket. For each y ∈ b, let η y denote the noise sample associated with this observation. Note that η y is not known to the algorithm; we will simply use this notation in our analysis. Also, let len(dom(b)) denote the length of the domain, i.e., z - w if dom(b) = [w, z]. We can think of the bucketing system in terms of binary trees. Each bucket of depth 0 is a "root", and whenever we split a bucket, we create two "children". Thus B e is essentially a forest of binary trees, where each node represents a bucket. Any bucket b ∈ B e has exactly one ancestor of depth 0.

3. EXPERIMENTS

In this section, we empirically validate the algorithm's performance on graphs from New York City road networks. We first describe the methodology, and then reflect on the results.

3.1. METHODOLOGY

We used two different graphs, corresponding to different regions within New York City. Graph 1 consisted of 303 vertices and 657 edges, and Graph 2 consisted of 429 vertices and 1204 edges.Congestion functions with domain [0, 1] were generated randomly for each run of the algorithm, with each congestion function consisting of three pieces of slope chosen uniformly from [0, 1]. Uniform noise distributions were used for all runs, with varying values of β. The algorithm was initialized with L = 1 and α = 2β 2 . On each time step, the following steps were performed: 1. Independently sample a uniformly random flow x t e in [0, 1] for each edge e. Algorithm 1 Algorithm for learning an optimal social routing policy. 1: function ROUTINGALGORITHM(E, L, α, h) 2: for each e ∈ E do 3: B e ← {create(0, [0, 1])} Start with a single bucket of depth 0 and constant length 4: for each t ∈ N >0 do 5: x t e ← proposed flow on edge e 6: for each e ∈ E do 7: u  return max 0, estimate(b) -|b| -1 ln t α 4: else 5: return 0 1: function UPDATEBUCKETS(y, c y , B, h) 2: if ∃b ∈ B s.t. y ∈ dom(b) then Insert our observation into the corresponding bucket 3: INSERT(b, (y, c y )) 4: m ← depth(b) 5: if |b| > h(m) then Split the bucket if it is too big 6: [w, z] ← dom(b) 7: b 1 ← create(m + 1, [w, w+z 2 ]) 8: b 2 ← create(m + 1, [ w+z 2 , z]) 9: INSERT(b 1 , (y, c y )) Insert the observation into new buckets so they aren't empty. Figure 1 shows the results of all runs.

3.2. RESULTS

All runs of the algorithm display the general behavior we would hope to see: concavity of the cumulative regret curve. Also as expected, higher noise levels lead to both larger absolute regret, and weaker concavity: learning takes longer. Most of the runs display substantial concavity within 100,000 time steps, but some the β = .5 runs (on both graphs) require closer to 200,000 steps in order for the concavity to clearly manifest. With regards to absolute regret, β has a dramatic effect. On the extremes, the total regret for β = 0 after 1,000,000 time steps is less than 5,000 for Graph 1 and less than 9,000 for Graph 2. (The regret for Graph 1 tends to larger, due to the graph being larger.) However, for β = .5, the regret after 100,000 time steps is already more than 12,000 and 20,000 for Graph 1 and Graph 2, respectively. If we look at the average regret for β = 0 over 1,000,000 time steps, we get <.005 and <.009 for Graphs 1 and 2, respectively. Since the expected cost of each edge on a given time step is .25 (expected flow is .5, expected slope is .5), we view this as quite good. In contrast, for β = .5, the average regret over 1,000,000 time steps is > .48 and >.88 for Graphs 1 and 2 respectively. Relative to the average edge cost of .25, we view this as quite poor. This behavior again matches our expectations. Finally, we note that there are many other experimental parameters that we have not included in these plots. For example: non-i.i.d flows, non-i.i.d. sources and destinations, non-i.i.d. congestion functions, congestion functions with different numbers of pieces, congestion functions with different Lipschitz constants, other noise distributions, correlated noise, and many more. We have indeed tested many of these, but due to space constraints, there is a limit to the number of plots we can include in the paper. We assure the reader that we have yet to find a parameter combination that causes unexpected behavior from the algorithm.

4. CONCLUSION

In this paper, we presented an algorithm which learns an optimal routing policy for any graph, any (Lipschitz continuous) congestion functions, and any (bounded) noise distribution, and adversarial routing requests (i.e., source, destination, and the current flow on each edge). Our algorithm has cumulative regret Õ(|E|tfoot_1/3 ). There are many interesting directions for future work. The first relates to the dependence on t in the regret bound. Since our problem generalizes the multi-armed bandit problem, we immediately inherit a Ω( √ t) lower bound. However, that still leaves substantial room to improve the dependence on t. It would also be interesting to improve the dependence on |E|. We have made several worstcase assumptions in this paper that are perhaps overly pessimistic: for example, neighboring edges may have correlated congestion functions, and flows do not change arbitrarily between time steps. On the empirical side, we evaluated our algorithm using real-world graphs, but synthetic congestion functions. Future experiments could use real data for the congestion functions as well: either by fitting a congestion function to the data, or by randomly sampling data points to simulate a congestion function. Figure 1 : the experimental performance of Algorithm 1. The x-axis in each plots is the time step, and the y-axis is the cumulative regret. There are eight plots, each consisting of four curves, each with different noise levels. Each curve is a single run of the algorithm. The upper four plots have time horizons of 100,000, and the bottom four have time horizons of 1,000,000. All plots in the left column use Graph 1, and all plots in the right column use Graph 2. Finally, since we tested seven values of β, we separated them into two different plots to avoid overcrowding a single plot. For example, the top two plots in the left column have the same time horizon of 100,000 and both use Graph 1. The top left plot contains the lower noise levels (0, .01, .02, .05), and the second-fromthe-top left contains the higher noise levels (.05, .1, .2, .5). Note that β = .05 is included in both to allow visual comparison between the plots. Algorithm 1 Algorithm for learning an optimal social routing policy. 1: function ROUTINGALGORITHM(E, L, α, h)

2:

for each e ∈ E do 3: B e ← {create(0, [0, 1])} Start with a single bucket of depth 0 and constant length 4: for each t ∈ N >0 do 5: x t e ← proposed flow on edge e 6: for each e ∈ E do 7: u INSERT(b, (y, c y )) 4: m ← depth(b) 5: if |b| > h(m) then Split the bucket if it is too big 6: [w, z] ← dom(b) 7: b 1 ← create(m + 1, [w, w+z 2 ]) 8: b 2 ← create(m + 1, [ w+z 2 , z]) 9: INSERT(b 1 , (y, c y )) Insert the observation into new buckets so they aren't empty. |b| ← |b| + 1 Lemma A.2 (Hoeffding's Inequality). Let X 1 , . . . , X n be independent random variables each supported on an interval of length β > 0 and let X = 1 n n i=1 X i . Then for any ε > 0, Pr[| X -E[ X]| > ε] ≤ exp -2nε 2 β 2 Lemma A.3. Fix an edge e and a bucket b ∈ B e . With probability at least 1 -δ, 1 |b| y∈b η y ≤ β ln δ -1 Proof. If β = 0, the claim is trivially true. Otherwise, since E[η y ] = 0, and all η y are independent for a given edge, Hoeffding's Inequality gives us Pr   1 |b| y∈b η y > β ln δ -1 2|b|   ≤ exp β 2 ln δ -1 2|b| • -2|b| β 2 = δ as required. We will now start the process of bounding the costs incurred by our algorithm. Moving forward, we will need to separately handle the time steps where x t e does not fall into any bucket we have, and we must create a new depth 0 bucket (on line 14 of UPDATEBUCKETS). Let T e be the set of time steps t where (1) e ∈ p t , (2), ∃r < t such that e ∈ p r (i.e., this is not the first we are using this edge), and (3) we do not create a new depth 0 bucket. The reader can think of T e as the set of time steps when edge e is used in a "normal" way.  (t) = 1] ≥ 1 -t -α/β 2 . The next lemma states that as long as ζ e (t) = 1, our algorithm's estimate u t e will always be an underestimate of the true cost f e (x t e ). Lemma A.4. Fix a time t and edge e. If ζ e (t) = 1, then f e (x t e ) ≥ u t e . Proof. If u t e = 0, then trivially u t e ≤ f e (x t e ). Thus assume u t e > 0. This implies t ∈ T e , and so b t e exists. For each y ∈ b t e , define u y as in the INSERT function in Algorithm 1, and let [w, z] = dom(b t e ). For brevity, let x = x t e . Note that x ≥ w and thus f e (x) ≥ f e (w). Also recall that c y = f e (y) + η y by definition. We first claim that f e (x) ≥ u y -η y for all y ∈ b t e . Case 1: w > y. Then x > y, so we have u y -η y = c y -η y = f e (y) ≤ f e (x), as required. Case 2: y ≥ w. Then u y -η y = c y -η y -L(y -w) = f e (y)-L(y -w). Because f e is L-Lipschitz, we have f e (y) ≤ f e (w) + L(y -w), so u y -η y = f e (y) -L(y -w) ≤ f e (w) + L(y -w) -L(y -w) ≤ f e (x) Therefore for brevity. We first claim that f e (x) ≤ u y -η y + L(x -y) for all y ∈ b t e . Since x ≥ w and f e is L-Lipschitz, we have f e (x) ≤ f e (w) + L(x -w). f e (x) Case 1: y ≥ w. Then f e (w) ≤ f e (y) by monotonicity, so f e (x) ≤ f e (y) + L(x -w) = c y -η y + L(x -y) + L(y -w) = u y -η y + L(x -w) ≤ u y -η y + 2 1-m x t max L where the last step follows from Lemma A.1. Case 2: w > y. 3 Then f e (w) ≤ f e (y) + L(w -y), so f e (x) ≤ f e (y) + L(w -y) + L(x -w) = c y -η y + L(x -y) ≤ u y -η y + L(x -y) ≤ u y -η y + 2 1-m x t max L again using Lemma A.1 on the last step. Therefore  f e (x) ≤ 1 |b t e | y∈b t e (u y -η y + 2 1-m x t max L) = 2 1- f e (x) ≤ u t e + 2 1-m x t max L + 1 |b t e | y∈b t e η y + |b t e | -1 ln t α ≤ u t e + 2 1-m x t max L + 2 |b t e | -1 ln t α The next lemma is a simple bound on the maximum flows on time steps when we create a new depth 0 bucket. This will be needed later to bound the error on those time steps. Lemma A.6. Fix an edge e ∈ E and a time step t > 0. Let t 0 , . . . , t k be the time steps up to and including time t on which a new depth 0 bucket is created in B e (where t 0 = 0). Then k ≤ 1 + log x t max . Furthermore, k i=1 x t k e ≤ 2x t max . Proof. Assume t 0 < t 1 , . . . , < t k . Let y r max = max(∪ b∈B r e dom(b)), where B r e denotes the set of buckets for edge e at the beginning of time step r. This is the maximum flow that falls into any bucket in B e at time r Note that y r+1 max = y r max for all r ∈ {t 1 , . . . , t k }. Furthermore, for r ∈ {t 1 , . . . , t k }, we have y r+1 max = 2x r e > y r max . Therefore for each i ∈ {1, . . . , k}, y ti+1 max = 2x ti e > 2y ti max Thus for all i, j ∈ {1, k} with i ≤ j, y tj max > 2 j-i . In particular, y t k max > 2 k • y t0 max = 2 i where y t0 max = 1 is because we initialize B e with a single bucket corresponding to the interval [0, 1]. We know that y t k +1 max = 2x t k-1 e ≤ 2x t max , so 2 k ≤ 2x t max . Therefore |D t e | = k ≤ 1 + log x t max . Finally, k i=1 x ti e = 1 2 k i=1 y ti+1 max ≤ y t k +1 max 2 k i=1 1 2 k-i ≤ y t k +1 max 2 • 2 ≤ 2x t max as required. Lemma A.7 gives our first bound on the cumulative regret of the algorithm. This bound is quite complex, and it will take substantial work later on to show that this bound is in fact Õ(t 2/3 ). Lemma A.7. Assume α = 2β 2 . Then for all t, we have R t ≤ |E|x t max L(π 2 + 18) 6 + t r=1 e∈pr x t max L 2 depth(b t e )-1 + 2 |b t e | -1 ln t α Proof. We first prove a lower bound on E[c * t ], the expected cost incurred by the optimal algorithm. By definition, c * t = min p∈Pt e∈p f e (x t e ), so Lemma A.4, we have f e (x t e ) ≥ u t e with probability at least 1 -t -α/β 2 . For brevity, let δ = t -α/β 2 . Since c * t ≥ 0 always, we have E[c * t ] ≥ (1 -δ) • min p∈Pt e∈p u t e Next, we prove an upper bound on E[c t ], the expected cost incurred by our algorithm. Note that E[c t ] = e∈pt f e (x t e ) = e∈pt:t∈Te f e (x t e ) + e∈pt:t ∈Te f e (x t e ) We will first handle the first term, and then the second term. Since f e is L-Lipschitz and f e (0) = 0, we have f e (x) ≤ x t max L for all x ≤ x t max . Since x t e ≤ x t max , we have f e (x t e ) ≤ x t max L as a broad upper bound for all cases. If t ∈ T e , then Lemma A.5 implies that f e (x t e ) ≤ u t e + 2 1-depth(b t e ) x t max L + 2 |b t e | -1 ln t α with probability at least 1 -δ. Also recall that p t = arg min p∈Pt e∈p u t e by definition. Thus  e∈pt:t∈Te f e (x t e ) ≤ e∈pt δx t max L + (1 -δ) u t e + x t max L 2 depth(b t e )-1 + 2 |b t e | -1 ln t α ≤ δ|E|x t max L + (1 -δ) • min p∈Pt e∈p u t e + (1 -δ) e∈pt x t max L 2 depth(b t e )-1 + 2 |b t e | -1 ln t α ≤ δ|E|x t max L + (1 -δ) • min p∈Pt e∈p u t e + e∈pt x t max L 2 depth(b t e )-1 + 2 |b t e | - E[c t -c * t ] ≤ |{e ∈ p t : t ∈ T e }|x t e,max L + |E|x t max L t α/β 2 + e∈pt x t max L 2 depth(b t e )-1 + 2 |b t e | -1 ln t α Summing the second term gives usfoot_3  t r=1 |E|x r max L r α/β 2 = |E|x t max L t r=1 1 r 2 ≤ |E|x t max L ∞ r=1 1 r 2 = |E|x t max Lπ 2 6 For the first term, we further divide time steps t ∈ T e into (1) time steps where we create a new depth 0 bucket, and (2) time steps where we use edge e for the first time. For case (1), let Q e denote the set of these time steps. Recall that we only create a depth 0 bucket when the current flow x t e does not fall into any bucket we already have. In particular, that implies it is the maximum flow we have seen on this edge: formally, t ∈ Q e implies x t e = x t e,max . For case (2), let z e (t) be the indicator variable which takes on value 1 if t is the first time that e is used, and 0 otherwise. Since this happens at most once per edge, we have x r e + L  x t max L 2 depth(b)-1 + 2 |b r e | -1 ln t α Note that we use |b r e | instead of |b| in order to denote the size of the bucket on time step r specifically. Let R t (b) = r∈At(b) 2 1-depth(b) x t max L + 2 |b| -1 ln t α be the total regret incurred by bucket b, and let R t (e) = b∈S t e R t (b) be the total regret incurred by edge e. The rest of the proof is devoted to bounding R t (e); after we have done so, we can simply sum this bound across all edges. We next give a brief proof of a standard inequality that will be useful to us. Lemma A.8. For all n, n i=1 1 √ i ≤ 2 √ n. Proof. The proof is by induction n. The base case of n = 1 is trivial, so consider an arbitrary n > 1 and assume the claim holds for n -1. Then n i=1 1 √ i = 1 √ n + n-1 i=1 1 √ i ≤ 2 √ n + √ n -1 + 2 √ n -1 = 2( √ n - √ n -1) + 2 √ n -1 = 2 √ n as required. We will now bound the maximum regret a single bucket of depth m can contribute. Lemma A.9. Let b ∈ S t e be a bucket with depth m. Then  R t (b) ≤ 4 h(m) ln t α + x t max Lh(m) 2 m-1 Proof. Substituting m = depth(b), R t (b) is defined by R t (b) ≤ x t max L|A t (b)| 2 m-1 + 2 r∈At(b) |b| -1 ln t α R t (b) ≤ x t max L|A t (b)| 2 m-1 + 2 √ ln t α |At(b)| i=1 1 √ i ≤ x t max Lh(m) 2 m-1 + 2 √ ln t α h(m) i=1 1 √ i ≤ x t max Lh(m) 2 m-1 + 4 h(m) ln t α (Lemma A.8) as required. Next, we will analyze a linear program which will be useful in bounding the total regret contributed by a single edge. x m 2 km (1) s.t. n m=1 2 2m-3 x m ≤ t x m ≤ 2 m ∀m ∈ {1, . . . , n} Assume 0 ≤ k < 2, and let (x 1 , . . . , x n ) be an optimal solution, and suppose x m > 0. Then for all i < m, x i = 2 i . Proof. Suppose not, and let δ = min(x m , 2 i -x i ) > 0. Define a new solution (y 1 , . . . , y n ) by y i = x i + δ 2 2i-3 , y m = x m -δ 2 2m-3 , and y j = x j for j ∈ {i, m}. We first claim that y is feasible. We have y m ≥ x m -δ ≥ 0, and y i ≤ x i + δ ≤ 2 i . Since y j = x j for j ∈ {i, m}, we have 0 ≤ x j ≤ 2 j for all j. Also, we have n j=1 y j 2 2j-3 = x i + δ 2 2i-3 • 2 2i-3 + x m - δ 2 2m-3 2 2m-3 + j ∈{i,m} x j 2 2j-3 = x i 2 2i-3 + δ + x m 2 2m-3 -δ + j ∈{i,m} x j 2 2j-3 = n j=1 x j 2 2j-3 which must be at most t, since (x 1 , . . . , x n ) is assumed to be feasible. Finally, we claim that (y 1 , . . . , y n ) has a better objective value. We have: n j=1 2 kj y j - n j=1 2 kj x j = y i 2 ki + y m 2 km -x i 2 ki -x m 2 km = 2 ki x i + δ2 ki 2 2i + x m 2 km - δ2 km 2 2m -2 ki x i -2 km x m = δ 2 (2-k)i - δ 2 (2-k)m Since 2 -k > 0 and m > i, we have 2 (2-k)i < 2 (2-k)m . Thus the above expression is strictly positive. Therefore (y 1 , . . . , y n ) has a higher objective value, which contradicts the optimality of (x 1 , . . . , x n ). For now, we are primarily interested in k = 1. Let g(t, n) denote the maximum value of Program 1 for parameters t and n, with k = 1. Lemma A.10 implies that for a fixed t, n actually does not matter, assuming it is sufficiently large: the n m=1 2 2m-3 x m ≤ t constraint will be saturated by smaller values of m, so any additional variables will always have value 0. Formally, for all t, there exists n t such that g(t, n t ) = g(t, n) for all n ≥ n t . Thus we can simply take n to be large enough and write g(t) = g(t, n t ). For example, n t = t suffices. The next lemma bounds the total regret contributed each "tree", i.e., a set of buckets that share a depth 0 ancestor. Also, at this point we will substitute in h(m) = 2 2m . Lemma A.11. Fix an edge e and time t. Let S ⊆ S t e be a set of buckets which all have the same depth 0 ancestor. Then b∈S R t (b) ≤ (4 √ ln t α + 2x t max L)g(t) Proof. Let x m denote the number of buckets b ∈ S of depth m. Letting h(m) = 2 2m , Lemma A.9 implies that R t (b) ≤ 4 h(m) ln t α + x t max Lh(m) 2 m-1 = (4 √ ln t α + 2x t max L)2 m Thus b∈B j e R t (b) ≤ (4 √ ln t α + 2x t max L) nt m=0 x m 2 m We next claim that (x 1 , . . . , x m ) is feasible for Program 1. We know that x m ≤ 2 m , since within a single binary tree, there are at most 2 m nodes of depth m. It remains to show that t ≥ nt m=1 2 2m-3 x m . Note that all buckets of depth at least 1 come in pairs: spending h(m -1) steps on a bucket of depth m -1 creates two buckets of depth m. Thus the number of time steps we spent on buckets of depth We are now ready to bound the total regret contributed by a single edge. Lemma A.12. For all e, t, we have m -1 is exactly xm 2 • h(m -1) = 2 2m-2 xm 2 = 2 2m-3 x m . R t (e) ≤ (1 + log x t max )(4 √ ln t α + 2x t max L)g(t) Proof. Let S 1 , . . . , S q be a partition of S t e such that each S i contains exactly one depth 0 bucket, and all buckets in S i have the same depth 0 ancestor. (i.e., each S i constitutes a single binary tree). By Lemma A.6, q ≤ 1 + log x t max . Therefore R t (e) = q i=1 b∈Si R t (b) ≤ q i=1 ( √ ln t α + 2x t max L)g(t) (Lemma A.11) = q(4 √ ln t α + 2x t max L)g(t) ≤ (1 + log x t max )(4 √ ln t α + 2x t max L)g(t) as required. We are almost done. One of our last remaining tasks is to bound g(t): the optimal objective value of Program 1. Lemma A.13. For all t, we have g(t) ≤ 2 8 3 t 2/3 . Proof. Fix a time t, and let (x 1 , . . . , x n ) be an optimal solution to Program 1 for parameter t. Let m be the minimum integer so that m i=1 2 2i-3 2 i > t. Then there exists i ≤ m so that x i < 2 i : otherwise, m i=1 2 2i-3 x i = m i=1 2 2i-3 2 i > t by assumption, which would imply that (x 1 , . . . , x n ) is infeasible. Since there exists i ≤ m so that x i < 2 i , Lemma A.10 implies that x i = 0 for all i > m. Therefore g (t) = n i=1 2 i x i = m i=1 2 i x i = m i=1 2 2i . By definition of m, we have t ≥ m-1 i=1 2 2i-3 2 i = m-1 i=1 2 3i-3 ≥ 2 3m-9 . Therefore 3m -9 ≤ log t m ≤ 1 3 log t + 3 m ≤ log t 1/3 + 3 Therefore g(t) ≤ m i=1 2 2i ≤ 4 3 2 2m ≤ 4 3 2 2 log t 1/3 +6 = 4 3 • 2 6 • t 2/3 = 2 8 3 t 2/3 as required. Our last lemma analyzes space and time complexity of our algorithm. Lemma A.14. Proof. For the space complexity, note that each bucket b requires a constant amount of space: it simply needs to store |b|, dom(b), depth(b), and estimate(b), each of which take constant space. Thus the space complexity is the total number of buckets at time t. In fact, we will bound |S t e |, the number of buckets that have existed at any time up to time t for edge e. As in Lemma A.12, let S 1 , . . . , S q be a partition of S t e such that each S i contains exactly one depth 0 bucket, and all buckets in S i have the same depth 0 ancestor. We claim that |S i | ≤ t 1/3 for all i. We will show this using Program 1, but now with k = 0 instead of k = 1: The analysis proceeds similarly to Lemmas A.11 and A.13. Let g (t) be the optimal objective value for Program 2 with parameter t. We claim that |S t e | ≤ g (t). Let x m denote the number of buckets of depth m that have existed up to and including time t for this edge. Since the constraints in Program 2 are the same as in Program 1, we already showed in the proof of Lemma A.12 that (x 1 , . . . , x n ) is feasible for Program 2. Since |S t e | = n m=1 x m , we have |S t e | ≤ g (t). It remains to bound g (t). Let m be the minimum integer such that m i=1 2 2i-3 2 i > t. Then as argued in Lemma A.13, m ≤ log t 1/3 + 3. Therefore g (t) ≤ m i=1 x m ≤ m i=1 2 m ≤ 2 m+1 , so g (t) ≤ 2 log t 1/3 +4 = O(t 1/3 ) By Lemma A.6, q ≤ 1 + log x t max . Therefore the total number of buckets for edge e at time t is at most |S t e | = q i=1 |S i | ≤ q • O(t 1/3 ) = O(log x t max t 1/3 ) and thus the total space complexity is O(|E| log x t max t 1/3 ). For the time complexity, each time step involves two tasks that may take non-constant time: computing u t e for each e ∈ E, and computing arg min p∈Pt e∈p u t e . Once the first task has been done, the second task can by running any shortest path algorithm on the directed graph (E, V ) where edge e has nonnegative weight u t e ; this takes time ShortestPath(|E|, |V |). For the first task, there are two parts that may take non-constant time. The first part is searching for a bucket b such that y ∈ b. The buckets correspond to non-overlapping intervals, so this can be done by keeping the buckets in sorted order and using binary search. This approach yields a time complexity of O(log |S t e |) = O(log t+log log x t max ) per edge and thus O(|E|(log t+log log x t max )) total. Note that to keep the buckets in sorted order, creating a bucket now requires us to insert it into the right place in the ordering. Using a self-balancing binary tree, this be done in time O(log |S t e |) per operation as well (and at most two buckets are created per edge per time step). The second part that may take non-constant time is computing y max = max ∪ b∈Be dom(b) , if applicable. If the buckets are stored in sorted order, this can be done in constant time by looking at the last bucket in the ordering. This yields the desired bound. We are finally ready to prove Theorem 1.1. Theorem 1.1. The expected regret of Algorithm 1 after t time steps is Combining this with Lemma A.14 proves the theorem. R t = O t 2/3



Note that E[ct -c * t ] may not monotonically decrease: a sublinear regret algorithm can still incur large errors sporadically, as long as the the frequency of the large errors goes to 0. With regards to the chosen range of β values, note that the expected maximum cost on each edge is .5, since the domain of each fe is [0, 1] and the expected slope of each piece is .5. Thus in some sense, β = .5 represents "as much noise as signal", β = .2 represents "2/5 as much noise as signal", and so on. This means that y is actually not in dom(b t e ). This only happens if y was inserted into b t e on the time step when b t e was created. Note that we could have chosen any α ≥ 2β 2 and simply obtained a different constant.



|b| → returns the number of (flow, cost) pairs stored in b dom(b) → returns the interval [w, z] ⊆ [0, K] associated with b depth(b) → returns the depth of bucket b estimate(b) → returns the cost estimate based on the current observations stored in b create(m, [w, z]) → returns a new empty bucket b with dom(b) = [w, z] and depth(b) = m INSERT(b, (y, c y )) → inserts the flow y and its associated observed cost c y into b

2 , (y, c y )) Type 1 error is still proportional to the domain lengths. 11: B e ← (B e \ b) ∪ {b 1 } ∪ {b 2 } 12: else This flow does not fall into any bucket: create a new depth 0 bucket 13: y max ← max ∪ b∈Be dom(b) 14: b ← create 0, [y max , 2x] 15: B e ← B e ∪ {b} 1: function INSERT(b, (y, c y )) 2: [w, z] ← dom(b) 3: if y < w then y ∈ dom(b) is possible if we created b on this time step 4: u y = c y 5: else 6: u y = c y -L(y -w) 7: estimate(b) ← |b| • estimate(b) + u y |b| + 1 This is a running average of {u y : y ∈ b} 8: |b| ← |b| + 1 2. Independently sample a uniformly random noise term η t e ∈ [-β/2, β/2] for each edge e. 3. Independently sample a uniformly random source and destination from the graph. 4. Compute the path chosen p t chosen by our algorithm. 5. Compute c t -c * t = e∈pt f e (x t e )-min p∈Pt e∈p f e (x t e ) and record the cumulative regret R t . Each run was characterized by the following parameters: 1. Time horizon (either 100,000 or 1,000,000) 2. The graph used (Graph 1 or Graph 2) 3. The noise parameter β ∈ {0, .01, .02, .05, .1, .2, .5}. 2

2 , (y, c y )) Type 1 error is still proportional to the domain lengths. 11: B e ← (B e \ b) ∪ {b 1 } ∪ {b 2 } 12: else This flow does not fall into any bucket: create a new depth 0 bucket 13: y max ← max ∪ b∈Be dom(b) 14: b ← create 0, [y max , 2x] 15: B e ← B e ∪ {b} 1: function INSERT(b, (y, c y )) 2: [w, z] ← dom(b) 3: if y < w then y ∈ dom(b) is possible if we created b on this time step This is a running average of {u y : y ∈ b} 8:

r=1 z e (r) ≤ 1 for all e ∈ E. Therefore,L t r=1 |{e ∈ p r : r ∈ T e }|x r e,max = L t r=1x r e,max |{e : r ∈ Q e }| + |{e ∈ p r : z e (r

Next, observe that |A t (b)| ≤ h(m): |b| = 1 initially, and once |b| > h(m), we split it into two new buckets and never use it again. Thus we can rewrite the above bound to sum over the number of elements in b:

Consider the following linear program, parameterized by t, n and k: max x1,...,xn∈R ≥0 n m=1

3 x m ≤ t x m ≤ 2 m ∀m ∈ {1, . . . , n}

Throughout the analysis, for each e ∈ E and time t, let b t e denote the bucket that is used on line 3 of ESTIMATEEDGECOST, if such a bucket exists. For each t ∈ T e , we are guaranteed that b t e exists, since Condition 3 above ensures that there exists b ∈ B e with x t e ∈ dom(b), and Condition 2 ensures that |b| > 0. Throughout the proof, we will use |b t e | to denote the number of observations in the bucket on that time step, before inserting the new element (if one is inserted). Also, for each e ∈ E and time t, let ζ e (t) be the indicator variable that | 1

Lemma A.5 states that if t ∈ T e and ζ e (t) = 1, then our estimate u t e also shouldn't be too much more than f e (x t e ). Lemma A.5. Fix a time t ∈ T e and edge e, and let m = depth(b t e ). If ζ e (t) = 1, then Proof. As before, for each y ∈ b t e , define u y as in INSERT, let [w, z] = dom(b t e ), and let x = x t e

1 ln t α Next, for e∈pt:t ∈Te f e (x t e ) we have

be the set of buckets associated with edge e that have existed up to and including time t, and let A t (b) = {r ≤ t : e ∈ p r and b t e = b} be the set of time steps that we used bucket b. We know that every time we use edge e (i.e., e ∈ p t ), we use exactly one bucket b t e . Thus we can rewrite the bound from Lemma A.7 as

, . . . , x m ) is a feasible solution to Program 1, and b∈S R t (b) is at most (4 times the objective value of Program 1 for (x 1 , . . . , x m ). Since the objective value for (x 1 , . . . , x m ) is at most the optimal objective value g(t), we have b∈S R t (b) ≤ (4

The space complexity and time complexity of Algorithm 1 on time step t are O(|E|t 1/3 log x t max ) and O |E|(log t+log log x t max )+ShortestPath(|E|, |V |) , respectively.

• |E| log x t max (β log t + x t max L) The space complexity and time complexity on time step t are O(|E|t 1/3 log x t max ) and O |E|(log t+ log log x t max ) + SP(|E|, |V |) , respectively.

annex

A PROOF OF THEOREM 1.1In this section, we prove our main theorem: Theorem 1.1. The expected regret of Algorithm 1 after t time steps isThe space complexity and time complexity on time step t are O(|E|t 1/3 log x t max ) and O |E|(log t+ log log x t max ) + SP(|E|, |V |) , respectively.We restate the algorithm below for the reader's convenience.Our first lemma upper bounds the distance between two flows associated with the same bucket. Lemma A.2 is a standard concentration equality due to Hoeffding. Lemma A.3 is essentially a restatement of Hoeffding's Lemma for our setting.

