COMBINING PHYSICS AND MACHINE LEARNING FOR NETWORK FLOW ESTIMATION

Abstract

The flow estimation problem consists of predicting missing edge flows in a network (e.g., traffic, power, and water) based on partial observations. These missing flows depend both on the underlying physics (edge features and a flow conservation law) as well as the observed edge flows. This paper introduces an optimization framework for computing missing edge flows and solves the problem using bilevel optimization and deep learning. More specifically, we learn regularizers that depend on edge features (e.g., number of lanes in a road, resistance of a power line) using neural networks. Empirical results show that our method accurately predicts missing flows, outperforming the best baseline, and is able to capture relevant physical properties in traffic and power networks.

1. INTRODUCTION

In many applications, ranging from road traffic to supply chains to power networks, the dynamics of flows on edges of a graph is governed by physical laws/models (Bressan et al., 2014; Garavello & Piccoli, 2006) . For instance, the LWR model describes equilibrium equations for road traffic Lighthill & Whitham (1955) ; Richards (1956) . However, it is often difficult to fully observe flows in these applications and, as a result, they rely on off-the-shelf machine learning models to make predictions about missing flows (Li et al., 2017; Yu et al., 2018) . A key limitation of these machine learning models is that they disregard the physics governing the flows. So, the question arises: can we combine physics and machine learning to make better flow predictions? This paper investigates the problem of predicting missing edge flows based on partial observations and the underlying domain-specific physics defined by flow conservation and edge features (Jia et al., 2019) . Edge flows depend on the graph topology due to a flow conservation law-i.e. the total inflow at every vertex is approximately its total out-flow. Moreover, the flow at an edge also depends on its features, which might regularize the space of possible flow distributions in the graph. Here, we propose a model that learns how to predict missing flows from data using bilevel optimization (Franceschi et al., 2017) and neural networks. More specifically, features are given as inputs to a neural network that produces edge flow regularizers. Weights of the network are then optimized via reverse-mode differentiation based on a flow estimation loss from multiple train-validation pairs. Our work falls under a broader effort towards incorporating physics knowledge to machine learning, which is relevant for natural sciences and engineering applications where data availability is limited (Rackauckas et al., 2020) . Conservation laws (of energy, mass, momentum, charge, etc.) are essential to our understanding of the physical world. The classical Noether's theorem shows that such laws arise from symmetries in nature (Hanc et al., 2004) . However, flow estimation, which is an inverse problem (Tarantola, 2005; Arridge et al., 2019) , is ill-posed under conservation alone. Regularization enables us to apply domain-knowledge in the solution of inverse problems. We motivate our problem and evaluate its solutions using two application scenarios. The first is road traffic networks (Coclite et al., 2005) , where vertices represent locations, edges are road segments, flows are counts of vehicles that traverse a segment and features include numbers of lanes and speed limits. The second scenario is electric power networks (Dörfler et al., 2018) , where vertices represent power buses, edges are power lines, flows are amounts of power transmitted and edge features include resistances and lengths of lines. Irrigation channels, gas pipelines, blood circulation, supply chains, air traffic, and telecommunication networks are other examples of flow graphs. Our contributions can be summarized as follows: (1) We introduce a missing flow estimation problem with applications in a broad class of flow graphs; (2) we propose a model for flow estimation that is able to learn the physics of flows by combining reverse-mode differentiation and neural networks; (3) we show that our model outperforms the best baseline by up to 18%; and (4) we provide evidence that our model learns interpretable physical properties, such as the role played by resistance in a power transmission network and by the number of lanes in a road traffic network.

2. FLOW ESTIMATION PROBLEM

We introduce the flow estimation problem, which consists of inferring missing flows in a network based on a flow conservation law and edge features. We provide a list of symbols in the Appendix. Flow Graph. Let G(V, E, X ) be a flow graph with vertices V (n = |V|), edges E (m = |E|), and edge feature matrix X ∈ R m×d , where X [e] are the features of edge e. A flow vector f ∈ R m contains the (possibly noisy) flow f e for each edge e ∈ E. In case G is directed, f ∈ R m + , otherwise, a flow is negative if it goes against the arbitrary orientation of its edge. We assume that flows are induced by the graph, and thus, the total flow-in plus out-at each vertex is approximately conserved: (vi,u)∈E f (vi,u) ≈ (u,vo)∈E f (u,vo) , ∀u ∈ V In the case of a road network, flow conservation implies that vehicles mostly remain on the road. Flow Estimation Problem. Given a graph G(V, E, X ) with partial flow observations f ∈ R m for a subset E ⊆ E of edges ( fe is the flow for e ∈ E , m = |E | < m), predict flows for edges in E \ E . In our road network example, partial vehicle counts f might be measured by sensors placed at a few segments, and the goal is to estimate counts at the remaining segments. One would expect flows not to be fully conserved in most applications due to the existence of inputs and outputs, such as parking lots and a power generators/consumers. In case these input and output values are known exactly, they can be easily incorporated to our problem as flow observations. Moreover, if they are known approximately, we can apply them as priors (as will be detailed in the next section). For the remaining of this paper, we assume that inputs and outputs are unknown and employ flow conservation as an approximation of the system. Thus, different from classical flow optimization problems, such as min-cost flow (Ahuja et al., 1988) , we assume that flows are conserved approximately. Notice that our problem is similar to the one studied in Jia et al. (2019) . However, while their definition also assumes flow conservation, it does not take into account edge features. We claim that these features play important role in capturing the physics of flows. Our main contribution is a new model that is able to learn how to regularize flows based on edge features using neural networks.

3. OUR APPROACH: PHYSICS+LEARNING

In this section, we introduce our approach for the flow estimation problem, which is summarized in Figure 1 . We formulate flow estimation as an optimization problem (Section 3.1), where the interplay between the flow network topology and edge features is defined by the physics of flow graphs. Flow estimation is shown to be equivalent to a regularized least-squares problem (Section Observed Missing Validation Predicted flow Flow graph Q 1,1 Q 2,2 Q 3,3 Q 4,4 Q 5,5 Q 6,6 Q 7,7 Q 8,8 Edge Regularizer Q e,e fold 1

K-fold cross validation

Flow estimation (conservation+regularization) 3.2). Moreover, we describe how the effect of edge features and the graph topology can be learned from data using bilevel optimization and neural networks in Section 3.3. Finally, we propose a reverse-mode differentiation algorithm for flow estimation in Section 3.4.

3.1. FLOW ESTIMATION VIA OPTIMIZATION

The extent to which flow conservation holds for flows in a graph is known as divergence and can be measured using the oriented incidence matrix B ∈ R n×m of G. The matrix is defined as follows, B ij = 1 if ∃u such that e j = (v i , u) ∈ E, B ij = -1 if ∃u such that e j = (u, v i ) ∈ E, and B ij = 0, otherwise. Given B and f , the divergence at a vertex u can be computed as: (Bf ) u = (vi,u)∈E f (vi,u) - (u,vo)∈E f (u,vo) And thus, we can compute the total (squared) divergence in the graph as ||Bf || 2 2 = f B Bf = u∈V ((Bf ) u ) 2 . One could try to solve the flow estimation problem by minimizing ||Bf || 2 2 while keeping the observed flows fixed, however, this problem is ill-posed-there might be multiple solutions to the optimization. The standard approach in such a scenario is to resort to regularization. In particular, we apply a generic regularization function Φ with parameters Θ as follows: f * = arg min f ∈Ω ||Bf || 2 2 + Φ(f , X ; f (0) ; Θ) st. f e = fe , ∀e ∈ E (2) where Ω is the domain of f , f (0) ∈ R m is a prior for flows, f e ( fe ) are entries of f ( f ) for edge e and the constraint guarantees that observed flows are not changed. Priors f (0) , not be confused with observed flows f , should be set according to the application (e.g., as zero, based on a black-box model or historical data). Regarding the domain Ω, we consider Ω = R m and Ω = R m + . The second case is relevant for directed graphs-when flows must follow edge orientations (e.g., traffic). In Jia et al. (2019) , the authors set Φ(f , X, f (0) ; Θ) as λ 2 ||f || 2 2 for a regularization parameter λ, which implies a uniform zero prior with an L 2 penalty over edges. We claim that the regularization function plays an important role in capturing the physics of flow graphs. As an example, for a power network, Φ should account for the resistance of the lines. Thus, we propose learning the regularization from data. Our approach is based on a least-squares formulation, which will be described next.

3.2. REGULARIZED LEAST-SQUARES FORMULATION

Flow estimation problem can be viewed as an inverse problem (Tarantola, 2005) . Let x ∈ R m-m be the vector of missing flows and H ∈ R m×m-m be a matrix such that H ij = 1 if f i maps to x j (i.e., they are associated to the same edge), and H i,j = 0, otherwise. Moreover, let f ∈ R m be such that f e = fe if e ∈ E and f i = 0, otherwise. Using this notation, we define flow estimation as BHx = -B f + , where BH is a forward operator, projecting x to a vector of vertex divergences, and -B f + is the observed data, capturing (negative) vertex divergences for observed flows. The error can be interpreted as noise in observations or some level of model misspecification. We can also define a regularized least-squares problem with the goal of recovering missing flows x: x * = arg min x∈Ω ||BHx + B f || 2 2 + ||x -x (0) || 2 Q(X ;Θ) (3) where Ω is a projection of the domain of f to the space of x, ||x|| 2 M = x M x is the matrixscaled norm of x and x (0) ∈ R m-m are priors for missing flows. The regularization function Φ(f , X ; f (0) , Θ) has the form ||x -x (0) || 2 Q(X ;Θ) , where the matrix Q(X ; Θ) is a function of parameters Θ and edge features X . We focus on the case where Q(X ; Θ) is non-negative and diagonal. Equation 3 has a Bayesian interpretation, with x being a maximum likelihood estimate under a Gaussian assumption-i.e., x ∼ N (x (0) , Q(X ; Θ) -1 ) and B f ∼ N (0, I) (Tarantola, 2005) . Thus, Q(X ; Θ) captures the variance in flow observations f in prior estimates f (0) compared to the one. This allows the regularization function to adapt to different edges based on their features. For instance, in our road network example, Q(X ; Θ) might place a lower weight on flow conservation for flows at a road segment with a small number of lanes, which are possible traffic bottlenecks. Given the least-squares formulation described in this section, how do we model the regularization function Q and learn its parameters Θ? We would like Q to be expressive enough to be able to capture complex physical properties of flows, while Θ to be computed accurately and efficiently. We will address these challenges in the remaining of this paper.

3.3. BILEVEL OPTIMIZATION FOR META-LEARNING THE PHYSICS OF FLOWS

This section introduces a model for flow estimation that is able to learn the regularization function Q(X ; Θ) in Equation 3 from data using bilevel optimization and neural networks. Bilevel formulation. We learn the parameters Θ that determine the regularization function Q(X ; Θ) using the following bilevel optimization formulation: Θ * = arg min Θ E[||x -x * || 2 2 ] (4) st. x * = arg min x∈Ω ||BHx + B f || 2 2 + ||x -x 0 || 2 Q(X ;Θ) where the inner (lower) problem is the same as Equation 3 and the outer (upper) problem is the expected loss with respect to ground truth flows x-which we estimate using cross-validation. Notice that optimal values for parameters Θ and missing flows x are both unknown in the bilevel optimization problem. The expectation in Equation 4 is a function of multiple instances of the inner problem (Equation 5). Each inner problem instance has an optimal solution x * that depends on parameters Θ. In general, bilevel optimization is not only non-convex but also NP-hard (Colson et al., 2007) . However, recent gradient-based solutions for bilevel optimization have been successfully applied to large-scale problems, such as hyper-parameter optimization and meta-learning (Franceschi et al., 2018; Lorraine et al., 2020) . We will first describe how we model the function Q(X ; Θ) and then discuss how this problem can be solved efficiently using reverse-mode differentiation. We propose to model Q(X ; Θ) using a neural network, where X are inputs, Θ are learnable weights and the outputs are diagonal entries of the regularization matrix. This is a natural choice due to the expressive power of neural nets (Cybenko, 1989; Xu et al., 2018) . Multi-Layer Perceptron (MLP). An MLP-based Q(X ; Θ) has the following form: Q(X ; Θ) = diag(M LP (X ; Θ)) where M LP (X ; Θ) ∈ R m-m . For instance, Q(X ; Θ) can be a 2-layer MLP: Q(X ; Θ) = diag(a(b(X W (1) )W (2) )) where Θ = {W (1) , W (2) }, W (1) ∈ R d×h , W (2) ∈ R h×1 , h is the number of nodes in the hidden layer, both a and b are activation functions, and the bias was omitted for convenience. Graph Neural Network (GNN). The MLP-based approach assumes that each entry [Q(X ; Θ)] e,e associated to an edge e is a function of its features X [e] only. However, we are also interested in how entries [Q(X ; Θ)] e,e might depend on the features of neighborhood of e in the flow graph topology. Thus, we consider the case where Q(X ; Θ) is a GNN, which is described in the Appendix.

3.4. FLOW ESTIMATION ALGORITHM

We now focus on how to solve our bilevel optimization problem (Equations 4 and 5). Our solution applies gradient-based approaches (e.g., SGD (Bottou & Bousquet, 2008) , Adam (Kingma & Ba, 2014)) and, for simplicity, our description will be based on the particular case of Gradient Descent and assume a zero prior (x (0) = 0). A key challenge in our problem is to efficiently approximate the gradient of the outer objective with respect to the parameters Θ, which, by the chain rule, depends on the gradient of the inner objective with respect to Θ. We first introduce extra notation to describe the outer problem (Equation 4). Let ( fk , ĝk ) be one of K train-validation folds, both containing ground-truth flow values, such that fk ∈ R p and ĝk ∈ R q . For each fold k, we apply the inner problem (Equation 5) to estimate missing flows x k . Estimates for all folds are concatenated into a single vector x = [x 1 ; x 2 ; . . . ; x K ] and the same for validation sets ĝ = [ ĝ1 ; ĝ2 ; . . . ĝK ]. We define a matrix R ∈ R q×(m-m ) such that R ij = 1 if prediction x j corresponds to validation flow ĝi and R ij = 0, otherwise. Using this representation, we can approximate the expectation in the outer objective as Ψ(x, Θ) = (1/K)||Rx -ĝ|| 2 2 , where x depends implicitly on Θ. We also introduce Υ Θ (x) as the inner problem objective. Moreover, let Γ j (x k,j-1 , Θ i-1 ) be one step of gradient descent for the value of x k at iteration j with learning rate β: Γ j (x k,j-1 , Θ i-1 ) = x k,j-1 -β∇ x Υ Θ (x k,j ) = x k,j-1 -2β[H k B (BH k x k,j-1 + B f k ) + 2Q k x k,j-1 ] where H k , Q k and f k are the matrix H, a sub-matrix of Q(X ; Θ i-1 ) and the observed flows vector f (see Section 3.2) for the specific fold k. We have assumed the domain (Ω ) of flows x k,j to be the set of real vectors. For non-negative flows, we add the appropriate proximal operator to Γ j . Our algorithm applies Reverse-Mode Differentiation (RMD) (Domke, 2012; Franceschi et al., 2017) to estimate ∇ Θ Ψ and optimizes Θ also using an iterative algorithm. The main idea of RMD is to first unroll and store a finite number of iterations for the inner problem x 1 , x 2 , . . . x J and then reverse over those iterations to estimate ∇ Θ Ψ, which is computed as follows: ∇ x J ,Θ Ψ(x J , Θ i ) = ∇ x Ψ(x J , Θ i ) J j=1   J s=j+1 ∂Γ s (x s-1 , Θ i ) ∂x s-1   ∂Γ j (x j-1 , Θ i ) ∂Θ In particular, our reverse iteration is based on the following equations: ∇ x Ψ(x J , Θ i ) = (2/K)R (Rx J -ĝ) ∂Γ s (x s-1 , Θ i ) ∂x s-1 = I -2β(H B BH + 2Q(X ; Θ i )) ∂Γ j (x j-1 , Θ i ) ∂Θ = -4β(∂Q(X ; Θ i )/∂Θ)x j-1 where ∂Q(X ; Θ i )/∂Θ is the gradient of the regularization function Q(X ; Θ) evaluated at Θ i . In our case, this gradient is the same as the neural network gradients and is omitted here for convenience. Algorithm 1 describes our RMD approach for flow estimation. It receives as inputs the flow network G(V, E, X ), K train-validation folds {( fk , ĝk )} K k=1 , and also hyperparameters T , J, α, and β,

Algorithm 1 RMD Algorithm for Flow Estimation

Require: Flow network G(V, E, X ), train-validation folds {( fk , ĝk )} K k=1 , number of outer iterations T and inner iterations J, learning rates α and β Ensure: Regularization parameters Θ 1: Initialize parameters Θ 0 2: ĝ ← [ ĝ1 ; . . . ĝK ] 3: B ← incidence matrix of G 4: for outer iterations i = 1, . . . T do 5: Initialize missing flows x k,0 for all k 6: for inner iterations j = 1, . . . J do 7: for folds k = 1, . . . K do 8: x k,j ← x k,j-1 -2β[H k B (BH k x k,j-1 + B f k ) + 2Q k x k,j-1 ] 9: end for 10: x j ← [x 1,j ; . . . x K,j ] 11: end for 12: z J ← (2/K)R T (Rx J -ĝ) 13: for reverse inner iterations j = J -1, . . . 1 do 14: ← - Θ ← ← - Θ -4βz j+1 (∂Q(X ; Θ i-1 )/∂Θ)x j+1 15: z j ← z j+1 [I -2β(H B BH + Q(X ; Θ i-1 ))] 16: end for 17: Update Θ i ← Θ i-1 -α ← - Θ 18: end for 19: return parameters Θ I corresponding to the number of outer and inner iterations, and learning rates for the outer and inner problem, respectively. Its output is a vector of optimal parameters Θ for the regularization function Q(X ; Θ) according to the bilevel objective in Equations 4 and 5. We use ← -Θ to indicate our estimate of ∇ Θ Ψ(Θ i ). Iterations of the inner problem are stored for each train-validation fold in lines 4-12. Reverse steps, which produce an estimate ← -Θ , are performed in lines 13-17. We then use ← -Θ to update our estimate of Θ in line 17. The time and space complexities of the algorithm are O(T JKm) and O(Jm), respectively, due to the cost of computing and storing the inner problem iterations. As discussed in the previous section, bilevel optimization is non-convex and thus we cannot guarantee that Algorithm 1 will return a global optima. In particular, the learning objective of our regularization function Q(X ; Θ) is non-convex-it is a neural network. However, the inner problem (Equation 5) in our formulation has a convex objective (least-squares). In Franceschi et al. (2018) , the authors have shown that this property implies convergence. We also find that our algorithm often converges to a good estimate of the parameters in our experiments.

4. EXPERIMENTS

We evaluate our approaches for the flow estimation problem using two real datasets and a representative set of baselines and metrics. Due to space limitations, we provide an extended version of this section, with more details on datasets, experimental settings, and additional results in the Appendix.

4.1. DATASETS

This section summarizes the datasets used in our evaluation. We normalize flow values to [0, 1] and map discrete features to real vector dimensions using one-hot encoding. Traffic: Vertices represent locations and directed edges represent road segments between two locations in Los Angeles County, CA. Flows are daily average vehicle counts measured by sensors placed along highways in the year 2018. We assign each sensor to an edge in the graph based on proximity and other sensor attributes. Our road network covers the Los Angeles County area, with 5, 749 vertices, 7, 498 edges, of which 2, 879 edges (38%) have sensors. The following features were mapped to an 18-dimensional vector: lat-long coordinates, number of lanes, max-speed, and highway type (motorway, motorway link, trunk, etc.), in-degree, out-degree, and centrality (PageRank). The in-degree and centrality of an edge are computed based on its source vertex. Similarly, the out-degree of an edge is the out-degree of its target vertex. Power: Vertices represent buses in Europe, undirected edges are power transmission lines and edge flows measure the total active power (in MW) being transmitted through the lines. The dataset is obtained from PyPSA-Eur (Hörsch et al., 2018; Brown et al., 2017 )-an optimization model of the European power transmission system-which generates realistic power flows based on solutions of optimal linear power flow problems with historical production and consumption data. Default values were applied for the PyPSA-Eur settings. The resulting graph has 2,048 vertices, 2,729 edges, and 14-dimensional feature vectors capturing resistance, reactance, length, and number of parallel lines, nominal power, edge degree etc. Please see the Appendix for more details.

4.2. EXPERIMENTAL SETTINGS

Evaluation metrics: We apply Pearson's correlation (CORR), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) to compare groundtruth and predicted flows. These metrics are formally defined in the Appendix. Baselines: Divergence minimization (Div) (Jia et al., 2019) maximizes flow conservation using a single regularization parameter λ, which we optimize using line search in a validation set of flows. Multi-Layer Perceptron (MLP) is a 2-layer neural network with ReLU activations for all layers that learns to predict flows based on edge features. Graph Convolutional Network (GCN) is a 2layer graph neural network, also with ReLU activations and Chebyshev convolutions of degree 2, that learns to predict the flows using both edge features and the topology but disregarding flow conservation (Kipf & Welling, 2016; Defferrard et al., 2016) . We also consider two hybrid baselines. MLP-Div applies the predictions from MLP as priors to Div. Similarly, predictions from GCN are used as priors for GCN-Div. For both hybrid models, we also optimize the parameter λ. Our approaches: We consider three variations of Algorithm 1. However, one important modification is that we perform the reverse iterations for each fold-i.e., folds are treated as batches in SGD. Bil-MLP and Bil-GCN apply our reverse-mode differentiation approach using an MLP and a GCN as a regularizer. Moreover, both approaches use zero as the prior x (0) . Bil-GCN-Prior applies the GCN predictions as flow priors. Architectures of the neural nets are the same as the baselines.

4.3. FLOW ESTIMATION ACCURACY

Table 1 compares our methods and the baselines in terms of several metrics using the Traffic and Power datasets. Values of CORR achieved by MLP and GCN for Traffic are missing because they were undefined-they have generated predictions with zero variance for at least one of the train-test folds. All methods suffer from high MAPE errors for Power, which is due to an over-estimation of small flows. Bil-GCN achieves the best results in both datasets in terms of all metrics, with 6% and 18% lower RMSE than the best baseline for Traffic and Power, respectively. However, notice that Bil-MLP and Bil-GCN achieve very similar performance for Power and Bil-GCN-Prior does not outperform our other methods. We also show scatter plots with the true vs. predicted flows for some of the best approaches in Figure 2 . Traffic has shown to be the more challenging dataset, which can be explained, in part, by training data sparsity-only 38% of edges are labeled. 

4.4. ANALYSIS OF REGULARIZERS

Figure 3 illustrates the regularization function learned by Bil-MLP. We focus on Bil-MLP because it can be analyzed independently of the topology. Figures 3a-3c show scatter plots where the x and y axes represent the value of the regularizer and features, respectively. For Power, Bil-MLP captures the effect of resistance over flows (Fig. 3a ). However, only high values of resistance are mostly affected-that is the reason few points can be seen and also explains the good results for Div. We did not find a significant correlation for other features, with the exception of reactance, which is related to resistance. For Traffic, the model learns how the number of lanes constrains the flow at a road segment (Fig. 3b ). Results for speed limit are more surprising, 45mph roads are less regularized (Fig. 3c ). This is evidence that regularization is affecting mostly traffic bottlenecks in highways-with few lanes but a 65mph speed limit. To further investigate this result, we also show the regularizers over the Traffic topology in Figure 3d . High regularization overlaps with wellknown congested areas in Los Angeles, CA (e.g., Highway 5, Southeast). These results are strong evidence that our methods are able to learn the physics of flows in road traffic and power networks. Our model is able to learn the effect of the resistance for Power. In Traffic, a higher number of lanes is correlated to less regularization and lower speed roads (45mph) are less regularized. The regularization is also correlated with congested areas in Los Angeles, CA.

5. RELATED WORK

Flow graphs are quite ubiquitous in engineering, biomedical and social sciences. Two important properties of flow graphs are that their state space is defined by a graph topology and their dynamics are governed by the physics (or logic) of the problem of interest. We refer to Bressan et al. (2014) for a unified characterization of the mathematical treatment of flow graphs. Notice that these studies do not address the flow inference problem and their applications to real data is limited (Herrera et al., 2010; Work et al., 2010) . Moreover, we focus on long term flows (e.g. daily vehicle traffic flows) and not on the dynamics. This simplifies the equations of our model to the conservation law. Flow inference via divergence minimization was originally proposed in Jia et al. (2019) . However, their work has not considered edge features and instead applied a single regularization parameter to the norm of the flow vector f in Equation 2. Our work leverages relevant edge features to learn the interplay between flow conservation and local predictions (priors). Thus, we generalize the formulation from Jia et al. (2019) to the case of a learnable regularization function Q(Θ, X). Our experiments show that the proposed approach achieves superior results in two datasets. Flow optimization problems, such as min-cost flow, max-flow and multi-commodity flow, have a long history in computer science (Ahuja et al., 1988; Ford Jr & Fulkerson, 2015) . These problems impose flow conservation as a hard constraint, requiring full knowledge of source and sink vertices and noiseless flow observations. Our approach relaxes these requirements by minimizing the flow divergence (see Equation 2). Moreover, our problem does not assume edge capacities and costs. The relationship between flow estimation and inverse problems is of particular interest due to the role played by regularization (Engl et al., 1996) in the solution of ill-posed problems. Recent work on inverse problems has also focused on learning to regularize based on data and even learning the forward operator as well-see Arridge et al. ( 2019) for a review. The use of the expression "learning the physics" is also popular in the context of the universal differential equation framework, which enables the incorporation of domain-knowledge from scientific models to machine learning (Raissi et al., 2019; Long et al., 2018; Rackauckas et al., 2020) . Bilevel optimization in machine learning has been popularized due its applications in hyperparameter optimization (Bengio, 2000; Larsen et al., 1996) . In the last decade, deep learning has motivated novel approaches able to optimize millions of hyperparameters using gradient-based schemes (Maclaurin et al., 2015; Lorraine et al., 2020; Pedregosa, 2016) . Our flow estimation algorithm is based on reverse-mode differentiation, which is a scalable approach for bilevel optimization (Franceschi et al., 2017; Domke, 2012; Maclaurin et al., 2015) . Another application of bilevel optimization quite related to ours is meta-learning (Franceschi et al., 2018; Grefenstette et al., 2019) . Our problem is also related to semi-supervised learning on graphs (Zhu et al., 2003; Belkin et al., 2006; Zhou et al., 2004) , which is the inference of vertex labels given partial observations. These approaches can be applied for flow estimation via the line graph transformation (Jia et al., 2019) . The duality between a recent approach for predicting vertex labels Hallac et al. (2015) and min-cost flows was shown in Jung (2020) . However, the same relation does not hold for flow estimation. Graph neural network models, which generalize deep learning to graph data, have been shown to outperform traditional semi-supervised learning methods in many tasks (Kipf & Welling, 2016; Hamilton et al., 2017; Veličković et al., 2018) . These models have also been applied for traffic forecasting (Li et al., 2017; Yu et al., 2018; Yao et al., 2019) . Different from our approach, traditional GNNs do not conserve flows. We show that our models outperform GNNs at flow prediction. Moreover, we also apply GNNs as a regularization function in our model.

6. CONCLUSIONS

We have introduced an approach for flow estimation on graphs by combining a conservation law and edge features. Our model learns the physics of flows from data by combining bilevel optimization and deep learning. Experiments using traffic and power networks have shown that the proposed model outperforms a set of baselines and learns interpretable physical properties of flow graphs. While we have focused on learning a diagonal regularization matrix, we want to apply our framework to the case of a full matri. We are also interested in combining different edge measurements in order to learn more complex physical laws, such as described by the fundamental diagram in the LWR model Lighthill & Whitham (1955); Daganzo (1994; 1995) ; Garavello & Piccoli (2006) . Symbol Meaning G Flow graph V Set of vertices in G n Size of V E Set of edges in G m Size of E E ⊆ E Set of observed edges m Size of E X ∈ R m×d Edge feature matrix X [e] ∈ R d Features of edge e f ∈ R m Complete flow vector f e ∈ R Flow for edge e f ∈ R m Observed flow vector fe ∈ R Observed flow for edge e B Incidence matrix of G Φ(f , X ; f (0) ; Θ) ∈ R + Regularization function f (0) ∈ R m Flow prior Θ Regularization parameters Ω Domain of F x ∈ R m-m Estimated vector of missing flows x ∈ R m-m True vector of missing flows x (0) ∈ R m-m Prior for missing flows H ∈ R m×m Map from f to x f ∈ R m Vector with observed flows or 0 otherwise Outer objective Υ Θ (x) Q(X ; Θ) ∈ R (m-m )×(m-m ) Inner objective  Γ j (x k,j-1 , Θ i-1 ) One step of SGD ← - Θ Estimate of ∇ Θ Ψ(x, Θ i-1 ) H k Matrix H for fold k Q k Matrix Q for fold k f k Vector f for fold k Table 2:

B BILEVEL OPTIMIZATION WITH GRAPH NEURAL NETWORKS

This section is an extension of Section 3.3. Here, we consider the case where Q(X ; Θ) is a GNN: Q(X , Θ) = diag(GN N (X, Θ, G)) For instance, we apply a 2-layer spectral Graph Convolutional Network (GCN) with Chebyshev convolutions (Defferrard et al., 2016; Kipf & Welling, 2016; Hammond et al., 2011) : Q(X ; Θ) = diag   ReLU   Z z =1 T z ( L)ReLU Z z=1 T z ( L)X W (1) z W (2) z     where L = 2/λ max L -I, L is the normalized Laplacian of the undirected version of the line graph G of G, λ max is the largest eigenvalue of L, T z ( L) is a Chebyshev polynomial of L with order z and W (i) z is the matrix of learnable weights for the z-th order polynomial at the layer i. In a line graph, each vertex represents an edge of the undirected version of G and two vertices are connected if their corresponding edges in G are adjacent. Morever L = I -D -foot_0/foot_1 AD -1/2 , where A and D are the adjacency and degree matrices of G . Chebyshev polynomials are defined recursively, with T z (y) = 2yT z-1 (y) -T z-2 (y) and T 1 (y) = y. In our experiments, we compare GCN against MLP regularization functions. We have also applied the more popular non-spectral graph convolutional operator (Kipf & Welling, 2016) but preliminary results have shown that the Chebyshev operator achieves better performance in flow estimation.

C EXTENDED EXPERIMENTAL SECTION

This section in an extension of Section 4.

C.1 MORE DETAILS ON DATASETS

Traffic: Flow data was collected from the Caltrans-the California Department of Transportation-PeMS (Performance Measurement System). 1 Sensors are placed at major highways in the state. We use sensor geo-locations and other attributes to approximately match them to a compressed version of road network extracted from Openstreetmap. 2 The compression merges any sequence of segments without a branch, as these extra edges would not affect the flow estimation results. We emphasize that this dataset is not of as high quality as Power, due to possible sensor malfunction and matchings of sensors to the wrong road segments. This explains why flow estimation is more challenging in Traffic. Figure 4 is a visualization of our traffic dataset with geographic (lat-long) located vertices and colors indicating light versus heavy traffic (compared to the average). The road segments in the graph (approximately) cover the LA County area. We show the map (from Openstreetmap) of the area covered by our road network in Figure 5 . Power: We will provide more details on how we build the power dataset. PyPSA (Python for Power System Analsys) is a toolbox for the simulation of power systems (Brown et al., 2017) . We applied the European transmission system (PyPSA-Eur), which covers the ENTSO-E area (Hörsch et al., 2018) , to generate a single network snapshot. Besides the PyPSA-Eur original set of edges, which we will refer to as line edges, we have added a set of bus edges. These extra edges allow us to represent power generation and consumption as edge flows. For the line edges, we cover the following PyPSA attributes (with their respective PyPSA identifiersfoot_2 ): reactance (x), resistance(r), capacity (s nom), whether the capacity s nom can be extended (s nom extendable), the capital cost of extending s nom (capital cost), the length of the line (length), the number of parallel lines (num parallel) and the optimized capacity (s nom opt). For bus lines, the only attribute is the control strategy (PQ, PV, or Slack). Notice that we create a single vector representation for both line and bus lines by adding an extra indicator position (line or bus). Moreover, categorical attributes (e.g., the control strategy) were represented using one-hot encoding. Figure 6 is a visualization of our power dataset with geographic (lat-long) located vertices and colors indicating high versus low power (compared to the average).

C.2 EVALUATION METRICS

We apply the following evaluation metrics for flow estimation. Let f true and f pred be mdimensional vectors with true and predicted values for missing flows associated to edges in E \ E .

Correlation (Corr):

cov(f pred , f true )/(σ(f pred ).σ(f true )) where cov is the covariance and σ is the standard deviation. Implementationfoot_3 : We have implemented Algorithm 1 using PyTorch, CUDA, and Higher (Grefenstette et al., 2019) , a meta-learning framework that greatly facilitates the implementation of bilevel optimization algorithms by implicitly performing the reverse iterations for a list of optimization algorithms, including SGD. Moreover, our GCN implementation is based on the Deep Graph Library (DGL) (Wang et al., 2019) . Hardware: We ran our experiments on a single machine with 4 NVIDIA GeForce RTX 2080 GPUs (each with 8Gb of RAM) and 32 Intel Xeon CPUs (2.10GHz and 128Gb of RAM). 

D.2 RUNNING TIME

Table 4 shows the average running times-over the 10-fold cross-validation-of our methods and the baselines for the Traffic and Power datasets. We show both training and test times. The results show that our reverse-mode differentiation algorithm adds significant overhead on training time for Traffic, taking up to 4 times longer than Min-Div to finish. As described in Section 3.4, this is due mainly to the cost of computing and storing the inner problem iterations. On the other hand, all the methods are efficient at testing. GCN converged quickly (due to early stopping) for both datasets. However, it achieved poor results for Power, as shown in Table 1 , which is a sign of overfitting or underfitting. Notice that the results reported are the best in terms of RMSE.



Source: http://pems.dot.ca.gov/ Source: https://www.openstreetmap.org https://pypsa.readthedocs.io/en/latest/components.html https://github.com/arleilps/flow-estimation



Figure 1: Summary of the proposed approach for predicting missing flows in a graph based on partial observations and edge features. We learn to combine features and a flow conservation law, which together define the physics of the flow graph. A regularization function Q(X ; Θ) modeled as a neural network with parameters Θ takes as input edge features X [e]. A flow estimation algorithm applies the regularization, partial observations ( f ), prior flows (x (0) ) and flow conservation to predict missing flows x. Network parameters Θ are learned based on a K-fold cross validation loss with respect to validation flows x. Our model is trained end-to-end using reverse-mode differentiation.

Figure2: Scatter plots with true (x) and predicted (y) flows for two approaches on each dataset. The results are consistent with Table1and show that our methods are more accurate than the baselines.

Figure 3: Edge regularizer learned by Bil-MLP vs. features values (a-c) and visualization of regularizers on the Traffic topology (d).Our model is able to learn the effect of the resistance for Power. In Traffic, a higher number of lanes is correlated to less regularization and lower speed roads (45mph) are less regularized. The regularization is also correlated with congested areas in Los Angeles, CA.

Figure 4: Visualization of our traffic network with geo-located vertices. Edges in grey have missing flows, edges in red have traffic above the average and edges in blue have traffic below the average. Better seen in color. See Figure 5 for map of the area.

Figure 5: Road map covered by the road network shown in Figure 4 (from Openstreetmap)

Table1and show that our methods are more accurate than the baselines. Average flow estimation accuracy for the baselines (Div, MLP and GCN) and our methods (Bil-MLP, Bil-GCN and Bil-GCN-Prior) using the Traffic and Power datasets. RMSE, MAE and MAPE are errors (the lower the better) and CORR is a correlation (the higher the better). Values of correlation for MLP and GCN using Traffic were undefined. Bil-GCN (ours) outperforms the best baseline for all the metrics, with up to 20% lower RMSE than Div using Power.

Table of the main symbols used in this paper.

lists the main symbols used in our paper.

Figure 8: Visualization of regularizers on the Power network topology. We highlight edges with large vaues of regularizer. Better seen in color. Average training and test times (in seconds) for our methods and the baselines (in seconds).

ACKNOWLEDGEMENTS

Research partially funded by the grants NSF IIS #1817046 and DTRA #HDTRA1-19-1-0017.

annex

Hyperparameter settings: We have selected the parameters based on RMSE for each method using grid search with learning rate over [10 0 , 10 -1 , 10 -2 , 10 -3 ] and number of nodes in the hidden layer over [4, 8, 16] . The total number of iterations was set to 3000 for Min-Div and 5000 for MLP and GCN, all with early stop on convergence after 10 iterations. For our methods (both based on Algorithm 1), we set T = 10, J = 300, α = 10 -2 , β = 10 -2 and K = 10 in all experiments.

C.4 DIVERGENCE RESULTS

Although the main goal of flow estimation is to minimize the flow prediction loss, we also evaluate how our methods and the baselines perform in terms of divergence (or flow conservation) in Table 3 . As expected, MLP and GCN do not conserve the flows. However, interestingly, our methods (Bil-MLP and Bil-GCN) achieve higher flow conservation than Min-Div. This is due to the regularization parameter λ, which is tuned based on a set of validation flows. 

D.1 VISUALIZATION OF REGULARIZER FOR POWER

Figure 8 shows the regularizers over the Power network topology. As discussed in Section 4.4, the regularizer affects mostly a few top resistance edges. For the remaining ones, regularizers have a small value. Notice that these high resistance edges are associated with lines transmitting small amounts of power, as shown in Figure Figure 6 , and have a large impact on the overall flow estimation accuracy.

