LEARNING PROXIMAL OPERATORS TO DISCOVER MULTIPLE OPTIMA

Abstract

Finding multiple solutions of non-convex optimization problems is a ubiquitous yet challenging task. Most past algorithms either apply single-solution optimization methods from multiple random initial guesses or search in the vicinity of found solutions using ad hoc heuristics. We present an end-to-end method to learn the proximal operator of a family of training problems so that multiple local minima can be quickly obtained from initial guesses by iterating the learned operator, emulating the proximal-point algorithm that has fast convergence. The learned proximal operator can be further generalized to recover multiple optima for unseen problems at test time, enabling applications such as object detection. The key ingredient in our formulation is a proximal regularization term, which elevates the convexity of our training loss: by applying recent theoretical results, we show that for weakly-convex objectives with Lipschitz gradients, training of the proximal operator converges globally with a practical degree of over-parameterization. We further present an exhaustive benchmark for multi-solution optimization to demonstrate the effectiveness of our method.

1. INTRODUCTION

Searching for multiple optima of an optimization problem is a ubiquitous yet under-explored task. In applications like low-rank recovery (Ge et al., 2017) , topology optimization (Papadopoulos et al., 2021) , object detection (Lin et al., 2014) , and symmetry detection (Shi et al., 2020) , it is desirable to recover multiple near-optimal solutions, either because there are many equally-performant global optima or due to the fact that the optimization objective does not capture user preferences precisely. Even for single-solution non-convex optimization, typical methods look for multiple local optima from random initial guesses before picking the best local optimum. Additionally, it is often desirable to obtain solutions to a family of optimization problems with parameters not known in advance, for instance, the weight of a regularization term, without having to restart from scratch. Formally, we define a multi-solution optimization (MSO) problem to be the minimization min x∈X f τ (x), where τ ∈ T encodes parameters of the problem, X is the search space of the variable x, and f τ : R d → R is the objective function depending on τ . The goal of MSO is to identify multiple solutions for each τ ∈ T , i.e., the set {x * ∈ X : f τ (x * ) = min x∈X f τ (x)}, which can contain more than one element or even infinitely many elements. In this work, we assume that X ⊂ R d is bounded and that d is small, and that T is, in a loose sense, a continuous space, such that the objective f τ changes continuously as τ varies. To make gradient-based methods viable, we further assume that each f τ is differentiable almost everywhere. As finding all global minima in the general case is extremely challenging, realistically our goal is to find a diverse set of local minima. As a concrete example, for object detection, T could parameterize the space of images and X could be the 4-dimensional space of bounding boxes (ignoring class labels). Then, f τ (x) could be the minimum distance between the bounding box x ∈ X and any ground truth box for image τ ∈ T . Minimizing f τ (x) would yield all object bounding boxes for image τ . Object detection can then be cast as solving this MSO on a training set of images and extrapolating to unseen images (Section 5.5). Object detection is a singular example of MSO where the ground truth annotation is widely available. In such cases, supervised learning can solve MSO by predicting a fixed number of solutions together with confidence scores using a set-based loss such as the Hausdorff distance. Unfortunately, such annotation is not available for most optimization problems in the wild where we only have access to the objective functions -this is the setting that our method aims to tackle. Our work is inspired by the proximal-point algorithm (PPA), which applies the proximal operator of the objective function to an initial point iteratively to refine it to a local minimum. PPA is known to converge faster than gradient descent even when the proximal operator is approximated, both theoretically (Rockafellar, 1976; 2021) and empirically (e.g., Figure 2 of Hoheisel et al. (2020) ). If the proximal operator of the objective function is available, then MSO can be solved efficiently by running PPA from a variety of initial points. However, obtaining a good approximation of the proximal operator for generic functions is difficult, and typically we have to solve a separate optimization problem for each evaluation of the proximal operator (Davis & Grimmer, 2019) . In this work, we approximate the proximal operator using a neural network that is trained using a straightforward loss term including only the objective and a proximal term that penalizes deviation from the input point. Crucially, our training does not require accessing the ground truth proximal operator. Additionally, neural parameterization allows us to learn the proximal operator for all {f τ } τ ∈T by treating τ as an input to the network along with an application-specific encoder. Once trained, the learned proximal operator allows us to effortlessly run PPA from any initial point to arrive at a nearby local minimum; from a generative modeling point of view, the learned proximal operator implicitly encodes the solutions of an MSO problem as the pushforward of a prior distribution by iterated application of the operator. Such a formulation bypasses the need to predict a fixed number of solutions and can represent infinitely many solutions. The proximal term in our loss promotes the convexity of the formulation: applying recent results (Kawaguchi & Huang, 2019) , we show that for weakly-convex objectives with Lipschitz gradients-in particular, objectives with bounded second derivatives-with practical degrees of over-parameterization, training converges globally and the ground truth proximal operator is recovered (Theorem 3.1 below). Such a global convergence result is not known for any previous learning-to-optimize method (Chen et al., 2021) . Literature on MSO is scarce, so we build a benchmark with a wide variety of applications including level set sampling, non-convex sparse recovery, max-cut, 3D symmetry detection, and object detection in images. When evaluated on this benchmark, our learned proximal operator reliably produces high-quality results compared to reasonable alternatives, while converging in a few iterations.

2. RELATED WORKS

Learning to optimize. Learning-to-optimize (L2O) methods utlilize past optimization experience to optimize future problems more effectively; see (Chen et al., 2021) for a survey. Model-free L2O uses recurrent neural networks to discover new optimizers suitable for similar problems (Andrychowicz et al., 2016; Li & Malik, 2016; Chen et al., 2017; Cao et al., 2019) ; while shown to be practical, these methods have almost no theoretical guarantee for the training to converge (Chen et al., 2021) . In comparison, we learn a problem-dependent proximal operator so that at test time we do not need access to objective functions or their gradients, which can be costly to evaluate (e.g. symmetry detection in Section 5.4) or unavailable (e.g. object detection in Section 5.5). Model-based L2O substitutes components of a specialized optimization framework or schematically unrolls an optimization procedure with neural networks. Related to proximal methods, Gregor & LeCun (2010) emulate a few iterations of proximal gradient descent using neural networks for sparse recovery with an ℓ 1 regularizer, extended to non-convex regularizers by Yang et al. (2020) ; a similar technique is applied to susceptibility-tensor imaging in Fang et al. (2022) . Gilton et al. (2021) propose a deep equilibrium model with proximal gradient descent for inverse problems in imaging that circumvents expensive backpropagation of unrolling iterations. Meinhardt et al. (2017) use a fixed denoising neural network as a surrogate proximal operator for inverse imaging problems. All these works use schematics of proximal methods to design a neural network that is then trained with strong supervision. In contrast, we learn the proximal operator directly, requiring only access to the objectives; we do not need ground truth for inverse problems during training. Existing L2O methods are not designed to recover multiple solutions: without a proximal term like in (2), the learned operator can degenerate even with multiple starts (Appendix D.3). Finding multiple solutions. Many heuristic methods have been proposed to discover multiple solutions including niching (Brits et al., 2007; Li, 2009) , parallel multi-starts (Larson & Wild, 2018), and deflation (Papadopoulos et al., 2021) . However, all these methods do not generalize to similar but unseen problems. Predicting multiple solutions at test time is universal in deep learning tasks like multi-label classification (Tsoumakas & Katakis, 2007) and detection (Liu et al., 2020) . The typical solution is to ask the network to predict a fixed number of candidates along with confidence scores to indicate how likely each candidate is a solution (Ren et al., 2015; Li et al., 2019; Carion et al., 2020) . Then the solutions will be chosen from the candidates using heuristics such as non-maximum suppression (Neubeck & Van Gool, 2006) . Models that output a fixed number of solutions without taking into account the unordered set structure can suffer from "discontinuity" issues: a small change in set space requires a large change in the neural network outputs (Zhang et al., 2019) . Furthermore, this approach cannot handle the case when the solution set is continuous. Wasserstein gradient flow. Our formulation (2) corresponds to one step of JKO discretization of the Wasserstein gradient flow where the energy functional is the the linear functional dual to the MSO objective function (Jordan et al., 1998; Benamou et al., 2016) . See the details in Appendix E. Compared to recent works on neural Wasserstein gradient flows (Mokrov et al., 2021; Hwang et al., 2021; Bunne et al., 2022) , where a separate network parameterizes the pushforward map for every JKO step, our functional's linearity makes the pushforward map identical for each step, allowing end-to-end training using a single neural network. We additionally let the network input a parameter τ , in effect learning a continuous family of JKO-discretized gradient flows.

3.1. PRELIMINARIES

Given the objective f τ : R d → R of an MSO problem parameterized by τ , the corresponding proximal operator (Moreau, 1962; Rockafellar, 1976; Parikh & Boyd, 2014) is defined, for a fixed λ ∈ R >0 , as prox(x; τ ) := arg min y f τ (y) + λ 2 ∥y -x∥ 2 2 . The weight λ in the proximal term λ /2∥y -x∥ 2 2 foot_0 controls how close prox(x; τ ) is to x: increasing λ will reduce ∥prox(x; τ ) -x∥ 2 . For the arg min in (1) to be unique, a sufficient condition is that f τ is ξ-weakly convex with ξ < λ, so that f τ (y) + λ 2 ∥y -x∥ 2 is strongly convex. The class of weakly convex functions is deceivingly broad: for instance, any twice differentiable function with bounded second derivatives (e.g. any C 2 function on a compact set) is weakly convex. When the function is convex, prox(x; τ ) is precisely one step of the backward Euler discretization of integrating the vector field -∇f τ with time step 1 /λ (see Section 4.1.1 of Parikh & Boyd (2014) ). The proximal-point algorithm (PPA) for finding a local minimum of f τ iterates Rockafellar, 1976) . In practice, prox(x; τ ) often can only be approximated, resulting in inexact PPA. When the objective function is locally indistinguishable from a convex function and x 0 is sufficiently close to the set of local minima, then with reasonable stopping criterion, inexact PPA converges linearly to a local minimum of the objective: the smaller λ is, the faster the convergence rate becomes (Theorem 2.1-2.3 of Rockafellar ( 2021)). x k := prox(x k-1 ; τ ), ∀k ∈ N ≥1 , with initial point x 0 (

3.2. LEARNING PROXIMAL OPERATORS

The fast convergence rate of PPA makes it a strong candidate for MSO: to obtain a diverse set of solutions for any τ ∈ T , we only need to run a few iterations of PPA from random initial points. The proximal term penalizes big jumps and prevents points from collapsing to a single solution. However, running a subroutine to approximate prox(x; τ ) for every pair (x, τ ) can be costly. To overcome this issue, we learn the operator prox(•; •) given access to {f τ } τ ∈T . A naïve way to learn prox(•; •) is to first solve (1) to produce ground truth for a large number of (x, τ ) pairs independently using gradient-based methods and then learn the operator using mean-squared error loss. However, this approach is costly as the space X × T can be large. Moreover, this procedure requires a stopping criterion for the minimization in (1), which is hard to design a priori. Instead, we formulate the following end-to-end optimization over the space of functions: min Φ:X ×T →X Ex∼µ τ ∼ν f τ (Φ(x, τ )) + λ 2 ∥Φ(x, τ ) -x∥ 2 2 , ( ) where x is sampled from µ, a distribution on X , and τ is sampled from ν, a distribution on T . To get (2) from (1), we essentially substitute y with the output Φ(x, τ ) and integrate over the product probability distribution µ ⊗ ν. To solve (2), we parameterize Φ : X × T → X using a neural network with additive and multiplicative residual connections (Appendix B). Intuitively, the implicit regularization of neural networks aligns well with the regularity of prox(•; •): for a fixed τ the proximal operator prox(•; τ ) is 1-Lipschitz in local regions where f τ is convex, while as the parameter τ varies f τ changes continuously so prox(x; τ ) should not change too much. To make (2) computationally practical during training, we realize ν as a training dataset. For the choice of µ, we employ an importance sampling technique from Wang & Solomon (2019) as opposed to using unif(X ), the uniform distribution over X , so that the learned operator can refine near-optimal points (Appendix C). To train Φ, we sample a mini-batch of (x, τ ) to evaluate the expectation and optimize using Adam (Kingma & Ba, 2014) . For problems where the space T is structured (e.g. images or point clouds), we first embed τ into a Euclidean feature space through an encoder before passing it to Φ. Such encoder is trained together with operator network Φ. This allows us to use efficient domain-specific encoder (e.g. convolutional networks) to facilitate generalization to unseen τ . To extract multiple solutions at test time for a problem with parameter τ , we sample a batch of x's from unif(X ) and apply the learned Φ(•, τ ) to the batch of samples a few times. Each application of Φ approximates a single step of PPA. From a distributional perspective, for k ∈ N ≥0 , we can view Φ k -the operator Φ applied k times-as a generative model so that the pushforward distribution, (Φ k ) # (unif(X )), concentrates on the set of local minima approximates as k increases. An advantage of our representation is that it can represent arbitrary number of solutions even when the set of minima is continuous (Figure 2 ). This procedure differs from those in existing L2O methods (Chen et al., 2021) : at test time, we do not need access to {f τ } τ ∈T or their gradients, which can be costly to evaluate or unavailable; instead we only need τ (e.g. in the case of object detection, τ is an image).

3.3. CONVERGENCE OF TRAINING

We have turned the problem of finding multiple solutions for each f τ in the space X into the problem of finding a single solution for (2) in the space of functions. If the f τ 's are ξ-weakly convex with ξ < λ and µ, ν have full support, then the arg min in (1) is unique for every pair (x, τ ) and hence the functional solution of (2) is the unique proxmal operator prox(•; τ ). If in addition the gradients of the objectives are Lipschitz, using recent learning theory results (Kawaguchi & Huang, 2019) we can show that with practical degrees of over-parameterization, gradient descent on neural network parameters of Φ converges globally during training. Suppose our training dataset is S = {(x i , τ i )} n i=1 ⊂ X × T . Define the training loss, a discretized version of (2) using S, to be, for g : X × T → X , L(g) := 1 n n i=1 f τi (g(x i , τ i )) + λ 2 ∥g(x i , τ i ) -x i ∥ 2 2 . (3) Theorem 3.1 (informal). Suppose for any τ ∈ T , the objective f τ is differentiable, ξ-weakly convex, and ∇f τ is ζ-Lipschitz with ξ ≤ λ. Then for any feed-forward neural network with Ω(n) total parametersfoot_1 and common activation units, when the initial weights are drawn from a Gaussian distribution, with high probability, gradient descent on its weights using a fixed learning rate will eventually reach the minimum loss min g:X ×T →X L(g). The number of iterations needed to achieve ϵ > 0 training error is O((λ + ζ)/ϵ), and when this occurs, if ξ < λ, then the mean-squared error of the learned proximal operator compared to the true one is O( 2ϵ /(λ-ξ)) on training data. We state and prove Theorem 3.1 formally in Appendix A. Even though the optimization over network weights is non-convex, training can still result in a globally minimal loss and the true proximal operator can be recovered. In Appendix D.2, we empirically verify that when the objective is the ℓ 1 norm, the trained operator converges to the true proximal operator, the shrinkage operator. In Appendix D.3, we study the effect of λ in relation to the weakly-convex constant ξ for the 2D cosine problem and compare to an L2O particle-swarm method (Cao et al., 2019) . We note a few gaps between Theorem 3.1 and our implementation. First, we use SGD with minibatching instead of gradient descent. 

4. PERFORMANCE MEASURES

Figure 1 : Interpretation of D t . In this example, the witness W is drawn uniformly from the union of four squares. If A t (resp. B t ) is the set of red (resp. blue) points, then P(D t ≈ 0) = 3 /4 and P(D t ≈ 0.5) = 1 /4, since D t is only non-zero when W is in the rightmost square. This aligns well with the intuition that 3 /4 of the red points match with the blue ones. In comparison, the Hausdorff distance between A t and B t is approximately 1, which is the same as the Hausdorff distance between the orange point and B t , despite the fact most of red points are close to the blue ones. Metrics. Designing a single-valued metric for MSO is challenging since one needs to consider the diversity of the solutions as well each solution's level of optimality. For an MSO problem with parameter τ and objective f τ , the output of an MSO algorithm can be represented as a (possibly infinite) set of solutions {x α } α ⊂ X with objective values u α := f τ (x α ). Suppose we have access to ground truth solutions {y β } β ⊂ X with v β := f τ (y β ). Pick a threshold t ∈ R and denote A t := {x α : u α ≤ t}, B t := {y β : v β ≤ t}. Let W be a random variable that is uniformly distributed on X . Define a random variable D t := 1 2 ∥π At (W ) -π Bt (π At (W ))∥ 2 + 1 2 ∥π Bt (W ) -π At (π Bt (W ))∥ 2 , where π S (x) := arg min s∈S ∥x -s∥ 2 . We call W a witness of D t , as it witnesses how different A t and B t are near W . To summarize the law of D t , we define the witnessed divergence and witnessed precision at δ > 0 as WD t := E[D t ] and WP δ t := P(D t < δ). (5) Witnesses help handle unbalanced clusters that can appear in the solution sets. These metrics are agnostic to duplicates, unlike the chamfer distance or optimal transport metrics. Compared to alternatives like the Hausdorff distance, WD t remains low if a small portion of A t , B t are mismatched. We illustrate these metrics in Figure 1 . One can interpret WD t as a weighted chamfer distance whose weight is proportional to the volume of the ℓ 2 -Voronoi cell at each point in either set. Particle Descent: Ground Truth Generation. A naïve method for MSO is to run gradient descent until convergence on randomly sampled particles in X for every τ ∈ T . We use this method to generate approximated ground truth solutions to compute the metrics in (5) when the ground truth is not available. This method is not directly comparable to ours since it cannot generalize to unseen τ 's at test time. Remarkably, for highly non-convex objectives, particle descent can produce worse solutions than the ones obtained using the learned proximal operator (Figure D.7) . Learning Gradient Descent Operators. As there is no readily-available application-agnostic baseline for MSO, we propose the following method that learns iterations of the gradient descent operator. Fix Q ∈ N ≥1 and a step size η > 0. We optimize an operator Ψ via min Ψ:X ×T →X Ex∼µ τ ∼ν Ψ(x, τ ) -Ψ * Q (x; τ ) 2 2 , where Ψ * Q (x; τ ) is the result of Q steps of gradient descent on f τ starting at x, i.e., Ψ * 0 (x; τ ) = x, and Ψ * k (x; τ ) = Ψ * k-1 (x; τ ) -η∇f τ (Ψ * k-1 (x; τ )). Each iteration of minimizing ( 6) requires Q evaluations of ∇f τ , which can be costly (e.g., for symmetry detection in Section 5.4). We use importance sampling similar to Appendix C. An ODE interpretation is that Ψ performs Q iterations of forward Euler on the gradient field ∇f τ , whereas the learned proximal operator performs a single iteration of backward Euler. We choose Q = 10 for all experiments except for symmetry detection (Section 5.4) where we choose Q = 1 because otherwise the training will take > 200 hours. As we will see in Figure D.6, aside from slower training, this approach struggles with non-smooth objectives due to the fixed step size η, while the learned proximal operator has no such issues.

5. APPLICATIONS

We consider five applications to benchmark our MSO method, chosen to highlight the ubiquity of MSO in diverse settings. We abbreviate POL for proximal operator learning (proposed method), GOL for gradient operator learning (Section 4), and PD for particle descent (Section 4). Further details about each application can be found in Appendix D. The source code for all experiments can be found at https://github.com/lingxiaoli94/POL. Formulation. Level sets provide a concise and resolution-free implicit shape representation (Museth et al., 2002; Park et al., 2019; Sitzmann et al., 2020 ). Yet they are less intuitive to work with, even for straightforward tasks on discretized domains (meshes, point clouds) like visualizing or integration on the domain. We present an MSO formulation to sample from level sets, enabling the adaptation of downstream tasks to level sets.

5.1. SAMPLING FROM LEVEL SETS

Given a family of functions {g τ : X → R q } τ ∈T , for each τ suppose we want to sample from the 0-level set g -1 τ (0). We formulate an MSO problem with objective f τ (x) := ∥g τ (x)∥ 2 2 , whose global optima are precisely g -1 τ (0). We do not need assumptions on level set topology or that the implicit function represents a distance field, unlike most existing methods (Park et al., 2019; Deng et al., 2020; Chen et al., 2020) . Benchmark. We consider sampling from conic sections. We keep this experiment simple so as to visualize the solutions easily. Let X = [-5, 5] 2 and T = [-1, 1] 6 . For τ = (A, B, C, D, E, F ) ∈ T , define g τ to be g τ (x 1 , x 2 ) := Ax 2 + Bxy + Cy 2 + Dx + Ey + F . Since f τ = (g τ ) 2 is a defined on a compact X , it satisfies the conditions of Theorem 3.1 for a large λ, but a large λ corresponds to small PPA step size. Empirically, small λ for POL gave decent results compared to GOL: Figure 2 illustrates that POL consistently produces sharper level sets for both hyperbolas (B 2 -4AC > 0) and ellipses (B 2 -4AC < 0). Figure D.4 shows that POL yields significantly higher WP δ t than GOL for small δ, implying that details are well recovered. Figure D.5 verifies that iterating the trained operator of POL converges much faster than that of GOL. It is straightforward to extend this setting to sample from more complicated implicit shapes parameterized by τ .

5.2. SPARSE RECOVERY

Formulation. In signal processing, the sparse recovery problem aims to recover a signal x * ∈ X ⊂ R d from a noisy measurement y ∈ R m distributed according to y = Ax * + e, where A ∈ R m×d , m < d, and e is measurement noise (Beck & Teboulle, 2009) . In applications like imaging and speech recognition, the signals are sparse, with few non-zero entries (Marques et al., 2018) . Hence, the goal of sparse recovery is to recover a sparse x * given A and y. A common way to encourage sparsity is to solve least-squares plus an ℓ p norm on the signal: min x∈X ∥Ax -y∥ 2 2 + α∥x∥ p p , for α, p > 0 and ∥x∥ p p := d i=1 (x 2 i + ϵ) p/2 for a small ϵ to prevent instability. We consider the non-convex case where 0 < p < 1. Compared to convex alternatives like in LASSO (p = 1), non-convex ℓ p norms require milder conditions under which the global optima of ( 7) are the desired sparse x * (Chartrand & Staneva, 2008; Chen & Gu, 2014) . To apply our MSO framework, we define τ = (α, p) ∈ T and f τ to be the objective (7) with corresponding α, p. Compared to existing methods for non-convex sparse recovery (Lai et al., 2013) , our method can recover multiple solutions from the non-convex landscape for a family of α's and p's without having to restart. The user can adjust parameters α, p to quickly generate candidate solutions before choosing a solution based on their preference. Benchmark. Let X = [-2, 2] 8 , T = [0, 1] × [0.2, 0.5]. We consider highly non-convex ℓ p norms with p ∈ [0.2, 0.5] to test our method's limits. We choose d = 8 and m = 4, and sample the sparse signal x * uniformly in X with half of the coordinates set to 0. We then sample entries in A i.i.d. from N (0, 1) and generate y = Ax * + e where e ∼ N (0, 0.1). Although ∥x∥ p p is not weakly convex, POL achieves decent results (Figure D.6) . Notably, POL often reaches a better objective than PD (Figure D.7) while retaining diversity, even though POL uses a much bigger step size ( 1 /λ = 0.1 compared to PD's 10 -5 ) and needs to learn a different operator for an entire family of τ ∈ T . In Figure D .8, we additionally compare POL with proximal gradient descent (Tibshirani et al., 2010) for p = 1 /2 where the corresponding thresholding formula has a closed-form (Cao et al., 2013) . Remarkably, we have observed superior performance of POL against such a strong baseline.

5.3. RANK-2 RELAXATION OF MAX-CUT

Formulation. MSO can be applied to solve combinatorial problems that admit smooth non-convex relaxations. Here, we consider the classical problem of finding the maximum cut of an undirected graph G = (V, E), where V = {1, . . . , n}, E ⊂ V × V , with edge weights {w ij } ⊂ R so that Burer et al. (2002) propose solving min θ∈R n i,j w ij cos(θ iθ j ), a rank-2 non-convex relaxation of the max-cut problem. This objective inherits weak convexity from cosine, so it satisfies the conditions of Theorem 3.1. In practice, instead of using angles as the variables which are ambiguous up to 2π, we represent each variable as a point on the unit circle S 1 , so we choose X = (S 1 ) n and T be the space of all edge weights with n vertices. For τ = {τ ij } ∈ T corresponding to a graph with edge weights {τ ij }, we define, for x ∈ X , w ij = 0 if (i, j) / ∈ E. The goal is to find {x i } ∈ {-1, +1} V to maximize i,j w ij (1 -x i x j ). f τ (x) := i,j τ ij x ⊤ i x j . After minimizing f τ , we can find cuts using a Goemans & Williamson-type procedure (1995). Instead of using heuristics to find optima near a solution (Burer et al., 2002) , our method can help the user effortlessly explore the set of near-optimal solutions without hand-designed heuristics. Benchmark. We apply our formulation to K 8 , the complete graph with 8 vertices. Hence X = (S 1 ) 8 ⊂ R 16 . We choose T = [0, 1] 28 as there are 28 edges in K 8 . We mix two types of random graphs with 8 vertices in training and testing: Erdős-Rényi graphs with p = 0.5 and K 8 with uniform edge weights in [0, 1]. Figure 3 shows that POL can generate diverse set of max cuts. Quantitatively, compared to GOL, POL achieves better witnessed metrics (Figure D.10).

5.4. SYMMETRY DETECTION OF 3D SHAPES

Formulation. Geometric symmetries are omnipresent in natural and man-made objects. Knowing symmetries can benefit downstream tasks in geometry and vision (Mitra et al., 2013; Shi et al., 2020; Zhou et al., 2021) . We consider the problem of finding all reflection symmetries of a 3D surface. Let τ be a shape representation (e.g. point cloud, multi-view scan), and let M τ ⊂ R 3 denote the corresponding triangular mesh that is available for the training set. As reflections are determined by the reflectional plane, we set X = S 2 × R ≥0 , where x = (n, d) ∈ X denotes the plane with unit normal n ∈ S 2 ⊂ R 3 and intercept d ∈ R ≥0 (we assume d ≥ 0 to remove the ambiguity of (-n, -d) representing the same plane). Let R x : R 3 → R 3 denote the corresponding reflection. Perfect symmetries of M τ satisfy R x (M τ ) = M τ . Let s τ : R 3 → R be the (unsigned) distance field of M τ given by s τ (p) = min q∈Mτ ∥p -q∥ 2 . Inspired by Podolak et al. (2006) , we define the MSO objective to be f τ (x) := E p∼Mτ [s τ (R x (p))], where a batch of p is sampled uniformly from M τ when evaluating the expectation. Although f τ is stochastic, since we use point-to-mesh distances to compute s τ , perfect symmetries will make ( 9) zero with probability one. Compared to existing methods that either require ground truth symmetries obtained by human annotators (Shi et al., 2020) or detect only a small number of symmetries (Gao et al., 2020) , our method applied to (9) finds arbitrary numbers of symmetries including continuous ones and can generalize to unseen shapes, without needing ground truth symmetries as supervision. Benchmark. We detect reflection symmetries for mechanical parts in the MCB dataset (Kim et al., 2020) . We choose T to be the space of 3D point clouds representing mechanical parts. From the mesh of each shape, we sample 2048 points with their normals uniformly and use DGCNN (Wang et al., 2019) to encode the oriented point clouds. Benchmark. We apply the above MSO formulation to the COCO2017 dataset (Lin et al., 2014) . As τ is an image, we fine-tune ResNet-50 (He et al., 2016) to encode τ into a vector z that can be consumed by the operator network (Figure B.1) . Table 1 : Object detection results. WD ∞ (resp. WP 0.1 ∞ ) is the witnessed divergence (resp. precision) in ( 5) with t = ∞ (i.e. keeping all solutions), averaged over 10 trials (standard deviation < 10 -3 ). Precision and recall are computed with Hungarian matching as no confidence score is available for the usual greedy matching (see Appendix D.8). FRCNN(.S) (Ren et al., 2015) In addition to GOL, we design a baseline method FN that uses the same ResNet-50 backbone and predicts a fixed number of boxes using the chamfer distance as the training loss. Table 1 compares the proposed methods with alternatives and the highly-optimized Faster R-CNN (Ren et al., 2015) on the test dataset. Since we do not output confidence scores, the metrics are computed solely based on the set of predicted boxes. Our method achieves significantly better results than FN and GOL. Compared to the Faster R-CNN, we achieve slightly worse results with 40.7% fewer network parameters. While Faster R-CNN contains highly-specialized modules such as the regional proposal network, in our method we simply feed the image feature vector output by ResNet-50 to a generalpurpose operator network. Incorporating specialized architectures like region proposal networks into our proximal operator learning framework for object detection is an exciting future direction. We visualize the effect of PPA using the learned proximal operator in Figure 5 . 

6. CONCLUSION

Our work provides a straightforward and effective method to learn the proximal operator of MSO problems with varying parameters. Iterating the learned operator on randomly initialized points efficiently yields multiple optima to the MSO problems. Beyond promising results on our benchmark tasks, we see many exciting future directions that will further improve our pipeline. A current limitation is that at test time the optimal number of iterations to apply the learned operator is not known ahead of time (see end of Appendix D.1). One way to overcome this limitation would be to train another network that estimates when to stop. This measurement can be the objective itself if the optimum value is known a priori (e.g., sampling from level sets) or the gradient norm if objectives are smooth. One other future direction is to learn a proximal operator that adapts to multiple λ's. This way, the user can easily experiment with different λ's and to enable PPA with growing step sizes for super-linear convergence (Rockafellar, 1976; 2021) . Another direction is to study how much we can relax the assumption that X is a low-dimensional Euclidean space. Our method could remain effective when X is a low-dimensional submanifold of a high-dimensional Euclidean space. The challenges would be to constrain the proximal operator to a submanifold and to design a proximal term that is more suitable than the ambient ℓ 2 norm. Reproducibility statement. The complete source code for all experiments can be found at https://github.com/lingxiaoli94/POL. Detailed instructions are given in README.md. We have further included a tutorial on how to extend the framework to custom problems-see "Extending to custom problems" section where we include a toy physics problem of finding all rest configurations of an elastic spring. For all our experiments, the important details are provided in the main text, while the remaining details needed to reproduce results exactly are included in the appendix.

A CONVERGENCE OF TRAINING

We formally state and prove Theorem 3.1 via the following Proposition A.1 and Proposition A.2. Proposition A.1. Suppose 1. T ⊂ R r for some r ∈ N ≥1 ; 2. for any τ ∈ T , the objective f τ is differentiable, ξ-weakly convex, and ∇f τ is ζ-Lipschitz, i.e., ∥∇f τ (x 1 ) -∇f τ (x 2 )∥ 2 ≤ ζ∥x 1 -x 2 ∥ 2 , with ξ ≤ λ. 3. the activation function σ(x) used is proper, real analytic, monotonically increasing and 1-Lipschitz, e.g., sigmoid, hyperbolic tangent. For any δ > 0, H ≥ 2, n ∈ N ≥1 , assume Φ is an H-layer feed-forward neural network with hidden layer sizes m 1 , . . . , m H satisfying m 1 , . . . , m H-2 ≥ Ω(H 2 log Hn 2 /δ ), m H-1 ≥ Ω(log Hn 2 /δ ), m H ≥ Ω(n). Let D denote the total number of weights in Φ. Then D = Ω(n). Moreover, there exists a learning rate η ∈ R D such that for any dataset S = {(x i , τ i )} n i=1 of size n with the training loss L defined as in (3), for any ϵ > 0, with probability at least 1δ (over random Gaussian initial weights θ 0 of Φ), there exists t = O(c r (λ + ζ)/ϵ) such that L(Φ(•, •; θ t )) ≤ L * + ϵ, where ∥θ t ∥ 2 2 stays bounded, L * := min g∈X ×T →X L(g) is the global minimum of the functional L, (θ k ) k∈N is the sequence generated by gradient descent θ k+1 := θ k -η ⊙ ∇ θ L(Φ(•, •; θ k )) , and c r depends only on L and the initialization θ 0 . Proof of Proposition A.1. The theorem is an application of Theorem 1 in Kawaguchi & Huang (2019) with the following modifications. For i ∈ [n], define ℓ i (x) := f τi (x) + λ 2 ∥x -x i ∥ 2 2 . To check Assumption 1 of Kawaguchi & Huang (2019), observe ∇ x ℓ i (x) = ∇f τi (x) + λ(x -x i ), ∇ 2 x ℓ i (x) = ∇ 2 f τi (x) + λI d . Hence the assumption that f τi is ξ-weakly convex implies that ∇ 2 f τi (x) + λI d ≽ ∇ 2 f τi (x) + ξI d ≽ 0. Hence ℓ i is convex. The assumption that ∇f τi is ζ-Lipschitz implies, for any x 1 , x 2 ∈ X × T , ∥∇ℓ i (x 1 ) -∇ℓ i (x 2 )∥ 2 = ∥∇f τi (x 1 ) -∇f τi (x 2 ) + λ(x 2 -x 1 )∥ 2 ≤ ∥∇f τi (x 1 ) -∇f τi (x 2 )∥ 2 + λ∥x 1 -x 2 ∥ 2 ≤ (λ + ζ)∥x 1 -x 2 ∥ 2 . Hence ∇ℓ i is (λ + ζ)-Lipschitz. An input vector to the neural network Φ is the concatenation (x, τ ) ∈ R d+r . Kawaguchi & Huang (2019) assume that the input data points are normalized to have unit length. This is not an issue, as we can scale down (x i , τ i ) uniformly to be contained in a unit ball, then pad τ i one extra coordinate to make ∥(x i , τ i )∥ 2 = 1 for all i ∈ [n], similar to the argument given in the footnotes before Assumption 2.1 of Allen-Zhu et al. (2019) . Lastly, we mention explicitly lower bounds for the layer sizes that are used in the proof of Theorem 1 of Kawaguchi & Huang (2019) (see the paragraph below Lemma 3), instead of stating a single bound on the total number of weights in the statement of Theorem 1. This is because Theorem 1 only states that there exists a network of size Ω(n) for which training converges, whereas every network satisfying the layer-wise bounds will have the same convergence guarantee. Next we show that once the training loss is ϵ away from the global minimum, we can guarantee that the approximation error on the training data in the mean-squared sense is small: i.e., the learned operator Φ(•, •; θ) is close to the true proximal operator (1). Proposition A.2. Suppose for any τ ∈ T , the objective f τ is differentiable and ξ-weakly convex with ξ < λ, where λ is the proximal regularization weight of the training loss L(g) defined in (3). Let θ be the weight of the network Φ such that L(Φ(•, •; θ)) ≤ L * + ϵ where L * := min g∈X ×T →X L(g) is the global minimum of the functional L. Let prox(•; •) be the true proximal operator defined in (1). Then the mean-squared error on the training data is bounded by 1 n n i=1 ∥Φ(x i , τ i ; θ) -prox(x i ; τ i )∥ 2 2 ≤ 2ϵ λ -ξ . ( ) Proof. Clearly L * = L(prox(•; •)), i.e., the minimum of L is achieved with the true proximal operator. Define h i : X → R by h i (x) := f τi (x) + λ 2 ∥x -x i ∥ 2 2 , so that we can write L(g) = 1 n n i=1 h i (g(x i , τ i )) . By the assumption on weak convexity, each h i is (λξ)-strongly convex. This implies for any x, y ∈ X , h i (x) ≥ h i (y) + ∇h i (y) ⊤ (x -y) + λ -ξ 2 ∥x -y∥ 2 2 . ( ) The minimum of h i is achieved at prox(x i ; τ i ) by the definition of prox. Differentiability and convexity imply ∇h i (prox(x i ; τ i )) = 0. Hence setting y = prox(x i ; τ i ) in ( 11) implies, for any x ∈ X , h i (x) -h i (prox(x i ; τ i )) ≥ λ -ξ 2 ∥x -prox(x i ; τ i )∥ 2 2 . Now by the definition of (3), ϵ ≥ L(Φ(•, •; θ)) -L * = L(Φ(•, •; θ)) -L(prox(•, •)) = 1 n n i=1 [h i (Φ(x i , τ i ; θ)) -h i (prox(x i ; τ i ))] ≥ 1 n n i=1 λ -ξ 2 ∥Φ(x i , τ i ; θ) -prox(x i ; τ i )∥ 2 2 . Rearranging terms we obtain the desired result.

B NETWORK ARCHITECTURES

The network architecture we use to parameterize the operators for both POL and GOL is identical and is shown in Figure B .1. The encoder of τ will be chosen depending on the application. For our conic section (5.1), sparse recovery (5.2), and max-cut (5.3) benchmarks, the encoder is just the identity map. For symmetry detection (5.4), τ is a point cloud and we use DGCNN (Wang et al., 2019) . For object detection (5.5), we use ResNet-50 (He et al., 2016) . Inspired by Dinh et al. (2016) , we include both additive and multiplicative coupling in the residual blocks. At the same time, since we do not need bijectivity of the operator (and proximal operators should not be) nor access to the determinant of the Jacobian, we do not restrict ourselves to a map with triangular structure as in Dinh et al. (2016) . We use 3 residual blocks for all applications, except for symmetry detection where we use 5 blocks which give slightly improved performance. Our architecture is economical: the model size (excluding the application-specific encoder) is under 2MB for all applications we consider. This also makes iterating the operators fast at test time. Note that the application-specific encoder only needs to be run once at each test time τ as the encoded vector z can be reused (Figure B.1) . C IMPORTANCE SAMPLING VIA UNFOLDING PPA Directly optimizing (2) or ( 6) using mini-batching may not yield an operator that can refine a nearoptimal solution, if µ is taken to be unif(X ), the uniform measure on X (more precisely, the ddimensional Lebesgue measure restricted to X and normalized to a probability distribution we would like to sample from a distribution that puts more probability density on near-optimal solutions. We achieve this goal as follows, inspired by Wang & Solomon (2019) . Let Φ denote the network with weights after t training iterations. For k ∈ N ≥0 , denote µ k := (Φ k ) # (unif(X )). For a fixed K ∈ N ≥1 , we set µ := 1 K+1 K k=0 µ k . Then, for training iteration t + 1, we optimize the objective (2) or ( 6) with the constructed µ. Note this modification does not introduce any bias for POL (similarly for GOL), in the sense that the optimal solution to (2) is still the true proximal operator since µ has full support, yet it puts more density in near-optimal regions as t increases. In practice, we choose K = 5 or K = 10. For the choice of other hyper-parameters, see Appendix D.1.

D.1 HYPER-PARAMETERS

Unless mentioned otherwise, the following hyper-parameters are used. In each training iteration of POL and GOL, we sample 32 problem parameters from the training dataset of T , and 256 of x's from unif(X ) when computing (2) or (6) using the importance sampling trick in Appendix C. The learning rate of the operator is kept at 10 -4 for both POL and GOL, and by default we train the operator network for 2 × 10 5 iterations. This is sufficient for the loss to converge for both POL and GOL in most cases. Since GOL requires multiple evaluations of the gradient of the objective, it typically trains two or more times slower than POL. For the proximal weight λ of POL, we choose it based on the scale of the objective and the dimension of X ; see For the step size η in GOL, we start with 1 /λ (so same step size as POL in the forward/backward Euler sense) and then slowly increase it (so fewer iterations are needed for convergence) without Table D.1: Choices of λ for all applications considered. X is the search space of solutions, and d is the dimension of the Euclidean space where we embed X (so it might be greater than the intrinsic dimension of X ).

Application

X d λ conic section (5.1) [-5, 5]foot_2 2 0.1 sparse recovery (5.2) [-2, 2] 8 8 10.0 max-cut (5.3) (S 1 ) 8 16 10.0 symmetry detection (5.4) S 2 × R ≥0 4 1.0 object detection (5.5) [0, 1] 4 4 1.0 degrading the metrics. When evaluating (6), we set Q = 10 in all experiments except for symmetry detection, where we use Q = 1 because otherwise the training will take > 200 hours. For PD, we choose a step size small enough so as to not miss significant minima and a sufficient number of iterations for the loss (i.e. the objectives) to fully converge. For evaluation, the number of iterations to apply the trained operators is chosen to be enough so that the objective converges. This number will be chosen separately for each application and method. By default, 1024 solutions are extracted from each method, and 1024 witnesses are sampled to compute WD t and WP δ t , averaged over test dataset and over 10 trials with standard deviation provided (in most cases the standard deviation is two orders of magnitude smaller than the metrics). We filter out solutions that do not lie in X . A limitation for both POL and GOL is that when the solution set is continuous, too many applications of the learned operator can cause the solutions to collapse. We suspect this is because even with the importance sampling trick (Appendix C), during training the operators may never see enough input that are near-optimal to learn the correct refinement needed to recover the continuous solution set. A future direction is to have another network to predict a confidence score for each x ∈ X so that at test time the user knows when to stop iterating the operator, e.g., when the objective value and its gradient are small enough; see the discussion in Section 6.

D.2 CONVERGENCE TO THE PROXIMAL OPERATOR

To empirically verify Proposition A.2, that our method can faithfully approximate the true proximal operators of the objectives, we conduct the following simple experiments. We consider the function f (x) = ∥x∥ 1 for x ∈ X = [-1, 1] d and treat T as a singleton. Its proximal operator prox(x) = arg min y ∥y∥ 1 + ∥y -x∥ 2 2 is known in closed form as the shrinkage operation, defined coordinatewise as: prox(x) i = x i -1/2 x i ≥ 1/2 0 |x i | ≤ 1/2 x i + 1/2 x i ≤ -1/2. (12) For each dimension of d = 2, 4, 8, 16, 32, we train an operator network Φ (Figure B.1) using (2) as the loss with learning rate 10 -3 . Figure D .1 shows the mean-squared-error ∥Φ(x) -Φ * (x)∥ 2 2 scaled by 1 /d and averaged over 1024 samples vs. the training iterations, where Φ * is the shrinkage operation (12). We see that the trained operator indeed converges to Φ * as predicted by Proposition A.2, and the convergence speed is faster in smaller dimensions.

D.3 EFFECT OF THE PROXIMAL TERM

In this section, we study the necessity of the proximal term λ /2∥Φ(x, τ ) -x∥ 2 in (2). Without such a term, the learned operator can degenerate. For example, consider (1) in (Chen et al., 2021) , which minimizes min Φ T t=1 w t f (x t ) with x t+1 = x t -Φ(x t , ∇f (x t ), . . .) for all t (with adapted notation). Suppose x * is one global optimum of f but is not the only one. Then Φ(x, . . .) := xx * clearly minimizes the objective, yet the update steps will always set x t = x * regardless of the initial positions. To further illustrate the effect of different choices of λ, consider the 2D cosine function f (x) = - with ξ = 40π 2 < 400 and has global minima forming a grid (all local minima are global minima). On the left of Figure D.2, we see that when λ = 400 > ξ-in which case the condition of Theorem 3.1 is met-POL recovers all optima. In comparison, for λ = 10, the outer ring of solutions is missing, and with λ = 0, 1 most optima are missing in the grid. To demonstrate how existing L2O methods can fail to recover multiple solutions, we conduct the same experiment on the L2O particle-swarm method by Cao et al. (2019) , which recovers a swarm of particles that are close to optima. We use the default parameters in the provided source code except changing the objective to the 2D cosine function and the standard deviation of the initial random particles to 1. As the method by Cao et al. (2019) could produce particles outside X = [-5, 5] 2 , we add an additional term 0.01∥x∥ 2 to the objective f (x); without such a term the particle swarm simply collapses to a single point far away from the origin. The results are shown on the right of Figure D.2. We see that even with 256 independent random starts and with population size 4, this method fails to recover most of the optima, in particular in non-positive quadrants.

D.4 SAMPLING FROM CONIC SECTIONS

Setup. For this problem, the training dataset contains 2 20 samples of τ ∈ T , while the test dataset has size 256. In our implementation and similarly in other benchmarks we do not store the dataset on disk, but instead generate them on the fly with fixed randomness. The τ 's are sampled uniformly in T . PD is run for 5 × 10 4 steps with learning rate 1.0. For step sizes, we choose λ = 0.1 for POL and η = 1.0 for GOL. We found that the training of GOL explodes when η > 1.0. Meanwhile, POL is able to take bigger ( 1 /λ = 10.0) steps while staying stable during training (but might fail to recover solutions due to large step size). To obtain solutions, we use 5 iterations for POL, while for GOL we use 100 iterations since it converges slower (and more iterations won't improve the results). Results. We visualize for the conic section problem in Here ρ gt indicates the percentage of PD solutions that have objectives ≤ t, and ρ similarly indicates the percentage of solutions for each method with objectives below t. We sample 1024 witnesses to compute WP δ t , averaged over 256 test problem instances. The plot is averaged over 10 trials of witness sampling (the fill-in region's width indicates the standard deviation). Here the standard deviations are all less than 10 -3 so the fill-in regions are too small to be visible. We extract 4096 solutions from each method after training. For PD, we run 5 × 10 5 steps of gradient step with learning rate 10 -5 . We found that due to the highly nonconvex landscape of the problem, bigger learning rates will cause PD to miss significant local minima. For step sizes, we choose λ = 10 for POL (so this corresponds to step size 0.1 for backward Euler) and η = 0.1 for GOL. To obtain solutions, POL requires less than 20 iterations to converge, while for GOL over 100 iterations are needed. Comparison with proximal gradient descent. For p = 1 /2, 2 /3, thresholding formulas exist for the ℓ p norm (Cao et al., 2013) . That is, the proximal operator of ∥x∥ p p has a closed-form. This allows us to apply proximal gradient descent (Tibshirani et al., 2010) to solve (7). When p = 1 (i.e. when the problem reduces to LASSO), this reduces to the popular iterative soft-thresholding algorithm (ISTA) which converges significantly faster than gradient descent.

Results

We compare the convergence speed of POL to that of proximal gradient descent (denoted PGD) for the p = 1 /2 case. We also include PD for reference. The generation of data (i.e. A and y in ( 7)) is the same as before. For POL we use the same setup with λ = 10 as before (corresponding to step size 0.1) except we restrict p to 1 /2 during training and we train only for 1000 steps (note the test-time α is unseen during training). For PGD we use 0.04 as the step size because using 0.05 for the step size would lead to divergence of PGD -the objective would go to infinity. For PD we use 0.05 as the step size. We run all three methods for 200 steps (for POL this corresponds to 200 steps of PPA after training) and visualize the convergence and histograms of the objective for each method in Figure D.8. We see that POL converges faster than PGD even when PGD is highly specialized to the p = 1 /2 case (where the thresholding formula has a closed form). Other sparsity-inducing regularizers. Our method can be applied to sparse recovery problems with other sparsity-inducing regularizers in a straightforward manner. Consider minimax concave penalty (MCP) from Yang et al. ( 2020) defined component-wise as: MCP(x; τ ) := |x| -τ x 2 |x| ≤ 1 2τ 1 4τ |x| > 1 2τ , which is τ -weakly convex. We repeat the same setup as in Section 5.2 but with objectives f τ (x) := ∥Ax -y∥ 2 2 + d i=1 MCP(x i ; τ ), for τ ∈ [0.5, 2]. Note PGD is viable to solve (13) because the proximal operator of MCP has a closedform. We run PGD for 2 × 10 4 iterations with step size 10 -4 to make sure it converges fully. We show the histogram of the solutions' objective values for PD, GOL, POL, and PGD in Figure D.9. Our results are consistent with those in Figure D.7: POL is on par with PD and significantly outperforms GOL. POL also performs better than PGD which is only applicable because the regularizer MCP has a closed-form proximal operator. 

D.6 RANK-2 RELAXATION OF MAX-CUT

Setup. An additional feature of ( 8) is that the variables are constrained to (S 1 ) 8 ⊂ R 16 . Hence for POL GOL we always project the output of the operator network to the constrained set (normalizing to unit length before computing the loss or before iterating), while for PD we apply projection after each gradient step. We generate a training dataset of 2 20 graphs and a test dataset of 1024 graphs using the procedure described in Section 5.3: half of the graphs will be Erdős-Rényi graphs with p = 0.5 and the remaining half being K 8 with edge weights drawn from [0, 1] uniformly. For PD, we use learning rate 10 -4 . For step sizes of POL and GOL, we choose λ = 10.0 and η = 10.0. We choose to directly feed the edge weight vector τ ∈ R 28 to the operator network (Appendix B). We find this simple encoding works better than alternatives such as graph convolutional networks. This is likely because x ∈ X = (S 1 ) 8 requires order information from the encoded τ , so graph pooling operation can be detrimental for the operator network architecture. Designing an equivariant operator network that is capable of effectively consuming larger graphs is an interesting direction for future work. Results. If a cut happens to be a local minimum of the relaxation, then it is a maximum cut (Theorem 3.4 of Burer et al. (2002) ). However, finding all the local minima of the relaxation is not enough to find all max cuts as max cuts can also appear as saddle points (see the discussion after Theorem 3.4 of Burer et al. (2002) ). Hence solving the MSO (8) is not enough to identify all the max cuts. Nevertheless, we can still compare POL and GOL against PD based solely on the relaxed MSO problem corresponding to the objective (8). In the solutions obtained by POL and GOL compared to PD. We see that POL more faithfully recovers the solutions generated by PD with consistently higher witnessed precision. Empirically, we found the proposed POL can identify a diverse family of cuts. We visualize the multiple cuts obtained by POL for a number of graphs in Figure D.12. Although some cuts are not maximal, they are likely due to the relaxation -not all fractional solutions correspond to a cutand not because of the proposed method. As evident in Figure D.10, they are still very close to the local minima of (8) generated by PD.

D.7 SYMMETRY DETECTION OF 3D SHAPES

Setup. Since the variables in ( 9) is constrained to X = S 2 × R ≥0 , we always project the output of the operator network to the constrained set: for x = (n, d) ∈ X , we normalize n to have unit length and take absolute value of d. The same projection is applied after each gradient step in PD. To generate training and test datasets, we use the original train/test split of the MCB dataset (Kim et al., 2020) but filter out meshes with more than 5000 triangles and keep up to 100 meshes per category to make the categories more balanced. During each training iteration, a fresh batch of point clouds are sampled (these are τ 's) from the meshes in the current batch. For step sizes, we choose λ = 1.0 for POL and η = 10.0 for GOL. The training of POL and GOL takes about 30 hours. For PD, we run gradient descent for 500 iterations for each model, which is sufficient for convergence. We use the official implementation of DGCNN by Wang et al. (2019) as the encoder with the modification that we change the input channels to 6-dimension to consume oriented point clouds and we turn off the dropout layers which do not improve performance. The objective (9) involves s τ which requires point-to-mesh projection. We implemented custom CUDA functions to speed up the projection. Even so, it remains the bottleneck of training. Since GOL requires multiple evaluations, it is extremely slow and can take more than a week. As such, we set Q = 1 in (6). Both POL and GOL are trained for 10 5 iterations with batch size 8. At test time iterating the operator networks does not need to evaluate the objective nor the s τ 's; moreover, only point clouds are needed. 2020) (t = 10 -4 , 10 -3 , 10 -2 , 10 -1 ). See the caption of Figure D.4 for the meaning of the various notations. We do not show WD t as the vertical bars here because GOL's WD t is much higher than POL's and is out of the range for the horizontal axis. We sample 1024 witnesses to compute WP δ t , averaged over 10 trials of witness sampling (the fill-in region's width indicates the standard deviation).

D.8 OBJECT DETECTION IN IMAGES

Setup. We use the training and validation split of COCO2017 (Lin et al., 2014) as the training and test dataset, keeping only images with at most 10 ground truth bounding boxes. For training, we use common augmentation techniques such as random resize/crop, horizontal flip, and random RGB shift, to generate a 400 × 400 patch from each training batch image, with batch size 32. For evaluation, we crop a 400×400 image patch from each test image. For step sizes, we choose λ = 1.0 for POL and η = 1.0 for GOL. We train both POL and GOL for 10 6 steps. This takes about 100 hours. To extract solutions, we use 100 iterations for POL (for most images it only needs 5 iterations to converge) and 1000 iterations for GOL (the convergence is very slow so we run it for a large number of iterations). We fine-tune PyTorch's pretrained ResNet-50 (He et al., 2016) with the following modifications. We first delete the last fully-connected layer. Then we add an additional linear layer to turn the 2048 channels into 256. We then add sinusoidal positional encodings to pixels in the feature image output by ResNet-50 followed by a fully-connected layers with hidden layer sizes 256, 256, 256. Finally average pooling is used to obtain a single feature vector for the image. For Faster R-CNN (FRCNN), we use the pretrained model from PyTorch with ResNet-50 backbone and a regional proposal network. It should be noted that FRCNN is designed for a different task that includes prediction of class labels, and thus it is trained with more supervision (object class labels) than our method and it uses additional loss terms for class labels. For the alternative method FN that predicts a fixed number of boxes, we attach a fully-connected layer of hidden sizes [256, 256, 80] with ReLU activation to consume the pooled feature vector from ResNet-50. The output vector of dimension 80 is then reshaped to 20 × 4, representing the box parameters of 20 boxes. We use chamfer distance between the set of predicted boxes and the set of ground truth boxes as the training loss.

Results.

In Table 1 , we compute witness metrics and traditional metrics including precision and recall. As our method does not output confidence scores, we cannot use common evaluation metrics such as average precision. To calculate precision and recall, which normally would require an order given by the confidence scores, we instead build a bipartite graph between the predicted boxes and the ground truth, adding an edge if the Intersection over Union (IoU) between two boxes is greater than 0.5. Then we consider predictions that appear in the Hungarian max matching as true positives, and the unmatched ones false positives. Then precision is defined as the number of true positives over the total number of predictions, while recall is defined as the number of true positives over the total number of ground truth boxes. When computing metrics for POL and GOL, we run mean-shift algorithm with RBF bandwidth 0.01 to find the centers of clusters and use them as the predictions. As shown in Figure 5 , the clusters formed by POL are usually extremely sharp after a few steps, and any reasonable bandwidth will result in the same clusters. In Figure D.14, we show the detection results by our method for a large number of test images chosen at random. In each image we display all 1024 bounding boxes without clustering, most of which are perfectly overlapping. For some images there is a bounding box for the whole image (check if the image has an orange border). There is no class label associated with each box.



The usual convention is to use the reciprocal of λ in front of the proximal term. We use a different convention to associate λ with the convexity of (1). We use Ω notation in the standard way, i.e., f ∈ Ω(n) ⇐⇒ ∃k ∈ N ≥0 such that f ∈ Ω(n log k n). i=1 10 cos(2πx i ) for x ∈ X = [-5, 5] 2 and a singleton T . This function is ξ-weakly convex



Figure 2: Visualization of the solutions for the conic section problem. Red, green, and blue indicate the solutions by PD, GOL, and POL respectively. See Figure D.3 for more examples.

Figure 3: 18 different max cuts (max cut value 10) of a graph generated by our method. Red and blue vertices indicate the two vertex set separated by the cut. Vertex 0 is set to blue to remove the duplicates obtained by swapping the colors. See Figure D.12 for more results.

Figure 4 show our method's results on a selection of models in the test dataset; for per-iteration PPA results of our method, see Figure D.13. Figure D.11 shows that POL achieves much higher witnessed precision compared to GOL.

Figure 4: Symmetry detection results. Each reflection is represented as a colored line segment representing the normal of the reflection plane with one endpoint on the plane. Pink indicates better objective values, while blue indicates worse. Our method is capable of detecting complicated discrete symmetries as well as continuous families of cylindrical reflectional symmetries.5.5 OBJECT DETECTION IN IMAGESFormulation. Identifying objects in an image is a central problem in vision on which recent works have made significant progress(Ren et al., 2015;Carion et al., 2020;Liu et al., 2021). We consider a simplified task where we drop the class labels and predict only bounding boxes. Let b = (x, y, w, h) ∈ X = [0, 1] 4 denote a box with (normalized) center coordinates (x, y), width w, and height h. We choose T to be the space of images. Suppose an image τ has K τ ground truth object bounding boxes {b τ i } Kτ i=1 . We define the MSO objective to be f τ (x) := min Kτ i=1 ∥b τ i -x∥ 1 ; its minimizers are exactly {b τ i } Kτ i=1 . Although the objective may seem trivial, its gradients reveal the ℓ 1 -Voronoi diagram formed by b τ i 's when training the proximal operator. Different from existing approaches, we encode the distribution of bounding boxes conditioned on each image in the learned proximal operator without needing to predict confidence scores or a fixed number of boxes. A similar idea based on diffusion is recently proposed byChen et al. (2022).

Further qualitative results (Figure D.14) and details can be found in Appendix D.8.

Figure 5: First 4 iterations of PPA using the learned proximal operator on 20 randomly initialized boxes (leftmost column). Only a few iterations are needed for the boxes to form distinctive clusters.

Figure D.1: Convergence to the true proximal operator of f (x) = ∥x∥ 1 .

Figure D.2: Left: the result of POL after 10 iterations with λ = 0, 1, 10, 400 for the 2D cosine function which has weakly convex constant ξ = 40π 2 < 400. Right: particle swarms recovered by Cao et al. (2019) for after 10 iterations from 256 independent runs. The population size of the swarm is 4 (default value in their source code).

Figure D.3 for 16 randomly chosen τ ∈ T . In Figure D.4 we plot of δ vs. WP δ t (5) to quantitatively verify how good POL and GOL are at recovering the level sets, where we treat the results by PD as the ground truth. Both visually and quantitatively, we see that POL outperforms GOL. Figure D.5 compares the convergence speed when applying the learned iterative operators at test time: clearly POL converges much faster.

Figure D.3: Visualization of the solutions for the conic section problem. Red indicates the solutions by PD which we treat as ground truth. Green and blue indicate the solutions by GOL (Section 4) and POL (proposed method) respectively.

Figure D.5: Convergence speed comparison at test time for the conic section problem. For POL and GOL, the x-axis is the number of iterations used. For PD, the x-axis is the number of gradient descent steps, multiplied by 100. The horizontal axis shows the number of iterations, and the vertical axis shows the value of f τ (x), averaged over all current solutions (fill-in region's width indicates standard deviation). The three plots shown correspond to the problem instances in the first three columns in Figure D.3. Once the operator has been trained, POL converges in less than 5 steps, while GOL converges slower (GOL is already trained with the largest step size without causing training to explode).

We show the histogram of the solutions' objective values for PD, GOL, and POL in Figure D.7 for 4 problem instances. Figure D.6 visualizes the solutions for 8 problem instances projected onto the last two coordinates. GOL fails badly in all instances. Remarkably, despite the non-convexity of the problem and the much larger step size (0.1 compared to 10 -5 ), POL yields solutions on par or better than PD when p is small. For instance, for the second and third columns in Figure D.6 (corresponding to second and third columns in Figure D.7), PD (in red) misses nearoptimal solutions that POL (in blue) captures. As such the results of PD can be suboptimal, so we do not compute witness metrics here.

Figure D.6: Visualization of the solutions' objective values for the conic section problem. Red indicates the solutions by PD which we treat as ground truth. Green and blue indicate the solutions by GOL and POL (proposed method) respectively.

Figure D.8: Convergence (left figure) and histograms (right figure) of the objective for the p = 1 /2 non-convex sparse recovery problem with α = 0.627. For the convergence figure on the left, the horizontal axis shows the number of iterations, and the vertical axis shows the value of f α (x),averaged over all current solutions (fill-in region's width indicates standard deviation). PGD denotes the proximal gradient descent method(Tibshirani et al., 2010) with the closed-form thresholding formula byCao et al. (2013).

Figure D.9: Histograms of the objectives for sparse recovery with minimax concave penalty (MCP) on 4 problem instances.

Figure D.10: The plot of δ vs. WP δ t for the max-cut problem. See the caption of Figure D.4 for the meaning of the symbols.As different edge weights lead to different minima values, we choose the threshold t in a relative manner: the actual threshold used will be t times the best objective value found by PD.

Figure D.11: The plot of δ vs. WP δ t for symmetry detection on the test dataset of Kim et al. (2020) (t= 10 -4 , 10 -3 , 10 -2 , 10 -1). See the caption of Figure D.4 for the meaning of the various notations. We do not show WD t as the vertical bars here because GOL's WD t is much higher than POL's and is out of the range for the horizontal axis. We sample 1024 witnesses to compute WP δ t , averaged over 10 trials of witness sampling (the fill-in region's width indicates the standard deviation).

Figure D.13: Visualization of PPA with learned proximal operator on selected models from the test dataset of Kim et al. (2020). Iterations 0, 1, 2, 5, 10, 15, 20 are shown, where the 0th iteration contains the initial samples from unif(X ). Pink indicates lower objective value in (9), while light blue indicates higher.

Figure D.14: Randomly selected object detection results by POL on COCO17 validation split. Each test image is 400 × 400 patch cropped from the original image with a random scaling by a number between 0.5 and 1.In each image we display all 1024 bounding boxes without clustering, most of which are perfectly overlapping. For some images there is a bounding box for the whole image (check if the image has an orange border). There is no class label associated with each box.

means keeping predictions with confidence ≥ S% for Faster R-CNN.

). Instead,

Table D.1. All training is done on a single NVIDIA RTX 3090 GPU.

acknowledgement

Acknowledgements We thank Chenyang Yuan for suggesting the rank-2 relaxation of max-cut problems. The MIT Geometric Data Processing group acknowledges the generous support of Army Research Office grants W911NF2010168 and W911NF2110293, of Air Force Office of Scientific Research award FA9550-19-1-031, of National Science Foundation grants IIS-1838071 and CHS-1955697, from the CSAIL Systems that Learn program, from the MIT-IBM Watson AI Laboratory, from the Toyota-CSAIL Joint Research Center, from a gift from Adobe Systems, and from a Google Research Scholar award.Results. We show the witness metrics in Figure D.11; quantitatively, POL exhibits far higher witnessed precision values than GOL. We show a visualization of iterations of PPA with the learned proximal operator in Figure D.13. In particular, our method is capable of detecting complicated discrete reflectional symmetries as well as a continuous family of reflectional symmetries for cylindrical objects.

E CONNECTION TO WASSERSTEIN GRADIENT FLOWS

In this section we show that we can view (2) as solving the JKO discretization of Wasserstein gradient flows at every time step, under the assumption that the measures along the JKO discretization are absolutely continuous.If F(µ) is a linear functional of the form F(µ) = f dµ on P(X ), the space of probability distributions in X with X compact, then the JKO discretization of the gradient flow of F at step t + with step size 1/λ iswhere W 2 is the Wasserstein-2 distance and we assume µ t is absolutely continuous. LetLet us also define another functional such that for a Borel map T : X → X ,First given µ ∈ P(X ), since X is compact (so in particular all probability distributions have finite second moments), by Brenier's theorem (Ambrosio et al., 2005, Theorem 6.2.4) , there exists a Borel map T (the Monge map) such that. Hence for such µ and T we have G(T ) = F(µ), and thus min µ∈P2(R d ) F(µ) ≥ min T G(T ).Next given a Borel T , let µ = T # µ t . By Brenier's theorem, let T ′ be the Monge map corresponding to W 2 (µ t , µ) so that µ = T ′ # µ t and ∥T ′ (x) -x∥ If µ t has full support, then the best T * := arg min T G(T ) is obtained pointwise and it becomes the proximal operator of f (cf. (2)). In particular, T * does not depend on µ t .

