PROJECTIVE PROXIMAL GRADIENT DESCENT FOR A CLASS OF NONCONVEX NONSMOOTH OPTIMIZA-TION PROBLEMS: FAST CONVERGENCE WITHOUT KURDYKA-ŁOJASIEWICZ (KŁ) PROPERTY

Abstract

Nonconvex and nonsmooth optimization problems are important and challenging for statistics and machine learning. In this paper, we propose Projected Proximal Gradient Descent (PPGD) which solves a class of nonconvex and nonsmooth optimization problems, where the nonconvexity and nonsmoothness come from a nonsmooth regularization term which is nonconvex but piecewise convex. In contrast with existing convergence analysis of accelerated PGD methods for nonconvex and nonsmooth problems based on the Kurdyka-Łojasiewicz (KŁ) property, we provide a new theoretical analysis showing local fast convergence of PPGD. It is proved that PPGD achieves a fast convergence rate of O(1/k 2 ) when the iteration number k ≥ k 0 for a finite k 0 on a class of nonconvex and nonsmooth problems under mild assumptions, which is locally Nesterov's optimal convergence rate of first-order methods on smooth and convex objective function with Lipschitz continuous gradient. Experimental results demonstrate the effectiveness of PPGD.

1. INTRODUCTION

Nonconvex and nonsmooth optimization problems are challenging ones which have received a lot of attention in statistics and machine learning (Bolte et al., 2014; Ochs et al., 2015) . In this paper, we consider fast optimization algorithms for a class of nonconvex and nonsmooth problems presented as min x∈R d F (x) = g(x) + h(x), where g is convex, h(x) = d j=1 h j (x j ) is a separable regularizer, each h j is piecewise convex. A piecewise convex function is defined in Definition 1.1. For simplicity of analysis we let h j = f for all j ∈ [d], and f is a piecewise convex function. Here [d] is the set of natural numbers between 1 and n inclusively. f can be either nonconvex or convex, and all the results in this paper can be straightforwardly extended to the case when {h j } are different. Definition 1.1. A univariate function f : R → R is piecewise convex if f is lower semicontinuous and there exist intervals {R m } M m=1 such that R = M m=1 R m , and f restricted on R m is convex for each m ∈ [M ]. The left and right endpoints of R m are denoted by q m-1 and q m for all m ∈ [M ], where {q m } M m=0 are the endpoints such that q 0 = -∞ ≤ q 1 < q 2 < . . . < q M = +∞. Furthermore, f is either left continuous or right continuous at each endpoint q m for m ∈ [M -1]. {R m } M m=1 are also referred to as convex pieces throughout this paper. It is important to note that for all m ∈ [M -1], when f is continuous at the endpoint q m or f is only left continuous at q m , q m ∈ R m and q m / ∈ R m+1 . If f is only right continuous at q m , q m / ∈ R m and q m ∈ R m+1 . This ensures that any point in R lies in only one convex piece. When M = 1, f becomes a convex function on R, and problem (1) is a convex problem. We consider M > 1 throughout this paper, and our proposed algorithm trivially extends to the case when M = 1. We allow a special case that an interval R m = {q m } for m ∈ [M -1] is a singlepoint set, in this case q m-1 = q m . When there are no single-point intervals in {R m } It is worthwhile to emphasize that the nonconvexity and nonsmoothness of many popular optimization problems come from piecewise convex regularizers. Below are three examples of piecewise convex functions with the corresponding regularizers which have been widely used in constrained optimization and sparse learning problems.  (x) = λ1I {x<τ } is piecewise convex with R 1 = (-∞, τ ), R 2 = [τ, ∞). (2) The capped-ℓ 1 penalty function f (x) = f (x; λ, b) = λ min {|x| , b} is piecewise convex with R 1 = (-∞, -b], R 2 = [-b, b], R 3 = [b, ∞). (3) The leaky cappedℓ 1 penalty function (Wangni & Lin, 2017) ) . The three functions are illustrated in Figure 1 . While not illustrated, f (x) = 1I {x̸ =0} for the ℓ 0 -norm with h(x) = ∥x∥ 0 is also piecewise convex. f = λ min {|x| , b} + β1I {|x|≥b} |x -b| with R 1 = (-∞, -b], R 2 = [-b, b], R 3 = [b, ∞

1.1. MAIN ASSUMPTION

The main assumption of this paper is that g and h satisfy the following conditions. Assumption 1 (Main Assumption). (a) g is convex with L-smooth gradient, that is, ∥∇g(x) -∇g(y)∥ 2 ≤ L g ∥x -y∥ 2 . F is coercive, that is, F (x) → ∞ when ∥x∥ 2 → ∞, and inf x∈R d F (x) > -∞. (b) f : R → R is a piecewise convex function and lower semicontinuous. Furthermore, there exists a small positive constant s 0 < R 0 such that f is differentiable on (q m -s 0 , q m ) and (q m , q m +s 0 ) for all m ∈ [M -1]. (c) The proximal mapping prox sfm has a closed-form solution for all m ∈ [M ], where prox sfm (x) := argmin v∈R 1 2s (v -x) 2 + f m (v). (d) f has "negative curvature" at each endpoint q m where f is continuous for all m ∈ [M -1], that is, lim x→q - m f ′ (x) > lim x→q + m f ′ (x). We define C := min m∈[M -1] : f continuous at qm lim x→q - m f ′ (x) -lim x→q + m f ′ (x) > 0. In addition, f has bounded Fréchet subdifferential, that is, sup x∈ M m=1 R o m sup v∈ ∂f (x) ∥v∥ 2 ≤ F 0 for some absolute constant F 0 > 0, where R o denotes the interior of an interval. Fréchet subdifferential is formally defined in Section 4. It is noted that on each R o m , the convex differential of f coincides with its Fréchet subdifferential. We define the minimum jump value of f at noncontinuous endpoints by (3) Assumption 1 is mild and explained as follows. Smooth gradient in Assumption 1(a) is a commonly required in the standard analysis of proximal methods for non-convex problems, such as Bolte et al. (2014) ; Ghadimi & Lan (2016) . The objective F is widely assumed to be coercive in the nonconvex and nonsmooth optimization literature, such as Li & Lin (2015) ; Li et al. (2017) . In addition, Assumption 1(b)-(d) are mild, and they hold for all the three piecewise convex functions in Example 1.1 as well as the ℓ 0 -norm. It is noted that (1) covers a broad range of optimization problems in machine learning and statistics. The nonnegative programming problem min x∈R d :xi≥0,i∈[d] g(x) can be reduced to the regularized problem g(x) + h(x) with f being the indicator penalty f (x) = λ1I {x<τ } for a properly large λ. When g is the squared loss, that is, g(x) = ∥y -Dx∥ 2 2 where y ∈ R n and D ∈ R n×d is the design matrix, (1) is the well-known regularized sparse linear regression problems with convex or nonconvex regularizers such as the capped-ℓ 1 or ℓ 0 -norm penalty.

1.2. MAIN RESULTS AND CONTRIBUTIONS

We propose a novel method termed Projective Proximal Gradient Descent (PPGD), which extends the existing Accelerated Proximal Gradient descent method (APG) (Beck & Teboulle, 2009b; Nesterov, 2013) to solve problem (1). PPGD enjoys fast convergence rate by a novel projection operator and a new negative curvature exploitation procedure. Our main results are summarized below. 1. Using a novel and carefully designed projection operator and the Negative-Curvature-Exploitation algorithm (Algorithm 2), PPGD achieves a fast convergence rate for the nonconvex and nonsmooth optimization problem (1) which locally matches Nesterov's optimal convergence rate of firstorder methods on smooth and convex objective function with Lipschitz continuous gradient. In particular, it is proved that under two mild assumptions, Assumption 1 and Assumption 2 to be introduced in Section 4, there exists a finite k 0 ≥ 1 such that for all k > k 0 , F (x (k) ) -F (x * ) ≤ O( 1 k 2 ), where x * is any limit point of x (k) k≥1 lying on the same convex pieces as x (k) k>k0 . Details are deferred to Theorem 4.4 in Section 4. It should be emphasized that this is the same convergence rate as that of regular APG on convex problems (Beck & Teboulle, 2009b; a) .

2.. Our analysis provides insights into accelerated PGD methods for a class of challenging nonconvex

and nonsmooth problems. In contrast to most existing accelerated PGD methods (Li & Lin, 2015; Li et al., 2017) which employ the Kurdyka-Łojasiewicz (KŁ) property to analyze the convergence rates, PPGD provides a new perspective of convergence analysis without the KŁ property while locally matching Nesterov's optimal convergence rate. Our analysis reveals that the objective function makes progress, that is, its value is decreased by a positive amount, when the iterate sequence generated by PPGD transits from one convex piece to another. Such observation opens the door for future research in analyzing convergence rate of accelerated proximal gradient methods without the KŁ property. It should be emphasized that the conditions in Assumption 1 can be relaxed while PPGD still enjoys the same order of convergence rate. First, the assumption that f is piecewise convex can be relaxed to a weaker one to be detailed in Remark 4.6. Second, the proximal mapping in Assumption 1 does not need to have a closed-form solution. Assuming that both g and h satisfy the KŁ Property defined in Definition A.1 in Section A.1 of the supplementary, the accelerated PGD algorithms in Li & Lin (2015) ; Li et al. (2017) have linear convergence rate with θ ∈ [1/2, 1), and sublinear rate O(k -1 1-2θ ) with θ ∈ (0, 1/2), where θ is the Kurdyka-Lojasiewicz (KŁ) exponent and both rates are for objective values. To the best of our knowledge, most existing convergence analysis of accelerated PGD methods for nonconvex and nonsmooth problems, where the nonconvexity and nonsmoothness are due to the regularizer h, are based on the KŁ property. While (Ghadimi & Lan, 2016) provides analysis of an accelerated PGD method for nonconvex and nonsmooth problems, the nonconvexity comes from the smooth part of the objective function (g in our objective function), so it cannot handle the problems discussed in this paper. Furthermore, the fast PGD method by Yang & Yu (2019) is restricted to ℓ 0 -regularized problems. As a result, it remains an interesting and important open problem regarding the convergence rate of accelerated PGD methods for problem (1). Another line of related works focuses on accelerated gradient methods for smooth objective functions. For example, Jin et al. (2018) proposes perturbed accelerated gradient method which achieves faster convergence by decreasing a quantity named Hamiltonian, which is the sum of the current objective value and the scaled squared distance between consecutive iterates, by exploiting the negative curvature. The very recent work Li & Lin (2022) further removes the polylogarithmic factor in the time complexity of Jin et al. (2018) by a restarting strategy. Because our objective is nonsmooth, the Hamiltonian cannot be decreased by exploiting the negative curvature of the objective in the same way as Jin et al. (2018) . However, we still manage to design a "Negative-Curvature-Exploitation" algorithm which decreases the objective value by an absolute positive amount when the iterates of PPGD cross endpoints. Our results are among the very few results in the optimization literature about fast optimization methods for nonconvex and nonsmooth problems which are not limited to ℓ 0 -regularized problems.

1.3. NOTATIONS

Throughout this paper, we use bold letters for matrices and vectors, regular lower letters for scalars. The bold letter with subscript indicates the corresponding element of a matrix or vector. ∥•∥ p denotes the ℓ p -norm of a vector, or the p-norm of a matrix. |A| denotes the cardinality of a set A. When R ⊆ R is an interval, R is the closure of R, R + denotes the set of all the points in R to the right of R while not in R, and R -is defined similarly denoting the set of all the points to the left of R. The domain of any function u is denoted by domu. lim x→a + and lim x→a -denote left limit and right limit at a point a ∈ R. N is the set of all natural numbers.

2. ROADMAP TO FAST CONVERGENCE BY PPGD

Two essential components of PPGD contribute to its fast convergence rate. The first component, a combination of a carefully designed Negative-Curvature-Exploitation algorithm and a new projection operator, decreases the objective function by a positive amount when the iterates generated by PPGD transit from one convex piece to another. As a result, there can be only a finite number of such transitions. After finite iterations, all iterates must stay on the same convex pieces. Restricted on these convex pieces, problem (1) is convex. The second component, which comprises M surrogate functions, naturally enables that after iterates reach their final convex pieces, they are operated in the same way as a regular APG does so that the convergence rate of PPGD locally matches Nesterov's optimal rate achieved by APG. Restricted on each convex piece, the piecewise convex function f is convex. Every surrogate function is designed to be an extension of this restricted function to the entire R, and PPGD performs proximal mapping only on the surrogate functions.

3. ALGORITHMS

Before presenting the proposed algorithm, we first define proximal mapping, and then describe how to build surrogate functions for each convex piece. Definition 3.1 (Proximal Mapping). The proximal mapping associated with function u is defined as prox u (x) := argmin v∈R d u(v) + 1 2 ∥v -x∥ 2 2 for x ∈ R d , and prox u (•) is called the proximal mapping associated with u.

3.1. CONSTRUCTION OF THE SURROGATE FUNCTIONS {f

m } M m=1 Given that f is convex on each convex piece R m , we describe how to construct a surrogate function f m : R → R such that f m (x) = f (x) for all x ∈ R m . The key idea is to extend the domain of f m from R m to R with the simplest structure, that is, f m is linear outside of R m . More concretely, if the right endpoint q = q m is not +∞ and f is continuous at q, then f m extends f | Rm such that f m on (q, +∞) is linear. Similar extension applies to the case when q = q m-1 is the left endpoint of R m . Formally, we let fm(x) =      f (qm) + v -(x -qm) f is continuous at qm, lim y→q - m f (y) + v -(x -qm) f (qm) < lim y→q - m f (y) lim y→q + m f (y) f (qm) < lim y→q + m f (y), for all x ∈ R + m if q m ̸ = +∞, where v -= lim x→q - m f ′ (x). The surrogate function f m is defined similarly on R - m by fm(x) =      f (qm-1) + v + (x -qm-1) f is continuous at qm-1, lim y→q + m-1 f (y) + v + (x -qm-1) f (qm-1) < lim y→q + m-1 f (y) lim y→q - m-1 f (y) f (qm-1) < lim y→q - m-1 f (y), for all x  x ∈ R - m if q m-1 ̸ = -∞, where v + = lim x→q + m-1 f ′ (x). 0 m q x 0 m q x 0 m q m ( ) ( ) = li m y m q f x f y + → ( ) f x ( ) m f x is continuous at m f q ( ) ( ) ( ) m m m f x f q v x q - = + - ( ) ( ) m f x f x = lim ( ) ( ) ) ( m m m y q f f y x v x q - - → + - = ( ) lim ( ) m m y q f q f y - → < ( ) lim ( ) m m y q f q f y + → < (x) = λ1I {x<τ } has surrogate functions f 1 ≡ λ, f 2 = f . The proximal mappings are prox sf1 (x) = x, prox sf2 (x) = τ when τ - √ 2λs < x < τ , and prox sf2 (x) = x otherwise except for the point x = τ - √ 2λs. While prox sf2 (x) has two values at x = τ - √ , it will not be evaluated at x = τ -√ 2λs in our PPGD algorithm to be introduced.

3.2. PROJECTIVE PROXIMAL GRADIENT DESCENT

We now define a basic operator P which returns the index of the convex piece that the input lies in, that is, P (x) = m for x ∈ R m . It is noted that P (x) is a single-valued function, so P (x) ̸ = P (y) indicates x, y are on different convex pieces, that is, there is no convex piece which contain both x and y. Given a vector x ∈ R d , we define P (x) ∈ R d such that [P (x)] i = P (x i ) for all i ∈ [d]. We then introduce a novel projection operator P x,R0 (•) for any x ∈ R d . That is, P x,R0 : R d → R d , and P x,R0 (u) for all u ∈ R d is defined as [Px,R 0 (u)] i := argmin x∈R P (x i ) ∩B(x i ,R 0 ) |x -ui| (11) Algorithm 1 Projected Proximal Gradient Descent x (0) 1: z (1) = x (1) = x (0) , t 0 = 0, t 1 = 1, endpoint {q m } M -1 m=1 , step size s, constant w 0 ∈ (0, 1]. 2: for k = 1, . . . , do u (k) = x (k) + t k-1 t k (z (k) -x (k) ) + t k-1 -1 t k (x (k) -x (k-1) ) (7) w (k) = P x (k) ,R 0 (u (k) ) ) for i = 1, . . . , d, do z (k+1) i = prox sf P (x (k) i ) w (k) -s∇g(w (k) ) i (9) t k+1 = 1 + 4t 2 k + 1 2 . if F P (x (k) ) (z (k+1) ) ≤ F (x (k) ) then x (k+1) = Negative-Curvature-Exploitation x (k) , z (k+1) , w0 else x (k+1) = x (k) for all i ∈ [d], where B(x i , R 0 ) := [x i -R 0 , x i + R 0 ]. Since R P (xi) ∩ B(x i , R 0 ) is a closed convex set, (11) is the projection of u i onto the closed convex set R P (xi) ∩B(x i , R 0 ) which is well-defined. P x,R0 u) can be computed very efficiently by setting the endpoint of R P (xi) ∩ B(x i , R 0 ) closer to u i to [P x,R0 (u)] i . Algorithm 2 Negative-Curvature-Exploitation x (k) , z (k+1) , w 0 1: if P (z (k+1) ) = P (x (k) ) then 2: x (k+1) = z (k+1) 3: return x (k+1) 4: Flag = false 5: z ′ = z (k+1) 6: for i = 1, . . . , d do 7: if P (z (k+1) i ) ̸ = P (x (k) i ) = m i then 8: if f is continuous at q(w (k) i ) and d i,1 ≥ w 0 d i,0 then 9: Flag = true 10: if f is not continuous at q(w (k) i ) then 11: Flag = true 12: if R P (z (k+1) i ) = q(w (k) i ) then 13: z ′ i = q(w (k) i ) 14: if Flag = false then 15: x (k+1) = x (k) 16: else 17: x (k+1) = z ′ 18: return x (k+1) The Projected Proximal Gradient Descent (PPGD) algorithm is described in Algorithm 1. The following functions and quantities are defined which are useful for PPGD. Let q(w (k) i ) be the endpoint closest to w (k) i which lies in [w (k) i , z (k+1) i ] or [z (k+1) i , w (k) i ] when P (z (k+1) i ) ̸ = P (x (k) i ) at itera- tion k of Algorithm 1. We define F P (x) (v) := g(v) + d i=1 f P (xi) (v i ), d i,0 := z (k+1) i -w (k) i , d i,1 := z (k+1) i -q(w (k) i ) . The idea of PPGD is to use z (k+1) as a probe for the next convex pieces that the current iterate x (k) should transit to. Compared to the regular APG described in Algorithm 4 in Section A.2 of the supplementary, the projection of u (k) onto a closed convex set is used to compute z (k+1) . This is to make sure that u (k) is "dragged" back to the convex pieces of x (k) , and the only variable that can explore new convex pieces is z (k+1) . We have a novel Negative-Curvature-Exploitation (NCE) algorithm described in Algorithm 2 which decides if PPGD should update the next iterate x (k+1) . In particular, if z (k+1) is on the same convex pieces as x (k) , then x (k+1) is updated to z (k+1) only if z (k+1) has smaller objective value. Otherwise, the NCE algorithm carefully checks if any one of the two sufficient conditions can be met so that the objective value can be decreased. If this is the case, then Flag is set to True and x (k+1) is updated by z (k+1) , which indicates that x (k+1) transits to convex pieces different from that of x (k) . Otherwise, x (k+1) is set to the previous value x (k) . The two sufficient conditions are checked in line 8 and 10 of Algorithm 2. Such properties of NCE along with the convergence of PPGD will be proved in the next section.

4. CONVERGENCE ANALYSIS

In this section, we present the analysis for the convergence rate of PPGD in Algorithm 1. Before that, we present the definition of Fréchet subdifferential which is used to define critical points of the objective function. • For a given x ∈ domu, its Fréchet subdifferential of u at x, denoted by ∂u(x), is the set of all vectors u ∈ R d which satisfy lim inf y̸ =x,y→x,y∈domu u(y) -u(x) -⟨u, y -x⟩ ∥y -x∥ ≥ 0. • The limiting subdifferential of u at x ∈ R d , denoted by written ∂u(x), is defined by ∂u(x) = {v ∈ R d | ∃x k → x, u(x k ) → u(x), ṽk ∈ ∂u(x k ) → v}. The point x is a critical point of u if 0 ∈ ∂u(x). The following assumption is useful for our analysis when the convex pieces have one or more endpoints at which f is continuous. Assumption 2 (Nonvanishing Gradient at Continuous Endpoints). Let q m with m ∈ [M -1] be an endpoint at which f is continuous. There exists a positive constant ε 0 such that the following two conditions hold. If q m is a left endpoint, then inf x∈R d ,xi=qm |[∇g(x)] i + p + | ≥ ε 0 for all i ∈ [d]. If q m is a right endpoint, then inf x∈R d ,xi=qm |[∇g(x)] i + p -| ≥ ε 0 . Here p + := lim y→q + m f ′ (y), and p -:= lim y→q - m f ′ (y). We note that Assumption 2 is rather mild. For example, when f (x) = λ min {|x| , b} is the cappedℓ 1 penalty, then Assumption 2 holds when λ > G and ε 0 can be set to λ -G. Such requirement for large λ is commonly used for analysis of consistency of sparse linear models, such as Loh (2017). In addition, when there are no endpoints at which f is continuous, then Assumption 2 is not required for the provable convergence rate of PPGD. Details are deferred to Remark 4.5. We then have the important Theorem 4.3, which directly follows from Lemma 4.1 and Lemma 4.2 below. Theorem 4.3 states that the Negative-Curvature-Exploitation algorithm guarantees that, if convex pieces change across consecutive iterates x (k) and x (k+1) , then the objective value is decreased by a positive amount. Before presenting these results, we note that the level set L := x F (x) ≤ F (x (0) ) is bounded because F is coercive. Since ∇g is smooth, we define the supremum of ∥∇g∥ 2 over an enlarged version of L as G: G := sup x∈L R 0 ∥∇g(x)∥ 2 , where L R0 := x ∈ R d sup x ′ ∈L ∥x -x ′ ∥ 2 ≤ R 0 . G is finite due to the fact that L R0 is compact, and it will serve as the upper bound for ∇g(w (k) ) 2 . J -2sF 0 (G + F 0 ), and κ := κ 0 -s sL g (G + √ dF 0 ) + G (G + √ dF 0 ). Let the step size satisfy s < s1 := min    s0 G + F0 , ε0 Lg(1 -w0)(G + F0) , J F0G + G 2 /2 , J 2F0(G + F0) , -G + G 2 + 4Aκ 0 G+ √ dF 0 2A    , and suppose ∇g(w (k) ) 2 ≤ G. If P (x (k+1) ) ̸ = P (x (k) ) for some k ≥ 1 in the sequence x (k) k≥1 generated by the PPGD algorithm, then under Assumption 1 and Assumption 2, F (x (k+1) ) ≤ F (x (k) ) -κ. ( ) Lemma 4.2. The sequences x (k) k≥1 and w (k) k≥1 generated by the PPGD algorithm described in Algorithm 1 satisfy F (x (k) ) ≤ F (x (k-1) ), ∇g(w (k) ) 2 ≤ G, ∀k ≥ 1. ( ) Theorem 4.3. Suppose the step size s < s 1 where s 1 is defined in (13). If P (x (k+1) ) ̸ = P (x (k) ) for some k ≥ 1 in the sequence x (k) k≥1 generated by the PPGD algorithm, then under Assumption 1 and Assumption 2, F (x (k+1) ) ≤ F (x (k) ) -κ, ( ) where κ is defined in Lemma 4.1. Due to Theorem 4.3, there can only be a finite number of k's such that the convex pieces change across consecutive iterates x (k) and x (k+1) . As a result, after finite iterations all the iterates generated by PPGD lie on the same convex pieces, so the nice properties of convex optimization apply to these iterates. Formally, we have the following theorem stating the convergence rate of PPGD. (2) Let Ω := x x is a limit point of x (k) k≥1 , P (x) = m * be the set of all limit points of the sequence x (k) k≥1 generated by Algorithm 1 lying on the convex pieces indexed by m * . Then for any x * ∈ Ω, F (x * ) = inf {x∈R d | P (x)=m * } F (x), F (x (k) ) -F (x * ) ≤ 4 k 2 U (k 0 ) for all k > k 0 , where  U (k0) := 1 2s ∥(t k0-1 -1)x (k0-1) -t k0-1 z (k0) + x * ∥ 2 2 + t 2 k0-1 (F (x (k0) ) - F (x * )) . Moreover, if f m * i does not take the third case in (5) or (6) for all i ∈ [d], then x * is a critical point of F , that is, 0 ∈ ∂F (x * ). piece R m * i is convex for all i ∈ [d]. The reason is that we only need the convexity of f on the final convex pieces to prove Nesterov's optimal convergence rate in (17). Roadmap of the proof of Theorem 4.4. Proof of Theorem 4.4 is presented in Section B.6 of the supplementary, and the proof consists of three steps. In step 1, it is proved that there exists a finite k 0 ≥ k 1 such that for all k > k 0 , F (x (k) ) -Fm * (x) ≤ O( 1 k 2 ). Noting that F (x (k) ) = F m * (x (k) ) ≥ F m * (x) by the optimality of x, the above inequality combined with the monotone nonincreasing of F (x (k) ) indicate that F (x (k) ) ↓ F m * (x). In step 2, it is proved that F (x ′ ) = F m * (x) for any limit point x ′ ∈ Ω. According to the definition of F m * and the optimality of x, it follows that F (x ′ ) is a local minimum of F and a global minimum of F m * over R * . In step 3, it is proved that any limit point x ′ ∈ Ω is a critical point of F under a mild condition, following the argument in step 2.

5. EXPERIMENTAL RESULTS

In this section, PPGD and the baseline optimization methods are used to solve the capped- ℓ 1 regularized logistic regression problem, min x 1 n n i=1 log(1 + exp(-y i x ⊤ x i )) + h(x) , where {x i , y i } n i=1 are training data, g(x) = 1 n n i=1 log(1 + exp(-y i x ⊤ x i )), and h(x) = d i=1 f (x i ) with f (x) = λ min {|x| , b}. We set λ = 0.2 and conduct experiments on the MNIST handwritten digits dataset (LeCun et al., 1998) . being the corresponding class labels (±1). We compare PPGD to the regular monotone Accelerated Proximal Gradient (APG) described in Algorithm 4 and the monotone Accelerated Proximal Gradient (mAPG) algorithm (Li & Lin, 2015) . Due to the observation that monotone version of accelerated gradient methods usually converge faster than its nonmonotone counterparts, the regular nonmonotone APG and the nonmonotone Accelerated Proximal Gradient (nmAPG) method (Li & Lin, 2015) are not included in the baselines. Figure 3 illustrate the decrease of objective value with respect to the iteration number on the MNIST dataset. It can be observed that PPGD always converges faster than all the competing baselines, evidencing the proved fast convergence results in this paper.

6. CONCLUSION

We present Projective Proximal Gradient Descent (PPGD) to solve a class of challenging nonconvex and nonsmooth optimization problems. Using a novel projection operator and a carefully designed Negative-Curvature-Exploitation algorithm, PPGD locally matches Nesterov's optimal convergence rate of first-order methods on smooth and convex objective function. The effectiveness of PPGD is evidenced by experiments.

A MORE DETAILS ABOUT THIS PAPER

A.1 KURDYKA-ŁOJASIEWICZ (KŁ) PROPERTY Definition A.1 (Kurdyka-Łojasiewicz (KŁ) Property). Let u : R d → (-∞, ∞] be proper and lower semicontinuous. u is said to satisfy the Kurdyka-Lojasiewicz (KŁ) property at u ∈ dom ∂u := {u ′ ∈ R d : ∂u(u ′ ) ̸ = ∅}, if there exist η ∈ (0, ∞], a neighborhood U of u and a function ϕ ∈ Φ η , such that for all ū ∈ U ∪ [u(u) < u(ū) < u(u) + η], the following inequality holds: ϕ ′ u(ū) -u(u) dist(0, ∂u(ū)) ≥ 1. Here η ∈ (0, ∞], Φ η is defined to be the class of all concave and continuous functions ϕ : [0, η) → R + which satisfy the following conditions: 1. ϕ(0) = 0; 2. ϕ is continuously differentiable on (0, η) and continuous at 0; 3. ϕ ′ (r) > 0 for all r ∈ (0, η). If u satisfies the KŁ property at each point of dom ∂u, then u is called a KŁ function.

A.2 PROXIMAL GRADIENT DESCENT (PGD) AND MONOTONE ACCELERATED PROXIMAL GRADIENT (APG) DESCENT

We describe PGD and the regular monotone APG in this subsection. In the k-th iteration of PGD for k ≥ 1, gradient descent is performed on the squared loss term g(x) to obtain an intermediate variable as the result of gradient descent, i.e. x (k) -s∇g(x (k) ), where s > 0 is the step size, and 1 s is usually chosen to be larger than L g . x (k+1) is computed by proximal mapping on x (k) -s∇g(x (k) ). Algorithm 3 Proximal Gradient Descent (x (0) ) 1: for k=1,. . . , do x (k+1) = prox sh x (k) -s∇g(x (k) ) Algorithm 4 Monotone Accelerated Proximal Gradient Descent (x (0) ) 1: z (1) = x (1) = x (0) , t 0 = 0. 2: for k = 1, . . . , do u (k) = x (k) + t k-1 t k (z (k) -x (k) ) + t k-1 -1 t k (x (k) -x (k-1) ), z (k+1) = prox sh (u (k) -s∇g(u (k) )), t k+1 = 1 + 4t 2 k + 1 2 , x (k+1) = z (k+1) if F (z (k+1) ) ≤ F (x (k) ) x (k) otherwise. The iterations start from k = 1 and continue until the sequence {F (x (k) )} k or {x (k) } k converges or maximum iteration number is achieved. The optimization algorithm for the ℓ 0 sparse approximation problem (1) by PGD is described in Algorithm 3. In practice, the time complexity of optimization by PGD is O(M dn) where M is the number of iterations (or maximum number of iterations) for PGD. The regular monotone Accelerated Proximal Gradient (APG) descent algorithm is described in Algorithm 4.

B PROOFS

The following lemma shows that the Fermat's rule still applies to functions in the sense of Fréchet subdifferential, that is, local minimum has 0 in its Fréchet subdifferential and also its limiting subdifferential. Proof. Since x ∈ Ω is a local minimum point, then there exists a neighborhood x ∈ U(x) such that u(y) ≥ u(x) for all y ∈ U(x). It follows that lim inf y̸ =x,y→x,y∈domu u(y) -u(x) ∥y -x∥ ≥ 0. By ( 19), 0 ∈ ∂u(x). Because ∂u(x) ⊆ ∂u(x), 0 ∈ ∂u(x) ⊆ ∂u(x). We need to have the following two lemmas before proving Lemma 4.1. Lemma B.2 (Decrease of f at continuous endpoint). Suppose Assumption 1 and Assumption 2 hold, s < min s0 G+F0 , ε0 Lg(1-w0)(G+F0) , ∇g(w (k) ) 2 ≤ G, P (x (k+1) i ) ̸ = P (x i ) = m i for some k ≥ 0 and some i ∈ [d], and f is continuous at q(w (k) i ). Then f (x (k+1) i ) ≤ f mi (z (k+1) i ). Furthermore, if d i,1 ≥ w 0 d i,0 , then f (x (k+1) i ) ≤ fm i (z (k+1) i ) -κ1, where κ 1 := sCw 0 (ε 0 -sL g (1 -w 0 )(G + F 0 )). Lemma B.3 (Decrease of f at discontinuous endpoints). Suppose Assumption 1 holds, s < min s0 G+F0 , J 2F0(G+F0) , ∇g(w (k) ) 2 ≤ G, P (x (k+1) i ) ̸ = P (x (k) i ) = m i for some k ≥ 0 and some i ∈ [d]. Then f (x (k+1) i ) ≤ fm i (z (k+1) i ) -κ2, where κ 2 := J -2sF 0 (G + F 0 ).

B.1 PROOF OF LEMMA B.2

Proof of Lemma B.2. Let q = q(w (k) i ). We must have x (k+1) i = z (k+1) i with the PPGD algorithm and the Negative-Curvature-Exploitation algorithm described in Algorithm 1 and Algorithm 2. Without loss of generality, we assume x (k+1) i ∈ R + mi . The case that x (k+1) i ∈ R - mi can be proved in a similar manner. We have fm i (x (k+1) i ) = f (q) + v -x (k+1) i -q (22) by definition ( 5), where v -= lim x→q -f ′ (x). On the other hand, we have z (k+1) i -w (k) i = -s ∇g(w (k) ) i + f ′ m i (z (k+1) i ) = -s ∇g(w (k) ) i + v -. It follows by ( 23) that x (k+1) i -q = z (k+1) i -q ≤ z (k+1) i -w (k) i = s ∇g(w (k) ) i + v -≤ s(G + F0). ( ) As a result, s < s0 G+F0 guarantees that x (k+1) i ∈ (q, q + s 0 ). Therefore, we have f (x (k+1) i ) ≤ f (x) + f ′ (x (k+1) i ) x (k+1) i -x (25) when x ∈ (q, x ). Letting x → q in (25), we have f (x (k+1) i ) ≤ f (q) + v + x (k+1) i -q , where v + = lim x→q + f ′ (x). By the negative curvature condition in Assumption 1(d), we have v --v + ≥ C > 0, therefore, f (x (k+1) i ) ≤ fm i (x (k+1) i ) -C x (k+1) i -q ≤ fm i (x (k+1) i ) -Cw0 z (k+1) i -w (k) i . ( ) Because z (k+1) i ∈ R + mi , z (k+1) i ≥ w (k) i . By ( 27) we have f (x (k+1) i ) ≤ fm i (x (k+1) i ) = fm i (z (k+1) i ). (28) Let w ∈ R d with wj = w (k) j for all j ̸ = i, and wi = q. By the definition of w and ( 23), we have w (k) -w 2 = w (k) i -q ≤ (1 -w0) z (k+1) i -w (k) i = s(1 -w0) ∇g(w (k) ) i + v -≤ s(1 -w0)(G + F0). ( ) We have z (k+1) i -w (k) i = z (k+1) i -w (k) i = s [∇g(w (k) )]i + v - = s ∇g(w (k) ) -∇g( w) i + [∇g( w)] i + v - ≥ s ε0 -Lg w (k) -w 2 ≥ s (ε0 -sLg(1 -w0)(G + F0)) , where the last inequality follows from (29). It follows by ( 27) and ( 30) that f (x (k+1) i ) ≤ fm i (x (k+1) i ) -sCw0 (ε0 -sLg(1 -w0)(G + F0)) . ( ) Noting that x (k+1) i = z (k+1) i in (31) completes the proof.

B.2 PROOF OF LEMMA B.3

Before presenting the proof, we introduce and prove the following lemma. Lemma B.4. Suppose Assumption 1 holds, w (k) 2 ≤ G, and s < min J F0G+G 2 /2 , s0 G , P (x (k+1) i ) ̸ = P (x (k) i ) = m i for some k ≥ 0 and some i ∈ [d], and f is not continuous at q(w (k) i ). Then f (q(w (k) i )) < lim y→q(w (k) i ) -f (y) if q(w (k) i ) is the right endpoint of R mi for 1 ≤ m i < M , and f (q(w (k) i )) < lim y→q(w (k) i ) + f (y) if q(w (k) i ) is the left endpoint of R mi for 1 < m i ≤ M . Proof of Lemma B.4. We prove the case when q(w (k) i ) is the right endpoint of R mi . The case when q(w (k) i ) is the left endpoint of R mi can be proved in a similar manner. Suppose the claimed result f (q(w (k) i )) < lim y→q(w (k) i ) -f (y) does not hold, so we must have lim y→q(w (k) i ) + f (y) > f (q(w (k) i )). In this case, x (k+1) i = z (k+1) i , and z (k+1) i ∈ (q(w (k) i ), ∞). Define p(v) := 1 2s v -w (k) -s∇g(w (k) i 2 + f mi (v). We have z (k+1) i = prox sf P (x (k) i ) w (k) -s∇g(w (k) i . The optimality of z = w (k) -s∇g(w (k) i , and p(z (k+1) i ) ≤ p(w (k) i ). ( ) Published as a conference paper at ICLR 2023 We have w (k) i ∈ (q(w (k) i ) -s 0 , q(w (k) i )) due to s < s0 G . On the other hand, we have p(w (k) i ) = s 2 ∇ g(w (k) ) 2 i + fm i (w (k) i ), p(z (k+1) i ) ≥ fm i (z (k+1) i ) = lim y→q(w (k) i ) + f (y), and fm i (w (k) i ) -fm i (q(w (k) i )) = f (w (k) i ) -f (q(w (k) i )) ≤ F w (k) i -q(w (k) i ) ≤ F w (k) i -z (k+1) i ≤ sF0 ∇g(w (k) ) i ≤ sF0G. ( ) (34) follows because . It follows by ( 33) and ( 34) that p(z (k+1) i ) ≥ f (q(w (k) i )) + J 1 ⃝ > fm i (q(w (k) i )) + sF0G + s 2 G 2 ≥ fm i (w (k) i ) + s 2 G 2 ≥ p(w (k) i ), where 1 ⃝ follows from s < J F0G+G 2 /2 . This contradiction shows that we must have lim y→q(w (k) i ) + f (y) ≤ f (q(w (k) i )). Since f is not continuous at q(w (k) i ), we must have f (q(w (k) i )) < lim y→q(w (k) i ) -f (y). Proof of Lemma B.3. According to Lemma B.4, the following claim holds: when q(w (k) i ) is a right endpoint of R mi , then f (q(w (k) i )) < lim y→q(w (k) i ) -f (y); when q(w (k) i ) is a left endpoint of R mi , then f (q(w (k) i )) < lim y→q(w (k) i ) + f (y). Let q = q(w (k) i ). Without loss of generality, we assume x (k+1) i ∈ R + mi . The case that x (k+1) i ∈ R - mi can be proved in a similar manner. We first consider the case that R mi+1 is not a single-point set, that is, R mi+1 ̸ = {q}. According to the definition of surrogate function (5), we have fm i (x (k+1) i ) = lim y→q - f (y) + v -(x (k+1) i -q), where v -= lim x→q -f ′ (x). If x (k+1) i ̸ = q, applying the argument in (24) in the proof of Lemma B.2, s < s0 G+F0 guarantees that x (k+1) i ∈ (q, q + s 0 ). So we have f (x (k+1) i ) ≤ f (x) + f ′ (x (k+1) i ) x (k+1) i -x (37) when x ∈ (q, x ). Letting x → q in (37), we have f (x (k+1) i ) ≤ f (q) + v + x (k+1) i -q , ( ) where v + = lim x→q + f ′ (x). In addition, It follows by ( 38) and ( 24) in the proof of Lemma B.2 that f (x (k+1) i ) ≤ f (q) + sF0(G + F0). ( ) We note that (39) holds for all x (k+1) i ∈ [q, q + s 0 ) Moreover, it follows by ( 36) and ( 24) in the proof of Lemma B.2 that fm i (x (k+1) i ) ≥ lim y→q - f (y) -v -(x (k+1) i -q) ≥ lim y→q - f (y) -sF0(G + F0) ≥ f (q) + J -sF0(G + F0). (40) Combining ( 39) and ( 40), we have f (x (k+1) i ) ≤ fm i (x (k+1) i ) -(J -2sF0(G + F0)) = fm i (z (k+1) i ) -(J -2sF0(G + F0)) . (41) If R P (z (k+1) i ) = {q}, then x (k+1) i = q and f (x (k+1) i ) = f (q) . By the fact that f mi (z (k+1) i ) = lim y→q -f (y) + v -(z (k+1) i -q) and z (k+1) i -q ≤ z (k+1) i -w (k) i , following the argument similar to (40) we have fm i (z (k+1) i ) ≥ f (q) + J -sF0(G + F0) = f (x (k+1) i ) + J -sF0(G + F0). It follows that f (x (k+1) i ) ≤ fm i (z (k+1) i ) -(J -sF0(G + F0)) . ( ) The proof is completed by combining ( 41) and (42). B.3 PROOF OF LEMMA 4.1 Proof of Lemma 4.1. We split [d] into three disjoint subsets, [d] = S 1 ∪ S 2 ∪ S 3 , and the three subsets are defined by S1 := i ∈ [d] : P (x ) ̸ = P (x (k) i ), f is continuous at q(w (k) i ) S2 := i ∈ [d] : P (x ) ̸ = P (x (k) i ), f is not continuous at q(w (k) i ) , S3 := i ∈ [d] : P (x (k+1) i ) = P (x (k) i ) . Let m i = P (x (k) i ). According to Lemma B.2, for all i ∈ S 1 such that d i,1 ≥ w 0 d i,0 , we have f (x (k+1) i ) ≤ fm i (z (k+1) i ) -κ1. In addition, for all i ∈ S 1 , we have f (x (k+1) i ) ≤ fm i (z ). According to Lemma B.3, f (x (k+1) i ) ≤ fm i (z (k+1) i ) -κ2. for all i ∈ S 2 . For all i ∈ S 3 , we have f (x (k+1) i ) = fm i (x (k+1) i ) = fm i (z ). The Negative-Curvature-Exploitation algorithm described in Algorithm 2 guarantees that when P (x (k+1) ) ̸ = P (x (k) ), there exists at lease one i ∈ S 1 ∪ S 2 such that (44) or ( 46) holds. It follows by ( 44)-(47) and the above argument that d i=1 f (x (k+1) i ) ≤ d i=1 fm i (z (k+1) i ) -κ0. ( ) Let Sk+1 := i ∈ [d] : x (k+1) i ̸ = z (k+1) i . It can be verified by Algorithm 2 that Sk+1 ⊆ S 2 . If Sk+1 = ∅, then x (k+1) = z (k+1) . It follows from (48) that F (x (k+1) ) ≤ g(x (k+1) ) + d i=1 fm i (z (k+1) i ) -κ = F P (x (k) ) (z (k+1) ) -κ0. If Sk+1 ̸ = ∅, then it is possible that x (k+1) ̸ = z (k+1) . To handle the case that x (k+1) ̸ = z (k+1) , we first bound x (k+1) -z (k+1) 2 . Define h m (x) := d i=1 f mi (x i ) for m ∈ N d and m i ∈ m i = P (x (k) i ) for all i ∈ [d] . By the optimality of z (k+1) , we have 1 s (w (k) -z (k+1) ) -∇g(w k ) ∈ ∂hm(w k ). It follows from (50) that w (k) -z (k+1) 2 ≤ s(G + √ dF 0 ), and x (k+1) -z (k+1) 2 ≤ w (k) -z k+1 2 ≤ s(G + √ dF 0 ). We then bound g(x (k+1) ) -g(z (k+1) ) by g(x (k+1) ) -g(z (k+1) ) 1 ⃝ = ∇g(ζ), x (k+1) -z (k+1) ≤ ∥∇g(ζ)∥ 2 x (k+1) -z (k+1) 2 ≤ ∇g(ζ) -∇g(w (k) ) + ∇g(w (k) ) 2 • s(G + √ dF0) 2 ⃝ ≤ Lg ζ -w (k) 2 + G • s(G + √ dF0) ≤ s sLg(G + √ dF0) + G (G + √ dF0). Here ζ in 1 ⃝ lies in the line segment between x (k+1) and z (k+1) by the mean value theorem of differentiable functions. 2 ⃝ follows from the fact that ζ - w (k) 2 ≤ w (k) -z (k+1) 2 ≤ s(G + √ dF 0 ). We then have F (x (k+1) ) ≤ g(x (k+1) ) + d i=1 fm i (z (k+1) i ) -κ0 ≤ g(z (k+1) ) + d i=1 fm i (z (k+1) i ) + g(x (k+1) ) -g(z (k+1) ) -κ0 ≤ F P (x (k) ) (z (k+1) ) -κ0 -s sLg(G + √ dF0) + G (G + √ dF0) = F P (x (k) ) (z (k+1) ) -κ. The PPGD algorithm guarantees that F P (x (k) ) (z (k+1) ) ≤ F (x (k) ). It follows from ( 49), (52), and (53) that F (x (k+1) ) ≤ F (x (k) ) -κ0 -s sLg(G + √ dF0) + G (G + √ dF0) . Since s < -G+ G 2 + 4Aκ 0 G+ √ dF 0

2A

, we have κ > 0, which completes the proof.

B.4 PROOF OF LEMMA 4.2

Proof of Lemma 4.2. The PPGD algorithm described in Algorithm 1 ensures that F (x (k) ) ≤ F (x (k-1) ) for k = 1, and ∇g(w (k) ) 2 ≤ G for k = 1. Suppose that F (x (k) ) ≤ F (x (k-1) ) and ∇g(w (k) ) 2 ≤ G hold for all 1 ≤ k ≤ k ′ with k ′ ≥ 1. With the chosen step size and the proof of Lemma 4.1, we have F (x (k ′ +1) ) ≤ F (x (k ′ ) ). This indicates that x (k ′ +1) ∈ L and w (k ′ +1) ∈ L R0 , so w (k ′ +1) 2 ≤ G. It follows by induction that F (x (k) ) ≤ F (x (k-1) ) and ∇g(w (k) ) 2 ≤ G hold for all k ≥ 1.

B.5 PROOF OF THEOREM 4.3

Proof of Theorem 4.3. By Lemma 4.2, ∇g(w (k) ) 2 ≤ G holds for all k ≥ 1. Then the conclusion of this theorem directly follows from Lemma 4.1.

B.6 PROOF OF THEOREM 4.4

The following lemma is crucial in the proof of Theorem 4.4. It shows that after sufficient iterations, all the coordinates of x (k) belong to the same the convex pieces indexed by m * , that is, P (x (k) ) = m * . Moreover, the objective value F (x (k) ) is not greater than the surrogate objective value F m * (z (k) ) = g(z (k) ) + d i=1 f m * i (z (k) i ). Lemma B.5. Suppose Assumption 1 and Assumption 2 hold, and s < min s 1 , ε0 Lg(G+ √ dF0) . Then there must exists a finite k ∈ N such that P (x (k) ) = m * ∈ N d and F (x (k) ) ≤ F m * (z (k) ) for all k ≥ k. Proof of Lemma B.5. According to Theorem 4.3, when P (x (k+1) ) ̸ = P (x (k) ), F (x (k+1) ) ≤ F (x (k) ) -κ. Because inf x∈R d F (x) > -∞, we can only have finite number of k's such that F (x (k+1) ) ≤ F (x (k) ) -κ. As a result, there must exists a finite k 1 ∈ N such that P (x (k) ) = m * ∈ N d for all k ≥ k 1 . We now prove that there exists a finite k 2 > k 1 such that F (x (k) ) ≤ F m * (z (k) ) for all k ≥ k 2 . Suppose this is not the case and there are infinitely many k's such that F (x (k) ) > F m * (z (k) ) and k ≥ k 1 . It follows that there exists a sequence {m k } k≥1 such that F (x (m k ) ) > F m * (z (m k ) ) with m k > k 1 for all k ≥ 1, and lim k→∞ m k = ∞. According to the PPGD algorithm and the Negative-Curvature-Exploitation algorithm described in Algorithm 1 and Algorithm 2, this is possible only if for all k ≥ 1, there exists i ∈ [d] such that P (z (m k ) i ) ̸ = P (x (m k -1) i ) = m * i , z (m k ) i -q(w (m k -1) i ) < w 0 z (m k ) i -w (m k -1) i , x (m k ) = x (m k -1) , and f is continuous at q(w (m k -1) i ). We consider the case that q(w (m k -1) i ) = q m * i is the right endpoint of R m * i . The case that q(w (m k -1) i ) = q m * i -1 is the left endpoint of R m * i can be proved in a similar manner. In this case, P (z (m k ) i ) = m * i + 1. Let w ∈ R d with wj = w (m k -1) j for all j ̸ = i, and wi = q m * i . By the definition of w and the same argument as (29) in the proof of Lemma B.2, we have z (m k ) i -w (m k -1) i = -s [∇g(w (m k -1) )]i + v - = s ∇g(w (m k -1) ) -∇g( w) i + [∇g( w)] i + v -. Because ∇g(w (m k -1) ) -∇g( w) 2 ≤ sL g (1 -w 0 )(G + F 0 ) and s < ε0 Lg(1-w0)(G+F0) , we must have [∇g( w)] i + v -≤ -ε 0 due to Assumption 2 and the fact that z (m k ) i -w (m k -1) i > 0. With k → ∞, we have tm k -1 tm k → 1 and u (m k ) = x (m k ) + t m k -1 t m k (z (m k ) -x (m k ) ) + t m k -1 -1 t m k (x (m k ) -x (m k -1) ) = x (m k ) + t m k -1 t m k (z (m k ) -x (m k ) ) k→∞ → z (m k ) , and it follows that w (m k ) = P x (m k ) ,R0 (u (m k ) ) k→∞ → P x (m k ) ,R0 (z (m k ) ), and we have P x (m k ) ,R0 (z (m k ) ) i = P x (m k -1) ,R0 (z (m k ) ) i = q m * i . By the updating rule (9) in the PPGD algorithm and the first equality in (54), we have z (m k ) -w (m k -1) 2 ≤ s(G + √ dF0). It follows from (56) that ∇g P x (m k ) ,R 0 (z (m k ) ) -∇g( w) 2 ≤ Lg P x (m k ) ,R 0 (z (m k ) ) -w 2 1 ⃝ ≤ Lg P x (m k ) ,R 0 (z (m k ) ) -P x (m k ) ,R 0 (w (m k -1) ) 2 2 ⃝ ≤ Lg z (m k ) -w (m k -1) 2 ≤ sLg(G + √ dF0). Here 1 ⃝ follows from P x (m k ) ,R0 (z (m k ) ) i = wi = q m * i , w (m k -1) = P x (m k ) ,R0 (w (m k -1) ), and x (m k ) = x (m k -1) . 2 ⃝ follows from the contraction property of projection onto a closed convex set. Combining (57) and the fact that [∇g( w)] i + v -≤ -ε 0 , we have ∇g P x (m k ) ,R 0 (z (m k ) ) i + v -= [∇g( w)] i + v -+ ∇g P x (m k ) ,R 0 (z (m k ) ) i -[∇g( w)] i ≤ -ε0 + sLg(G + √ dF0) < 0 (58) due to s < ε0 Lg(G+ √ dF0) . Now noting that w (m k ) = P x (m k ) ,R0 (u (m k ) ) k→∞ → P x (m k ) ,R0 (z (m k ) ), it follows from (58) and the smoothness of ∇g that there exists a large enough k ′ such that when k ≥ k ′ , ∇g(w (m k ) ) i + v -< 0. We now analyze the next iterate z (m k +1) i . By the updating rule z (m k +1) i = prox sf P (x (m k ) i ) w (m k ) -s∇g(w (m k ) ) i , we must have z (m k +1) i ∈ R + m * i , P x (m k ) ,R 0 (z (m k ) ) i = q m * i = q(w (m k ) i ). (60) In the iteration m k + 1, we have di,0 = z (m k +1) i -w (m k ) i , di,1 := z (m k +1) i -q(w (m k ) i ) , and w (m k ) i k→∞ → q(w (m k ) i ). Therefore, with sufficiently large k ′ , we have d i,1 ≥ w 0 d i,0 due to d i,1 k→∞ → d i,0 . It follows from the Negative-Curvature-Exploitation algorithm described in Algorithm 2 that P (x (m k +1) i ) = m * i + 1, which contradict the fact that P (x (m k ) ) = m * ∈ N d for all k ≥ k 1 . This contradiction shows that there exists a finite k 2 > k 1 such that F (x (k) ) ≤ F m * (z (k) ) for all k ≥ k 2 . Setting k = k 2 completes the proof. Proof of Theorem 4.4. According to Lemma B.5, there exists a finite k 1 > 1 such that P (x k) ) for all k ≥ k 1 . Furthermore, the proof of Lemma 4.2 shows that the sequence x (k) k≥1 generated by PPGD satisfies x (k) k≥1 ⊆ L which is a compact set, so there exists at least one limit point for x (k) , and (k) ) = m * ∈ N d and F (x (k) ) ≤ F m * (z ( Ω ̸ = ∅. Define h m (x) := d i=1 f mi (x i ) for m ∈ N d and m i ∈ [M ] for all i ∈ [d], F m := g + h m . Note that for all i ∈ [d], f m * i is convex except for the third case in (5) or ( 6). In such a case, either event 1  ⃝: f m * i (x) = lim y→q + m * i f (y) for x > q m * i and f (q m * i ) < lim y→q + m * i f (y), or event 2 ⃝: f m * i (x) = lim y→q - m * i -1 f (y) for x < q m * i -1 and f (q m * i -1 ) < lim y→q - m * i -1 f (k) i k≥k1+1 sat- isfies x (k) i k≥k1 ⊆ (-∞, q m * i ] and z (k) i k≥k1+1 ⊆ (-∞, q m * i ] if event 1 ⃝ happens, and x (k) i k≥k1 ⊆ [q m * i -1 , +∞) and z (k) i k≥k1+1 ⊆ [q m * i -1 , +∞) if event 2 ⃝ happens. For all i ∈ [d], define R * i as the region over which f m * i is convex. It is clear that R * i = R if event 1 ⃝ and 2 ⃝ do not happen for f m * i . If only event 1 ⃝ happens for f m * i , then R * i = (-∞, q m * i ]. If only event 2 ⃝ happens, R * i = [q m * i -1 , +∞). If both event 1 ⃝ and 2 ⃝ happen, then R * i = [q m * i -1 , q m * i ]. Let x ∈ R d be an optimal solution to min x i ∈R * i ,i∈[d] Fm * (x). The existence of x is proved as follows. First, it can be verified that the convex surrogate function F m * is continuous over the convex region R * , and R * is a closed set in the usual Euclidean topology. By the coercivity assumption and the continuity of F m * , the set R 0) is the initialization point of PPGD) is bounded and closed, so the set R 0 ∩ R * is bounded and closed thus a compact set. Therefore, the minimizer x is by its definition a minimizer of F m * over R * , which is also a minimizer of a continuous function F m * over the compact set R 0 ∩ R * . The existence x follows by the existence of a minimizer of a continuous function over a compact set. 0 := x | F m * (x) ≤ F m * (x (0) ), x ∈ R * (x Roadmap of the proof. We prove this theorem in three steps. In step 1, it is proved that there exists a finite k 0 ≥ k 1 such that for all k > k 0 , F (x (k) ) -Fm * (x) ≤ O( 1 k 2 ). Noting that F (x (k) ) = F m * (x (k) ) ≥ F m * (x) by the optimality of x, the above inequality combined with the monotone nonincreasing of F (x (k) ) indicate that F (x (k) ) ↓ F m * (x). In step 2, we will prove that F (x ′ ) = F m * (x) for any limit point x ′ ∈ Ω. According to the definition of F m * and the optimality of x, it follows that F (x ′ ) is a local minimum of F and a global minimum of F m * over R * . In step 3, we will prove that any limit point x ′ ∈ Ω is a critical point of F under a mild condition, following the argument in step 2. Step 1. We now consider k ≥ k 1 in the sequel. We have f m * i (z (k+1) i ) ≤ f m * i (v) + p(z (k+1) i -v) for all i ∈ [d], v ∈ R i , and all p ∈ ∂f (z (k+1) i ). It follows that if v ∈ R d and v i ∈ R * i for all i ∈ [d], then hm * (z (k+1) ) ≤ hm * (v) + p(z (k+1) -v) (62) for all p ∈ ∂h m * . Because z (k+1) i = prox sf P (x (k) i ) w (k) -s∇g(w (k) i in (9) of Algorithm 1, it follows by the optimality of z (k+1) i that 1 s (w (k) -z k+1 ) -∇g(w k ) ∈ ∂hm * (z (k+1) ). It follows by ( 62) and ( 63) that hm * (z (k+1) ) ≤ hm * (v) + 1 s (w (k) -z k+1 ) -∇g(w k ) (z (k+1) -v) (64) for any v ∈ R d such that v i ∈ R * i for all i ∈ [d]. For such v, we have Fm * (z (k+1) ) ≤ g(v) + ⟨∇g(w (k) ), z (k+1) -v⟩ + Lg 2 ∥z (k+1) -w (k) ∥ 2 2 + hm * (z (k+1) ) 1 ⃝ ≤ g(v) + ⟨∇g(w (k) ), z (k+1) -v⟩ + Lg 2 z (k+1) -w (k) 2 2 + hm * (v) + ⟨∇g(w (k) ) + 1 s (z (k+1) -w (k) ), v -z (k+1) ⟩ = Fm * (v) + 1 s ⟨z (k+1) -w (k) , v -z (k+1) ⟩ + Lg 2 z (k+1) -w (k) 2 2 ≤ Fm * (v) + 1 s ⟨z (k+1) -w (k) , v -w (k) ⟩ - 1 s ∥z (k+1) -w (k) ∥ 2 2 + Lg 2 z (k+1) -w (k) 2 2 = Fm * (v) + 1 s ⟨z (k+1) -w (k) , v -w (k) ⟩ - 1 s - Lg 2 z (k+1) -w (k) 2 2 . ( ) Here 1 ⃝ follows from (64). Let v = x (k) and v = x in (65), we have Fm * (z (k+1) ) ≤ Fm * (x (k) ) + 1 s ⟨z (k+1) -w (k) , x (k) -w (k) ⟩ - 1 s - Lg 2 z (k+1) -w (k) 2 2 , and Fm * (z (k+1) ) ≤ Fm * (x) + 1 s ⟨z (k+1) -w (k) , x -w (k) ⟩ - 1 s - Lg 2 z (k+1) -w (k) 2 2 . (66)×(t k -1)+ (67), we have t k Fm * (z (k+1) ) -(t k -1)Fm * (x (k) ) -Fm * (x) ≤ 1 s ⟨z (k+1) -w (k) , (t k -1)(x (k) -w (k) ) + x -w (k) -t k 1 s - Lg 2 z (k+1) -w (k) 2 2 . ( ) It follows that t k Fm * (z (k+1) ) -Fm * (x) -(t k -1) Fm * (x (k) ) -Fm * (x) ≤ 1 s ⟨z (k+1) -w (k) , (t k -1)(x (k) -w (k) ) + x -w (k) ⟩ -t k 1 s - Lg 2 z (k+1) -w (k) 2 2 . ( ) Multiplying both sides of (69) by t k , since t 2 k -t k = t 2 k-1 , we have t 2 k Fm * (z (k+1) ) -Fm * (x) -t 2 k-1 Fm * (x (k) ) -Fm * (x) ≤ 1 s ⟨t k (z (k+1) -w (k) ), (t k -1)(x (k) -w (k) ) + x -w (k) ⟩ - 1 s - Lg 2 ∥t k (z (k+1) -w (k) )∥ 2 2 ≤ 1 s ⟨t k (z (k+1) -w (k) ), (t k -1)(x (k) -w (k) ) + x -w (k) ⟩ - 1 2s t k (z (k+1) -w (k) ) 2 2 = 1 2s (t k -1)x (k) -t k w (k) + x 2 2 -(t k -1)x (k) -t k z (k+1) + x 2 2 . ( ) Since t k ≥ k+1 2 for k ≥ 1, we have t k k→∞ → = ∞ and lim k→∞ 1 -1 t k x (k) + 1 t k x = x (k) . It follows that there exists a finite k 2 such that 1 - 1 t k x (k) + 1 t k x i ∈ R P (x (k) i ) ∩ B(x (k) i , R0) for all k ≥ k 2 and all i ∈ [d]. It follows from (71) that 1 - 1 t k x (k) + 1 t k x = P x (k) ,R 0 1 - 1 t k x (k) + 1 t k x . Now let k ≥ k 0 := max {k 1 , k 2 }, we have (t k -1)x (k) -t k w (k) + x 2 = t k 1 - 1 t k x (k) + 1 t k x -w (k) 2 1 ⃝ = t k P x (k) ,R 0 1 - 1 t k x (k) + 1 t k x -P x (k) ,R 0 (u (k) ) 2 2 ⃝ ≤ t k 1 - 1 t k x (k) + 1 t k x -u (k) 2 ≤ (t k -1)x (k) -t k u (k) + x 2 , ( ) where 1 ⃝ follows from (72), 2 ⃝ follows from the contraction property of projection onto a closed convex set. It follows by ( 70) and ( 73), that t 2 k Fm * (z (k+1) ) -Fm * (x) -t 2 k-1 Fm * (x (k) ) -Fm * (x) ≤ 1 2s (t k -1)x (k) -t k u (k) + x 2 2 -(t k -1)x (k) -t k z (k+1) + x 2 2 . ( ) Define Q (k+1) = (t k -1)x (k) -t k z (k+1) + x, then Q (k) = (t k-1 -1)x (k-1) -t k-1 z (k) + x. It can be verified that Q (k) = (t k -1)x (k) -t k u (k) + x. Therefore, t 2 k Fm * (z (k+1) ) -Fm * (x) -t 2 k-1 Fm * (x (k) ) -Fm * (x) ≤ 1 2s Q (k) ∥ 2 2 -∥Q (k+1) 2 2 . Noting that F (x (k+1) ) ≤ F m * (z (k+1) ) for k ≥ k 0 , it follows from the above inequality that  ≤ 1 2s ∥Q (k 0 ) ∥ 2 2 -∥Q (m+1) ∥ 2 2 ≤ 1 2s Q (k 0 ) 2 2 = 1 2s (t k 0 -1 -1)x (k 0 -1) -t k 0 -1 z (k 0 ) + x 2 2 . ( ) Since t k ≥ k+1 2 for k ≥ 1, it follows from (76) that Fm * (x (m+1) ) -Fm * (x) ≤ 4 (m + 1) 2 1 2s (t k 0 -1 -1)x (k 0 -1) -t k 0 -1 z (k 0 ) + x 2 2 + t 2 k 0 -1 Fm * (x (k 0 ) ) -Fm * (x) ≜ 4 (m + 1) 2 U (k 0 ) . ( ) Noting that F (x (m+1) ) = F m * (x (m+1) ), we have F (x (m+1) ) -F m * (x) ≤ 4 (m+1) 2 U (k0) . Replacing m + 1 with k, we have F (x (k) ) -Fm * (x) ≤ 4 k 2 U (k 0 ) for all k > k 0 . Because F (x (k) ) = F m * (x (k) ) ≥ F m * (x) due to the optimality of x, we have F (x (k) ) ↓ F m * (x) as k → ∞ based on (78). Step 2. We now prove that any limit point x ′ ∈ Ω achieves a local minimum of F . Let x ′ ∈ Ω be an arbitrary limit point of x (k) k≥1 . By Lemma 4.2, we have F (x (k) ) ↓ F (x ′ ) as k → ∞. To see this, we first note that F m * is continuous over the set R * by the definition of R * in the beginning of this proof. It follows by the beginning of this proof that x (k) i k≥k0 ⊆ R * i for all k ≥ k 0 and all i ∈ [d]. In addition, x ′ i ∈ R m * i ⊆ R * i for all i ∈ [d]. Therefore, x (k) k≥k0 and x ′ belong to R * on which F m * is continuous, so F (x (k) ) = F m * (x (k) ) ↓ F m * (x ′ ) = F (x ′ ). We also have F (x (k) ) ↓ F m * (x) as k → ∞ due to step 1. As a result, F (x ′ ) = Fm * (x) for any x ′ ∈ Ω. That is, F has constant value on Ω. It is noted that F m * = F on the set x ∈ R d P (x) = m * ⊆ R * , so the optimality of x indicates that F (x ′ ) = Fm * (x) ≤ inf {x∈R d | P (x)=m * } F (x). ( ) On the other hand, since P (x) = m * , we have F (x ′ ) ≥ inf Step 3. We now prove that any limit point x ′ ∈ Ω is a critical point of F under a mild condition that f m * i does not take the third case in ( 5) or ( 6) for all i ∈ and F m * is convex over R d . The optimality of x ′ for this convex programming problem indicates that 0 ∈ ∂F m * (x ′ ). Because F m * = F on the set x ∈ R d P (x) = m * and x ′ ∈ x ∈ R d P (x) = m * , we have 0 ∈ ∂F (x ′ ). This can be verified using the definition of limiting subdifferential by considering a constant sequence x k = x ′ for all k ≥ 1. We have with 0 ∈ ∂F (x k ) because 0 ∈ ∂F m * (x ′ ) and P (x ′ ) = m * .



minimum interval length is defined by R 0 = min m∈[M ] |R m |. Otherwise,the minimum interval length is R 0 = min m∈[M ] : |Rm|̸ =0 |R m |, where |•| denotes the length of an interval.

Figure 1: Illustration of three piecewise convex functions. Example 1.1. (1) The indicator penalty function f(x) = λ1I {x<τ } is piecewise convex with R 1 = (-∞, τ ), R 2 = [τ, ∞). (2) The capped-ℓ 1 penalty function f (x) = f (x; λ, b) = λ min {|x| , b} is piecewise convex with R 1 = (-∞, -b], R 2 = [-b, b], R 3 = [b, ∞).(3) The leaky cappedℓ 1 penalty function(Wangni & Lin, 2017) f = λ min {|x| , b} + β1I {|x|≥b} |x -b| with R 1 = (-∞, -b], R 2 = [-b, b], R 3 = [b, ∞).The three functions are illustrated in Figure1. While not illustrated, f (x) = 1I {x̸ =0} for the ℓ 0 -norm with h(x) = ∥x∥ 0 is also piecewise convex.

-1] : f is only right continuous at qm lim y→q - m f (y) -f (qm) , min m∈[M -1] : f is only left continuous at qm lim y→q + m f (y) -f (qm)    .

Figure 2 illustrates the surrogate function f m (x) with x ∈ R + m for the three different cases in (5).

Figure 2: Illustration of three piecewise convex functions.

Definition 4.1. (Fréchet subdifferential and critical points (Rockafellar & Wets, 2009)) Given a non-convex function u : R d → R ∪ {+∞} which is a proper and lower semicontinuous function.

Theorem 4.4 (Convergence rate of PPGD). Suppose the step size s < min s 1 , ε0 Lg(G+ √ dF0) , 1 Lg with s 1 defined by (13), Assumption 1 and Assumption 2 hold. Then there exists a finite k 0 ≥ 1 such that the following statements hold. (1) P (x (k) ) = m * for some m * ∈ N d for all k > k 0 .

If the convex pieces {R m } M m=1 has no endpoints at which f is continuous, then Theorem 4.4 holds without requiring Assumption 2. In particular, Theorem 4.4 holds without Assumption 2 for problem (1) with indicator penalty or ℓ 0 -norm regularizer. Remark 4.6. All the statements of Theorem 4.4 still hold if f is not piecewise convex while satisfying all the other conditions of Assumption 1 and Assumption 2, if f restricted on the final convex

Figure 3: Illustration of the objective value with respect to the iteration number on the MNIST data for capped-ℓ 1 regularized logistic regression

Lemma B.1. (Fermat's rule for Fréchet subdifferential and limiting subdifferential, also in Rockafellar & Wets (2009, Theorem 10.1)) Let u be a real-valued function defined on Ω ⊆ R d , u := Ω → R, has a local minimum point at x ∈ Ω. Then 0 ∈ ∂u(x) ⊆ ∂u(x).

(k+1) i by Fermat's rule indicates that z (k+1) i

y). It follows by the proof of Lemma B.4 that all the sequences x (k) i k≥k1and z

Fm * (x (k+1) ) -Fm * (x) -t 2 k-1 Fm * (x (k) ) -Fm * (x) over k = k 0 , . . . , m for m ≥ k 0 , we have t 2 m Fm * (x (m+1) ) -Fm * (x) -t 2 k 0 -1 Fm * (x (k 0 ) ) -Fm * (x)

{x∈R d | P (x)=m * } F (x).Combining this inequality and (80), we have F (x ′ ) = inf {x∈R d | P (x)=m * } F (x).

[d]. As explained in the beginning of this proof, under this condition,R * i = R for all i ∈ [d]. Because F (x ′ ) = F m * (x) for any x ′ ∈ Ω, x ′ is an optimal solution to min x∈R d Fm * (x),

