GRADIENT PROPERTIES OF HARD THRESHOLDING OPERATOR

Abstract

Sparse optimization receives increasing attention in many applications such as compressed sensing, variable selection in regression problems, and recently neural network compression in machine learning. For example, the problem of compressing a neural network is a bi-level, stochastic, and nonconvex problem that can be cast into a sparse optimization problem. Hence, developing efficient methods for sparse optimization plays a critical role in applications. The goal of this paper is to develop analytical techniques for general, large size sparse optimization problems using the hard thresholding operator. To this end, we study the iterative hard thresholding (IHT) algorithm, which has been extensively studied in the literature because it is scalable, fast, and easily implementable. In spite of extensive research on the IHT scheme, we develop several new techniques that not only recover many known results but also lead to new results. Specifically, we first establish a new and critical gradient descent property of the hard thresholding (HT) operator. Our gradient descent result can be related to the distance between points that are sparse. However, the distance between sparse points cannot provide any information about the gradient in the sparse setting. To the best of our knowledge, the other way around (the gradient to the distance) has not been shown so far in the literature. Also, our gradient descent property allows one to study the IHT when the stepsize is less than or equal to 1/L, where L>0 is the Lipschitz constant of the gradient of an objective function. Note that the existing techniques in the literature can only handle the case when the stepsize is strictly less than 1/L. By exploiting this we introduce and study HT-stable and HT-unstable stationary points and show no matter how close an initialization is to a HT-unstable stationary point (saddle point in sparse sense), the IHT sequence leaves it. Finally, we show that no matter what sparse initial point is selected, the IHT sequence converges if the function values at HT-stable stationary points are distinct, where the last condition is a new assumption that has not been found in the literature. We provide a video of 4000 independent runs where the IHT algorithm is initialized very close to a HT-unstable stationary point and show the sequences escape them.

1. INTRODUCTION

Solving sparse problems has gained increasing attention in the fields of statistics, finance, and engineering. These problems emerge in statistics as variable selection in linear regression problems Fan & Li (2001) ; Zou & Hastie (2005) ; Chun & Keleş (2010) ; Desboulets (2018), mixed-integer programs Bourguignon et al. (2015) ; Liu et al. (2017) ; Dedieu et al. (2021) , portfolio optimization in finance Brodie et al. (2009) ; Chang et al. (2000) , compressed sensing in signal processing Foucart & Rauhut (2013) ; Eldar & Kutyniok (2012) , and compressing deep neural networks in machine learning Damadi et al. (2022) ; Molchanov et al. (2017) ; Gale et al. (2019) , just to name a few. Due to the use of ℓ 0 -(pseudo) normfoot_0 , these problems are discontinuous and nonconvex. The ℓ 0 -norm case have been addressed by the hard thresholding (HT) techniques specially the iterative HT (IHT) scheme Blumensath & Davies (2008) ; Beck & Eldar (2013) ; Lu (2014) ; Zhou et al. (2021) . The Lasso-type, Basic Pursuit(BP)-type, and BP denoising(BPDN)-type problems consider ℓ 1 -norm as a convex approximation of ℓ 0 -norm Tibshirani (1996) ; Mousavi & Shen (2019) . Nonconvex approximation of ℓ 0 -norm as ℓ p -(pseudo) norm (0 < p < 1) has also been studied well Chartrand Algorithm 1 The iterative hard thresholding (IHT) Require: x 0 ∈ R n such that ∥x 0 ∥ 0 ≤ s and stepsize γ > 0 . 1: x k+1 ∈ H s (x k -γ∇f (x k )) for k = 0, 1, . . . 2007) ; Foucart & Lai (2009) ; Lai & Wang (2011) ; Wang et al. (2011) ; Zheng et al. (2017) ; Won et al. (2022) . Sparse optimization problems can also be formulated as mixed-integer programs Burdakov et al. (2016) . Intrinsic combinatorics involved in sparse optimization problems makes it an NP-hard problem (even for a quadratic loss Davis (1994) ; Natarajan (1995)) so it is difficult to find a global minimizer. However, greedy algorithms have developed to find local minimizers. To this end, following the ideas of matching pursuit (MP) and orthogonal MP (OMP) Mallat & Zhang (1993) ; Pati et al. (1993) as greedy algorithms, numerous other greedy algorithms have been developed such as stagewise OMP (StOMP) Donoho et al. (2012) , regularized OMP (ROMP) Needell & Vershynin (2009; 2010) , Compressive Sampling MP (CoSaMP) Needell & Tropp (2009) , and Gradient Support Pursuit (GraSP) Bahmani et al. (2013) . It should be noted that sparse optimization is not restricted to finding a sparse vector. For example Fornasier et al. (2011) ; Haeffele et al. (2014) ; Davenport & Romberg (2016) , finding a low-rank matrix is considered. The problem of finding a low-rank matrix is a counterpart to finding a sparse vector when it comes to applications dealing with matrices. In addition to devising algorithms for solving sparse optimization problems, developing first and second order optimality conditions have also been addressed well Pan et al. (2017a) ; Bauschke et al. ( 2014); Beck & Hallak (2016) ; Li & Song (2015) ; Lu (2015) ; Pan et al. (2015) ; Bucher & Schwartz (2018) . ( The general sparse optimization problem is the following: min f (x) s.t. C s ∩ X where C s = {x ∈ R n | ∥x∥ 0 ≤ s} (sparsity constraint) is the union of finitely many subspaces of dimension s such that 1 ≤ s < n, X is a constraint set in R n , and the objective function f : R n → R is lower bounded and continuously differentiable, i.e., C 1 . In this paper we address a special case of Problem (1) where X = R n as follows: (P) : min f (x) s.t. x ∈ C s (2) To address Problem (2) the following fundamental questions arise: By considering the IHT algorithm, we will answer the above questions. This algorithm has been extensively studied in the literature. It was originally devised for solving compressed sensing problems in 2008 Blumensath & Davies (2008; 2009) . Since then, there has been a large body of literature studying the IHT-type algorithms from different standpoints. For example, Beck & Eldar (2013)  (Q1)

SUMMARY OF CONTRIBUTIONS

By considering the IHT Algorithm 1 for Problem (2), we develop the following results: • We establish a new critical gradient descent property of the hard thresholding (HT) operator that has not been found in the literature. Our gradient descent result can be related to the distance between points that are sparse. However, the distance between sparse points cannot provide any information about the gradient in the sparse setting. To the best of our knowledge, the other way around (the gradient to the distance) has not been shown so far in the literature. This property allows one to study the IHT when the stepsize is less than or equal to 1/L, where L > 0 is the Lipschitz constant of the gradient of an objective function. Note that the existing techniques in the literature can only handle the case when the stepsize is strictly less than 1/L. As an example, one can refer to Liu & Foygel Barber (2020) that needs the stepsize to be greater than or equal to 1/L. • We introduce the notion of HT-stable/unstable stationary points. Using them we establish the escapability property of HT-unstable stationary points (saddle point the sparse sense) and local reachability property of strictly HT-stable stationary points. We provide a video of 4000 independent runs where the IHT algorithm is initialized very close to a HT-unstable stationary point and show the sequences escape them. • We also show that the IHT sequence converges globally under a new assumption that has not been found in the literature. In addition, Q-linearly convergence of the IHT algorithm towards a local minimum when the objective function is both RSS and restricted strictly convex is shown. According to our results, we address (Q1) and (Q2) by establishing a new gradient descent property of the hard thresholding (HT) operator and introducing the notion of HT-stable/unstable stationary points. By considering RSS, restricted strictly convex, and RSC properties we address (Q3) and (Q4). 2012) and first used by Bahmani et al. (2013) for sparsity optimization problems. Currently, they have become standard restrictions for analyzing sparsity optimization problems. Under RSS and RSC properties for the objective function, one is able to address (Q3) and (Q4). Finding a closed-form expression for P Cs∩X when X is an arbitrary set is difficult. However, Beck & Hallak (2016) shows orthogonal projection of a point onto C s ∩ X can be efficiently computed when X is a symmetric closed convex set. In this context, two types of sets are of interest: nonnegative symmetric sets and sign free sets. To address (Q1) in a more generalized setting, Beck and Hallak Beck & Hallak (2016) characterize L-stationary points of Problem (1) when X is either nonnegative symmetric set or sign free. Also, Lu in Lu (2015) considers the same setting as Beck & Hallak (2016) and introduces a new optimality condition that is stronger than L-stationary. He devises a Nonmonotone Projected Gradient (NPG) algorithm and shows an accumulation of the NPG sequence is the global optimal of Problem (1). Pan et al. Pan et al. (2017a) consider Problem (1) when X = R n + . They develop an Improved IHT algorithm (IIHT) that employs the Armijo-type stepsize rule. They show when the objective function is RSS and RSC, the IIHT sequence converges to a local minimum. A recent work by Zhou et al. Zhou et al. (2021) develops Newton Hard-Thresholding Pursuit (NHTP) for solving problem (2). They show that when accumulation points of the NHTP sequence are L-stationary and are isolated, the sequence converges with a locally Q-quadratic rate.  f (x): L-LG x ∈ Cs 0 < γ < 1/L x * ∈ Hs(x * -1 L ∇f (x * )) IHT Any accumulation point of the IHT sequence satisfies x * ∈ Hs(x * -1 L ∇f (x * ))(Theorem 3.1). Lu (2014) f (x) + λ∥x∥0 f (x) is convex x ∈ Cs l ≤ x ≤ u 0 < γ < 1/L Local minimizer The IHT sequence converges to a local minimizer. (Theorem 3.3) Beck & Hallak ( 2016) f (x): L-LG x ∈ Cs ∩ X , X is nonnegative symmetric or sign free 0 < γ < 1/L Basic feasibility BFS The sequence of BFS converges to a basic feasible point (Lemma 7.1). Lu ( 2015) f (x): L-LG x ∈ Cs ∩ X , X is nonnegative symmetric or sign free 0 < γ < 1/L x * = PC s∩X (x * -1 L ∇f (x * )) NPG An accumulation point of NPG sequence satisfies the optimality condition (Theorem 4.3). Pan et al. (2017b) f (x) Ls-RSS & s-RC x ∈ Cs ∩ R n + 0 < γ < 1/Ls x * ∈ PC s∩R n + (x * -1 L ∇f (x * )) or Ls-stationary IIHT Any accumulation point of IIHT sequence converges to a Ls-stationary point (Theorem 3.1). Pan et al. (2017b) f (x)  Ls-RSS & βs-RSC x ∈ Cs ∩ R n + 0 < γ < 1/Ls x * ∈ PC s∩R n + (x * -1 L ∇f (x * )) or Ls-

3. DEFINITIONS

We provide some definitions that will be used throughout the paper. These definitions are the HT operator (HTO) and HTO inequality, RSS and RSC functions. Definition 1 (Restricted Strong Smoothness (RSS)). A differentiable function f : R n → R is said to be restricted strongly smooth with modulus L s > 0 or is L s -RSS if f (y) ≤ f (x) + ⟨∇f (x), y -x⟩ + L s 2 ∥y -x∥ 2 2 ∀x, y ∈ R n such that ∥x∥ 0 ≤ s, ∥y∥ 0 ≤ s. (3) Definition 2 (Restricted Strong Convexity (RSC)). A differentiable function f : R n → R is said to be restricted strongly convex with modulus β s > 0 or is β s -RSC if f (y) ≥ f (x) + ⟨∇f (x), y -x⟩ + β s 2 ∥y -x∥ 2 2 ∀x, y ∈ R n such that ||x|| 0 ≤ s, ||y|| 0 ≤ s. (4) Definition 3 (The HT operator). The HT operator H s (•) denotes the orthogonal projection onto multiple subspaces of R n with dimension 1 ≤ s < n, that is, H s (x) ∈ arg min ∥z∥0≤s ∥z -x∥ 2 . ( ) Claim 1. The HT operator keeps the s largest entries of its input in absolute values. For a vector x ∈ R n , I x s ⊂ {1, . . . , n} denotes the set of indices corresponding to the first s largest elements of x in absolute values. For example H 2 ([1, -3, 1] ⊤ ) is either [0, -3, 1] ⊤ or [1, -3, 0] ⊤ where I y 2 = {2, 3} and I y 2 = {1, 2}, respectively. Therefore, the output of it may not be unique. This clearly shows why HTO is not a convex operator and why there is an inclusion in (5) not an inequality.

4. RESULTS

We consider solving Problem (2) using the IHT Algorithm 1 and develop results on the HT operator. Using them, the behavior of the IHT sequence generated by Algorithm 1 is characterized. Towards this end, statements of the main results are provided and all the technical proofs are postponed to the Appendix for the reviewers.

4.1. GRADIENT DESCENT PROPERTY

First, we establish a new and critical gradient descent property of the hard thresholding (HT) operator. Theorem 1. Let f : R n → R be a differentiable function that is L s -RSS, y ∈ H s (x -γ∇f (x)) with any I y s and 0 < γ ≤ 1 Ls , and x be a sparse vector such that ∥x∥ 0 ≤ s with any I x s . Then, γ 2 (1 -L s γ)∥∇ I x s ∪I y s f (x)∥ 2 2 ≤ f (x) -f (y) where ∇ I x s ∪I y s f (x) is the restriction of the gradient vector to the union of the index sets I x s and I y s . Theorem 1 provides a lower bound on the difference between the current function value evaluated at x and the one evaluated at the updated point provided by the HTO, i.e., y. Note that, y may not be a unique vector that has s nonzero elements. Nonetheless, as stated in Theorem 1, Inequality (1) holds for any y that might be the output of the HTO. As one clearly see, the descent can only be characterized by looking at the entries of the gradient that are restricted to the union of the s largest elements in both x and y. The rest of the gradient can be ignored. Since one may be interested in characterizing the descent using the distance between x and y, we provide the following corollary. Corollary 1. Assume all the assumptions in Theorem 1 hold, then, 1 -L s γ 6γ ∥y -x∥ 2 2 ≤ γ 2 (1 -L s γ)∥∇ I x s ∪I y s f (x)∥ 2 2 ≤ f (x) -f (y) The above result shows the superiority of our gradient result because our gradient result can be related to the distance of points that are sparse. However, the distance between sparse points cannot provide any information about the gradient. To the best of our knowledge, the other way around (the gradient to the distance) has not been shown so far in the literature. Algebraically speaking, characterizing a descent of the function value solely with the information of the current iterate, i.e., x, is of more interest. To this end, we provide another corollary to Theorem 1 that ties the descent to x only. Corollary 2. Assume all the assumptions in Theorem 1 hold. Then, the norm of the gradient restricted to any I x s can be bounded as follows: γ 2 ∥∇ I x s f (x)∥ 2 2 ≤ f (x) -f (y) By this point, we have shown that applying the HTO once, can result in smaller function value provided the gradient over the s largest entries of x are nonzero. This can be utilized to show the sequence generated by the IHT algorithm is nonincreasing. Specially, if the generated sequence has an accumulation point, the objective value function sequence converges to the objective value of the accumulation point.foot_1 Corollary 3. Let f : R n → R be a bounded below differential function that is L s -RSS and x k k≥0 be the IHT sequence x k k≥0 with 0 < γ ≤ 1 Ls . Then, f (x k ) k≥0 is nonincreasing and converges. Also, if x * is an accumulation point of x k k≥0 then f (x k ) k≥0 → f (x * ). Next, we look at basic stationary points of Problem (2) and show their properties.

4.2. OPTIMALITY CONDITION BASED ON THE HT PROPERTIES

In this subsection, we will show that not all basic stationary points of Problem (2) are reachable when the IHT algorithm is run. To do so, the notion of HT stationary points are introduced as follows.

4.2.1. HT stationary POINTS

Definition 4. For a given constant γ > 0, we say that a sparse vector x * ∈ C s is HT-stable stationary point of Problem (2) associated with γ if ∇ supp(x * ) f (x * ) = 0, and min |x * i | : i ∈ I x * s ≥ γ max |∇ j f (x * )| : j / ∈ supp(x * ) = γ∥∇ (supp(x * )) c f (x * )∥ ∞ . (Note that min |x * i | : i ∈ I x * s is unique and does not depend on the choice I x * s .) If ∇ supp(x * ) f (x * ) = 0 but (9) fails, we say that x * is HT-unstable stationary point with γ. Moreover, if ∇ supp(x * ) f (x * ) = 0 and (9) holds strictly, namely, min |x * i | : i ∈ I x * s > γ max |∇ j f (x * )| : j / ∈ supp(x * ) (10) we say that x * is a strictly HT-stable stationary point associated with γ. Note that when ∥x * ∥ 0 = s, I x * s is unique and equals supp(x * ) such that supp(x * ) in the above definition can be replaced by I x * s . Moreover, if x * is a strictly HT-stable stationary point, then we must have I x * s = supp(x * ) (or equivalently ∥x * ∥ 0 = s) because otherwise, 0 = min |x * i | : i ∈ I x * s > γ∥∇ (supp(x * )) c f (x * )∥ ∞ which is impossible. Remark 1. As stated in the Definition 4, a basic stationary point is a point whose gradient is zero over the nonzero elements. For example, suppose x = [0, 4, 0, 2] ⊤ ∈ R 4 is a basic stationary point. Then ∇f (x) = [c 1 , 0, c 2 , 0] ⊤ where c 1 , c 2 are scalars. The main idea of the HT-stable stationary point is that it has to be a basic stationary point. In other words x can be a basic stationary point but not a HT-stable stationary point. This is the analogue of the non-sparse optimization where a point x whose gradient is zero, i.e., ∇f (x) = 0 may not be necessary a local or global minimizer. It can be a saddle point. Remark 2. The main message of Definition 4 is the following: "only by looking at the gradient restricted to the nonzero entries of a basic feasible point, one cannot say whether it is a local minimizer of Problem (2) or not". An HT-stable stationary point associated with γ is equivalent to the 1 γ -stationary point of Problem (2) defined in (Beck & Eldar, 2013, Definition 2.3 ). Thus, by (Beck & Eldar, 2013 , Lemma 2.2), x * is a HT-stable point if and only if x * ∈ H s (x * -γ∇f (x * )). The notion of a HT-unstable stationary point is novel and is a key point for proving Theorem 2. Theorem 2 is the foundation for the proof of part b) of Theorem 3 as well as Theorem 4 which characterizes the accumulation point of the IHT sequence. In addition, we have introduced another stationary point, namely strictly HT-stable stationary which is a crucial concept for local convergence of the IHT sequence. In the following, we present a result that characterizes a HT-unstable stationary point. In essence, the following result shows that there always exists a neighborhood around a sparse HT-unstable stationary point whose gradient is zero over the nonzero elements and one can decrease the function value by going towards the direction of any nonzero coordinates. Theorem 2. Suppose f : R n → R is C 1 and L s -RSS. Given any 0 < γ ≤ 1 Ls , if a vector x ∈ C s is such that ∇ supp(x) f (x) = 0 and min |x i | : i ∈ I x s < γ∥∇ (supp(x)) c f (x)∥ ∞ for some I x s , then there exist a constant ν > 0 and a neighborhood N of x such that f (y) ≤ f (x) -ν for any x ∈ N ∩ C s and any y ∈ H s (x -γ∇f (x)). The above result leads to the following necessary optimality conditions for a global minimizer of Problem (2) in terms of hard thresholding operator H s . For the case where γ = 1/L s , i.e., part b), one needs to use Theorem 2. Indeed, to the best of our knowledge, no proof has not been found for it in the literature. Essentially, establishing the result in part b) is one of our contributions. For the case γ < 1/L s it is proven that x * = H s (x * -γ∇f (x * )). Note that the condition for 0 < γ < 1 Ls has been obtained in (Beck & Eldar, 2013 , Theorem 2.2) without using gradient properties of the HT operator. Theorem 3. Suppose f : R n → R is L s -RSS and x * is a global minimizer. Then, x * is a HT-stable (or 1 γ -) stationary point for any 0 < γ ≤ 1 Ls . Particularly, the following hold: a ) For any 0 < γ < 1 Ls , x * = H s (x * -γ∇f (x * )). b ) For γ = 1 Ls , x * ∈ H s (x * -γ∇f (x * )). The following result shows that any accumulation point of an IHT sequence must be a HT-stable stationary point. Theorem 4. Let f : R n → R be L s -RSS and C 1 function. Suppose f is bounded below on C s . Consider an IHT sequence x k k≥0 associated with an arbitrary γ ∈ (0, 1 Ls ], and let x * be an accumulation point of x k k≥0 . Then, x * is a HT-stable stationary point of Problem (2). Remark 3. The above theorem shows that any accumulation point of an IHT sequence is a HT-stable stationary point of Problem (2). Since each HT-stable stationary point must be a basic stationary point, one can observe that any accumulating point x * of an IHT sequence must satisfy ∇ supp(x * ) f (x * ) = 0 when ∥x * ∥ 0 = s, or ∇f (x * ) = 0 when ∥x * ∥ 0 < s. The following result pertains to the objective function values of HT-stable and HT-unstable stationary points. Corollary 4. Let f : R n → R be L s -RSS and C 1 function. Suppose that every (nonempty) sub-level set of f contained in C s is bounded, i.e., for any α ∈ R, {x ∈ C s |f (x) ≤ α} is bounded (and closed). Consider an arbitrary γ ∈ (0, 1 Ls ]. For any HT-unstable stationary point x * associated with γ, there exists a HT-stable stationary point x * associated with γ such that f (x * ) > f (x * ). Based on the above corollary, it is easy to see that if there are finitely many HT-unstable stationary points (happens when the function is RSC), then there is a HT-stable stationary point x * such that f (x * ) > f (x * ) for any HT-unstable stationary point x * . The following result provides sufficient conditions for the convergence of an IHT sequence. Corollary 5 aims to remove any restrictions on the initial condition. This corollary shows that no matter what initial condition in C s is selected, the IHT sequence will converge to a HT-stable stationary point. Note that we say a HT-stable/unstable stationary point x * associated with γ ∈ (0, 1 Ls ] is isolated if there exists a neighborhood N of x * such that N does not contain any HT stationary point other than x * . Corollary 5. Let f : R n → R be L s -RSS and C 1 function. Suppose that every (nonempty) sub-level set of f contained in C s is bounded. Consider an arbitrary γ ∈ (0, 1 Ls ]. Assume that A.1 : For any two distinct HT-stable stationary points x * and y * associated with γ, f (x * ) ̸ = f (y * ). Then, for any x 0 ∈ C s , the IHT sequence x k k≥0 converges to a HT-stable stationary point associated with γ. This convergence results also hold under the following assumption: A.2 : when 0 < γ < 1 Ls , each HT-stable stationary point associated with γ is isolated. The following corollary shows that any IHT sequence always "escape" from a HT-unstable stationary point. Corollary 6. Let f : R n → R be L s -RSS and C 1 function. Suppose f is bounded below on C s . For any given γ ∈ (0, 1 Ls ] and any HT-unstable stationary point x * associated with γ, there exists a neighborhood N of x * such that for any IHT sequence starting from any x 0 ∈ C s , there exists N ∈ N such that x k / ∈ N ∩ C s for all k ≥ N . The next result shows the attraction towards a strictly HT-stable stationary point in a neighborhood of such a stationary point. In what follows, for each index subset J with |J | = s, a subspace S J := {x ∈ R n | x J c } associated with J is defined. Clearly, C s is the union of S J 's for all J 's with |J | = s. Proposition 1. Let f : R n → R be L s -RSS and C 1 function. Suppose f is bounded below on C s and f is strictly convex on S J for any index subset J with |J | = s. Let x * be a strictly HT-stable stationary point associated with any given γ ∈ (0, 1 Ls ]. Then there exists a neighborhood B of x * such that for every x 0 ∈ B ∩ C s , the IHT sequence x k k≥0 converges to x * . Moreover, if f is strongly convex on J for every index subset J with |J | = s, then for x 0 ∈ B ∩ C s , the IHT sequence x k k≥0 Q-linearly converges to x * . Next, we provide an example to show the escapability property of HT-unstable points.

5. SIMULATION

To elaborate on theoretical results including Corollary 3, Theorem 2, the notion of HT-stationary points, Corollary 6 which shows escapability property of HT-unstable stationary points, and Proposition 1 which shows Reachability to HT-stable stationary points, we use a quadratic function f (x) = 1 m m i=1 (A i• x -y i ) 2 = 1 m ∥Ax -b∥ 2 where A ∈ R m×n , A i• is the i-th row of A, x ∈ R n is the optimization variable, and y ∈ R m is the target. This function is both RSS and RSC so both Corollary 6 and Proposition 1 follow. To better visualize the process, we let m = n = 4 and s = 2. Therefore, there are six HT-stationary points where the gradient over the nonzero elements is zero. We use Pytorch (Paszke et al., 2019) to select the matrix A and y. By setting the random seed to be 45966 we draw a 4 × 4 matrix A whose elements are standard normal. Keeping the same seed, we generate y. The following would be A and y: The restricted Lipschitz constant, i.e., L s , for the above quadratic function is 2 m × λ max (A ⊤ A) where λ max is the maximum eigenvalue of A ⊤ A. Thus, for the above choice of A, y, the maximum allowable stepsize is γ = 1 Ls = 0.06. Once, γ is fixed, one can determine stability of each stationary point. The following are HT-stationary points along with their stability status as well as the gradient status of each HT-stationary point. As you can see, the gradient corresponding to nonzero elements in HT-stationary point are zero:         N o. x         where x 1 , x 2 , x 3 , x 4 are four coordinates of each HT-stationary point and where g 1 , g 2 , g 3 , g 4 are the four gradient entries corresponding to each HT-stationary point. Since HT-stationary points are vectors in R 4 , there is no way to show all of them on one 2-d plane. Thus, we use six 2-d plains where each plane shows only two coordinates of HT-stationary points. On each 2-d plain we have 6 different points, each one associated with one of the HT-stationary points shown in a particular 2-d plain with specified coordinates. In Figure 1 the points with red stars are HT-unstable ones, and the blue ones are the HT-stable ones. For example, the first 2-d plain (first row-first column) including coordinates x 1 -x 2 shows the x 1 , x 2 coordinates of all of the six HT-stationary points. On the first row-first column 2-d plain, the first HT-stationary point is more distinct because it is the only one that has two nonzeros elements associated with x 1 -x 2 coordinates. We also can see three points with x 2 = 0, two of which are HT-unstable points and one is HT-unstable one. This is more clear, if one looks at the column x 2 in HT-stationary points matrix above. Also, it is clear that we have three HT-unstable points with x 1 = 0 on the first 2-d plane. We perturb nonzero coordinates of all HT-unstable points with a normal random noise with mean zero and standard deviation of σ = 0.5 to create 4,000 different initialization points. These points create four clouds around HT-unstable points which are shown in Figure 2 . Then we run the IHT algorithm for 400 steps. After 300 steps, all of these initializations escape from those HT-unstable points and converge to either of the HT-stable stationary points on x 1 -x 2 or x 1 -x 4 2-d planes. In fact, these two HT-stable stationary points are sparse local minimizers. Figure 3 shows the 300-th step of IHT algorithm. There is a video in the supplementary materials that shows 400 steps of applying IHT algorithm for 4,000 different runs. These numerical results corroborate our theoretical results as expected. By looking at the video, one can easily see escapability property of HT-unstable stationary points, and Reachability to HT-stable stationary points.

6. CONCLUSION

This paper provide theoretical results that help to understand the IHT algorithm. These theoretical results include a critical gradient descent property of the hard thresholding (HT) operator which is used to show the sequence of the IHT algorithm is decreasing and by doing it over and over we get smaller objective value. This property also allows one to study the IHT algorithm when the stepsize is less than or equal to 1/L s , where L s > 0 is the Lipschitz constant of the gradient of an objective function. We introduced different stationary points including HT-stable and HT-unstable stationary points and show no matter how close an initialization is to a HT-unstable stationary point, the IHT sequence leaves it. We provided a video of 4000 independent runs where the IHT algorithm is initialized very close to a HT-unstable stationary point and showed the sequences escape them. This property is used to prove that the IHT sequence converges to a HT-stable stationary point. Also, we established a condition for a HT-stable stationary that is a global minimizer with respect to γ = 1/L s . Finally, we showed the IHT sequence always converges if the function values of HT-stable stationary points are distinct, this is a new assumption that has not been found in the literature.



ℓ0 is not mathematically a norm because for any norm ∥ • ∥ and α ∈ R, ∥αθ∥ = |α|∥θ∥, while ∥αθ∥0 = |α|∥θ∥0 if and only if |α| = 1 A sequence may not converge but it may have an accumulation point. For example 1, -1, 1, -1, . . . is not a convergent sequence but it has two accumulation points.



;Lu (2014;2015);Pan et al. (2017b);Zhou et al. (2021) consider convergence of iterations, Jain et al. (2014); Liu & Foygel Barber (2020) study the limit of the objective function value sequence, Liu et al. (2017); Zhu et al. (2018) address duality, Zhou et al. (2020); Zhao et al. (2021) extend it to Newton's-type IHT, Chen & Gu (2016); Li et al. (2016); Liang et al. (2020); Zhou et al. (2018) consider the stochastic version, Blumensath (2012); Khanna & Kyrillidis (2018); Vu & Raich (2019); Wu & Bian (2020) address accelerated IHT, and Wang et al. (2019); Bahmani et al. (2013) solve logistic regression problem using the IHT.

Figure 1: Illustration of HT-stationary points on 2-d plains.

Figure 2: Illustration of 4,000 initialization close to four HT-unstable stationary points.

Figure 3: Illustration of 4,000 IHT sequences initialized close to four HT-unstable stationary points after 400 steps (please refer to the video of all 400 steps).

Table 1 is provided to compare our results with those in the literature. It shows what has been done chronologically and demonstrates our results.

Table 1 compares current results in the literature.

Comparison of results for the deterministic IHT-type algorithms.

stationary IIHTThe IIHT sequence converges to a local minimizer (Theorem 3.2). If ∥x * ∥0 < s, then x * is a global minimizer. When ∥x * ∥0 = s, Ls ∇f (x * )) or Ls-stationary NHTP Any accumulation point x * of the NHTP sequence is an Ls-stationary point (Theorem 9). If x * is isolated, the entire sequence converges.

