CONVERGENCE OF THE MINI-BATCH SIHT ALGORITHM

Abstract

The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations. The IHT algorithm benefits from the information of the batch (full) gradient at each point and this information is a crucial key for the convergence analysis of the generated sequence. However, this strength becomes a weakness when it comes to machine learning and high dimensional statistical applications because calculating the batch gradient at each iteration is computationally expensive or impractical. Fortunately, in these applications the objective function has a summation structure that can be taken advantage of to approximate the batch gradient by the stochastic minibatch gradient. In this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving the sparse optimizations. As opposed to previous works where increasing and variable mini-batch size is necessary for derivation, we fix the mini-batch size according to a lower bound that we derive and show our work. To prove stochastic convergence of the objective value function we first establish a critical sparse stochastic gradient descent property. Using this stochastic gradient descent property we show that the sequence generated by the stochastic mini-batch SIHT is a supermartingale sequence and converges with probability one. Unlike previous work we do not assume the function to be a restricted strongly convex. To the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.

1. INTRODUCTION

We consider the following sparse optimization problem: (P) : min f (x, Ξ) := 1 N N i=1 f (i) (x, ξ (i) ) s.t. x ∈ C s where f (i) : R n × Ξ → R for i = 1, . . . , N , Ξ = {ξ (1) , . . . , ξ (N ) }, and C s = {x ∈ R n | ∥x∥ 0 ≤ s} (sparsity constraint) is the union of finitely many subspaces whose dimension is less than or equal to the sparsity level s such that 1 ≤ s < n. The importance of the Problem (P) is due to the fact that finding a sparse network whose accuracy is on a par with a dense network amounts to solving a bi-level, constrained, stochastic, nonconvex, and non-smooth sparse optimization problem Damadi et al. (2022) . Thus finding efficient algorithms that solve Problem (P) can be beneficial for addressing compression of deep neural networks. Among algorithms for solving sparse optimization the Iterative Hard Thresholding (IHT) algorithm has been a very successful one due to the simplicity of its implementation. The IHT algorithm not only has been practically efficient, but also shows theoretical promising results. It was originally devised for solving compressed sensing problems in 2008 Blumensath & Davies (2008; 2009) . Since then, a large body of literature has been studying it from different perspectives. For example, Beck & Eldar (2013) ; Lu (2014; 2015) ; Pan et al. (2017) ; Zhou et al. (2021) consider convergence of iterations, Jain et al. (2014) ; Liu & Foygel Barber (2020) study the limit of the objective function value sequence, Liu et al. (2017) ; Zhu et al. (2018) address duality, Zhou et al. (2020) ; Zhao et al. (2021) extend it to Newton's-type IHT, Blumensath (2012) ; Khanna & Kyrillidis (2018) ; Vu & Raich (2019) ; Wu & Bian (2020) address accelerated IHT, and Wang et al. (2019) ; Bahmani et al. (2013) solve logistic regression problem using the IHT algorithm. Recently Damadi & Shen (2022) introduced the concepts of HT-unstable stationary points (saddle points in the sense of sparse optimization) and Algorithm 1 The mini-batch stochastic iterative hard thresholding Require: x 0 ∈ C s such that ∥x 0 ∥ 0 ≤ s, a stepsize 0 < γ < 1 Ls , and 1 ≤ S B ∈ N such that S B ≥ N 1 + 1-Lsγ 1+Lsγ N -1 c N -1 for some c > 0. 1: for k = 0, 1, . . . do 2: Construct B k by selecting S B elements from {1, . . . , N } uniformly without replacement such that |B k | = S B . 3: Calculate the stochastic mini-batch gradient as G(X k , Ξ, B k ) = 1 S B i∈B k ∇f (i) (X k , ξ (i) ). 4: X k+1 ∈ H s (X k -γG(X k , Ξ, B k ). showed the escapability property of the HT-unstable stationary points as one of the crucial properties of the IHT algorithm. Also, they showed Q-linearly convergence of the IHT algorithm towards strictly HT-stable stationary points. However, these desirable properties, requires to compute the batch (full) gradient at each iteration which is computationally expensive or impractical with current GPUs. On the other hand, almost all training for deep neural networks are done using the mini-batch stochastic gradient which is a combination of the stochastic approximation Robbins & Monro (1951) implemented by the backpropagation algorithm Rumelhart et al. (1986) . By taking the mini-batch stochastic approximation, we consider solving Problem (P) using the mini-batch Stochastic Iterative Hard Thresholding algorithm outlined in Algorithm 1. Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous work Zhou et al. (2018) in this area. Also, for showing our theoretical results we directly use the mini-batch stochastic gradient and derive our theoretical results which is different from previous works Chen & Gu (2016); Li et al. (2016) where the batch (full) gradient is used to show the theoretical results. As opposed to other works where restricted strong convexity is necessary for deriving convergence results Liang et al. (2020) ; Zhou et al. (2018) , here the only assumption we make is the restricted strong smoothness on the objective function not on each individual one. Also, we assume that the objective function is a bounded below function which is the case for objective functions used in machine learning applications. Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous works Zhou et al. (2018) .

SUMMARY OF CONTRIBUTIONS

By considering the mini-batch SIHT Algorithm 1 for Problem (P), we develop the following results: • We establish a new critical sparse stochastic gradient descent property of the hard thresholding (HT) operator that has not been found in the literature. • For a given step-size 0 < γ < 1 Ls , we find a lower bound on the size of the mini-batch that guarantees the expected descent of the objective value function after hardthresholding. • Using the sparse stochastic gradient descent property we show that the sequence generated by the mini-batch SIHT algorithm is supermartingale and converges with probability one. • We show that for a certain class of functions in Problem (P) where f (x, ξ i ) := f (i) (V i• x) f (i) : R n → R, the sum of norm squared of individual gradients restricted to a set of some elements J , i.e., N i=1 ∥∇ J f (i) ∥ 2 2 , evaluated at every point is proportionate to the norm of the batch gradient ∥∇ J f ∥ 2 2 where the proportionality constant only depends on the data. Moreover, dependency of the proportionality constant on the data is restricted to the set of J not the entire data.

2. RELATED WORK

In order improve computational efficiency of the IHT algorithm, algorithms based on stochastic hard thresholding try to use the finite-sum structure of problem (P) Nguyen et al. (2017) ; Li et al. (2016) ; Shen & Li (2017) . The StoIHT algorithm is introduced in Nguyen et al. (2017) where at each iteration a random element from the sum in Problem (P) is drawn and the associated gradient is calculated. Basically, the gradient is approximated by a mini-batch stochastic gradient with size one. The StoIHT algorithm defines a sparse subspace and then projects the updated vector into that. To show the theoretical results in Nguyen et al. (2017) , the restricted strong smoothness condition for each individual function in Problem (P) is required as well as the restricted strong convexity for the objective function. In addition, the StoIHT algorithm needs the restricted condition number be to 4/3 which is hard to meet in practice. The stochastic variance reduced gradient hard thresholding (SVRG-HT) algorithm Li et al. (2016) ; Shen & Li (2017) tries to mitigate the variance with a cost of calculating the (batch) full gradient at some stages. This information of the batch gradient is the key for reducing the variance. Similar to the StoIHT algorithm, the SVRG-HT algorithm requires the restricted strong smoothness condition for each individual function in Problem (P) as well as the restricted strong convexity for the objective function. The Accelerated Stochastic Block Coordinate Gradient Descent with Hard Thresholding (ASBCDHT) algorithm in Chen & Gu ( 2016) is a randomized version of the StoIHT algorithm which suffers the drawbacks of the StoIHT algorithm, i.e., calculating the full gradient and requirement of the restricted strong conditions. The Hybrid Stochastic Gradient Hard Thresholding (HSG-HT) algorithm in Zhou et al. (2018) is a variant of stochastic IHT algorithms that uses a mini-batch stochastic gradient at each step. However, from the theoretical perspective, the size of a mini-batch has to increase as the algorithm progresses. This makes the algorithm almost deterministic in calculating the gradient and defeats the purpose of using the mini-batch stochastic gradient. The stochastically controlled stochastic gradients (SCSG-HT) algorithm in Liang et al. (2020) uses mini-batch stochastic gradients with large batch size as opposed to the SVRG-HT and the ASBCDHT algorithms to reduce the variance with less computation, i.e., not calculating the batch gradient at some steps. We present the mini-batch stochastic IHT algorithm and show that the stochastic sequence of the function value is a supermartingale sequence and it converges with probability one. To show our result, we assume the objective function has the restricted strong smoothness property and is bounded below which is the case for objective functions used machine learning applications. Also, to the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.

3. DEFINITIONS

We provide some definitions that will be used throughout the paper. Definition 1 (Restricted Strong Smoothness (RSS)). A differentiable function f : R n → R is said to be restricted strongly smooth with modulus L s > 0 or is L s -RSS if f (y) ≤ f (x) + ⟨∇f (x), y -x⟩ + L s 2 ∥y -x∥ 2 2 ∀x, y ∈ R n such that ∥x∥ 0 ≤ s, ∥y∥ 0 ≤ s. (2) Definition 2 (The HT operator). The HT operator H s (•) denotes the orthogonal projection onto multiple subspaces of R n with dimension 1 ≤ s < n, that is, H s (x) ∈ arg min ∥z∥0≤s ∥z -x∥ 2 . ( ) Claim 1. The HT operator keeps the s largest entries of its input in absolute values. For a vector x ∈ R n , I x s ⊂ {1, . . . , n} denotes the set of indices corresponding to the first s largest elements of x in absolute values. For example H 2 ([1, -3, 1] ⊤ ) is either [0, -3, 1] ⊤ or [1, -3, 0] ⊤ where I y 2 = {2, 3} and I y 2 = {1, 2}, respectively. Therefore, the output of it may not be unique. This clearly shows why HTO is not a convex operator and why there is an inclusion in (3) not an inequality. Definition 3 (Convergence with probability one). A random sequence (x k ∈ R n ) in a sample space Ω converges to a random variable x * with probability one if P ω ∈ Ω : lim k→∞ ∥x k (ω) -x * ∥ = 0.

4. RESULTS

We consider solving Problem (1) using the mini-batch SIHT Algorithm 1 and develop results that guarantee the convergence of the sequence of function values generated by the SIHT Algorithm. To do so, we present our results in two separate subsections. The first part provides stochastic results characterizing expectation of functions involving the sample average of given vectors. Then, in the subsequent subsection we use the aforementioned results to show Theorem 3 which establishes a stochastic gradient result that is the foundation for the convergence of the function value sequence.

4.1. STOCHASTIC RESULTS FOR SAMPLE AVERAGE

In this subsection, we consider a sample average whose elements are drawn uniformly and without replacement. Then, we prove Lemma 2 that calculates the expected value of the norm squared of the sample average based on the covariance matrix of a random vector whose elements are Bernoulli random variable determining elements of the sample average. Next, in Corollary 1 using Lemma 2 we calculate the expected value of the squared distance between the sample average and the overall average. This result is extended in Theorem 1 where the expected value is calculated so that one is able to find the mentioned expectation based on each individual vector and the overall average. We start with the following well-known lemma. Lemma 1 (Mathai & Provost (1992)). Let Λ ∈ R n×n be a deterministic matrix and ξ ∈ R n be a random vector that is distributed according to some probability distribution P. Then, E ξ ξ ⊤ Λξ = trace(ΛCov(ξ)) + E ⊤ ξ ξ ΛE ξ ξ . To invoke the above lemma, notice that one can define a random vector whose elements are Bernoulli random variables determining whether the associated vector is in the sample average or not. Thus we prove the following lemma. Lemma 2. Let g (1) , . . . , g (N ) ∈ R n be N deterministic vectors and B ⊆ {1, . . . , N } be a random set. Let ḡ := 1 i) , G := g (1) . . . g (N ) ∈ R n×N , and z(B) = [z 1 (B), . . . , z N (B)] ⊤ where z i (B) is a Bernoulli random variable such that z i (B) = 1 if i ∈ B otherwise z i (B) = 0 for i = 1, . . . , N . Assume E B G(B) = ḡ, then for any random set B with fixed size |B|, the following holds: N N i=1 g (i) , G(B) := 1 |B| i∈B g ( E B ∥G(B)∥ 2 = 1 |B| 2 trace G ⊤ GCov Z(B) + ∥ḡ∥ 2 . (5) Once the above result is established, it is straightforward to show the following by observing the fact that the sample average is an unbiased estimator of the overall average, i.e., E B G(B) = ḡ. Corollary 1. Assume all the assumptions in Lemma 1 hold. Then for any random set B with fixed size |B|, the following holds: E B ∥G(B) -ḡ∥ 2 = 1 |B| 2 trace G ⊤ GCov Z(B) Finally, we use the above results to prove the following which calculates the expected squared distance between the sample average and the overall average based on individual vectors and the overall average. The following result is critical because later we will see that Equation ( 7) connects the mini-batch stochastic gradient, the batch gradient, and individual gradients in Problem (P). Theorem 1. Assume all the assumptions in Lemma 1 hold. If elements of the random set B are drawn uniformly and without replacement, then E B ∥G(B) -ḡ∥ 2 = N -|B| |B|N (N -1) N i=1 ∥g (i) ∥ 2 2 -N ∥ḡ∥ 2 = N -|B| |B|N 1 N -1 N i=1 ∥g (i) -ḡ∥ 2 2 . (7)

4.2. STOCHASTIC RESULTS FOR HARD THRESHOLDING OPERATOR

The goal of this subsection is to show the random sequence f (x k ) k≥1 generated by the mini-batch SIHT algorithm converges with probability one. To show this we prove that the random sequence of the function value is a supermartingale sequence so the expected value of the function value sequence is decreasing. To achieve our goal, we prove the following lemma that provides an upper bound on the function value evaluated at a thresholded vector. Notice that the following result does not require the input be an updated vector by the gradient. Lemma 3. Let f : R n → R be in Cfoot_0 and Ls-RSS. Then for a fixed x ∈ C s with any I x s , any 0 < γ ≤ 1 Ls , and any given vector g ∈ R n , either of the following holds for any y ∈ H s (x -γg) with any I y s : f (y) ≤ f (x) - γ 2 (1 -L s γ)∥g I y s ∥ 2 2 - γ 2 ∥g I x s ∥ 2 2 + γ⟨δ I y s , g I y s ⟩ + γ⟨δ I\I y s , x I\I y s ⟩ where I = I x s ∪ I y s and δ = g -∇f (x). Observe that in the above lemma the vector g can be any vector in R n . It need not be the gradient nor the mini-batch gradient. However, in the following lemma we prove that if g is designated to be an unbiased stochastic approximation of the gradient at an arbitrary point, then the following result holds. Lemma 4. Let f : R n → R be in C 1 and Ls-RSS. Assume g(x, ω) be an unbiased stochastic approximation of the gradient at x ∈ R n where ω ∼ D for some distribution D, i.e., E ω [g(x, ω)] = ∇f (x). Then for a fixed x ∈ C s with any I x s and 0 < γ ≤ 1 Ls , either of the following holds for any y(ω) ∈ H s (x -γg(x, ω)) with any I y(ω) s : E ω [f (Y(ω))] ≤ f (x)- γ 2 (1-L s γ)E ω [∥g I Y(ω) s (x, ω)∥ 2 2 ]- γ 2 ∥∇ I x s f (x)∥ 2 2 +γE ω [∥δ I Y(ω) s (ω)∥ 2 2 ] (9) where I(ω) = I x s ∪ I Y(ω) s and δ(ω) = g(x, ω) -∇f (x). The following Theorem is the climax of our technical results because it establishes a stochastic gradient descent property for the expectation of the function value. Later we will see how Inequality (11) is used in Theorem 3 to show the sequence of the function values generated by the mini-batch SIHT is a supermartingale sequence. Theorem 2. Let f (i) : R n × Ξ → R be in C 1 1 for i = 1, . . . , N and Ξ = {ξ (1) , . . . , ξ (N ) } be a given set such that f (x, Ξ) = 1 N N i=1 f (i) (x, ξ (i) ) be an L s -RSS function. Assume there exists a c > 0foot_1 such that E J N i=1 ∥∇ J f (i) (x, ξ (i) )∥ 2 2 ≤ cE J ∥∇ J f (x, Ξ)∥ 2 2 (10) for all x ∈ R n and any random index set J ⊆ {1, . . . , n} with |J | ≤ s. Let G(x, Ξ, B) = 1 |B| i∈B ∇f (i) (x, ξ (i) ) be the mini-batch stochastic gradient at any x ∈ R n where B ⊆ {1, . . . , N } be a random set whose elements are drawn randomly and uniformly from {1, . . . , N } without replacement and its size is |B|. For a fixed 0 < γ < 1 Ls , assume the size of B is fixed such that |B| ≥ N/ 1 + 1-Lsγ : E B f (Y(B), Ξ) ≤ f (x, Ξ) - γ 2 ∥∇ I x s f (x)∥ 2 2 - γ 2 (1 + L s γ)ζ 1 - c N + 1 -L s γ 1 + L s γ 1 ζ E I Y(B) s ∥∇ I Y(B) s f (x, Ξ)∥ 2 (11) where 1 -c N + 1-Lsγ 1+Lsγ 1 ζ ≥ 0. A crucial assumption for proving the results in Theorem ( 11) is the assumption made in Inequality (10). In the following Claim we show that for a certain class of functions c > 0 always exists and it does not depend on the function. We will prove that for these special classes of functions the value of c only depends on the data. Claim 2. Let the given set Ξ in Problem (P) be defined such that Ξ := {V 1• , . . . , V N • } where each V i• is the i-th row of a given matrix V ∈ R N ×n . Then the objective function in Problem (P) can be defined as f (x, Ξ) := 1 N N i=1 f (i) (V i• x) f (i) : R n × Ξ → R and the following holds: N i=1 ∥∇ J f (i) (V i• x)∥ 2 2 ≤ N 2 σ 2 min (VI ⊤ J • I J • V ⊤ ) max r=1,...,N ∥(V ⊤ r• ) J ∥ 2 2 ∥∇ J f (x, V)∥ 2 2 (12) where J ⊆ {1, . . . , n} with |J | ≤ s, I J • ∈ R |J |×n is a restriction of the Identity matrix whose rows are associated with indices in J , VI ⊤ J • I J • = |J | i=1 V •i V ⊤ •i , σ min (•) is the smallest singular value, V •i is the i-th column of V, and (•)J is a vector restricted to indices in J . Remark 1. The above claim shows that for a class of functions f (x, Ξ) := 1 N N i=1 f (i) (V i• x) the constant c > 0 in Theorem 3 always exists and it does not depend on the value of x or its gradient whether it is batch (full) gradient or individual one. For an example of functions belonging to this class one can think of the mean square error loss used for linear regression as follows: f (x, V) = 1 N ∥Vx -y∥ 2 = 1 N N i=1 (V i• x -y i ) 2 where V ∈ R N ×n , V i• is the i-th row of V, x ∈ R n is the optimization variable, and y ∈ R N is the target. Also, the logistic regression loss (binary cross entropy) is a function for which c > 0 in Inequality ( 12) always exists since it can be written as follows: f (x, V) = 1 N N i=1 -y (i) (V i• x) + log 1 + e Vi•x where V ∈ R N ×n whose last column is all one, V i• is the i-th row of V, R n ∋ x = [w, b] ⊤ such that w ∈ R n-1 and b ∈ R are the optimization variables, and y (i) ∈ {0, 1} for i = 1, . . . , N . Now we can provide a result showing that by fixing a sparse point, one can use the stochastic mini-batch gradient with a fixed mini-batch size determined in Theorem 3 and decrease the function value in expectation. Theorem 3. Assume all the assumptions in Theorem 2 hold. Then for a fixed x ∈ C s with any I x s the following holds for any Y(B) ∈ H s (x -γG(x, Ξ, B)): E B f (Y(B), Ξ) x ≤ f (x, Ξ) - γ 2 ∥∇ I x s f (x)∥ 2 2 . ( ) The above result is the analogue result to (Damadi & Shen, 2022 , Corollary 1). Theorem 4. Assume all the assumptions in Theorem 2 hold. Let f be a bounded below differential function and X k X k-1 ) k≥0 be the stochastic IHT sequence. Then, f (X k , Ξ, B) X k k≥1 is a supermartingale sequence and converges to a random variable f * with probability one.

5. CONCLUSION

We showed the stochastic sequence generated by the mini-batch stochastic IHT is a supermartingale sequence converging with probability one. To show this result we used the stochastic gradient descent property that we derived where we utilized the property of the mini-batch stochastic gradient as the sample sum of a finite sum.



The class consisting of all differentiable functions whose derivative is continuous. In Remark 1, we explain why such a c always exist for widespread objective functions in machine learning applications



and let ζ := N -|B| |B|(N -1) for N ≥ 2. Then for a fixed x ∈ C s with any I x s the following holds for any Y(B) ∈ H s (x -γg(x, Ξ, B)) with any I Y(B) s

