CONVERGENCE OF THE MINI-BATCH SIHT ALGORITHM

Abstract

The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations. The IHT algorithm benefits from the information of the batch (full) gradient at each point and this information is a crucial key for the convergence analysis of the generated sequence. However, this strength becomes a weakness when it comes to machine learning and high dimensional statistical applications because calculating the batch gradient at each iteration is computationally expensive or impractical. Fortunately, in these applications the objective function has a summation structure that can be taken advantage of to approximate the batch gradient by the stochastic minibatch gradient. In this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving the sparse optimizations. As opposed to previous works where increasing and variable mini-batch size is necessary for derivation, we fix the mini-batch size according to a lower bound that we derive and show our work. To prove stochastic convergence of the objective value function we first establish a critical sparse stochastic gradient descent property. Using this stochastic gradient descent property we show that the sequence generated by the stochastic mini-batch SIHT is a supermartingale sequence and converges with probability one. Unlike previous work we do not assume the function to be a restricted strongly convex. To the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.

1. INTRODUCTION

We consider the following sparse optimization problem: (P) : min f (x, Ξ) := 1 N N i=1 f (i) (x, ξ (i) ) s.t. x ∈ C s where f (i) : R n × Ξ → R for i = 1, . . . , N , Ξ = {ξ (1) , . . . , ξ (N ) }, and C s = {x ∈ R n | ∥x∥ 0 ≤ s} (sparsity constraint) is the union of finitely many subspaces whose dimension is less than or equal to the sparsity level s such that 1 ≤ s < n. The importance of the Problem (P) is due to the fact that finding a sparse network whose accuracy is on a par with a dense network amounts to solving a bi-level, constrained, stochastic, nonconvex, and non-smooth sparse optimization problem Damadi et al. (2022) . Thus finding efficient algorithms that solve Problem (P) can be beneficial for addressing compression of deep neural networks. Among algorithms for solving sparse optimization the Iterative Hard Thresholding (IHT) algorithm has been a very successful one due to the simplicity of its implementation. The IHT algorithm not only has been practically efficient, but also shows theoretical promising results. It was originally devised for solving compressed sensing problems in 2008 Blumensath & Davies (2008; 2009) . Since then, a large body of literature has been studying it from different perspectives. Require: x 0 ∈ C s such that ∥x 0 ∥ 0 ≤ s, a stepsize 0 < γ < 1 Ls , and 1 ≤ S B ∈ N such that S B ≥ N 1 + 1-Lsγ 1+Lsγ N -1 c N -1 for some c > 0. 1: for k = 0, 1, . . . do 2: Construct B k by selecting S B elements from {1, . . . , N } uniformly without replacement such that |B k | = S B . 3: Calculate the stochastic mini-batch gradient as G(X k , Ξ, B k ) = 1 S B i∈B k ∇f (i) (X k , ξ (i) ). 4: X k+1 ∈ H s (X k -γG(X k , Ξ, B k ). showed the escapability property of the HT-unstable stationary points as one of the crucial properties of the IHT algorithm. Also, they showed Q-linearly convergence of the IHT algorithm towards strictly HT-stable stationary points. However, these desirable properties, requires to compute the batch (full) gradient at each iteration which is computationally expensive or impractical with current GPUs. On the other hand, almost all training for deep neural networks are done using the mini-batch stochastic gradient which is a combination of the stochastic approximation Robbins & Monro (1951) implemented by the backpropagation algorithm Rumelhart et al. (1986) . By taking the mini-batch stochastic approximation, we consider solving Problem (P) using the mini-batch Stochastic Iterative Hard Thresholding algorithm outlined in Algorithm 1. Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous work Zhou et al. (2018) in this area. Also, for showing our theoretical results we directly use the mini-batch stochastic gradient and derive our theoretical results which is different from previous works Chen & Gu (2016); Li et al. (2016) where the batch (full) gradient is used to show the theoretical results. As opposed to other works where restricted strong convexity is necessary for deriving convergence results Liang et al. (2020); Zhou et al. (2018) , here the only assumption we make is the restricted strong smoothness on the objective function not on each individual one. Also, we assume that the objective function is a bounded below function which is the case for objective functions used in machine learning applications. Similar to practice where the mini-batch size is fixed beforehand, we fix the mini-batch size at the beginning which is different from previous works Zhou et al. (2018) .

SUMMARY OF CONTRIBUTIONS

By considering the mini-batch SIHT Algorithm 1 for Problem (P), we develop the following results: • We establish a new critical sparse stochastic gradient descent property of the hard thresholding (HT) operator that has not been found in the literature. • For a given step-size 0 < γ < 1 Ls , we find a lower bound on the size of the mini-batch that guarantees the expected descent of the objective value function after hardthresholding. • Using the sparse stochastic gradient descent property we show that the sequence generated by the mini-batch SIHT algorithm is supermartingale and converges with probability one. • We show that for a certain class of functions in Problem (P) where f (x, ξ i) : R n → R, the sum of norm squared of individual gradients restricted to a set of some elements J , i.e., N i=1 ∥∇ J f (i) ∥ 2 2 , evaluated at every point is proportionate to the norm of the batch gradient ∥∇ J f ∥ 2 2 where the proportionality constant only depends on the data. Moreover, dependency of the proportionality constant on the data is restricted to the set of J not the entire data. i ) := f (i) (V i• x) f

2. RELATED WORK

In order improve computational efficiency of the IHT algorithm, algorithms based on stochastic hard thresholding try to use the finite-sum structure of problem (P) Nguyen et al. (2017) ; Li et al.



For example, Beck & Eldar (2013); Lu (2014; 2015); Pan et al. (2017); Zhou et al. (2021) consider convergence of iterations, Jain et al. (2014); Liu & Foygel Barber (2020) study the limit of the objective function value sequence, Liu et al. (2017); Zhu et al. (2018) address duality, Zhou et al. (2020); Zhao et al. (2021) extend it to Newton's-type IHT, Blumensath (2012); Khanna & Kyrillidis (2018); Vu & Raich (2019); Wu & Bian (2020) address accelerated IHT, and Wang et al. (2019); Bahmani et al. (2013) solve logistic regression problem using the IHT algorithm. Recently Damadi & Shen (2022) introduced the concepts of HT-unstable stationary points (saddle points in the sense of sparse optimization) and Algorithm 1 The mini-batch stochastic iterative hard thresholding

