CONVERGENCE OF THE MINI-BATCH SIHT ALGORITHM

Abstract

The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations. The IHT algorithm benefits from the information of the batch (full) gradient at each point and this information is a crucial key for the convergence analysis of the generated sequence. However, this strength becomes a weakness when it comes to machine learning and high dimensional statistical applications because calculating the batch gradient at each iteration is computationally expensive or impractical. Fortunately, in these applications the objective function has a summation structure that can be taken advantage of to approximate the batch gradient by the stochastic minibatch gradient. In this paper, we study the mini-batch Stochastic IHT (SIHT) algorithm for solving the sparse optimizations. As opposed to previous works where increasing and variable mini-batch size is necessary for derivation, we fix the mini-batch size according to a lower bound that we derive and show our work. To prove stochastic convergence of the objective value function we first establish a critical sparse stochastic gradient descent property. Using this stochastic gradient descent property we show that the sequence generated by the stochastic mini-batch SIHT is a supermartingale sequence and converges with probability one. Unlike previous work we do not assume the function to be a restricted strongly convex. To the best of our knowledge, in the regime of sparse optimization, this is the first time in the literature that it is shown that the sequence of the stochastic function values converges with probability one by fixing the mini-batch size for all steps.

1. INTRODUCTION

We consider the following sparse optimization problem: (P) : min f (x, Ξ) := 1 N N i=1 f (i) (x, ξ (i) ) s.t. x ∈ C s where f (i) : R n × Ξ → R for i = 1, . . . , N , Ξ = {ξ (1) , . . . , ξ (N ) }, and C s = {x ∈ R n | ∥x∥ 0 ≤ s} (sparsity constraint) is the union of finitely many subspaces whose dimension is less than or equal to the sparsity level s such that 1 ≤ s < n. The importance of the Problem (P) is due to the fact that finding a sparse network whose accuracy is on a par with a dense network amounts to solving a bi-level, constrained, stochastic, nonconvex, and non-smooth sparse optimization problem Damadi et al. (2022) . Thus finding efficient algorithms that solve Problem (P) can be beneficial for addressing compression of deep neural networks. Among algorithms for solving sparse optimization the Iterative Hard Thresholding (IHT) algorithm has been a very successful one due to the simplicity of its implementation. The IHT algorithm not only has been practically efficient, but also shows theoretical promising results. It was originally devised for solving compressed sensing problems in 2008 Blumensath & Davies (2008; 2009) . Since then, a large body of literature has been studying it from different perspectives. 



For example, Beck & Eldar (2013); Lu (2014; 2015); Pan et al. (2017); Zhou et al. (2021) consider convergence of iterations, Jain et al. (2014); Liu & Foygel Barber (2020) study the limit of the objective function value sequence, Liu et al. (2017); Zhu et al. (2018) address duality, Zhou et al. (2020); Zhao et al. (2021) extend it to Newton's-type IHT, Blumensath (2012); Khanna & Kyrillidis (2018); Vu & Raich (2019); Wu & Bian (2020) address accelerated IHT, and Wang et al. (2019); Bahmani et al. (2013) solve logistic regression problem using the IHT algorithm. Recently Damadi & Shen (2022) introduced the concepts of HT-unstable stationary points (saddle points in the sense of sparse optimization) and

