LEARNING SPARSE GROUP MODELS THROUGH BOOLEAN RELAXATION

Abstract

We introduce an efficient algorithmic framework for learning sparse group models formulated as the natural convex relaxation of a cardinality-constrained program with Boolean variables. We provide theoretical techniques to characterize the equivalent condition when the relaxation achieves the exact integral optimal solution, as well as a rounding algorithm to produce a feasible integral solution once the optimal relaxation solution is fractional. We demonstrate the power of our equivalent condition by applying it to two ensembles of random problem instances that are challenging and popularly used in literature and prove that our method achieves the exactness with overwhelming probability and the nearly optimal sample complexity. Empirically, we use synthetic datasets to demonstrate that our proposed method significantly outperforms the state-of-the-art group sparse learning models in terms of individual and group support recovery when the number of samples is small. Furthermore, we show the out-performance of our method in cancer drug response prediction.

1. INTRODUCTION

Sparsity is one of the most important concepts in statistical machine learning, which strongly connects to the data & computational efficiency, generalizability, and interpretability of the model. Traditional sparse estimation tasks aim at selecting sparse features at the individual level Tibshirani (1996) ; Negahban et al. (2012) . However, in many real-world scenarios, structural properties among the individual features are assumed thanks to prior knowledge, and leveraging these structures may improve both model accuracy and learning efficiency Gramfort & Kowalski (2009) ; Kim & Xing (2012) . In this paper, we focus on learning the sparse group models for intersection-closed group sparsity, where groups of variables are either selected or discarded together. The general task of learning the sparse group models has been investigated quite a lot in literature, where most of the prior studies are based on the structured sparsity-inducing norm regularization Friedman et al. (2010) ; Huang et al. (2011) ; Zhao et al. (2009) ; Simon et al. (2013) , which stems from Lasso Tibshirani (1996) , the traditional and popular technique for a sparse estimate at the individual feature level. As reviewed in Bach et al. (2012) ; Jenatton et al. (2011) , the structured sparsity-inducing norm is quite general and can encode structural assumptions such as trees Kim & Xing (2012) ; Liu & Ye (2010) , contiguous groups Rapaport et al. (2008) , directed-acyclic-graphs Zheng et al. (2018) , and general overlapping groups Yuan et al. (2011) . Another type of approach for learning the sparse group models is to view the task as a cardinalityconstrained program, where the constraint set encodes the group structures as well as restricts the number of groups of variables being selected. Baldassarre et al. Baldassarre et al. (2013) investigate the projection onto such cardinality-constrained set. However, due to the combinatorial nature of the projection, directly applying the projected gradient descent with the projection Baldassarre et al. (2013) to solve general learning problems with typical loss functions might not have good results Kyrillidis et al. (2015) . Recent work Pilanci et al. (2015) studies the Boolean relaxation of the learning problem with cardinality constraints on the individual variables. This work Pilanci et al. (2015) can be viewed as a special case of sparse group models, where each group contains only one variable. Both the original work of Pilanci et al. (2015) and several follow-up papers Bertsimas & Parys (2020) ; Bertsimas et al. (2020) show that the Boolean relaxation empirically outperforms the sparse estimation methods using sparse-inducing norms (Lasso Tibshirani (1996) and elastic net Zou & Hastie (2005) ), especially when the sample size is small and the feature dimension is large. However, the results in Pilanci et al. (2015) cannot be applied to the sparse group models with arbitrary group structures. To fill the gap, in this paper, we study the sparse group models through a cardinality-constrained program. We first propose the Boolean relaxation for sparse group models. We further establish an analytical and algorithmic framework for our Boolean relaxation which includes a theorem stating the equivalent condition for the relaxation to achieve the exactness (i.e., the optimal integral solution) and a rounding scheme that produces an integral solution when the optimal relaxation solution is fractional. We demonstrate the power of our equivalent condition theorem by applying it to two ensembles of random problem instances that are challenging and popularly used in literature and proving that our Boolean relaxation achieves the exactness with high probability and the nearly optimal sample complexity. Our contributions are threefold: 1) We propose a novel framework that uses constraints to induce intersection-closed group sparsity. Baldassarre et al. Baldassarre et al. (2013) investigate the projection on the group sparsity constraints. But our framework extends to any convex loss function with the group sparsity constraints. 2) We prove our framework is tight and can achieve the exactness with high probability and the nearly optimal sample complexity for two ensembles of random problem instances. This result is inspired by Pilanci et al. (2015) but our derivations and proofs are not straightforward extensions (e.g., due to the group structure, we need to analyze more complex feature-group matrices, prove new matrix concentration properties, and carefully choose different regularization parameters). 3) Empirically, we perform extensive experiments to demonstrate that our framework significantly outperforms the state-of-the-art methods when the sample size is small on simulated datasets. Furthermore, we show the out-performance of our framework in cancer drug response prediction.

1.1. RELATED WORKS

Convex programming relaxations and their rounding techniques have been widely used for approximating many combinatorial optimization problems that are computationally intractable (see, e.g., Williamson & Shmoys (2011) ). The specific algorithmic technique in this work is inspired by the Boolean relaxation method introduced in Pilanci et al. (2015) for learning sparsity at the individual feature level. However, the additional group structure in our problem raises new algorithmic challenges, and both our Boolean relaxation formulation and its theoretical analysis (e.g., the equivalent condition for the exactness) are different from their counterparts Pilanci et al. (2015) . As mentioned before, sparse estimation using structured sparsity-inducing norms were thoroughly studied for learning structured sparsity under different structure assumptions motivated by various practical scenarios Friedman et al. (2010) ; Huang et al. (2011); Zhao et al. (2009) ; Simon et al. (2013) ; Tibshirani (1996) ; Bach et al. (2012) ; Kim & Xing (2012) ; Liu & Ye (2010) ; Rapaport et al. (2008) ; Zheng et al. (2018) ; Yuan et al. (2011); Jenatton et al. (2011) . However, none of these algorithms provides the rigorous theoretical techniques as in this work to verify whether the algorithm has produced the exact optimal solution. Also, as we will show in the experiments section, our proposed method outperforms these algorithms on both synthetic and real-world datasets. In our experiments, we also compare with the elastic net method Zou & Hastie (2005) , which can only control the sparsity at the individual feature level. There exist another family of structured sparsity-inducing norms Jacob et al. (2009) that aim to model the union-closed families of supports, where the support of the solution is a union of groups. Different from our proposed models, in which the support of the solution is the intersection of the complements of some of groups considered (intersection-closed group sparsity) Jenatton et al. (2009) . Another approach is to learn sparse group models by introducing the penalty functions for the constraints and applying the convex relaxation to them. Bach Bach (2010) investigate to design norms from submodular set-functions. Halabi et al. El Halabi & Cevher (2015) ; Halabi et al. (2018) study to induce group sparsity using tight convex relaxation of linear matrix inequalities and combinatorial penalties. Note that these works use convex regularizers to induce group sparsity while we use constraints. Halabi et al. El Halabi & Cevher (2015) ; Halabi et al. (2018) study general equivalent conditions to characterize the tightness of their relaxations while our theoretical results works for specific distributions where their general conditions cannot be easily verified. We use different analytical frameworks and thus the theoretical results cannot be directly compared. The organization of this section is as follows. In section 2.1, we introduce the original problem and its exact boolean representation. In section 2.2, we propose the boolean relaxed program and provide the condition under which the relaxed program is guaranteed to have an integer solution and hence be tight. In section 2.3, we propose the rounding strategy if the relaxed program does not generate integral solutions.

2.1. SPARSE GROUP MODEL AND ITS FORMULATION VIA BOOLEAN CONSTRAINTS

We consider a learning problem for a collection of n samples {(x i , y i ) ∈ R d × Y} n i=1 and define the design matrix as X ∈ R n×d , where x ⊤ i ∈ R d is the i-th row of X. This setup is flexible to model various problems including binary classification (where the label space Y = {-1, +1}) and regression problems (where the label space Y = R). For a linear model x → w ⊤ x, our goal is to learn a sparse weight vector w ∈ R d whose support encodes certain structures reflecting the relationships among the features which are usually defined by the prior knowledge. More formally, we need to solve the following mathematical program. P * = min w∈Θ F (w) := n i=1 f (w ⊤ x i ; y i ) + 1 2 ρ∥w∥ 2 2 . (1) Here, the loss function f (•; •) measures the prediction error by our linear model, where the common choices include the squared loss for least-squares regression, the log loss for the logistic regression, and the hinge loss for the support vector machine. The regularization term 1 2 ρ∥w∥ 2 2 in (1) makes sure that the objective function is strongly convex and therefore has a unique optimal solution w * ∈ R d . Finally, the constraint set Θ encodes the sparsity requirements for both individual features and groups of features. We use g i to denote the set of the indices of features in the i-th group and for any vector w ∈ R d , we use w gi to denote the vector containing all entries of w corresponding to the indices in g i . We further assume that we have b predefined groups and then the cardinality constraint set Θ can be written as Θ =    w ∈ R d ∥w∥ 0 ≤ k, b j=1 1 w gj 0 > 0 ≤ h    , where ∥ • ∥ 0 denotes the ℓ 0 norm and 1[•] denotes the indicator variable that takes the value 1 when the corresponding condition holds and 0 otherwise. The first constraint enforces the number of contributing features to be less than k, and the second constraint makes sure the number of groups that contain those selected features is less than h. We also remark that the structured sparsity constraints defined by Θ in equation 2 is very flexible. First, the ∥w∥ 0 ≤ k constraint imposes the feature-level sparsity requirement and encompasses the unstructured sparsity model (as investigated in Pilanci et al. (2015) ) as a special case. Second, the group-level sparsity constraint is introduced by 1 w gj 0 > 0 ≤ h covers the needs for structured sparsity arising from many practical scenarios such as neuroimaging Gramfort & Kowalski (2009); Xi et al. (2009) , genomic analysis Rapaport et al. (2008) ; Kim & Xing (2012) , and wavelet-based denoising Zhao et al. (2009) ; Huang et al. (2011) . The groups {g i } i∈{1,2,...,b} can be arbitrary sets of the features and may model not only non-overlapping structured sparsity when g i ∩ g j = ∅, ∀i, j but also various overlapping patterns including the contiguous pattern, the block pattern, and the hierarchical pattern as reviewed in Bach et al. (2012) ; Jenatton et al. (2011) . Note that the first term in equation 2 can be absorbed into the second term, which however will not have the sparsity control at the individual level. Note that the structured sparse group learning problem P * defined in equation 1 involves only real variables. In the following theorem, we show that the problem can be reformulated as a convex program with additional Boolean variables and constraints, which will naturally lead to the Boolean relaxation algorithm in the later sections. Theorem 2.1 (Exact representation with Boolean constraints). Suppose that for each y ∈ Y, the function t → f (t; y) is closed and convex. The Legendre-Fenchel conjugate of f is f * (s; y) := sup t∈R {st -f (t : y)}. Then for any ρ > 0, the structured sparse learning problem P * in equation 1 can be represented by the following Boolean convex program P * = min (u,z)∈Γ max v∈R n - 1 2ρ v ⊤ XD(u)X ⊤ v - n i=1 f * (v i ; y i ) H(u) , where D(u) := diag(u) is a diagonal matrix with u ∈ R d on its diagonal, and Γ is the constraint set for u and v, defined as the follows, Γ =    (u, z) d i=1 u i ≤ k, b j=1 z j ≤ h, u i ≤ z j , ∀i ∈ g j , u ∈ {0, 1} d , z ∈ {0, 1} b    . The proof of Theorem 2.1 can be found in the supplementary materials Section A. In the theorem statement, u is a vector of the Boolean indicators for the supports of the individual features and z is also a vector of the Boolean indicators for the supports of the group features. H(u) in equation 3 is convex in u because it is the maximum of a family of functions that are linear with u. However, the whole program is still computationally difficult due to the Boolean constraints u ∈ {0, 1} d and z ∈ {0, 1} b . In the next subsection, we will relax these Boolean constraints which leads to a convex program that can be efficiently solved for many popular loss functions f .

2.2. CONVEX PROGRAM THROUGH BOOLEAN RELAXATION AND THEORETICAL CONDITIONS

FOR EXACTNESS We apply interval relaxation to both Boolean vector variables u and z, and obtain the Boolean relaxation for P * as follows. P BR = min (u,z)∈Ω max v∈R n - 1 2ρ v ⊤ XD(u)X ⊤ v - n i=1 f * (v i ; y i ) , where Ω =    (u, z) d i=1 u i ≤ k, b j=1 z j ≤ h, u i ≤ z j , ∀i ∈ g j , u ∈ [0, 1] d , z ∈ [0, 1] b    . P BR is a convex program and can be solved by the sub-gradient-based optimization algorithm Nesterov (2009) if the inner maximization problem can be solved efficiently. In general, P BR can also be converted into a minimax optimization problem and solved by methods in Lin et al. (2020) . We now investigate when P BR achieves the exact solution of P * , under the assumption that the groups are non-overlapping. Note that P BR is a relaxation of P * as defined in equation 3. The relaxation is exact if and only if the optimal solution in P BR also happens to be integral and therefore feasible in P * . The following theorem (proved in the supplementary materials Section B.1) characterizes the equivalent condition for the exactness. Theorem 2.2. Suppose that each feature belongs to only one group and the optimal integral solution (û, ẑ) for P * as defined in equation 3 selects k features and h groups, then the optimal solution of P BR also recovers (û, ẑ) if and only if there exists non-negative λ k and λ h , such that v ∈ arg max v∈R n - 1 2ρ v ⊤ XD(û)X ⊤ v - n i=1 f * (v i ; y i ) (5) 1. For each group i such that ẑi = 1, it holds that ∀p ∈ g i , ûp = 1 ⇒ (X ⊤ p v) 2 > λ k and (6) ∀p ∈ g i , ûp = 0 ⇒ (X ⊤ p v) 2 ≤ λ k . (7) 2. For each group i such that ẑi = 1, it holds that p∈gi,ûp=1 ((X ⊤ p v) 2 -λ k ) > λ h . (8) 3. For each group i such that ẑi = 0, p∈gi max{(X ⊤ p v) 2 -λ k , 0} ≤ λ h . (9) Here, X p denotes the p-th column of the design matrix X. The special case of least-squares regression. Among all candidate choices of the loss function f , the squared loss f (t; y) = 1 2 (t -y) 2 for least-squares regression is the most important and popular one with many real-world applications Kim et al. (2007); Nguyen & Rocke (2002) ; Boulesteix & Strimmer (2007) ; Fort & Lambert-Lacroix (2004) , and the corresponding Legendre-Fenchel conjugate becomes f * (s; y) = s 2 2 + sy. In this special case of the structured sparse learning for least-squares regression, the relaxed convex program P BR becomes the following form. L BR = min (u,z)∈Ω G(u) := y ⊤ 1 ρ XD(u)X ⊤ + I -1 y . The detailed derivation of equation 10 can be found in the supplementary materials Section B.2. We let S denote the support of the unique optimal solution u * to the original Boolean program L * := min (u,z)∈Γ {G(u)} and define the n × n matrix M by M := I n + ρ -1 X S X ⊤ S . (12) Now we are ready to apply Theorem 2.2 to the squared loss function and derive the sufficient and necessary condition for the exactness of L BR (assuming non-overlapping groups), as follows. Corollary 2.3. Suppose that each feature belongs to only one group and the optimal integral solution (û, ẑ) selects k features and h groups, then L BR = L * if and only if there exist non-negative λ k and λ h , such that 1. For each group i such that ẑi = 1, it holds that ∀p ∈ g i , ûp = 1 ⇒ (X ⊤ p M y) 2 > λ k and (13) ∀p ∈ g i , ûp = 0 ⇒ (X ⊤ p M y) 2 ≤ λ k . (14) 2. For each group i such that ẑi = 1, it holds that p∈gi,ûp=1 ((X ⊤ p M y) 2 -λ k ) > λ h . ( ) 3. For each group i such that ẑi = 0, p∈gi max{(X ⊤ p M y) 2 -λ k , 0} ≤ λ h . The proof of Corollary 2.3 can be found in the supplementary materials Section B.3. Corollary 2.3 creates an analysis framework for the exactness of the Boolean relaxation L BR where one only has to construct two scalars λ k and λ h and verify the conditions in equation 14, equation 15, and equation 16. In Section 3, we will follow this framework to theoretically prove the exactness of L BR for several classes of problem instances that are popularly studied in the literature, demonstrating the power of Corollary 2.3.

2.3. RANDOMIZED ROUNDING

When the Boolean relaxation is not exact (i.e., the optimal solution of the Boolean relaxation turns out to be fractional), we describe in this section a rounding method to recover an integral solution. We will apply randomized rounding, a state-of-the-art technique for converting fractional solutions to integer solutions with provable approximation guarantees Pilanci et al. (2015) , to the solution of the relaxed problem P BR . Given the fractional solution ū ∈ [0, 1] d and z ∈ [0, 1] b , our goal is to recover a feasible Boolean solution u ∈ {0, 1} d and z ∈ {0, 1} b . For simplicity of exposition, we show the rounding scheme for the case when each feature belongs to exactly one group. However, the algorithm could be easily generalized to the cases of overlapping groups. In our rounding algorithm, we first generate a feasible Boolean solution at the group level z ∈ {0, 1} b . For each group j, we independently set z j according to the following probability distribution: Pr[z j = 1] = zj and Pr[z j = 0] = 1 -zj . ( ) Once the groups are decided, the u i is set to 0 for each feature i that belongs to a non-selected group. For each selected group g j and for each feature i that belongs to the group, we independently set u i according to the following probability distribution: Pr[u i = 1] = ūi zj and Pr[u i = 0] = 1 - ūi zj . ( ) It is easy to verify that the Boolean solution generated by the method above matches the fractional solution in the sense of expectation E[z] = z and E[u] = ū, and furthermore their expected ℓ 0 norms are given by E [∥z∥ 0 ] = b j=1 Pr [z j = 1] = b j=1 zj ≤ h, and E [∥u∥ 0 ] = d i=1 (Pr[ũ i = 1, zj = 1] + Pr[ũ i = 1, zj = 0]) = d i=1 ūi ≤ k. With these expectation bounds in hand, applying standard concentration inequalities, we can show that if we let G = max j |g j | be the size of the largest group, for any δ ∈ (0, 1/3), with probability at least (1 -exp(Ω(-hδ 2 )) -exp(-Ω(k 2 δ 2 /(bG 2 )))), we have that ∥z∥ 0 ≤ (1 + δ)h and ∥u∥ 0 ≤ (1 + δ)k. This means that when the group sizes are relatively small, our randomized rounding scheme produces a nearly optimal solution with high probability. Finally, once we have obtained the integral solution u, the weight vector w for the original problem equation 1 can be computed by w := arg min w∈R d F (D(ũ)w). Value guarantees for least-squares regression. For least-squares loss, we are also able to establish theoretical guarantees for the value (i.e., H(u) as defined in equation 3) of our rounding scheme. Without loss of generality, let us assume the columns of the design matrix X are normalized, i.e., ∥X p ∥ 2 ≤ 1 for p ∈ {1, 2, . . . , d} and ∥y∥ 2 = 1. We have the following theorem and its proof is in the supplementary materials Section C. Theorem 2.4. Let (ū, z) be the optimal solution to the relaxed program. Let r z be the number of fractional entries in z and let r u be the number of fractional entries in ū. Let (u, z) be the integral solution returned by our rounding scheme. For any δ > 0, with probability (1 -δ), it holds that H(u) -P * ≤ O 1 ρ r z log(r z /δ)G + r u log(r u /δ) .

3. THEORETICAL GUARANTEES OF L BR ON ENSEMBLES OF RANDOM INSTANCES

In this section, we apply Corollary 2.3 to prove our relaxed program is tight and can achieves the exactness with high probability and the nearly optimal sample complexity for two ensembles of random problem instances. We focus on the case of least-squares regression and its corresponding relaxation L BR . We will introduce two ensembles of random problem instances and theoretically analyze the performance of our L BR on them. The first random ensemble has been popularly used in literature to evaluate the ℓ 1 Sparse Group Lasso algorithms Simon et al. (2013) ; Friedman et al. (2010) . The second random ensemble is designed by us. It is more challenging compared to the first ensemble as there is more than one "optimal" weight vector w at the feature level. However, the algorithm has to figure out the one with the most group sparsity. For both ensembles, we will show that our L BR will successfully recover both the group and feature sparsity with overwhelming probability and almost optimal sample complexity (i.e., n -the number of observations).

3.1. RANDOM ENSEMBLE I

The first class of random instances can be generated as follows (illustrated in Fig. 1(a) ). We first generate the random design matrix X ∈ R n×d with i.i.d. N (0, 1) entries. The d features are divided into b groups where the size of each group is d/b. We will construct the regression weight vector w ∈ R d such that the first h groups will have non-zero coefficients and the coefficients in the rest of the groups are 0. The number of non-zero coefficients in the j-th group (for j ∈ {1, 2, . . . , h}) is k j and we have that h j=1 k j = k, where k is the total number of non-zero coefficients in w. For each group j ∈ {1, 2, . . . , h}, we arbitrarily choose k j features in the group and randomly set the corresponding coefficient in w to be ± 1 √ k . Finally, y = Xw + ϵ, where the noise vector ϵ ∈ R n has i.i.d. N (0, γ 2 ) entries. The goal is to identify the support of w and the corresponding coefficients. The following theorem provides the theoretical guarantee that given a sufficient amount of observations, L BR recovers the individual and group level sparsity for this random ensemble. Theorem 3.1. Consider the random instance described above with parameters (n, d, k, γ, b, h) and let y = Xw + ϵ be the observed response vector. Suppose that γ ≥ 1. Let ρ = n 1/2+δ (δ ∈ (0, 1/2)). With probability at least (1 -d exp(-Ω(n 2δ /(γ 2 k))) -d exp(-Ω(n 1-2δ )), the relaxed program L BR admits the optimal solution u * and z * where u * i = 1[w i ̸ = 0] and z * j = 1[j ∈ {1, 2, . . . , h}]. We remark that the regularization parameter ρ is set to n 1/2+δ in our theorem, while in contrast, ρ was set to √ n in Pilanci et al. (2015) for the sparse learning problem without group constraints. In fact, if we are only looking for (1 -o(1)) success probability, we can set δ to be as large (close to 1/2) as possible. For example, if we set δ = 1/2 -log log n log n . The success probability is (1 -o(1)) as long as n/(γ 2 k log n) = ω(1), which means that we only need n = ω((k/γ 2 ) log(k/γ 2 )) to achieve the (1 -o(1)) success probability, which almost matches the information theoretic lower bound Wainwright (2007) up to the logarithmic factor.

3.2. RANDOM ENSEMBLE II

We now describe the another class of random instances which is more challenging due to the multiple candidate regression vectors (illustrated in Fig. 2(a) ). We first generate a random design matrix X ∈ R n×d with two candidate k-sparse cross-fit regression vectors w (1) , w (2) ∈ R d , such that both vectors lead to the same expected response (i.e., Xw (1) = Xw (2) ). The response vector is generated by y = Xw (1) + ϵ where ϵ ∈ R n has i.i.d. N (0, γ 2 ) entries. The d feature dimensions are divided into b groups of the same size d/b (for simplicity we assume that d is a multiple of b). The groups are arranged in a way so that the non-zero coefficients of w (1) span h groups (where h • d/b ≥ k), and the non-zero coefficients of w (2) span ζh groups (ζ > 1). Given (X, y) (after randomly permuting the indices of the coordinates and the groups), the goal is to identify the support of w (1) since it is sparser at the group level. We next specifically describe how we generate w (1) , w (2) and design the groups. For any vector w ∈ R k with no non-zero entries, we let the first k entries of w (1) filled by w and the rest filled by 0; we also let the (k + 1)-th to the 2k-th entries of w (2) filled by w and the rest filled 0. We let each of the first h groups contain k/h non-zero identical coordinates of w (1) , and let each of the next ζh groups contain k/(ζh) non-zero identical coordinates of w (2) . We finally fill the b groups with the coordinates numbered from (2k + 1) to d so that each group has the same size d/b. We next describe how we generate the design matrix X. We first generate a random orthogonal matrix Q such that Qw = w. This can be done by first fixing an arbitrary orthogonal matrix with w ∥w∥2 as its first column (i.e., letting P = [ w ∥w∥2 , β 1 , β 2 , . . . , β k-1 ]), generating a random (k -1) × (k -1) orthogonal matrix Q ′ , and finally letting Q = P • diag(1, Q ′ ) • P ⊤ . We then generate the matrices X 1 ∈ R n×k and X 3 ∈ R n×(d-2k) with i.i.d. N (0, 1) entries. Let X 2 = X 1 Q. Since Q is orthogonal, it is easy to see that in the the marginal distribution of X 2 , each entry is also i.i.d. N (0, 1). We finally let X = [X 1 , X 2 , X 3 ]. One can verify that Xw (1) = X 1 w as well as Xw (2) = X 1 Qw = X 1 w. For the theoretical guarantee of L BR on our second random ensemble, we have the following theorem. Theorem 3.2. Let X = [X 1 , X 2 , X 3 ] and y = Xw (1) + ϵ be a random instance described above with parameters (n, d, k, γ, b, h, ζ, w) . Suppose there exists ξ > 0 such that ξ ≤ |w i | ≤ ζ 1/4 ξ for all i ∈ {1, 2, . . . , k}. Also suppose that γ ≥ 1. Let ρ = n 1/2+δ (δ ∈ (0, 1/2)). For large enough constant ζ, with probability at least (1 -d exp(-Ω(n 2δ ξ 2 /γ 2 )) -d exp(-Ω(n 1-2δ ))), the relaxed program L BR admits the optimal solution u * and z * where u * i = 1[w (1) i ̸ = 0] and z * g = 1[∃i ∈ g : w (1) i ̸ = 0]. Here, we use g to denote both the index of a group and the set of the features included in the group. We first remark that the smallest possible value of ζ for the theorem to hold can be made arbitrarily close to 1 (but greater than 1). This relaxation would only affect the constant coefficients in the Ω(•) notations in the success probability bound. Also, similarly to the remark in Section 3.1, we may set δ = 1/2 -log log n log n . The success probability is (1 -o(1)) as long as nξ 2 /(γ 2 log n) = ω(1). Since ξ 2 is usually Θ(1/k), again, we only need n = ω((k/γ 2 ) log(k/γ 2 )) to achieve the (1 -o(1)) success probability, almost matching the information theoretic lower bound Wainwright (2007) up to the logarithmic factor.

4. EXPERIMENTS

In this section, we perform extensive experiments to investigate the effectiveness of the proposed sparse group models under the setting of ℓ 2 2 -regularized least-squares regression specified in equation 10. We use both simulated datasets (non-overlapping groups) introduced in Sections 3.1 and 3.2 and a real-world application (overlapping groups) in cancer to evaluate the performance. To access the performance on simulated data, we compute the recovery accuracy of both individual supports and group supports, which are defined as A I (w) and A G (w), respectively: A I (w) := |{i : w i ̸ = 0, w true i ̸ = 0}| |{i : w true i ̸ = 0}| , A G (w) := |{j : w gj ̸ = 0, w true gj ̸ = 0}| |{j : w true gj ̸ = 0}| . ( ) Here w true i is the ith element of the ground-truth vector w true and w true gj is the weight vector for group g j . We apply the projected Quasi-Newton method Schmidt et al. (2009) to efficiently solve equation 10. The details related to the optimization could be found in supplementary materials Section E. If the solution of equation 10 is not integral, we use the rounding scheme proposed in section 2.3. For the non-overlapping setting, we compare the performance of our method with Sparse Group Lasso (SGL) Simon et al. (2013) , Sparse G-group cover (SGCover) El Halabi & Cevher (2015) , and SGL ∞ (group level sparsity using ℓ ∞ norm and individual level sparsity using ℓ 1 norm); for the overlapping group setting, we compare with (SGL-Overlap) Yuan et al. (2011) and SGCover El Halabi & Cevher (2015) . We also compare our method with Elastic Net (ENet) Zou & Hastie (2005) which is the state-of-the-art method for detecting sparse features at the individual level. All experiments run on a computer with 8 cores 3.7GHz Intel CPU and 32 GB RAM.

4.1. SIMULATION EVALUATION OF RANDOM ENSEMBLE I

We first consider the simulation setting described in Section 3.1 in which we use d = 1000, b = 10, k = 50 (k i = 10, ∀i ∈ {1, 2, . . . , 5}), h = 5, and γ = 2.5 to generate the simulation data whose signal-noise-ratio (SNR) is around 4. We evaluate the performance of all the methods on recovering the ground truth weight vector w true with k = 50 contributing features distributed in h = 5 groups. Feature selection with given support sizes k and h: We first consider the case when k and h are given and equal to the ground-truth for all methods, while all other hyper-parameters are selected by cross-validation. It is hard to let SGL, SGCover, and SGL ∞ to select exact k individual features and h groups of features so we indirectly control k and h by sweeping the regularization parameters. For cases where these methods do not yield exact k individual features and h groups of features, we rank their results and get the top k individual features and h groups of features instead. The same procedure was adopted to ENet to select k features. In this setting, we do not need to worry about false discovery rate (FDR), because it is complementary to the accuracy defined in equation 19. As shown in Fig. 1(b ) and (c), A I and A G of all competing methods converge to 1 with the increasing sample size n. However, our proposed method converges the fastest among all the methods, indicating its effectiveness on recovering the structured sparsity when the sample size is small. We further conduct similar experiments with larger γ and show the results in the supplementary materials Section G. Based on these results, we could confirm that our proposed method outperforms conventional methods in selecting contributing features at both the individual level and group level. Estimation of support sizes k and h: In the real-world scenario, we might not know the number of contributing features and groups. Typically, they could be selected by cross-validation based on the out-of-sample mean square error (MSE). Motivated by this practical need, in this study, we also investigate whether the competing methods could accurately recover the ground truth k and h based on the out-of-sample of MSE. As shown in Fig. 1(d )-(g), only our proposed method is able to provide the smallest MSE at h = 5. In addition, for the number of contributing feature k, only our method achieves the smallest MSE around the number of features k = 50. Interpreting the superior performance of our method. All the competing methods are norm-based and they rely on the regularization factor before the penalty norm to adjust to the support size (group number and feature number) requirements. However, this connection is not explicit as our Boolean relaxation where we directly require that d i=1 u i ≤ k and b i=1 z i ≤ h. Together with our randomized rounding scheme, our Boolean relaxation method might be at an advantageous position to utilize the prior knowledge on support sizes and/or recover the support sizes from the out-of-sample MSE. 4.2 SIMULATION EVALUATION OF RANDOM ENSEMBLE II Next, we consider a more challenging simulated data generated from the procedure introduced in Section 3.2. We set d = 500, k = 80, h = 10, ζ = 4, and γ = 0.1 as described in Section 3.2. As shown in Fig. 2 (a), both w (1) and w (2) are "optimal" for the linear regression problem. However, the supports of w (1) are in h = 10 groups but the supports of w (2) are in ζh = 40 groups. The goal is to test whether each method can recover the solution w (1) , which is more sparse on the group level. We control the parameters of each method to make it select k = 80 features in h = 10 groups. As shown in Fig. 2(b ) and (c), the recovery accuracy (A G and A I ) of the proposed method rapidly converges to 1 and 0.93 when more training samples are provided. Surprisingly, the recovery accuracy of all the rest of the competing methods converge very slowly. Note that here we do not need to consider false discovery rate (FDR) because each method selects exact k features in h groups, therefore, FDR is complementary to the accuracy. Overall, under the setting of Random Ensemble II, the performance improvement of our method over the-state-of-art algorithms is significant in terms of both recovery accuracy.

4.3. CANCER DRUG RESPONSE PREDICTION

We further adapt our model and algorithm on a real-world application to predict the drug response. The goal of the task is to find the essential genes and pathways that are responsible for the ineffectiveness of cancer therapy. For this task, we chose ℓ 2 2 -regularized least-square regression model to predict the continuous value of drug response. (2020) . We only consider pathways that contain more than 10 genes and less than 80 genes. We collect 207 pathways (gene groups), in which the average number of genes in each group is 28.9. The gene expression data are extracted from CCLE Barretina et al. (2012) . For each drug (machine learning task), we hold 20% of the samples as the test set and used the remaining samples as training and validation set. For each competing method, we use standard cross-validation to determine the hyper-parameter based on the out-of-sample square error MSE on the validation set. Here we only show the performance of the drug IMATNIB as a representative and put the performance of other drugs in the supplementary materials Section I. Table 1 illustrates the estimation of k and h and the out-of-sample MSE on the test set for drug IMATNIB with 10 bootstrap samples. We do not compare with SGL ∞ because the SpaSM package Sjöstrand et al. (2018) we used to solve SGL ∞ cannot handle overlapping groups. We find that our proposed method achieves the smallest out-of-sample MSE as well as selects the smallest number of genes and pathways. More importantly, as shown in supplementary materials Table S3 , the selected pathways are all well associated with the drug response and functional mechanisms of IMATNIB in different types of tumor cells supported by rich literature. We also make predictions for other drugs and show the results in Table . S4.

5. CONCLUSION

In this paper, we propose a novel convex framework for learning structured sparsity. We provide theoretical tools to verify the exactness of the solution of the relaxation, and a rounding algorithm to produce the feasible integral solution when the relaxation solution is fractional. For the case of least-squares loss, we perform extensive experiments to demonstrate the effectiveness of the proposed framework.

Supplementary Materials

A PROOF OF THEOREM 2.1 Proof Theorem 2.1. Let u i ∈ {0, 1} indicate whether the i-th feature is selected and u gj be a vector containing u i , i ∈ g j . We then define D(u) := diag(u) and D(u gj ) := diag(u gj ). Considering the change of variable w = D(u)w, we find the original problem equation 1 is equivalent to P * = min ∥D(u)w∥0 ≤ k b j=1 1 D(ug j )wg j 0 > 0 ≤ h n i=1 f (w ⊤ D(u)x i ; y i ) + 1 2 ρ∥D(u)w∥ 2 2 . ( ) We further introduce z j ∈ {0, 1} to indicate whether the group of features g j is selected and obtain the following equivalent formulation P * = min ∥D(u)w∥0 ≤ k D(ug j )wg j 0 ≤ zj , ∀j b j=1 zj ≤ h n i=1 f (w ⊤ D(u)x i ; y i ) + 1 2 ρ∥D(u)w∥ 2 2 . We further split w and u and then equation 21 becomes P * = min (u,z)∈Γ min w∈R d n i=1 f (w ⊤ D(u)x i ; y i ) + 1 2 ρ∥w∥ 2 2 , where Γ =    (u, z) d i=1 u i ≤ k, b j=1 z j ≤ h, u i ≤ z j , ∀i ∈ g j , u ∈ {0, 1} d , z ∈ {0, 1} b    . It is easy to verify that equation 22 achieves the same objective function value of equation 1 at the same unique optimal solution w * . It remains to prove the inner minimization is equivalent to min w∈R d n i=1 f (w ⊤ D(u)x i ; y i ) + 1 2 ρ∥w∥ 2 2 = max v∈R n - 1 2ρ v ⊤ XD(u)X ⊤ v - n i=1 f * (v i ; y i ) . (23) Replacing f by its Legendre-Fenchel conjugate f * , we have min w∈R d max v∈R n n i=1 w ⊤ D(u)x i • v i -f * (v i : y i ) + 1 2 ρ∥w∥ 2 2 . ( ) Under the stated assumptions, strong duality must hold and therefore minimum and maximum can be exchanged. max v∈R n min w∈R d n i=1 w ⊤ D(u)x i • v i -f * (v i : y i ) + 1 2 ρ∥w∥ 2 2 . The objective function is strongly convex with respect to w. Hence, we can obtain the unique minimizer w * = 1 To prove Theorem 2.2, let us first provide the sufficient and necessary conditions for P BR to have integral solutions in the following Lemma. Lemma B.1. Suppose that each feature belongs to exactly one group. Then suppose that the integral solution (û, ẑ) of equation 3 selects exactly k features and h groups. This solution is also the optimal solution of the relaxed program P BR if and only if there exist non-negative {λ up≤1 , λ up≥0 } p∈[d] , {λ zi≤1 , λ zi≥0 } i∈[b] , {λ up≤zi } ∀p∈gi , λ k , and λ h , such that v ∈ arg max v∈R n - 1 2ρ v ⊤ XD(û)X ⊤ v - n i=1 f * (v i ; y i ) λ up≤1 -λ up≥0 + λ k = (X ⊤ p v) 2 -λ up≤zi , ∀p ∈ g i ; λ zi≤1 -λ zi≥0 + λ h = p∈gi λ up≤zi , ∀i ∈ [b]; λ up≤1 = 0, ∀p : ûp = 0; (29) λ up≥0 = 0, ∀p : ûp = 1; (30) λ zi≤1 = 0, ∀i : ẑi = 0; (31) λ zi≥0 = 0, ∀i : ẑi = 1; (32) λ up≤zi = 0, ∀p ∈ g i : ûp < ẑi . To prove Lemma B.1, we need to use the following two theorems. Theorem B.2 (Davis (2020)). Suppose x is a local minimizer of f : R d → R on a closed convex set X ⊆ R d . If f is differentiable at x, it holds that -∇f (x) ∈ N X (x). ( ) Theorem B.3 (Davis (2020)). Let A ∈ R m×n and let β ∈ R m . Consider the polyhedron Q(A, β) = {x|Ax ≤ β}. Suppose x ∈ Q(A, β), then the normal cone at x is N Q(A,β) (x) = {A ⊤ y|y ∈ R m such that y ≥ 0 and y ⊤ (β -Ax) = 0}. Proof of Lemma B.1. P BR is P BR = min (u,z)∈Ω max v∈R n - 1 2ρ v ⊤ XD(u)X ⊤ v - n i=1 f * (v i ; y i ) F (u,z) , where Ω = {(u, z)| j u j ≤ k; i z i ≤ h; u i ≤ z j ; ∀i ∈ g j ; u ∈ [0, 1] d ; z ∈ [0, 1] b } is the feasible set for (u, z) . By Theorem B.2, we know (û, ẑ) is optimal if and only if the following inclusion holds: -∇F (û, ẑ) ∈ N Ω (û, ẑ), where N Ω (û, ẑ) is the normal cone at (û, ẑ). Regarding the Left-Hand-Side of equation 36, by standard calculation, we have that ∂ ui F (û) = -(X ⊤ i v) 2 (v is defined in equation 26) and ∂ zi F (ẑ) = 0. Therefore, -∇F (û, ẑ) =              (X ⊤ 1 v) 2 (X ⊤ 2 v) 2 . . . (X ⊤ d v) 2 0 0 . . . 0              . ( ) Regarding the Right-Hand-Side of equation 36, we obtain the normal cone N Ω (û, ẑ) using Theorem B.3. Specifically, the feasible set Ω is a polyhedron that can be presented by Ω(A, b) = û ẑ A û ẑ ≤ β , where the A matrix and β vector are constructed as follows.                                  feature block: d columns 1 group block: b columns -1 1 -1 . . . 1 -1 1 1 • • • 1 1 -1 1 -1 . . . 1 -1 1 1 • • • 1 A ′                                  A              û1 û2 . . . ûd ẑ1 ẑ2 . . . ẑb              ≤                                   1 0 1 0 . . . 1 0 k 1 0 1 0 . . . 1 0 h 0 . . . 0                                   β . Here, A ′ is the feature-group relation matrix where for each feature group relation p ∈ g i , there is a row in A ′ where the p-th entry in the feature block is 1, the i-th entry in the group block is -1, and all the rest entries are 0. According to Theorem B.3, we know the normal cone N Ω (û, ẑ) = N Ω(A,b) (û, ẑ) = A ⊤ λ λ ∈ (R ≥0 ) c : λ ⊤ b -A û ẑ = 0 , where c denotes the number of constraints (i.e., the row dimension of A). We identify the entries of λ as follows: λ up≤1 denotes the dual parameter corresponding to the constraint u p ≤ 1 for each feature p; λ up≥0 denotes the dual parameter corresponding to the constraint u p ≥ 0 for each feature p; λ k denotes the dual parameter corresponding to the constraint j u j ≤ k; λ zi≤1 denotes the dual parameter corresponding to the constraint z i ≤ 1 for each group i; λ zi≥0 denotes the dual parameter corresponding to the constraint z i ≥ 0 for each group i; λ h denotes the dual parameter corresponding to the constraint i z i ≤ h; λ up≤zi denotes the dual parameter corresponding to the constraint u p ≤ z i for each feature p and group g i such that p ∈ g i . Finally, we conclude that the equivalent condition of equation 36 is there exists λ ∈ (R ≥0 ) c such that -∇F (û, ẑ) = A ⊤ λ and λ ⊤ b -A û ẑ = 0. By equation 37, we obtain equation 27 and equation 28 as the equivalent condition of -∇F (û, ẑ) = A ⊤ λ. We also obtain equation 29, equation 30, equation 31, equation 32, and equation 33 as the equivalent condition of λ ⊤ b -A û ẑ = 0. With Lemma B.1 in hand, we can prove Theorem 2.2 in the following. Proof of Theorem 2.2. We first prove the sufficient condition. Given λ k and λ h , we only need to construct non-negative {λ up≤1 , λ up≥0 } p∈ [d] , {λ zi≤1 , λ zi≥0 } i∈ [b] , {λ up≤zi } ∀p∈gi to satisfy equation 27-equation 33. By Lemma B.1, this will establish the optimality of (û, ẑ) in the relaxed program P BR . We first construct {λ up≤1 , λ up≥0 } p∈ [d] and {λ up≤zi } ∀p∈gi as follows. 1. For each group i such that ẑi = 1, and for each p ∈ g i and ûp = 1, set λ up≤1 = λ up≥0 = 0 and λ up≤zi = (X ⊤ p v) 2 -λ k . For each p ∈ g i and ûp = 0, set λ up≥0 = λ k -(X ⊤ p v) 2 and λ up≤1 = λ up≤zi = 0. By equation 14, one may verify that all constructed values in this step are non-negative. 2. For each group i such that ẑi = 1, set λ zi≥0 = 0 and λ zi≤1 = p∈gi,ûp=1 ((X ⊤ p v) 2λ k ) -λ h , which is non-negative due to equation 15. 3. For each group i such that ẑi = 0, and for each p ∈ g i and (X ⊤ p v) 2 > λ k , set λ up≤1 = λ up≥0 = 0 and λ up≤zi = (X ⊤ p v) 2 -λ k . For each p ∈ g i and (X ⊤ p v) 2 ≤ λ k , set λ up≥0 = λ k -(X ⊤ p v) 2 and λ up≤1 = λ up≤zi = 0. Observe that we always have λ up≤zi = max{(X ⊤ p v) 2 -λ k , 0} for each p ∈ g i . 4. For each group i such that ẑi = 0, set λ zi≤1 = 0 and λ zi≥0 = λ h -p∈gi max{(X ⊤ p v) 2λ k , 0}, which is non-negative due to equation 16. Finally, it is straightforward to verify that equation 27-equation 33 are satisfied by our constructed λ, and therefore, by Lemma B.1, we conclude that (û, ẑ) is the optimal solution of the relaxed program P BR . We next prove the necessary condition. Given P BR and P * have the same integral solution, we only need to show there exist λ k and λ h that satisfy equation 14-equation 16. By Lemma B.1 and û is the integral solution, we have 1. For each group i such that ẑi = 1 and ∀p ∈ g i , ûp = 1, we have (X ⊤ p v) 2 = λ k + λ up≤1 + λ up≤zi ⇒ (X ⊤ p v) 2 > λ k . ( ) For each group i such that ẑi = 1 and ∀p ∈ g i , ûp = 0, we have (X ⊤ p v) 2 = λ k -λ up≥0 ⇒ (X ⊤ p v) 2 ≤ λ k . (40) 2. For each group i such that ẑi = 1 and all p ∈ g i , ûp = 1, we apply equation 39 and equation 28 and have p∈gi,ûp=1 (X ⊤ a v) 2 -λ k = λ h + λ zi≤1 ⇒ p∈gi,ûp=1 (X ⊤ a v) 2 -λ k > λ h . 3. For each group i such that ẑi = 0, we apply equation 40 and equation 28 and have p∈gi (X ⊤ p v) 2 -λ k = λ h -λ zi≥0 - p∈gi λ up≥0 ⇒ p∈gi max{(X ⊤ p v) 2 -λ k , 0} ≤ λ h -λ zi≥0 - p∈gi λ up≥0 ⇒ p∈gi max{(X ⊤ p v) 2 -λ k , 0} ≤ λ h . B.2 DERIVATION OF EQUATION 10 In the case of least-square regression, the Legendre-Fenchel conjugate of the least-square loss f (t; y) = 1 2 (t -y) 2 is given by f * (s; y) = s 2 2 + sy. Substituting this conjugate function into H(u) in equation 3 in Theorem 2.1, we have H(u) = max v∈R n - 1 2ρ v ⊤ XD(u)X ⊤ v - 1 2 ∥v∥ 2 2 -v ⊤ y . ( ) We can verify that the unique optimal solution of H(u) is v = - XD(u)X ⊤ ρ + I -1 y. ( ) Substituting v back into equation 3 and applying Theorem 2.1 yield the representation . L BR = min (u,z)∈Ω y ⊤ 1 ρ XD(u)X ⊤ + I -1 y . For each feature index i ∈ {1, 2, . . . , n}, by y = M w + ϵ, we have that e ⊤ i X ⊤ M Xy = e ⊤ i X ⊤ M Xw + e ⊤ i X ⊤ M ϵ. (51) We first bound e ⊤ i X ⊤ M Xw as follows. Applying Lemma D.1 to each feature index i such that w i ̸ = 0, and via a union bound, we have that there exists a universal constant c 1 ∈ (0, 1/3200), such that with probability at least (1 -3k exp(-c 1 n 1-2δ )), it holds that ∀i : w i ̸ = 0, 1 ρ e ⊤ i X ⊤ M Xw ∈ [0.9/ √ k, 1.1/ √ k]. Also, applying Lemma D.2 to each feature index i such that w i = 0, and via a union bound, we have that with probability at least (1 -4(d -k) exp(-c 1 n/k)), it holds that ∀i : w i = 0, 1 ρ e ⊤ i X ⊤ M Xw ≤ 0.1/ √ k. We then bound e ⊤ i X ⊤ M ϵ as follows. Note that M ≼ I. Therefore, applying Lemma D.3 to each feature index i (with τ = 0.1ρ/ √ k), and via a union bound, we have that with probability at least (1 -2d exp(-n/8) -2d exp(-n 2δ /(400γ 2 k))), it holds that ∀i ∈ {1, 2, . . . , d}, e ⊤ i X ⊤ M ϵ ≤ 0.1ρ/ √ k. Now we condition on the event that all of equation 52, equation 53, and equation 54 happen. By equation 51, we have that ∀i : w i ̸ = 0 : e ⊤ i X ⊤ M Xy ∈ [0.8ρ/ √ k, 1.2ρ/ √ k], ∀i : w i = 0 : e ⊤ i X ⊤ M Xy ∈ [0, 0.2ρ/ √ k]. (56) We now set λ k = (0.2ρ/ √ k) 2 , λ h = ((0.79ρ/ √ k) 2 -λ k ) • k min (where we let k min = min j∈{1,2,...,h} {k j } be the size of the smallest non-empty group (in terms of non-zero coefficients)), and verify the conditions in Corollary 2.3 as follows: 1. Fix any group g such that z * g = 1. For each i ∈ g such that u * i = 1, we have that w i ̸ = 0 and therefore (e ⊤ i X ⊤ M Xy) 2 ≥ (0.8ρ/ √ k) 2 > λ k by equation 55. For each i ∈ g such that u * i = 0, we have that w i = 0 and therefore (e ⊤ i X ⊤ M Xy)  * i =1 ((e ⊤ i X ⊤ M y) 2 - λ k ) ≥ ((0.8ρ/ √ k) 2 -λ k ) • k min > λ h . 3. Fix any group g such that z * g = 0. By equation 55 and equation 56, we have i∈g max{(e ⊤ i X ⊤ M y) 2 -λ k , 0} ≤ 0 < λ h . Finally, the theorem is proved by collecting the failure probabilities of the desired events (equation 52, equation 53, and equation 54).

D.2 PROOF OF THEOREM 3.2

Proof of Theorem 3.2. Let M = 1 ρ XD(u * )X ⊤ + I -1 = 1 ρ X 1 X ⊤ 1 + I -1 = 1 ρ X 2 X ⊤ 2 + I -1 . For each feature index i ∈ {1, 2, . . . , n}, by y = M w (1) + ϵ, we have that e 1) as follows. Applying Lemma D.1 to each feature index i such that w ⊤ i X ⊤ M Xy = e ⊤ i X ⊤ M Xw (1) + e ⊤ i X ⊤ M ϵ. (57) We first bound e ⊤ i X ⊤ M Xw (1) i ̸ = 0, and via a union bound, we have that there exists a universal constant c ′ 1 ∈ (0, 1/3200), such that with probability at least (1 -3k exp(-c ′ 1 n 1-2δ )), it holds that ∀i : w (1) i ̸ = 0, 1 ρ e ⊤ i X ⊤ M Xw (1) ∈ [0.9|w (1) i |, 1.1|w (1) i |] ⊆ [0.9ξ, 1.1ζ 1/4 ξ]. Similarly, applying Lemma D.1 to each feature index i such that w (2) i ̸ = 0, and via a union bound, we have that with probability at least (1 -3k exp(-c ′ 1 n 1-2δ )), it holds that ∀i : w (2) i ̸ = 0, 1 ρ e ⊤ i X ⊤ M Xw (1) = 1 ρ e ⊤ i X ⊤ M Xw (2) ∈ [0.9|w (2) i |, 1.1|w (2) i |] ⊆ [0.9ξ, 1.1ζ 1/4 ξ]. Finally, applying Lemma D.2 to each feature index i such that both w (1) i = 0 and w (2) i = 0, and via a union bound, we have that with probability at least (1 -4(d -2k) exp(-c ′ 1 nξ 2 )), it holds that ∀i : w (1) i = 0 ∧ w (2) i = 0, 1 ρ e ⊤ i X ⊤ M Xw (1) ≤ 0.1ξ. We then bound e ⊤ i X ⊤ M ϵ as follows. Note that M ≼ I. Therefore, applying Lemma D.3 to each feature index i (with τ = 0.1ξρ), and via a union bound, we have that with probability at least (1 -2d exp(-n/8) -2d exp(-ξ 2 n 2δ /(400γ 2 ))), it holds that ∀i ∈ {1, 2, . . . , d}, e ⊤ i X ⊤ M ϵ ≤ 0.1ξρ. Now we condition on the event that all of equation 58, equation 59, equation 60, and equation 61 happen. By equation 57, we have that ∀i : w (63) We now set λ k = (0.2ξρ) 2 , λ h = ((0.79ξρ) 2 -λ k ) • (k/h), and verify the conditions in Corollary 2.3 as follows: 1. Fix any group g such that z * g = 1. For each i ∈ g such that u * i = 1, we have that w (1) i ̸ = 0 and therefore (e ⊤ i X ⊤ M Xy) 2 ≥ (0.8ξρ) 2 > λ k by equation 62. For each i ∈ g such that u * i = 0, we have that w (1) i = w (2) i = 0 and therefore (e ⊤ i X ⊤ M Xy) 2 ≤ (0.2ξρ) 2 ≤ λ k by equation 63. Finally, the theorem is proved by collecting the failure probabilities of the desired events (equation 58, equation 59, equation 60, and equation 61).

D.3 TECHNICAL LEMMAS

Lemma D.1. Suppose k ≤ n/4. Let X ∈ R n×k be a matrix with i.i.d. N (0, 1) entries. Let ρ = n 1/2+δ (δ ≥ 0), and M = (I + 1 ρ XX ⊤ ) -1 . For any ϵ ≥ 32n δ-1/2 , any fixed vector z ∈ R k such that ∥z∥ ∞ ≤ 1 and any fixed index i ∈ {1, 2, . . . , k}, with probability at least (1 -3 exp -n 1-2δ ϵ 2 /2048 ), it holds that e ⊤ i 1 ρ X ⊤ M X -I z ≤ ϵ, where e i is the i-th (column) basis vector. Proof. The proof follows the similar lines of Part (1) of the proof of Lemma 2 in Pilanci et al. (2015) . However, we adopt a different regularization parameter ρ. We write X = U DV ⊤ for the singular decomposition of X. By standard results on the singular value of Gaussian random matrices (e.g., Davidson & Szarek (2001) (64) The rest of the proof will be carried out by conditioning on the successful event in equation 64. Note that X ⊤ M X = V (ρI + D 2 ) -1 D 2 V ⊤ , and therefore, 1 ρ X ⊤ M X -I = V [(ρI + D 2 ) -1 D 2 -I]V ⊤ = V DV ⊤ , where we let D := diag({ D 2 jj ρ+D 2 jj -1} j∈{1,2,...,k} ), and have that Djj ≥ 0 and Djj ≤ 16n δ-1/2 for all j ∈ {1, 2, . . . , k} due to the successful event in equation 64. Note that e ⊤ i 1 ρ X ⊤ M X -I z = e ⊤ i V DV ⊤ z = j V ij Djj q V qj z q ≤ j V 2 ij Djj z i + j V ij Djj q:q̸ =i V qj z q . ( ) It is easy to bound the first term in equation 65 by j V 2 ij Djj z i ≤ 16n δ-1/2 . ( ) Let Ṽ be the (k -1) × k matrix obtained by removing the i-th row from V , and let z be the (k -1)dimensional vector by removing the i-th entry of z. We can rewrite the second term in equation 65 as j V ij Djj q:q̸ =i V qj z q = e ⊤ i V D Ṽ ⊤ z . Observe that even when conditioned on D (and therefore D), e ⊤ i V D Ṽ ⊤ is a (k -1)-dimensional vector pointing towards a uniform random direction, and its 2-norm is at most 16n δ-1/2 . On the other hand, z is a fixed (k -1)-dimensional vector with ∥z∥ 2 ≤ √ k -1. Therefore, by standard spherical concentration inequality, we have that Pr e ⊤ i V D Ṽ ⊤ z ≤ ϵ -16n δ-1/2 ≥ 1 -exp -n(ϵ -16n δ-1/2 ) 2 /512 ≥ 1 -exp -n 1-2δ ϵ 2 /2048 . (68) Combining equation 65, equation 66, equation 67, equation 68, and collecting the probabilities, we prove the desired result. Lemma D.2. Suppose k ≤ n/4. Let X ∈ R n×k be a matrix with i.i.d. N (0, 1) entries. Let u ∈ R k be a column vector with i.i.d. N (0, 1) entries. Let ρ = n 1/2+δ (δ ≥ 0), and M = (I + 1 ρ XX ⊤ ) -1 . For any ϵ ∈ (0, 1) and any fixed vector z ∈ R k such that ∥z∥ 2 ≤ 1 , with probability at least (1 -4 exp -nϵ 2 /32 ), it holds that 1 ρ u ⊤ M Xz ≤ ϵ. 



Figure 1: (a) Illustration of the data generation process for Random Ensemble I. (b) A I as n increases. (c) A G as n increases. We average results over 100 datasets and the error bar means 95% confidence interval. (d)-(g) Estimate k and h by out-of-sample of MSE by different methods. The black vertical line indicates the true k. The results are averaged over 10 datasets and the error bar means standard deviation.

Figure 2: (a) Illustration of the data generation process for Random Ensemble II. (b) A I as n increases. (c) A G as n increases. We average results over 100 datasets and the error bar means 95% confidence interval.

u)x i v i . Substituting w * yields equation 23. B PROOFS AND DEVIATIONS FOR THE CONVEX FORMULATION B.1 PROOF OF THEOREM 2.2

PROOF OF COROLLARY 2.3 Proof of Corollary 2.3. Under the least-square regression setting, we apply Theorem 2.2 and based on equation 5, we know v = -XD(û)X ⊤ ρ + I -1 y. We further define M := I n + ρ -1 X S X ⊤ S , where S is the support set indicated by û and then v = -M y. Substituting v by v = -M y in Theorem 2.2 proves the Corollary. Combining equation 46, equation 47, equation 48, equation 49, and equation 50, we have that with probability at least (1 -δ), it holds that H(u) -P * ≤ O r z log(r z /δ)G + r u log(r u /δ

e ⊤ i X ⊤ M Xy ∈ [0, 0.2ξρ].

2. Fix any group g such that z * g = 1. By equation 62, we have i∈g:u * i =1 ((e ⊤ i X ⊤ M y) 2λ k ) = ((0.8ξρ) 2 -λ k ) • (k/h) > λ h . 3. Fix any group g such that z * g = 0.By equation 62 and equation 63, we havei∈g max{(e ⊤ i X ⊤ M y) 2 -λ k , 0} ≤ ((1.44ξρ) 2 √ ζ -λ k ) • (k/(ζh)) ≤ ((1.44ξρ) 2λ k ) • (k/( √ ζh)) = 2.0336(ξρ) 2 • (k/h)/ √ ζ < λ h, where the last inequality holds for large enough constant ζ > 1.

Result comparison for IMATNIB. For each drug, we create a separate machine learning task to predict the drug response from the expression value of each gene for different tumor samples. In total, we include 1,225 tumor samples and use the gene expression value of 2,369 genes. We focus on the signal transduction pathways which are mined and collected by the Reactome database Jassal et al.

), for each t ≥ 0, it holds that Pr[∀j ∈ {1, 2, . . . , k} : ≥ 1 -2 exp(-t 2 /2). In particular, if we set t = √ n/4, we have that Pr[∀j ∈ {1, 2, . . . , k} : √ n/4 ≤ D jj ≤ 7 √ n/4] ≥ 1 -2 exp(-n/32).

Result comparison for three other drugs.

C PROOF OF THEOREM 2.4

Proof of Theorem 2.4. For least-squares loss f (t, y) = 1 2 (t -y) 2 , we have thatSince the optimal value if non-negative, the optimal dual parameter v ∈ R n must satisfy that ∥v∥ 2 ≤ 2∥y∥ 2 ≤ 2. Note that H(u) -P * ≤ H(u) -H(ū) = H(u) -H(ũ) + H(ũ) -H(ū), (46) where we set ũi = ūi /z j if the feature i belongs to group j and z j = 1, otherwise we set ũi = 0. For H(u) -H(ũ), we have thatwhere σ max (•) denotes the maximum singular value of the matrix. Similarly, for H(ũ) -H(ū), we have thatFor X(D(ũ) -D(ū))X ⊤ , we rewrite it asNote that by our assumption, the operator norm of i∈gj z j × ūi zj -ūi X i X ⊤ i is at most |g j | ≤ G and the mean of the random matrix is 0. By the Ahlswede-Winter matrix concentration bound Ahlswede & Winter (2002) ; Oliveira et al. (2010) , we have that) where r z is the number of fractional entries in z. The standard matrix concentration bound (e.g., matrix Hoeffding) states that for n-dimensional random matrices X 1 , X 2 , . . . , X M , we have that, where σ upper bounds the operator norms of X 1 , X 2 , . . . , X M almost surely. In the context of Eq. ( 49), we have that σ = G and we set α = √ r z Gt. Using the matrix concentration inequality, we upper bound the Left-Hand-Side of Eq. ( 49) by 2n exp(-r z t 2 /8M ) ≤ 2n exp(-t 2 /8) (since r z ≥ M ). Note that this bound is already good enough since the only difference from the Right-Hand-Side of Eq. ( 49) is the factor 2n instead of r z . These factors would go into the logarithmic factor in the final error bound of Theorem 2.4 so they do not make much difference.To improve the factor 2n to r z in Eq. ( 49), we need to show that there exists r z dimensional subspace whose basis vector denoted by the columns ofsurely (where X ′ 1 , . . . , X ′ M are r z -dimensional matrices and their operator norms are also bounded by G). The construction of Q and X ′ 1 , . . . , X ′ M is possible because each matrix associated with a fractional variable zj is low rank and there are only r z fractional zj 's. We then apply the above matrix concentration bound to X ′ 1 + . . . X ′ M to derive Eq. ( 49). For X(D(u) -D(ũ))X ⊤ , we have thatAgain, by our assumption, the operator norm of (u i -ũi )X i X ⊤ i is at most 1, and the mean of the random matrix is 0 (even when conditioned on z). Therefore, by the Ahlswede-Winter matrix concentration bound Ahlswede & Winter (2002) ; Oliveira et al. (2010) , we have that) where r u is the number of fractional entries in ū.Proof. The following proof is based on the standard calculation, which also appeared in Part (2) of the proof of Lemma 2 in Pilanci et al. (2015) . Write X = U DV ⊤ for the singular decomposition of X. Again, we have equation 64 and will condition on the successful event in equation 64 for the rest of the proof. Note thatand thereforeby the successful event in equation 64. Thus, 1 ρ u ⊤ M Xz is a centered Gaussian with standard deviation 4/ √ n, and the probability that 1 ρ u ⊤ M Xz > ϵ is at most 2 exp(-ϵ 2 n/32) by the standard Gaussian tail bound. The lemma is proved by collecting the failure probabilities.Lemma D.3. Let u ∈ R n be a column vector with i.i.d. N (0, 1) entries. Let M ∈ R n×n be any PSD matrix (which might depend on u) such that its eigenvalues are at most 1. Let ϵ ∈ R n be an independent column noise vector with i.i.d. N (0, γ 2 ) entries. For any τ > 0, with probability at leastProof. By standard χ 2 concentration results, we have that with probability at least (1-2 exp(-n/8)), it holds that ∥u∥ 2 2 ≤ 2n. The rest of the proof will be carried out conditioning on this event. Note that ∥M u∥ 2 ≤ ∥u∥ 2 . Therefore, u ⊤ M ϵ ∼ N (0, γ 2 ∥u∥ 2 2 ). By the standard Gaussian tail bound, we have thatThe lemma is proved by collecting the failure probabilities.

E OPTIMIZATION

We use the projected Quasi-Newton (PQN) method to solve the optimization L BR defined in equation 10. The details of PQN is elaborated in Schmidt et al. (2009) Algorithm 1, therefore, we refer the interested audiences to Schmidt et al. (2009) for more details. To apply PQN, we need to know the gradient of the objective function in equation 10. The partial gradient of G(u) w.r.t u i can be written as ∂G(u)Computing such a gradient requires the solution of a rank-∥u∥ 0 linear system of size n, which can be calculated in time O(∥u∥ 3 0 ) + O(nd) via the QR decomposition. When the sparsity level k is relatively small, such computation is not expensive. We also need to do the following projection in PQN.where Ω is defined in Section 2.2. The projection on the relaxed constraint set Ω can be efficiently obtained by a commercial solver (we use Gurobi Gurobi Optimization, LLC (2022)).F NUMERICAL VERIFICATION OF COROLLARY 2.3Corollary 2.3 states that under the least-squares regression setting, under certain conditions, the solution of the relaxed program is integral and consistent with the optimal solution of the problem with Boolean constraints. In this section, we numerically verify the equivalence established in Corollary 2. For each n and SNR, we randomly generate 10 datasets. We run our relaxed program on these 10 datasets and count the number of solutions that are the same as the integral solutions of the program with Boolean constraints. As shown in the Table S1 , the proposed equivalence can be achieved when the sample size and SNR are large.Published as a conference paper at ICLR 2023 

G ADDITIONAL EXPERIMENTS FOR RANDOM ENSEMBLE I

We compare our proposed method with other state-of-the-art methods on simulation data generated from Random Ensemble I with different γs. We first set γ = 3.5 and compare the performance of the competing methods on recovering the individual and group supports. Note that when γ = 3.5, the signal to noise ratio is around 3. We compare the competing methods on 10 different data sets generated from Random Ensemble I with γ = 3.5. As illustrated in Fig. S1 (a) and (b), our proposed method still outperforms all the competing methods in terms of both A I and A G with the increasing number of samples.

H EXPERIMENTS FOR SYNTHETIC DATA SATISFYING MUTUAL INCOHERENCE CONDITION

In Random Ensemble I, we generate x i ∼ N (0, I) as i.i.d features. In this experiments, we aims to evaluate the performance of the competing methods in the presence of correlation between features.We follow the way we generate the data in Random Ensemble I, where we set d = 1000, b = 10, k = 50 (k i = 10, ∀i ∈ {1, 2, . . . , 5}), h = 5, and γ = 0.1. Then only different is that we generate x i ∼ N (0, Σ), where Σ is the Toeplitz covariance matrix Σ ij = p |i-j| . Such matrices satisfy the mutual incoherence condition, required by ℓ 1 -regularized estimators to be statistically consistent.We consider p = 0.2 and p = 0.7 for Σ to evaluate the competing methods' performance under different correlations. The performance is shown in Fig. S2 . We find that when p = 0.2, support recovery (both individual level Fig. S2 (a) and group level Fig. S2(b) ) can be easily achieved by most of the models. And our model outperforms other competing methods. However, when p = 0.7, all models have difficulties to accurately recover the support at the individual level (Fig. S2(c )), which makes sense because the correlation between features are high so that finding the correct features becomes more challenge. Our model still outperforms other competing methods at both individual level (Fig. S2(c )) and group level (Fig. S2(d) ).

I ADDITIONAL EXPERIMENTS FOR CANCER DRUG RESPONSE PREDICTION

We first show Table. S3, which illustrates the pathways and genes identified by the proposed method for drug IMATNIB and the corresponding researches that support the findings.We further find the targeted pathways and genes for three other drugs: GEFITINIB, BEXAROTENE, and BOSUTINIB. In Table . S4, we show the number of targeted pathways and genes identified by each competing method and the out-of-sample MSE. As shown our method achieves the smallest out-of-sample MSE and the fewest number of pathways and genes.

J TIME COMPARISON FOR RANDOM ENSEMBLE I

We compared the running time for experiments of Random Ensemble I when the sample size is 1,000 in the Table. S2. 

K IMPLEMENTATION DETAILS

For section 4.1 subsection "Feature selection with given support sizes k and h" and section 4.2, we select the parameters as follows. For our method, because k and h are given, we only have one parameter ρ in (1) left, we select ρ by the 5-fold CV in terms of MSE. For the result of the methods, which cannot control k and h, we just sweep the parameters to let them yield the desired k and h. For the real-world application, we select the parameters in terms of out-of-sample MSE.

L CODE AVAILABILITY

The codes for the proposed method can be found here: https://anonymous.4open. science/r/L0GL-F107/Readme Table S3: RAS processing ZDHHC9 GOLGA7 BCL2L1 ABHD17B Chung et al. (2006) ; Braun & Shannon (2008) 

