A SIMPLE SPARSE DENOISING LAYER FOR ROBUST DEEP LEARNING

Abstract

Deep models have achieved great success in many applications. However, vanilla deep models are not well-designed against the input perturbation. In this work, we take an initial step to design a simple robust layer as a lightweight plug-in for vanilla deep models. To achieve this goal, we first propose a fast sparse coding and dictionary learning algorithm for sparse coding problem with an exact k-sparse constraint or l 0 norm regularization. Our method comes with a closedform approximation for the sparse coding phase by taking advantage of a novel structured dictionary. With this handy approximation, we propose a simple sparse denoising layer (SDL) as a lightweight robust plug-in. Extensive experiments on both classification and reinforcement learning tasks manifest the effectiveness of our methods.

1. INTRODUCTION

Deep neural networks have obtained a great success in many applications, including computer vision, reinforcement learning (RL) and natural language processing, etc. However, vanilla deep models are not robust to noise perturbations of the input. Even a small perturbation of input data would dramatically harm the prediction performance (Goodfellow et al., 2015) . To address this issue, there are three mainstreams of strategies: data argumentation based learning methods (Zheng et al., 2016; Ratner et al., 2017; Madry et al., 2018; Cubuk et al., 2020) , loss functions/regularization techniques (Elsayed et al., 2018; Zhang et al., 2019) , and importance weighting of network architecture against noisy input perturbation. Su et al. (2018) empirically investigated 18 deep classification models. Their studies found that model architecture is a more critical factor to robustness than the model size. Most recently, Guo et al. (2020) employed a neural architecture search (NAS) method to investigate the robust architectures. However, the NAS-based methods are still very computationally expensive. Furthermore, their resultant model cannot be easily adopted as a plug-in for other vanilla deep models. A handy robust plug-in for backbone models remains highly demanding. In this work, we take an initial step to design a simple robust layer as a lightweight plug-in for the vanilla deep models. To achieve this goal, we first propose a novel fast sparse coding and dictionary learning algorithm. Our algorithm has a closed-form approximation for the sparse coding phase, which is cheap to compute compared with iterative methods in the literature. The closedform update is handy for the situation that needs fast computation, especially in the deep learning. Based on this, we design a very simple sparse denoising layer for deep models. Our SDL is very flexible, and it enables an end-to-end training. Our SDL can be used as a lightweight plug-in for many modern architecture of deep models (e.g., ResNet and DenseDet for classification and deep PPO models for RL). Our contributions are summarized as follows: • We propose simple sparse coding and dictionary learning algorithms for both k-sparse constrained sparse coding problem and l 0 -norm regularized problem. Our algorithms have simple approximation form for the sparse coding phase. • We introduce a simple sparse denoising layer (SDL) based on our handy update. Our SDL involves simple operations only, which is a fast plug-in layer for end-to-end training. • Extensive experiments on both classification tasks and reinforcement learning tasks show the effectiveness of our SDL.

2. RELATED WORKS

Sparse Coding and Dictionary Learning: Sparse coding and dictionary learning are widely studied in computer vision and image processing. One related popular method is K-SVD (Elad & Aharon, 2006; Rubinstein et al., 2008) , it jointly learns an over-complete dictionary and the sparse representations by minimizing a l 0 -norm regularized reconstruction problem. Specifically, K-SVD alternatively iterates between the sparse coding phase and dictionary updating phase. The both steps are based on heuristic greedy methods. Despite its good performance, K-SVD is very computationally demanding. Moreover, as pointed out by Bao et al. (2013) , both the sparse coding phase and dictonary updating of K-SVD use some greedy approaches that lack rigorous theoretical guarantee on its optimality and convergence. Bao et al. (2013) proposed to learn an orthogonal dictionary instead of the over-complete one. The idea is to concatenate the free parameters with predefined filters to form an orthogonal dictionary. This trick reduces the time complexity compared with K-SVD. However, their algorithm relies on the predefined filters. Furthermore, the alternative descent method heavily relies on SVD, which is not easy to extend to deep models. In contrast, our method learns a structured over-complete dictionary, which has a simple form as a layer for deep learning. Recently, some works (Venkatakrishnan et al., 2013) employed deep neural networks to approximate alternating direction method of multipliers (ADMM) or other proximal algorithms for image denoising tasks. In (Wei et al., 2020) , reinforcement learning is used to learn the hyperparameters of these deep iterative models. However, this kind of method itself needs to train a complex deep model. Thus, they are computationally expensive, which is too heavy or inflexible as a plug-in layer for backbone models in other tasks instead of image denoising tasks, e.g., reinforcement learning and multi-class classification, etc. An illustration of number of parameters of SDL, DnCNN (Zhang et al., 2017) and PnP (Wei et al., 2020) Robust Deep Learning: In the literature of robust deep learning, several robust losses have been studied. To achieve better generalization ability, Elsayed et al. (2018) proposed a loss function to impose a large margin of any chosen layers of a deep network. Barron (2019) proposed a general loss with a shape parameter to cover several robust losses as special cases. For the problems with noisy input perturbation, several data argumentation-based algorithms and regularization techniques are proposed (Zheng et al., 2016; Ratner et al., 2017; Cubuk et al., 2020; Elsayed et al., 2018; Zhang et al., 2019) . However, the network architecture remains less explored to address the robustness of the input perturbation. Guo et al. (2020) employed NAS methods to search the robust architectures. However, the searching-based method is very computationally expensive. The resultant architectures cannot be easily used as a plug-in for other popular networks. In contrast, our SDL is based on a closed-form of sparse coding, which can be used as a handy plug-in for many backbone models.

3. FAST SPARSE CODING AND DICTIONARY LEARNING

In this section, we present our fast sparse coding and dictionary learning algorithm for the k-sparse problem and the l 0 -norm regularized problem in Section 3.1 and Section 3.2, respectively. Both algorithms belong to the alternative descent optimization framework.

3.1. K-SPARSE CODING

We first introduce the optimization problem for sparse coding with a k-sparse constraint. Mathematically, we aim at optimizing the following objective min Y ,D X -DY 2 F subject to y i 0 ≤ k, ∀i ∈ {1, • • • , N } (1) µ(D) ≤ λ d j 2 = 1, ∀j ∈ {1, • • • , M }, where D ∈ R d×M is the dictionary, and d i denotes the i th column of matrix D. y i denotes the i th column of the matrix Y ∈ R M ×N , and µ(•) denotes the mutual coherence that is defined as µ(D) = max i =j |d i d j | d i 2 d j 2 . (2) The optimization problem (1) is discrete and non-convex, which is very difficult to optimize. To alleviate this problem, we employ a structured dictionary as D = R B. We require that R R = RR = I d and BB = I d , and each column vector of matrix B has a constant l 2 -norm, i.e., b i 2 = c. The benefit of the structured dictionary is that it enables a fast update algorithm with a closed-form approximation for the sparse coding phase.

3.1.1. CONSTRUCTION OF STRUCTURED MATRIX B

Now, we show how to design a structured matrix B that satisfies the requirements. First, we construct B by concatenating the real and imaginary parts of rows of a discrete Fourier matrix. The proof of the following theorems regarding the properties of B can be found in Appendix. Without loss of generality, we assume that d = 2m, M = 2n. Let F ∈ C n×n be an n × n discrete Fourier matrix. F k,j = e 2πikj n is the (k, j) th entry of F , where i = √ -1. Let Λ = {k 1 , k 2 , ..., k m } ⊂ {1, ..., n -1} be a subset of indexes. The structured matrix B can be constructed as Eq.( 4). B = 1 √ n ReF Λ -ImF Λ ImF Λ ReF Λ ∈ R d×N where Re and Im denote the real and imaginary parts of a complex number, and F Λ in Eq. ( 5) is the matrix constructed by m rows of F F Λ =    e 2πik 1 1 n • • • e 2πik 1 n n . . . . . . . . . e 2πikm1 n • • • e 2πikmn n    ∈ C m×n . ( ) Proposition 1. Suppose d = 2m, M = 2n. Construct matrix B as in Eq.( 4). Then BB = I d and b j 2 = m n , ∀j ∈ {1, • • • , M }. Theorem 1 shows that the structured construction B satisfies the orthogonal constraint and constant norm constraint. One thing remaining is how to construct B to achieve a small mutual coherence. To achieve this goal, we can leverage the coordinate descent method in (Lyu, 2017) to construct the index set Λ. For a prime number n such that m divides n -1, i.e., m|(n -1), we can employ a closed-form construction. Let g denote a primitive root modulo n. We construct the index Λ = {k 1 , k 2 , ..., k m } as Λ = {g 0 , g n-1 m , g 2(n-1) m , • • • , g (m-1)(n-1) m } mod n. The resulted structured matrix B has a bounded mutual coherence, which is shown in Theorem 1. Theorem 1. Suppose d = 2m, M = 2n, and n is a prime such that m|(n -1). Construct matrix B as in Eq.( 4) with index set Λ as Eq.( 6). Let mutual coherence µ(B) := max i =j |b i bj | bi 2 bj 2 . Then µ(B) ≤ √ n m . Remark: The bound of mutual coherence in Theorem 1 is non-trivial when n < m 2 . For the case n ≥ m 2 , we can use the coordinate descent method in (Lyu, 2017) to minimize the mutual coherence. Now, we show that the structured dictionary D = R B satisfies the constant norm constraint and has a bounded mutual coherence. The results are summarized in Theorem 1. Corollary 1. Let D = R B with R R = RR = I d . Construct matrix B as in Eq.( 4) with index set Λ as Eq.( 6). Then µ (D) = µ(B) ≤ √ n m and d j 2 = b j 2 = m n , ∀j ∈ {1, • • • , M }. Corollary 1 shows that, for any orthogonal matrix R, each column vector of the structured dictionary D has a constant l 2 -norm. Moreover, it remains a constant mutual coherence µ(D) = µ(B). Thus, given a fixed matrix B, we only need to learn matrix R for the dictionary learning without undermining the low mutual coherence property.

3.1.2. JOINT OPTIMIZATION FOR DICTIONARY LEARNING AND SPARSE CODING

With the structured matrix B, we can jointly optimize R and Y for the optimization problem (7). min Y ,R X -R BY 2 F subject to y i 0 ≤ k, ∀i ∈ {1, • • • , N } (7) R R = RR = I d This problem can be solved by the alternative descent method. For a fixed R, we show the sparse representation Y has a closed-form approximation thanks to the structured dictionary. For the fixed sparse codes Y , dictionary parameter R has a closed-form solution. Fix R, optimize Y : Since the constraints of Y is column separable, i.e., y i 0 ≤ k, and the objective ( 8) is also decomposable, X -R BY 2 F = N i=1 x i -R By i 2 F . It is sufficient to optimize the sparse code y i ∈ R M for each point x i ∈ R d separately. Without loss of generality, for any input x ∈ R d , we aim at finding the optimal sparse code y ∈ R M such that y 0 ≤ k. Since R R = RR = I d and BB = I d , we have x -R By 2 2 = Rx -RR By 2 2 = Rx -By 2 2 = BB Rx -By 2 2 = B(B Rx -y) 2 2 = B(h -y) 2 2 . ( ) where h = B Rx is a dense code. Case 1: When m = n (the columns of B are orthogonal), we can rewrite Eq.( 9) into a summation form as B(h -y) 2 2 = M j=1 (h j -y j ) 2 b j 2 2 . ( ) Case 2: When m < n, we have an error-bounded approximation using R.H.S. in Eq.( 10). Let z = hy, we have Bz 2 2 - M j=1 z 2 j b j 2 2 = M i=1 M j=1,j =i z i z j b i b j (11) ≤ M i=1 M j=1,j =i |z i z j | b i 2 b j 2 µ(B) (12) = M i=1 M j=1,j =i |z i z j | • m n • µ(B) It is worth to note that the error bound is small when the mutual coherence µ(B) is small. When we employ the structural matrix in Theorem 1. It follows that Bz 2 2 - M j=1 z 2 j b j 2 2 ≤ M i=1 M j=1,j =i |z i z j | • m n • min( √ n m , 1) (14) = C M i=1 M j=1,j =i |z i z j | (15) = C M i=1 M j=1,j =i |h i -y i ||h j -y j | where C = min( 1 √ n , m n ). In Eq.( 14), we use µ(B) ≤ √ n m from Theorem 1. Considering the sparse constraint y 0 ≤ k, the error bound is minimized when all the non-zero term y j = h j to get |y j -h j | = 0. Let S denote the set of index of non-zero element y j of y. Now the problem is to find the index set S to minimize M i=1 M j=1,j =i |h i -y i ||h j -y j | = i∈S c j∈S c ,j =i |h i ||h j | ( ) where S c denotes the complement set of S. We can see that Eq.( 17) is minimized when S consists of the index of the k largest (in absolute value) elements of h.

Now, we consider

M j=1 z 2 j b j 2 2 . Note that b j 2 2 = m n , it follows that M j=1 z 2 j b j 2 2 = m n M j=1 (h j -y j ) 2 . ( ) Because each term (h j -y j ) 2 ≥ 0 is minimized when y j = h j , we know that Eq.( 18) under sparse constraints is minimized when all the non-zero term setting as y j = h j . Otherwise we can set a non-zero term y j to y j = h j to further reduce term (h j -y j ) 2 to zero. Now, the problem is to find the index set of the non-zero term to minimize Eq.( 19). M j=1 (h j -y j ) 2 = M j=1 h 2 j - i∈S,|S|≤k h 2 i ( ) where S := {j|y j = 0}. We can see that Eq.( 19) is minimized when S consists of the index of the k largest (in absolute value ) elements of h. Remark: Both the approximation M j=1 z 2 j b j 2 2 and the error bound is minimized by the same solution. Fix Y , Optimize R : For a fixed Y , we know that X -R BY 2 F = X 2 F + BY 2 F -2tr(R BY X ) ) This is the nearest orthogonal matrix problem, which has a closed-form solution as shown in (Schönemann, 1966; Gong et al., 2012) . Let BY X = U ΓV obtained by singular value decomposition (SVD), where U , V are orthgonal matrix. Then, Eq.( 20) is minimized by R = U V (21) 3.2 l 0 -NORM REGULARIZATION We employ the structured dictionary D = R B same as in Section 3.1. The optimization problem with l 0 -norm regularization is defined as min Y ,R X -R BY 2 F + λ Y 0 subject to R R = RR = I d (22) Figure 1 : Illustration of the SDL Plug-in This problem can be solved by the alternative descent method. For a fixed R, we show Y has a closed-form approximation thanks to the structured dictionary. For fixed the sparse codes Y , dictionary parameter R also has a closed-form solution. Fix R, optimize Y : Since the objective can be rewritten as Eq.( 23) X -R BY 2 F + λ Y 0 = N i=1 x i -R By i 2 F + λ y i 0 . ( ) It is sufficient to optimize Y i for each point X i separately. Without loss of generality, for any input x ∈ R d , we aim at finding an optimal sparse code y ∈ R M . Since R R = RR = I d and BB = I d , when m = n, following the derivation in Section 3.1.2, we have x -R By 2 F + λ y 0 = B(h -y) 2 F + λ y 0 , where h = B Rx is a dense code. Note that b j 2 2 = m n , together with Eq.( 24), it follows that B(h -y) 2 F + λ y 0 = m n   M j=1 (h j -y j ) 2 + nλ m 1[y j = 0]   . ( ) where 1[•] is an indicator function which is 1 if its argument is true and 0 otherwise. This problem is separable for each variable y j , and each term is minimized by setting y j = h j if h 2 j ≥ nλ m 0 otherwise . ( ) Fix Y , update R: For a fixed Y , minimizing the objective leads to the same nearest orthogonal matrix problem as shown in Section 3.1.2. Let BY X = U ΓV obtained by SVD, where U , V are orthogonal matrix. Then, the reconstruction problem is minimized by R = U V . Remark: Problems with other separable regularization terms can be solved in a similar way. The key difference is how to achieve sparse codes y. For example, for l 1 -norm regularized problems, y can be obtained by a soft thresholding function, i.e., y = sign(y) max 0, |y| -nλ/(2m) .

4. SPARSE DENOISING LAYER

One benefit of our fast sparse coding algorithm is that it enables a simple closed-form reconstruction, which can be used as a plug-in layer for deep neural networks. Specifically, given an orthogonal matrix R and input vector x, the optimal reconstruction of our method can be expressed as x = R Bf (B Rx) , where f (•) is a non-linear mapping function. For the k-sparse constrained problem, f (•) is a k-max pooling function (w.r.t the absolute value) as Eq.( 28) f (h j ) = h j if |h j | is one of the k-highest values of |h| ∈ R M 0 otherwise . ( ) For the l 0 -norm regularization problem, f (•) is a hard thresholding function as Eq.( 29) f (h j ) = h j if |h j | ≥ nλ m 0 otherwise . ( ) For the l 1 -norm regularization problem, f (•) is a soft thresholding function as Eq.( 30) f (h j ) = sign(h j ) × max 0, |h j | - nλ 2m , ( ) where sign(•) denotes the Sign function. The reconstruction in Eq.( 27) can be used as a simple plug-in layer for deep networks, we named it as sparse denoising layer (SDL). It is worth noting that only the orthogonal matrix R is needed to learn. The structured matrix B is constructed as in Section 3.1.1 and fixed. The orthogonal matrix R can be parameterized by exponential mapping or Cayley mapping (Helfrich et al., 2018) of a skew-symmetric matrix. In this work, we employ the Cayley mapping to enable gradient update using deep learning tools. Specifically, the orthogonal matrix R can be obtained by the Cayley mapping of a skew-symmetric matrix as R = (I + W )(I -W ) -1 , ( ) where W is a skew-symmetric matrix, i.e., W = -W ∈ R d×d . For a skew-symmetric matrix W , only the upper triangular matrix (without main diagonal) are free parameters. Thus, the number of free parameters of SDL is d(d -1)/2, which is much smaller compared with the number of parameters of backbone deep networks. For training a network with a SDL, we add a reconstruction loss term as a regularization. The optimization problem is defined as min W ∈R d×d ,Θ (X; W , Θ) + β Z -Z 2 F , ( ) where W is a skew-symmetric matrix parameter of SDL. Θ is the parameter of the backbone network. Z is the reconstruction of the latent representation Z via SDL (Eq.( 27)). An illustration of the SDL plug-in is shown in Figure 1 . When SDL is used as the first layer, then Z = X, and Z = X. In this case, Z is the reconstruction of the input data X. It is worth noting that the shape of the input and output of SDL are same. Thus, SDL can be used as plug-in for any backbone models without changing the input/output shape of different layers in the backbone network. With the simple SDL plug-in, backbone models can be trained from scratches. 

5. EXPERIMENTS

We evaluate the performance of our SDL on both classification tasks and RL tasks. For classification, we employ both DenseNet-100 (Huang et al., 2017) and ResNet-34 (He et al., 2016) as backbone. For RL tasks, we employ deep PPO modelsfoot_0 (Schulman et al., 2017) as backbone. For all the tasks, we test the performance of backbone models with and without our SDL when adding Gaussian noise or Laplace noise. In all the experiments, we plug SDL as the first layer of deep models. We set the standard deviation of input noise as {0, 0.1, 0.2, 0.3}, respectively. (The input noise is added after input normalization) . We keep all the hyperparameters of the backbone models same, the only difference is whether plugging SDL. The parameter β for the reconstruction loss is fixed as β = 100 in all the experiments. Classification Tasks: We test SDL on CIFAR10 and CIFAR100 dataset. We construct the structured matrix B ∈ R 12×14 by Eq.( 4). In this setting, the orthogonal matrix R corresponds to the convolution parameter of Conv2d(•) with kernelsize=2 × 2. We set the sparse parameter of our k-sparse SDL as k = 3 in all the classification experiments. The average test accuracy over five independent runs on CIFAR10 and CIFAR100 with Gaussian noise are shown in Fig. 3 and Fig. 4 , respectively. We can observe that models with SDL obtain a similar performance compared with vanilla model on the clean input case. With an increasing variance of input noise, models with SDL outperform the vanilla models more and more significantly. The experimental results with Laplace noise are presented in Fig. 9 and Fig. 10 in the supplementary. The results on Laplace noise cases show the similar trends with Gaussian noise cases. We further test the performance of SDL under the fast gradient sign method (FGSM) attack (Goodfellow et al., 2015) . The perturbation parameter epsilon is set to 8/256. Experimental results are shown in Fig. 2 . We can observe that adding SDL plug-in can improve the adversarial robustness of the backbone models. More experiments on tiny-Imagenet dataset can be found in Appendix G. The return of one episode is the sum of rewards over all steps during the whole episode. We present the average return over five independent runs on KungFuMaster and Tennis game with Gaussian noise and Laplace noise in Fig. 5 and Fig. 6 , respectively. Results on Seaquest game are shown in Fig. 16 in the supplement due to the space limitation. We can see that models with SDL achieve a competitive average return on the clean cases. Moreover, models with SDL obtain higher average return than vanilla models when the input state is perturbed with noise.

6. CONCLUSION

We proposed fast sparse coding algorithms for both k-sparse problem and l 0 -norm regularization problems. Our algorithms have a simple closed-form update. We proposed a sparse denoising layer as a lightweight plug-in for backbone models against noisy input perturbation based on this handy closed-form. Experiments on both ResNet/DenseNet classification model and deep PPO RL model showed the effeteness of our SDL against noisy input perturbation and adversarial perturbation. Proof. Let c i ∈ C m×1 be the i th column of matrix F Λ ∈ C m×n in Eq.( 5). Let b i ∈ R 2m×1 be the i th column of matrix B ∈ R 2m×2n in Eq.(4). For 1 ≤ i, j ≤ n, i = j, we know that 1 n m i=1 sin 2 2πk i j n + cos 2 2πk i j n = m n b i b i+n = 0, b i+n b j+n = b i b j = Re(c * i c j ), b i+n b j = -b i b j+n = Im(c * i c j ), where * denotes the complex conjugate, Re(•) and Im(•) denote the real and imaginary parts of the input complex number. It follows that µ(B) ≤= max 1≤k,r≤2n,k =r |b k b r | ≤ max 1≤i,j≤n,i =j |c * i c j | = µ(F Λ ) From the definition of F Λ in Eq.( 5), we know that µ(F Λ ) = max 1≤i,j≤n,i =j |c * i c j | = max 1≤i,j≤n,i =j 1 m z∈Λ e 2πiz(j-i) n (44) = max 1≤k≤n-1 1 m z∈Λ e 2πizk n ( ) Because Λ = {g 0 , g n-1 m , g 2(n-1) m , • • • , g (m-1)(n-1) m } mod n , we know that Λ is a subgroup of the multiplicative group {g 0 , g 1 , • • • , g n-2 } mod n. From Bourgain et al. (2006) , we know that max 1≤k≤n-1 z∈Λ e 2πizk n ≤ √ n Finally, we know that µ(B) ≤ µ(F Λ ) ≤ √ n m . C PROOF OF COROLLARY 1  Proof. Since R R = RR = I d and D = R B, we know that d j 2 = b j 2 . From Theorem 1, we know that b j 2 = m n , ∀j ∈ {1, • • • , M }. It follows that d j 2 = b j 2 = m n for ∀j ∈ {1, • • • , M }.

D EMPIRICAL CONVERGENCE OF OBJECTIVE FUNCTIONS

We test our fast dictionary learning algorithms on Lena with image patches (size 12×12). We present the empirical convergence result of our fast algorithms in Figure 7 . It shows that the objective tends to converge less than fifty iterations. 



https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/



Figure 2: Mean test accuracy ± std over 5 independent runs on CIFAR10/CIFAR100 dataset under FGSM adversarial attack for Densenet and Resnet with or without SDL

Figure 3: Mean test accuracy ± std over 5 independent runs on CIFAR10 dataset with Gaussian noise for Densenet and Resnet with or without SDL

Figure 5: Average Return ± std over 5 independent runs on KungfuMaster and Tennis game with Gaussian noise with or without SDL

39) Thus, we have b j 2 = m n for j ∈ {1, • • • , M } B PROOF OF THEOREM 1

the definition of mutual coherence µ(•), we know it is rotation invariant. Since D = R B with R = RR = I d , we know µ(D) = µ(B). From Theorem 1, we have µ(B) ≤ √ n m . Thus, we obtain µ(D) = µ(B) ≤ √ n m .

Figure 7: Decreasing of the objective functions

Figure 8: Demo of denoised results of our fast sparse coding algorithm

are shown in Table1. SDL has much less parameters and simpler structure compared with DnCNN and PnP, and it can serve as a lightweight plug-in for other tasks, e.g., RL.

RL Tasks: We test deep PPO model with SDL on Atari games: KungFuMaster, Tennis and Seaquest. The deep PPO model concatenates four frames as the input state. The size of input

A PROOF OF PROPOSITION 1

Proof. Let c i ∈ C 1×n be the i th row of matrix F Λ ∈ C m×n in Eq.( 5). Let v i ∈ R 1×2n be the i th row of matrix B ∈ R 2m×2n in Eq.(4). For 1 ≤ i, j ≤ m, i = j, we know thatwhere * denotes the complex conjugate, Re(•) and Im(•) denote the real and imaginary parts of the input complex number.For a discrete Fourier matrix F , we know thatWhen i = j, from Eq.(36), we know c i c * j = 0. Thus, we haveThe l 2 -norm of the column vector of B is given as 

