DEEP GATED CANONICAL CORRELATION ANALYSIS

Abstract

Canonical Correlation Analysis (CCA) models can extract informative correlated representations from multimodal unlabelled data. Despite their success, CCA models may break if the number of variables exceeds the number of samples. We propose Deep Gated-CCA, a method for learning correlated representations based on a sparse subset of variables from two observed modalities. The proposed procedure learns two non-linear transformations and simultaneously gates the input variables to identify a subset of most correlated variables. The nonlinear transformations are learned by training two neural networks to maximize a shared correlation loss defined based on their outputs. Gating is obtained by adding an approximate 0 regularization term applied to the input variables. This approximation relies on a recently proposed continuous Gaussian based relaxation for Bernoulli variables which act as gates. We demonstrate the efficacy of the method using several synthetic and real examples. Most notably, the method outperforms other linear and non-linear CCA models.

1. INTRODUCTION

Canonical Correlation Analysis (CCA) (Hotelling, 1936; Thompson, 2005) , is a classic statistical method for finding the maximally correlated linear transformations of two modalities (or views). Using modalities X ∈ R Dx×N and Y ∈ R D Y ×N , which are centered and have N samples with D x and D y features respectively. CCA seeks canonical vectors a i ∈ R D X , and b i ∈ R D Y , such that , u i = a T i X, and v i = b T i Y , i = 1, ..., N , maximize the sample correlations between u i and v i , where u i (v i ) form an orthonormal basis for i = 1, ..., d, i.e. a i , b i = argmax ui,uj =δi,j , vi,vj =δi,j ,i,j=1,...,d Corr(u i , v i ). (1) While CCA enjoys a closed-form solution using a generalized eigen pair problem, it is restricted to the linear transformations A = [a 1 , ..., a d ] and B = [b 1 , ..., b d ]. In order to identify non-linear relations between input variables, several extensions of CCA have been proposed. Kernel methods such as Kernel CCA (Bach & Jordan, 2002) , Non-paramatric CCA (Michaeli et al., 2016) or Multi-view Diffusion maps (Lindenbaum et al., 2020) learn the non-linear relations in reproducing Hilbert spaces. These methods have several shortcomings: they are limited to a designed kernel, they require O(N 2 ) computations for training, and they have poor interpolation and extrapolation capabilities. To overcome these limitations, Andrew et al. (2013) have proposed Deep CCA, to learn parametric non-linear transformations of the input modalities X and Y . The functions are learned by training two neural networks to maximize the total correlation between their outputs. Linear and non-linear canonical correlation models have been widely used in the setting of unsupervised or semi-supervised learning. When d is set to a dimension satisfying d < D x , D y , these models find dimensional reduced representations that may be useful for clustering, classification or manifold learning in many applications. For example, in biology (Pimentel et al., 2018) , neuroscience (Al-Shargie et al., 2017) , medicine (Zhang et al., 2017) , and engineering (Chen et al., 2017) . One key limitations of these models is that they typically require more samples than features, i.e. N > D x , D y . However, if we have more variable than samples, the estimation based on the closed form solution of the CCA problem (in Eq. 1) breaks (Suo et al., 2017) . Moreover, in high dimensional data, often some of the variables are not informative and thus should be omitted from the transformations. For these reasons, there has been a growing interest in studying sparse CCA models. Sparse CCA (SCCA) (Waaijenborg et al., 2008; Hardoon & Shawe-Taylor, 2011; Suo et al., 2017) uses an 1 penalty to encourage sparsity of the canonical vectors a i and b i . This can not only remove the degeneracy inherit to N < D x , D y , but can improve interpetability and performance. One caveat of this approach is its high computational complexity, which can be reduced by replacing the orthonormality constraints on u i and v i with orthonormality constraints on a i and b i . This procedure is known as simplified-SCCA (Parkhomenko et al., 2009; Witten et al., 2009) , which enjoys a closed form solution. There has been limited work on extending these models to sparse nonlinear CCA. Specifically, there are two kernel based extensions, two-stage kernel CCA (TSKCCA) by Yoshida et al. (2017) and SCCA based on Hilbert-Schmidt Independence Criterion (SCCA-HSIC) by Uurtio et al. (2018) . However, these models suffer from the same limitations as KCCA and are not scalable to a high dimensional regime. This paper presents a sparse CCA model that can be optimized using standard deep learning methodologies. The method combines the differentiable loss presented in DCCA (Andrew et al., 2013) with an approximate 0 regularization term designed to sparsify the input variables of both X and Y . Our regularization relies on a recently proposed Gaussian based continuous relaxation of Bernoulli random variables, termed gates (Yamada et al., 2020) . The gates are applied to the input features to sparsify X and Y . The gates parameters are trained jointly via stochastic gradient decent to maximize the correlation between the representations of X and Y , while simultaneously selecting only the subsets of the most correlated input features. We apply the proposed method to synthetic data, and demonstrate that our method can improve the estimation of the canonical vectors compared with SCCA models. Then, we use the method to identify informative variable in multichannel noisy seismic data and show its advantage over other CCA models. 1.1 BACKGROUND 1.2 DEEP CCA Andrew et al. (2013) , present a deep neural network that learns correlated representations. They proposed Deep Canonical Correlation Analysis (DCCA) which extracts two nonlinear transformations of X and Y with maximal correlation. DCCA trains two neural networks with a joint loss aiming to maximize the total correlation of the network's outputs. The parameters of the networks are learned by applying stochastic gradient decent to the following objective: θ * X , θ * Y = argmax θ X , θ Y Corr(f (X; θ X ), g(Y ; θ Y )), where θ X and θ Y are the trainable parameters, and f (X), g(Y ) ∈ R d are the desired correlated representations.

1.3. SPARSE CCA

Several authors have proposed solutions for the problem of recovering sparse canonical vectors. The key advantages of sparse vectors are that they enable identifying correlated representations even in the regime of N < D x , D y and they allow unsupervised feature selection. Following the formulation by Suo et al. (2017) , SCCA could be described using the following regularized objective a, b = argmin -Cov(a T X, b T Y ) + τ 1 a 1 + τ 2 b 1 , subject to a T X 2 ≤ 1, b T Y 2 ≤ 1, where τ 1 and τ 2 are regularization parameters for controlling the sparsity of the canonical vectors a and b. Note that the relaxed inequality constrain on a T X and b T Y makes the problem bi-convex, however, if a T X 2 < 1 or b T X 2 < 1, then the covariance in the objective is no longer equal to the correlation.

1.4. STOCHASTIC GATES

In the last few years, several methods have been proposed for incorporating discrete random variables into gradient based optimization methods. Towards this goal, continuous relaxations of discrete random variables such as (Maddison et al., 2016; Jang et al., 2017) have been proposed. Such relaxations have been used in several applications, for example, model compression (Louizos et al., 2017) , feature selection or for defining discrete activations (Jang et al., 2016) . We focus on a Gaussian-based relaxation of Bernoulli variables, termed Stochastic Gates (STG) (Yamada et al., 2020) , which were originally proposed for supervised feature selection. We denote the STG random vector by z ∈ [0, 1] D , where each entry is defined as z[i] = max(0, min(1, µ[i] + [i])), where µ[i] is a trainable parameter for entry i, the injected noise [i] is drawn from N (0, σ 2 ) and σ is fixed throughout training. This approximation can be viewed as a clipped, mean-shifted, Gaussian random variable. In Fig. 1 we illustrate generation of the transformed random variable z[i] for µ[i] = 0.5 which represents a "fair" relaxed Bernoulli variable. 3) and the pdf of the relaxed Bernoulli variable for µ = 0.5 corresponding to a "fair" Bernoulli variable. The trainable parameter µ can shift the mass of z towards 0 or 1. Here, we refer to one element of the random vector and omit the index i.

2. DEEP GATED CCA 2.1 MODEL

It is appealing to try to combine ideas from Sparse CCA into the rich differentiable model of Deep CCA. However, a straight forward 1 regularization of the input layer of a neural network does not work in practice because it makes the learning procedure unstable. This was observed in the supervised setting by Li et al. (2016) ; Feng & Simon (2017) . This instability occurs because the objective is not differentiable everywhere. To overcome this limitation, we use the STG random variables (see Eq. 3) by multiplying them with the features of X and Y . Then, by penalizing for active gates using a regularization term E z 0 , we can induce sparsity in the input variables. We formulate the problem of sparse nonlinear CCA by regularizing a deep neural network with a correlation term. We introduce two random STG vectors into the input layers of two neural networks which are trained in tandem to maximize the total correlation. Denoting the random gating vectors z x and z y for view X and Y respectively, the Deep Gated CCA (DG-CCA) loss is defined by L(θ, µ) = E zx,zy -Corr(f (z T x X; θ X ), g(z T y Y ; θ Y )) + λ x z x 0 + λ y z y 0 , where θ = (θ X , θ Y ), µ = (µ X , µ Y ) are the model parameters, and λ x , λ y are regularization parameters that control the sparsity of the input variables. The vectors z x and z y are random STG vectors, with elements defined based on Eq. 3. Fig. 2 highlights the proposed architecture. Each observed modality is first passed through the gates. Then, the outputs of the gates are used as inputs to a view-specific neural sub-net. Finally, the shared loss term in Eq. 4 is minimized by optimizing the parameters of the gates and neural sub-nets. 

2.2. ALGORITHM DETAILS

We now detail the procedure used in DG-CCA for minimizing the loss L(θ, µ) (in Eq. 4). The regularization is based on a parametric expectation and therefore, can be expressed by E zx z 0 = Dx i=1 P(z x [i] ≥ 0) = Dx i=1 1 2 - 1 2 erf - µ x [i] √ 2σ , where erf() is the Gaussian error function, and is defined similarly for E zy z y 0 . Denoting the centered output representations of X, Y by Ψ X , Ψ Y ∈ R d×N respectively, the empirical covariance matrix between these representations can be expressed as Σ XY = 1 N -1 Ψ X Ψ T Y . Using a similar notations, we express the regularized empirical covariance matrices of X and Y as Σ X = 1 N -1 Ψ X Ψ T X + γI and Σ Y = 1 N -1 Ψ Y Ψ T Y + γI, where the matrix γI (γ > 0) is added to stabilize the invertability of Σ X and Σ Y . The total correlation in Eq. 4 can be expressed using the trace of Σ -1/2 Y Σ Y X Σ -1 X Σ XY Σ -1/2 Y . To learn the parameters of the gates µ and of the representations θ we apply stochastic gradient decent to L(θ, µ). Specifically, we used Monte Carlo sampling to estimate the left part of Eq. 4. This is repeated for each batch, using one Monte Carlo sample per batch as suggested by Louizos et al. (2017) and Yamada et al. (2020) , and worked well in our experiments. After training we remove the stochastic part of the gates, and use only variables i x ∈ {1, ..., D X } and i y ∈ {1, ..., D y } such that z x [i x ] > 0 and z y [i y ] > 0.

3. RESULTS

In the following section we detail the evaluation of the proposed approach using synthetic and real datasets. We start with two linear examples, demonstrating the performance of DG-CCA when N D X , D Y . Then, we use noisy images from MNIST and seismic data measured using two channels to demonstrate that DG-CCA finds meaningful representations of data even in a noisy regime. For a full description of the training procedure as well as the baseline methods, we refer the reader to the Appendix.

3.1. SYNTHETIC EXAMPLE

We start by describing a simple linear model also experimented by Suo et al. (2017) . Consider data generated from the following distribution and ρ = 0.9. The indices of the active elements are chosen randomly with values equal to 1/ √ 5. In this setting, based on Proposition 1 in (Suo et al., 2017) , the canonical vectors a and b which maximize the objective in Eq. 1 are φ and η respectively. X Y ∼ N ( 0 0 , Σ X Σ XY Σ Y X Σ Y ), Using this model we generate 400 samples and estimate the canonical vectors based on CCA and DG-CCA. In Fig. 3 we present a regularization path of the proposed scheme. Specifically, we apply DG-CCA to the data described above using various values of λ = λ x = λ y . We present the 0 of active gates (by expectation) along with the empirical correlation between the extracted representations Σ XY , which is also ρ = φXY T ηT . As evident from the left panel, there is a wide range of λ values such that DG-CCA converges to true number of coefficients (10) and correct correlation value (0.9). Next, we present the values of φ, the DG-CCA estimate (using λ = 30) of the canonical vector φ, and the CCA based estimate of the canonical vector â. The solution by CCA is wrong and not sparse, while the DG-CCA solution correctly identifies the support of φ. Finally, we evaluate the estimation error of φ using E φ = 2(1 -|φ T φ|), and E η is defined similarly. In Table 1 we present the estimated correlation along with the estimation errors of φ and ρ (averaged over 100 simulations). As baselines we present the results simulated by Suo et al. (2017) (mod-SCCA), comparing the performance to PMA (Witten et al., 2009) and SCCA (Chen et al., 2013) . 

3.2. MULTI VIEW SPINNING PUPPETS

As an illustrative example we use a dataset collected by Lederman & Talmon (2018) for multiview learning. The authors have generated two videos capturing rotations of 3 desk puppets. One camera captures two puppets, while the other captures another two, where one puppet is shared across cameras. A snapshot from both cameras appears in the top row of Fig. 4 . All puppets are placed on a spinning device that rotates the dolls at different frequencies. In both video there is a shared parameter, namely the rotation of the common bulldog. Even thought the Bulldog is captured from a slightly different angle, we attempt to use CCA to identify a linear transformation that projects the two Bulldogs in to a common embedding. We use a subset of the spinning puppets dataset, with 400 images from each camera. Each image has 240 × 320 = 76800 pixels (using a gray scaled version of the colored image), therefore, there are more feature than samples and direct application of CCA would fail. We apply the proposed scheme using λ y = λ x = 50, a linear activation and embedding dimension d = 2. DG-CCA converges to embedding with a correlation of 1.99 using 372 and 403 pixels from views X and Y . The active gates are presented in the bottom row of Fig. 4 . In Fig. 5 we present the coupled two dimensional embedding of both videos. Both embeddings are highly correlated with the angular orientation of the Bulldog. Note that adjacent images in the embedding are not necessarily adjacent in the original ambient space, this is because the Bunny and the Yoda puppets are gated and do not affect the embedding. Multi-view processing of the two noisy views can generate an informative representation of the noisy MNIST data. In the following we will focus on performing unsupervised embedding of each noisy MNIST view into correlated 10 dimensional space. By reducing the correlation cost, the DG-CCA learns which per-view pixels are relevant and informative in correlation maximization sense. In the bottom right corner of Fig 6 we present the location of the active gates. DG-CCA selects features (pixels) within an oval-like shape in the center of each view thus capturing the digit information and reducing the influence of the noise. To measure the class separation in the learned embedding, we apply k-means clustering to the stacked embedding of the two views. We run k-means (with k = 10) using 20 random initializations and record the run with the smallest sum of square distances from centroids. Given the cluster assignment, k-means clustering accuracy (KM ACC) and mutual information (MI) are measured using the true labels. Additionally, we train a Linear-SVM (LSVM) model on our train and validation datasets. LSVM classification accuracy (LSVM ACC) is measured on the remaining test set. The performance of DG-CCA compared with several baselines appears in Table 2 . In the appendix we provide all implementation details and provide an experiment demonstrating the performance for various values of λ = λ x = λ y .

3.4. SEISMIC EVENT CLASSIFICATION

Next, we evaluate the method using a dataset studied by Lindenbaum et al. (2018) . The data consists of 1609 seismic events. Here, we focus on 537 explosions which are categorized into 3 quarries. The events occurred between the years 2004 -2015, in the southern region of Israel and Jordan. Each event is recorded using two directional channels facing east (E) and north (N), these comprise the coupled views for the correlation analysis. Following the analysis by Lindenbaum et al. (2018) representations with bins equally tempered on a logarithmic scale. Each Sonogram z ∈ R 1157 with 89 time bins and 13 frequency bins. An example of Sonograms from both channels appears in the top row of Fig. 7 . We create the noisy seismic data by adding sonograms computed based on vehicle noise fromfoot_0 . Examples of noisy sonograms appear in the middle row of Fig 7 . We omit 20% of the data as a validation set. Then we train DG-CCA to embed the data in 3 dimensions using several values for λ = λ x = λ y . Then, we use the model that attains maximal correlation on the validation set. In Table 2 we present the MI, k-means and SVM accuracies computed based on DG-CCA embedding. Furthermore, we compare the performance with several other baselines. Here, the proposed scheme improves performance in all 3 metrics while identifying a subset of 71 and 68 features from channel E and N respectively. The active gates are presented in the bottom row of Fig. 7 . 

4. CONCLUSION

In this paper we present a method for learning sparse non-linear transformations which maximize the canonical correlations between two modalities. Our method is realized by gating the input layers of two neural networks which are trained to maximize their output's total correlations. Input variables are gated using a regularization term which encourages sparsity. This allows us to learn informative representations even when the number of variables far exceeds the number of samples. Finally, we demonstrate that the method outperforms existing methods for linear and non-linear canonical correlation analysis. Empirically, we have observed that for DG-CCA smaller values of σ translate to improved convergence. Specifically, we have used σ = 0.25 which worked well in our experiments. Studying the effect of σ is an open question that we aim to pursue in future study.

A.2 ADDITIONAL EXPERIMENTAL DETAILS

In the following sections we provide additional experimental details required for reproduction of the experiments provided in the main text.

A.2.1 SYNTHETIC EXAMPLE

For the linear model we use a learning rate of 0.005 with 10, 000 epochs. The values of λ x and λ y are both set to 30. These values were obtained using a cross validation procedure. We run the method 100 times with different realizations of X and Y . Importantly, following Suo et al. (2017) we present the average errors for the estimation of the canonical vectors, however the median values are one order of magnitude better, specifically E ψ = 0.0017 and E η = 0.0020.

A.2.2 NOISY MNIST

In this subsection we provide additional details regarding the noisy MNIST experiment. In Fig 8 , we present the performance as a function of the number of active gates (pixels) controlled by λ x = λ y . The MI score, k-means and SVM accuracy were computed based on DG-CCA embedding with learning rate of 0.01. Furthermore, the number of epochs (∼ 4000) was tuned by early stopping using random validation of size 10000. To learn 10 dimensional correlated embedding, we use the same architecture as suggested by (Wang et al., 2015) consisting of three hidden layers with 1000 neurons each. The number of dimensions in the embedding was selected based on the number of classes in MNIST. This architecture is used for both DCCA and DG-CCA. Note that for DG-CCA using small values of the regularization parameters λ x and λ y , increases the number of selected features and the degrades performance. This is duo to the fact that as more features are selected more noise is introduced into the extracted representation (of size 10).It is interesting to note that the k-means was more robust to the introduced noise than the LSVM. The regularization parameters λ x and λ y balances between the correlation loss and the amount of sparsification performed by the gates. These hyper parameters are tuned using the validation set in by maximizing the total correlation value. We compare DG-CCA to CCA (Chaudhuri et al., 2009) foot_1 , KCCA (Bach & Jordan, 2002)foot_2 , NCCA (Michaeli et al., 2016) foot_3 and DCCA (Andrew et al., 2013) 5 . For all methods we use an embedding with dimension 10, and evaluate performance with k-means using 20 random initilizations, and using LSVM by performing training on the training samples and testing on the remaining samples (split defined in the main text). In this experiment we tried to train SCCA-HSIC (Uurtio et al., 2018)foot_5 for over two days, but it did not converge. Furthermore, we believe that the poor performance of the kernel methods are degraded due to the high level of noise in the input images.

A.2.3 SEISMIC EVENT CLASSIFICATION

Using the seismic data, we compare the performance of DG-CCA with a linear and non-linear activation. In this exaple, we use a learning rate of 0.01 with 2000 epochs. The number of neurons for each hidden layer are: 300, 200, 100, 50, 40, with a Tanh activation. he number of dimensions in the embedding (d = 3) was selected based on the number of classes in the data. Parameters are optimized manually to maximize the correlation on a validation set. In Fig. 9 we present SVM accuracy for different levels of sparsity. The presented number of features is the average over both modalities, and SVM performance is evaluated using 5-folds cross validation. We compare DG-CCA to CCA (Chaudhuri et al., 2009) , SCCA (Suo et al., 2017) , SCCA-HSIC (Uurtio et al., 2018) , KCCA (Bach & Jordan, 2002) , NCCA (Michaeli et al., 2016) and DCCA (Andrew et al., 2013) . For all methods we use an embedding with dimension 3, and evaluate performance with k-means using 20 random initilizations, and using linear SVM by performing a 5-folds cross validation. For the kernel methods we evaluated performance by constructing a kernel using k = 5, 10, ..., 50, nearest neighbors and selected the value which maximized performance in terms of total correlation. 



https://bigsoundbank.com/search?q=car https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html https://gist.github.com/yuyay/16ce4914683da30f87d0 https://tomer.net.technion.ac.il/files/2017/08/NCCAcode_v3.zip https://github.com/adrianna1211/DeepCCA_tensorflow https://github.com/aalto-ics-kepaco/scca-hsic



Figure 1: From left to right, pdf of the Gaussian injected noise , the hard Sigmoid function (defined in Eq.3) and the pdf of the relaxed Bernoulli variable for µ = 0.5 corresponding to a "fair" Bernoulli variable. The trainable parameter µ can shift the mass of z towards 0 or 1. Here, we refer to one element of the random vector and omit the index i.

Figure2: The proposed architecture. Data from two views is propagated through stochastic gates. The gates output is fed into two neural sub-nets that have a shared loss. The shared loss is computed on both neural sub-nets output representations (with dimension d = 3 in this example). The shared loss combines a total correlation term and a differentiable regularization term which induces sparsity in the input variables.

Figure 3: Left: Regularization path of DG-CCA on data generated from the linear model. The left y-axis (green) represents the sum of active gates (by expectation) after training. The right y-axis represents the empirical correlation between the estimated representations, i.e. ρ = φT X T Y η, where φ and η are the estimated canonical vectors. Dashed lines indicate the correct number of active coefficients (green) and true correlation ρ (blue). Note that for small values of λ = λx = λy the model select more variables than needed and attains a higher correlation value, this is a similar over-fitting phenomenon that CCA suffers from. Right: True canonical vector φ along with the estimated vectors using DG-CCA ( φ) and CCA (â).

Figure 4: Top: two samples from the spinning puppets videos. Arrows indicate the spinning direction of each puppet. Bottom: the converged active gates for each video. There are 372 and 403 active gates for the left and right videos respectively.

Figure 5: The two DG-CCA embedding of the Yoda+Bulldog video (left) and Bulldog+Bunny (right). We overlay each embedding with 5 images corresponding to 5 points in the embedding spaces. The resulting embeddings are correlated with the angular rotation of the Bulldog, which is the common rotating puppet in this experiment.

Figure 6: Images from noisy MNIST (left) and corresponding images from background MNIST (right). In the bottom right of both figures, we presents the active gates (white values within a green frame). There are 277 and 258 active gates for view I and II respectively.

Figure 7: Top: Clean sample Sonograms of an explosion based on the E and N channels (left and right respectively). Arrows highlight the Primary (P) and Secondary (S) waves caused by the explosion. Middle: Noisy sonograms generated by adding sonograms of vehicle recordings. Bottom: the active gates for both channels. Note that the gates are active at time frequency bins which correspond to the P and S waves (see top left figure).

Figure 8: k-means and SVM classification accuracy (left) and mutual information score (right) vs. the number of selected features.

Figure9: Classification accuracy on the noisy seismic data. Performance is evaluated using linear SVM in the 3 dimensional embedding. Comparing performance of DG-CCA for different levels of sparsity, and using linear and nonlinear activation (Tanh).

Evaluating the estimation quality of the canonical vectors ψ and η.

, the input features are Sonogram representations of the seismic signal. Sonograms are time frequency Performance evaluation on the Noisy MNIST and seismic datasets.

A APPENDIX A.1 GATES INITIALIZATION

The Gaussian based stochastic gates suggested by Yamada et al. (2020) are based on trainable parameters µ and a constant parameter σ. These control the mean and standard deviation of the injected noise respectively. Yamada et al. (2020) have initialize all values of µ to 0.5, in this case, the gates approximate "fair" Bernoulli parameters. This is a reasonable choice, if no prior knowledge about the solution is available, however, we can utilize the closed form solution of the CCA problem to derive a smarter initialization procedure for the parameters of the gates. Specifically, given the empirical covariance matrix C XY = XY T (N -1) , we denote the thresholded covariance matrix by S XY , with values defined as followsWhere δ is a threshold value selected based on the estimated sparsity of X and Y . Specifically, if we assume that r percent of the values should be zeroed, then δ is set to be the r-th percentile of |(C XY )|. Then we compute the leading singular vectors u and v of S XY . We further threshold the absolute values of these vectors (using the same percentile used for S XY . The initial values of the parameters of the gates are then defined by µ X = ū + 0.5, and µ Y = v + 0.5, where ū and ū are the thresholded versions of the absolute value of the singular vectors.The standard deviation of the injected noise σ was set to 0.5 by (Yamada et al., 2020) . They have selected this value as it maximized the gradient of the regularization term at initialization.

