ON THE IMPORTANCE OF CONTRASTIVE LOSS IN MULTIMODAL LEARNING

Abstract

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point, while keeping the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn to align the representations from different views efficiently, especially in cases where the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we reveal a stagewise behavior of the learning process: In the first stage, the model aligns the feature representations using positive pairs and the condition number grows in this stage. Then, in the second stage, the model reduces the condition number of the learned representations using negative pairs. Under review as a conference paper at ICLR 2023 In this paper, we make preliminary theoretical steps towards answering the fundamental theoretical questions of the importance of contrastive loss in multimodal learning. We assume the data from the two modules are of form x A = Az A + A ξ ξ A and x B = Bz B + B ξ ξ B , respectively, where z A , z B are considered as the hidden signals, A, B linear transformations from the signal to the observation and A ξ ξ A , B ξ ξ B the noises. Similar linear models have also been used in previous works (Tian et al. (2021) ; Wen & Li ( 2021)) in the context of single-modal learning (A = B). The positive pair of the data shares the same signal z A = z B , and has different noises ξ A , ξ B and transformations A, B. The goal is to learn features f A , f B that align positive pairs while keeping representations of negative pairs away from each other. Under this setting, we make the following contributions: 1. We consider the challenging (but more practical) setting where the features in A and B are inhomogeneous, that is, the condition number of A and B can be ω(1). Prior works (Jing et al. (2022); Tian et al. (2021); Wen & Li (2021)) only consider cases where A and B are exactly column orthonormal matrices even in the simpler single-modal setting (A = B). 2. We consider feature learners f A , f B with normalization, meaning that f A , f B are always normalized to have (expected) norm one during training. Output normalization plays a critical role in the practical success of contrastive learning and is also employed in CLIP, but it is rarely considered in theory due to the additional complexity of the division by norm. 3. We analyze the learning process of stochastic gradient descent from random initialization. We prove that contrastive learning converges efficiently to a nearly optimal solution, which indeed aligns the feature representation f A and f B . 4. We also demonstrate the importance of negative pairs by comparing with training only over the positive pairs: We prove that although the latter can also learn to align f A and f B , the features learned by contrastive learning with negative pairs is much more uniform, meaning that f A , f B can recover all the singular vectors of A and B and normalize them. On the other hand, without negative pairs, the learned representation is close to a rank one solution, meaning that f A , f B will only focus on the top singular direction of A and B. 5. We also perform simulations and more practical experiments to further support our theory. Multimodal learning Despite the empirical success of multimodal learning, there are very few theoretical results on this topic. The one most related to ours is Huang et al. ( 2021), in which the authors show that, in certain cases, multimodal methods can provably perform better than singlemodal models. However, the authors consider neither contrastive pairs nor the training dynamics. Contrastive/Non-contrastive learning theory Another much richer line of research is about contrastive and non-contrastive methods in the context of single-modal self-supervised learning. Starting from Arora et al. (2019), many recent works have provided various explanations on why the representations learned with contrastive learning are useful in downstream tasks (Chuang et al. (2020); Tosh et al. (2021); Nozawa & Sato (2021); Wang et al. (2022a); HaoChen et al. (2021); Lee et al. (2021); Wang & Isola (2020)). These works mostly focus on the generalization aspect of the problem and do not consider training. Among them, Wang & Isola (2020) also study the problem using the notions of alignment and uniformity, and demonstrate that balanced representations benefit downstream tasks. However, they do not provide guarantees on training. Another related line of research is about non-contrastive learning, where the necessity of negative examples is questioned. In this line of research, the optimization problem does get considered as non-contrastive losses have trivial collapsed solutions. Tian et al. (2021) show that, under certain conditions, non-contrastive learning methods can learn non-collapsed solutions. Jing et al. (2022) show that, even with negative examples, contrastive learning can still suffer from another type of collapse, where the learned representations only span a low-dimensional subspace of the embedding space. In Pokle et al. (2022), the authors show that non-contrastive losses have many non-collapsed bad minima that the training algorithm does not avoid. Another related work that takes optimization into consideration is Wen & Li (2021), in which the authors analyze the training dynamics of contrastive learning and show that, B Namely, Γ Align is the accuracy of classifying whether the input pair (x A , x B ) is a positive pair. We say that the learned representations are aligned if Γ Align ≈ 1. Note that the notion of alignment introduced here is stronger than matching the positive pairs, which can be achieved by simply mapping all inputs to one single embedding. In that case, Γ Align will be 0 (or 0.5 if we choose to break ties randomly instead of using strict inequality). 1 See Section D for discussions on the sample complexity. 2 We use f + A as a shorthand for f A (x + A ), similarly for other combinations of A, B and ±. 3 Note that, in the non-contrastive case, changing τt only changes the learning rate.

1. INTRODUCTION

One of the exceptional abilities of humans is to associate data from different modalities (such as texts and images) together. For example, when we hear the words "white dog", we can immediately align it with the image of a dog with white color. When we hear the loud sound of the engine, we can imagine an expensive sports car passing nearby. Recently, in machine learning, multimodal learning methods -training the model to align the data from different modules, has become an increasingly popular research direction, especially in deep learning (He & Peng (2017) ; Stroud et al. (2020) ; Radford et al. (2021) ; Ramesh et al. (2021) ; Xu et al. (2021) ; Jia et al. (2021) ; Wang et al. (2022b) ). Among them, the recent work CLIP (Radford et al. (2021) ) shows remarkable quality results on aligning the features of text and images. The contrastive learning based method CLIP empirically outperforms many existing non-contrastive approaches (Grill et al. (2020) ; Chen & He (2021) ; He et al. (2020) ). The major difference between the contrastive approach and other approaches is that contrastive loss not only requires the learned representations from the same pair of data (i.e. positive pairs) to be positively aligned, but it also requires the data from different pairs (i.e. negative pairs) to be as negatively aligned as possible. In the paper, the authors also identify contrastive loss as the most critical part to the success of CLIP. Despite the empirical success of this contrastive learning-based method, from a theoretical perspective, the most fundamental questions are still largely open: In particular, how do contrastive pairs help in this new multimodal learning approach? How can the non-convex contrastive loss be efficiently minimized to learn features from both modules? Unlike the prior theoretical works on contrastive learning which mostly focus on extracting features from one module (e.g., Arora et al. (2019) ; Jing et al. (2022) ; Pokle et al. (2022) ; Tian et al. (2021) ; Wen & Li (2021) ), one main technical challenge of analyzing contrastive learning in a multimodal setting is how the model can be trained to align the feature representations f A , f B from modules A and B respectively. Due to the existence of negative pairs that emphasize negative correlations of f A and f B , it is unclear that the model still has incentives to align the features from different modules. with random augmentation, neural networks can learn features that are suppressed by noises when no augmentations are used. Though these works do consider the optimization problem, they focus on the case where the features are uniform, and only Wen & Li (2021) considers output normalization. We compare our results with the most relevant works in the next paragraph. Comparison with Jing et al. (2022) ; Tian et al. (2021) ; Wen & Li (2021) The dimensional collapse problem reported in Jing et al. (2022) is not a real issue in our setting, since, in our case, the best the model can do is to recover the latent vector z, up to some rotation. As a result, it is natural for the learned representations to span only a low-dimensional subspace of the embedding space R m . Here, the point of choosing m ≫ d is to make the optimization dynamics more regular, which is a common strategy in the study of over-parameterized neural networks. The main difference between our work and the analysis in Tian et al. (2021) and Wen & Li (2021) is we do not assume the inputs are isotropic. In our setting, the condition number can be as large as Θ(log d). When the condition number is 1, one can imagine that thanks to the symmetry, all directions will be learned simultaneously, and therefore, we do not need negative examples to prevent collapse (Tian et al. (2021) ) or the negative examples do not play an important role in analysis (Wen & Li (2021) ). On the other hand, when the condition number is larger than 1, we do need to use the negative examples to shrink the condition number, which corresponds to the second stage of our analysis.

3. PROBLEM SETUP

Similar to previous theoretical works (e.g., Lee et al. (2021) ; Wen & Li (2021) ), we consider a linear data-generating model. Formally, we assume that the contrastive pairs (x + A , x - B ) are constructed as x + A = Az + + A ξ ξ + A , x - B = Bz -+ B ξ ξ - B , where z ± , ξ - A , ξ - B are independent random variables following the uniforms distributions over {±1/ √ d} r , {±1/ √ d} d-r and {±1/ √ d} d-r , respectively, and A, B ∈ R d×r , A ξ , B ξ ∈ R d×(d-r) are matrices with A ⊤ A = B ⊤ B = diag(σ 2 ) and A ⊤ ξ A ξ = B ⊤ ξ B ξ = σ 2 ξ I d-r for some σ ∈ R r + and σ ξ ∈ R + . In words, we first sample the latent vectors z ± ∈ R r independently and encode them with A, B to form the signal part of the input. Then, we sample the latent noises ξ + A , ξ - B ∈ R d-r , and encode them with A ξ , B ξ to form the noise part of the input. Finally, we add the signal and noise parts together to obtain (x + A , x - B ). To generate a positive pair (x + A , x + B ), we use the same latent vector. That is, x + A = Az + + A ξ ξ + A and x + B = Bz + + B ξ ξ + B . Note that the latent noises here are still independent. We use σ 2 max and σ 2 min to denote the maximum and minimum of σ 2 1 , . . . , σ 2 r , σ 2 ξ , respectively. We assume that (σ 2 max /σ 2 min ) max 1, (d -r)σ 2 ξ /(rσ 2 min ) ≤ c log d for some small constant c > 0. Our results can easily be generalized to settings where the dimensions of x A and x B are not the same, since one can simply pad zeros at the end of each column of A and B. One way of interpreting this model is to view each coordinate of the latent vector z as an indicator for the presence/absence of a certain object, and the corresponding columns in A and B as the visual and word embeddings of this object, respectively. Now, we describe our learner model. We consider (normalized) linear feature learners. Define f A (x A ) := W ⊤ A x A E x A W ⊤ A x A 2 , f B (x B ) := W ⊤ B x B E x B W ⊤ B x A 2 , where W A , W B ∈ R d×m are the trainable parameters. In words, we first map the inputs (x A , x B ) into the embedding space R m using W A and W B , and then apply batch normalization to the outputs. By saying the learned representations are aligned, we mean that f A and f B are close for positive pairs and far away from each other for negative pairs. Meanwhile, we say the learned representations are balanced if changing a small fraction of coordinates of z does not change the representation dramatically. See Section 4 for formal definitions. One can easily verify that, in the population case, we have E x A W ⊤ A x A 2 = W ⊤ A A 2 F /d + W ⊤ A A ξ 2 F /d. For notational simplicity, we write K A = W ⊤ A A, K B = W ⊤ B B, K A,ξ = W ⊤ A A ξ , and K B,ξ = W ⊤ B B ξ . These are the matrices that directly map latent vectors to their final representations. We also define N 2 A = (∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F )/d and N 2 B = (∥K B ∥ 2 F + ∥K B,ξ ∥ 2 F )/d. Then our model can be rewritten as 1 f A (x A ) = K A z A + K A,ξ ξ A N A , f B (x B ) = K B z B + K B,ξ ξ B N B . We initialize each entry of W A and W B using iid Gaussian N (0, 1/m). This scaling ensures the norm of outputs before initialization does not blow up as m → ∞. We train our model using gradient descent over the following contrastive loss L. First, we define 2 S A (x + A , x + B ) = exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - B exp(τ 2 t f + A • f - B ) , S B (x + A , x + B ) = exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - A exp(τ 2 t f - A • f + B ) , where K is a positive constant controlling the strength of negative samples and τ t ∈ (0, 1] is the inverse temperature. Then, we define the contrastive loss as L := L A + L B := -E log S A (x + A , x + B ) -E log S B (x + A , x + B ). By the non-contrastive loss, we mean L := -E f + A , f + B . Formally, the training algorithm is defined as follows. Algorithm 3.1 (Training algorithm). Let L be L in the contrastive case and L in the non-contrastive case. At each step, we first sample a batch of positive/negative pairs {(x + A , x + B , x - A , x - B )} N i=1 , use them to compute the empirical version of L, and update the weight matrices using W A ← W A -τ -2 t η∇ W A L and W B ← W B -τ -2 t η∇ W B L. In the non-contrastive case, we always use τ t = 1 3 , and we repeat the above update until gradient descent converges to an approximate stationary point. In the contrastive case, we first use a small τ t = 1/ poly(d), run the process for T 1 = poly(d) iterations, switch to τ t = 1, and run the process for another T 2 -T 1 = poly(d) iterations.

4. MAIN RESULTS

In words, our results say that though both contrastive and non-contrastive methods can align the representations, the representations learned by contrastive methods are more balanced. First, we need to define what do "alignment" and "balance" mean here. We still use f A and f B to denote the embeddings, but our definitions here will be architecture-agnostic. After some general discussion, we also discuss these definitions in our specific setting. Definition 4.1 (Alignment). We define the alignment score as Γ Align := 1 2 E x ± A ,x ± Definition 4.2 (Balance). We define the balance score as Γ Balance := ∥Σ f ∥ 2 F / ∥Σ∥ 2 2 where Σ f := E x A f A f ⊤ A . Namely, Γ Balance is the stable rank of the covariance matrix of the output embeddings. We say that the learned representations are balanced if Γ Balance ≥ αr for some α ≈ 1. We make some short remarks on Definition 4.2. First, we only use f A in this definition because if the embeddings are well-aligned, f A f ⊤ A and f B f ⊤ B should be approximately the same. Second, the stable rank is usually used as an alternative to the actual rank because it is less sensitive to small singular values (Rudelson & Vershynin (2007) ). To see why this makes sense, note that ∥Σ f ∥ 2 F / ∥Σ∥ 2 2 = m k=1 κ 2 k / max k∈[m] κ 2 k , where κ 1 , . . . , κ m are the singular values of Σ f . If all nonzero κ k are the same, then this recovers the actual rank. The intuition behind the use of rank is that the more independent latent variables the model learns, the higher the rank of the representations needs to be. In Jing et al. (2022) , the authors also use rank to measure the degree of "dimensional collapse". Unlike their argument, here we only require Γ Balance to be at least αr, instead of αm, for the representations to be called balanced because even if we can recover the underlying latent vectors, the rank is still at most r. Hence, it does not make much sense to expect the learned representations to span the entire embedding space. Finally, note that a sufficient condition for Γ Balance ≥ αr is that the ratio of the largest and r-largest singular values is at most √ α. In other words, after excluding those singular values that should be 0, the condition number is approximately 1. Why balance representations are important? In our setting, one simple example of aligned but unbalanced representations is (-d) . This model maps inputs whose latent vector is z to diag(µ)z, up to some normalization, for both modules, whence it has Γ Align = 1. Meanwhile, one can verify that this model has Γ Balance /r ≤ 2/r ≈ 0. The problem of this model is that it overly emphasizes z 1 and is sensitive to small changes in z 1 . This model can be made balanced by replacing diag(µ) with I r , in which case the model directly recovers the latent vector z. As a result, all changes in z will be reflected in the final representation in a faithful way. Similar notions have also been studied in Wang & Isola (2020) under the name "uniformity" in the context of self-supervised learning, and they also report that balanced representations lead to better performance in downstream tasks, though, unlike our result, they do not provide guarantees on training. Still, this further suggests that learning balanced representations is a reasonable and important goal. W ⊤ A = diag(ν)A -1 and W ⊤ B = diag(ν)B -1 with ν 1 = 1 and ν 2 = • • • = ν r = exp With these two definitions, we can now state our main results. Theorem 4.3. Suppose that the network width m = poly(d) is sufficiently large, the learning rate η = 1/ poly(d) is sufficiently small, and we generate sufficiently, but still polynomially, many samples at each step to compute the lossfoot_0 . Suppose that, for some small constant c > 0, σ 2 max σ 2 min max 1, (d -r)σ 2 ξ rσ 2 min ≤ c log d. (a) Non-contrastive loss. There exists a σ ∈ R r satisfying the above assumptions such that, after poly(d) iterations, Algorithm 3.1 will converge to an approximate stationary point, at which the learned representations are aligned but not balanced, that is, Γ Align ≈ 1 but Γ Balance /r ≈ 0. (b) Contrastive loss. There exists τ 2 0 = 1/ poly(d) and T 1 = poly(d) such that, for any valid σ, Algorithm 3.1 will reach a point after poly(d) iterations at which the learned representations are aligned and balanced, that is, Γ Align ≈ 1 and Γ Balance /r ≈ 1. We close this section with sufficient conditions for the learned representations to be aligned and balanced in our setting. First, in our specific setting, the most natural way for a model to achieve Γ Align ≈ 1 is to learn ∥f A -f B ∥ ≈ ∥K 0 (z A -z B )∥ for some K 0 ∈ R m×r with K ⊤ 0 K 0 ≻ 0. In this case, the distance between positive pairs is always close to 0 while the distance between negative pairs is positive. Recall that the output of our model is f A = (K A z A + K A,ξ ξ A )/N A and f B = (K B z B + K B,ξ ξ B )/N B . Hence, we have the following sufficient condition. Lemma 4.4 (Sufficent condition for aligned representations). The first two conditions imply the signal parts dominate the outputted representations, the third condition gives the existence of K 0 , and the final condition makes sure the smallest singular value K ⊤ 0 K 0 of is still at least 1/ poly(d) after normalization. Now we consider the balance of the learned representations. Suppose that our representations are already aligned, in the sense of Lemma 4.4. Then we have If ∥K A ∥ F ≫ ∥K A,ξ ∥ F , ∥K B ∥ F ≫ ∥K B,ξ ∥ F , K A ≈ K B , Σ f = E x A f A f ⊤ A = E z K A zz ⊤ K ⊤ A ∥K A ∥ 2 F /d = K A K ⊤ A ∥K A ∥ 2 F . Recall the relationship between Γ Balance and the effective condition number of Σ f . We have the following sufficient condition. Lemma 4.5 (Sufficient condition for balanced representations). Suppose that our model is aligned, in the sense of Lemma 4.4. If the condition number of K ⊤ A K A is close to 1, then the learned representations are balanced. Note that, since the columns of A are not orthonormal, even at initialization, the condition number of K ⊤ A K A is not close to 1. In other words, a condition-number-reducing stage is necessary for the model to learn balanced representations.

5. TRAINING DYNAMICS AND PROOF OUTLINE

Our proof is based on characterizing the dynamics of gradient descent over the contrastive loss. We first choose a small inverse temperature τ 2 t = 1/ poly(d) and run gradient descent for poly(d) iterations. This stage is called Stage 1. Then, we set τ 2 t = 1 and run gradient descent for another poly(d) iterations. This stage is called Stage 2. The use of different τ t is mainly for technical purposes since this gives cleaner separation between stages. We observe similar stage-wise behavior when using a uniform τ 2 t . See Figure 1 for simulation results. We also report results for the noncontrastive loss there.

5.1. THE TRAINING DYNAMICS

Instead of tracking the parameters W A and W B directly, we will track K A , K A,ξ , K B , K B,ξ , the matrices that directly map latent signals and noises to the final representations. For the case with contrastive pairs, one can show that the dynamics of K A are governed by the following equation KA = E x + A ,x + B 2 -S A (x + A , x + B ) -S B (x + A , x + B ) f + B (z + ) ⊤ N A -f + A , f + B K A N 2 A d diag(σ 2 ) -K E x + A ,x ± B S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) f - B (z + ) ⊤ N A -f + A , f - B K A N 2 A d diag(σ 2 ) -K E x ± A ,x + B S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) f + B (z -) ⊤ N A -f - A , f + B K A N 2 A d diag(σ 2 ). (5) See Lemma A.3 for the calculation. The first term comes from the positive pairs and the second and third terms from the negative pairs. Within each term, the second part comes from the normalization. The equations for the other K-matrices can be derived similarly. We rescale the gradients by 1/τ 2 t so that d dt K A does not shrink with τ 2 t . For the non-contrastive case, the equation is Note that the RHS resembles the first term of (5). This is not a coincidence. We will establish the approximate equivalence between the non-contrastive approach and the contrastive approach with a small inverse temperature τ 2 t (cf. Section 5.3 and Lemma B.9). d dt K A = E x + A ,x + B f + B (z + ) ⊤ N A -f + A , f + B K A N 2 A d diag(σ 2 ).

5.2. THE INFINITE-WIDTH DYNAMICS

The overall proof strategy is to first characterize the dynamics of the infinite-width limit, which is much simpler compared to (5), and then control the error introduced by discretizing the infinitewidth network using polynomially many neurons. This discretization is one of the main technical challenges of the proof. In general, in order to track the infinite-width dynamics, an exponentially large network is needed (Mei et al. (2018) ). The basic idea is to use Taylor expansion around the infinite-width trajectory to factor out the firstorder error terms and show that, either they drive the process towards the infinite-width trajectory or the error growth introduced by them is slower than the convergence rate. Here, for ease of presentation, we will focus on the noiseless infinite-width dynamics and, in particular, the evolution of the condition number. Recall that we use iid Gaussian to initialize the entries of W A and W B . Hence, in the m → ∞ limit, different columns of K A and K B are orthogonal to each other, at least at initialization. Moreover, one can verify that thanks to the symmetry, this holds throughout the entire training procedure. Meanwhile, by symmetry, the corresponding quantities in modules A and B are always the same. Namely, when m → ∞, we have K ⊤ A K A = K ⊤ B K B = diag(κ 2 ) and K ⊤ A K B = diag(κ 2 ) for some κ, κ ∈ R r . As a result, in order to characterize the dynamics of K A , K B , it suffices to look at κ 2 and κ2 . One can show that, in this noiseless infinite-width limit, we have (cf. Lemma A.7) d dt κ 2 p = 4 1 -S κ2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ 2 p ∥κ∥ 2 σ 2 p -4 1 -S κ2 p ∥κ∥ 2 T p - κ 2 p ∥κ∥ 2 T σ 2 p , d dt κ2 p = 4 1 -S κ 2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ2 p ∥κ∥ 2 σ 2 p -4 1 -S κ 2 p ∥κ∥ 2 T p - κ2 p ∥κ∥ 2 T σ 2 p , where S is a Θ(1) quantity depending on κ and κ, T p = tanh(τ 2 t κ2 p / ∥κ∥ 2 ), and T = r k=1 (κ 2 p / ∥κ∥ 2 )T p . The first term comes from the first term of (5) and the second term from the second and third terms of (5). When κ 2 p ≈ κ2 p for all p ∈ [r], the model is aligned. When all κ 2 p are roughly the same, the model is balanced.

5.3. STAGE 1

In this subsection, we describe the dynamics of gradient descent in Stage 1 and explain how we control the growth of the errors and condition number. For ease of presentation, we will mostly use the infinite-width dynamics (7) instead of the finite-width one (5). Equivalence of Stage 1 and non-contrastive methods Recall that we use a small τ 2 t in Stage 1, T p ≈ 0 for all p ∈ [r], whence T is also approximately 0. As a result, the second terms of ( 7) are approximately 0. In other words, only the positive pairs matter. Meanwhile, one can check that, in the infinite-width limit, (6) corresponds to the first term of (7), up to some multiplicative factor. This gives the equivalence of the dynamics of Stage 1 and the non-contrastive method. See Lemma B.9 for a more formal proof in the finite-width setting. Now, we consider the contrastive loss. The main result of Stage 1 is as follows. Lemma 5.1 (Informal version of Lemma B.1). Under the assumptions of Theorem 4.3, the finitewidth dynamics closely track the infinite-width ones throughout Stage 1, which takes at most poly(d) iterations, and, at the end of Stage 1, we have, for any p, q ∈ [r] and s ∈ [d -r], κ2 p /κ 2 p ≈ 1, ∥[K A,ξ ] s ∥ 2 /κ 2 p ≈ 0, κ 2 p /κ 2 q ≤ O √ d . Moreover, there exists a σ ∈ R r such that max p,q κ 2 p /κ 2 q = Ω( √ d) at the end of Stage 1. In words, (8) says, at the end of Stage 1, K A ≈ K B in the relative sense, the noise-signal ratio is small, and the condition number is bounded by O( √ d). By Lemma 4.4, the first two conditions of (8) imply that the learned representations are aligned. The proof of this lemma can be found in Section B. Basically, we couple the convergence of κ2 p /κ 2 p and the noise-signal ratio with the growth of discretization error and condition number. The main tool we use is the following nonlinear version of Gronwall's lemma. Lemma 5.2. Let A t be a positive process. Let X t and Y t be defined as Ẋt ≤ -A t X t , Ẏt ≤ βA t X t Y t , with X 0 , Y 0 , β being positive. Then, for any T ≥ 0, we have Y T ≤ Y 0 exp(βX 0 ). Here, X t represents the progress we have made and Y t the error we wish to control. In our case, X t is the maximum between 1 -κ2 p /κ 2 p and the noise-signal ratio, and Y t the discretization error and condition number. This lemma says that, if the error growth rate decreases as we make more progress, then by coupling these two processes, we can make sure the error does not blow up.

5.4. STAGE 2

In Stage 2, τ 2 t is no longer o(1), and now the second term of (7) comes into play. We show that, in this stage, the model will reduce the condition number of K A to approximately 1. By Lemma 4.5, this implies that the learned representations are balanced. Formally, we have the following lemma. The proof can be found in Section C. Lemma 5.3 (Informal version of Lemma C.1). Under the assumptions of Theorem 4.3, the finitewidth dynamics still closely track the infinite-width one throughout Stage 2, which again takes at most poly(d) iterations, and, at the end of Stage 2, we have, for any p, q ∈ [r] and s ∈ [d -r], κ2 p /κ 2 p ≈ 1, ∥[K A,ξ ] s ∥ 2 /κ 2 p ≈ 0, κ 2 p /κ 2 q ≈ 1. The way to control κ2 p /κ 2 p and the noise-signal ratio is similar to Stage 1. For the condition number, note that when τ 2 t = 1 and κ ≈ κ, the first term of (7) becomes close to 0 while the second term becomes -4(1 -S)κ 2 p / ∥κ∥ 2 (T p -T )σ 2 p . Since T p = tanh(τ 2 t κ2 p / ∥κ∥ 2 ) is positively correlated with κ2 p / ∥κ∥ 2 and T is a weight average of these T p 's, this term will push κ2 p / ∥κ∥ 2 towards their average and, consequently, reduces the condition number. To obtain a satisfactory convergence rate, it suffices to consider the ratio κ 2 p /κ 2 q directly. See Appendix C.5 for details. The discretization error is handled in Appendix C.2. Besides the simulation results reported in Figure 1 , we also conduct experiments on the MSCOCO-2014 dataset (Lin et al., 2014) using more practical models. See Figure 2 for the results. For the text part, we use a pre-trained RoBERTa (Liu et al., 2019) , followed by a 3-layer fully-connected network with batch norm between layers. For the image part, we use a pre-trained ResNet101 (He et al., 2015) , followed by the same layers. In both parts, the width of the fully-connected layers and the output dimension are 768. We freeze the pre-trained parts of the model and only train the fully-connected parts.

6. EXPERIMENTAL RESULTS

We measure the quality of the learned representation using its zero-shot performance on the MSCOCO-2014 validation set. Unlike common image classification datasets, images in the MSCOCO dataset usually have multiple labels, each corresponding to an object that appears in the image, and there are 80 categories in total. We regard a prediction to be correct if it matches one label. The zero-shot classification is done in the same way as in Radford et al. (2021) . Namely, we compute the image embedding and the embeddings of prompts "This is a [LABEL NAME]", and use the prompt with the highest correlation with the image embedding as the prediction.

7. CONCLUSION AND DISCUSSION

In this work, we study the role of contrastive pairs in multimodal learning, and show that contrastive pairs are important for the model to learn representations that are both aligned and balanced. Our work extends previous results in several directions: First, we consider the more complicated multimodal learning problem. Meanwhile, our data generating model is inhomogeneous, and we show that in this case, non-contrastive method can collapse to an approximately rank-1 solution while contrastive method can learn all features. We also include output normalization in our analysis, a technique that is widely used in practice but is still under-studied in theory. However, despite the complexity of the analysis, our model is still linear, which is very different from the models used in practice. Also, for the results on non-contrastive methods, we do not consider more advanced training techniques such as Grill et al. (2020) and Chen & He (2021) . We leave the analysis of these more practical techniques for future work.

A GRADIENT CALCULATION

In this section, we compute the gradients and the equations governing some related quantities. We first consider the general finite-width case, and then the infinite-width case, in which we have simple formulas for many quantities of interests.

A.1 THE FINITE-WIDTH CASE

The results in this subsection are mostly brute-force calculations. First, we prove the following auxiliary lemma. Lemma A.1. For any x + A and x - B , we have ∇ W A exp(τ 2 t f + A •f - B ) = τ 2 t exp(τ 2 t f + A • f - B ) N A   x + A (f - B ) ⊤ -f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ /d N A   . Then we compute the gradients. Lemma A.2. We have ∇ W A L = -τ 2 t E x + A ,x + B    2 -S A (x + A , x + B ) -S B (x + A , x + B )   x + A (f + B ) ⊤ N A -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ N 2 A d      + Kτ 2 t N A E x + A ,x ± B    S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B )   x + A (f - B ) ⊤ -f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      + Kτ 2 t N A E x ± A ,x + B    S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B )   x - A (f + B ) ⊤ -f - A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      . The formula for ∇ W B L can be obtained by interchanging the roles of A and B. Instead of tracking W A and W B directly, we are going to track K A , K B , K A,ξ and K B,ξ . The dynamics of them is governed by the following equation. As a direct corollary of Lemma A.2, we have the following. Lemma A.3. We have d dt K A = τ 2 t E x + A ,x + B 2 -S A (x + A , x + B ) -S B (x + A , x + B ) (K B z + + K B,ξ ξ B )(z + ) ⊤ N A N B -f + A , f + B K A N 2 A d diag(σ 2 ) -Kτ 2 t E x + A ,x ± B S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) f - B (z + ) ⊤ N A -f + A , f - B K A N 2 A d diag(σ 2 ) -Kτ 2 t E x ± A ,x + B S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) f + B (z -) ⊤ N A -f - A , f + B K A N 2 A d diag(σ 2 ), d dt K A,ξ = τ 2 t N A E x + A ,x + B    2 -S A (x + A , x + B ) -S B (x + A , x + B )   f + B (ξ + A ) ⊤ - f + A , f + B K A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      σ 2 ξ - Kτ 2 t N A E x + A ,x ± B    S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B )   f - B (ξ + A ) ⊤ - f + A , f - B K A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      σ 2 ξ - Kτ 2 t N A E x ± A ,x + B    S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B )   f + B (ξ - A ) ⊤ - f - A , f + B K A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      σ 2 ξ . Under review as a conference paper at ICLR 2023 We can rewrite the above result as follows. Corollary A.4. Define Q 0 := -E x + A ,x + B 2 -S A (x + A , x + B ) -S B (x + A , x + B ) f + A , f + B + K E x + A ,x ± B S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) f + A , f - B + K E x ± A ,x + B S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) f - A , f + B , and Q 1 := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) z + (z + ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) z -(z + ) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) z + (z -) ⊤ d , Q 1,ξ B := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) ξ + B (z + ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) ξ - B (z + ) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) ξ + B (z -) ⊤ d , Q 1,ξ A := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) ξ + A (z + ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) ξ + A (z -) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) ξ - A (z + ) ⊤ d , Q 2 := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) ξ + B (ξ + A ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) ξ - B (ξ + A ) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) ξ + B (ξ - A ) ⊤ d . We have d dt K A = K B N A N B d Q 1 diag(σ 2 ) + K B,ξ N A N B d Q 1,ξ B diag(σ 2 ) + K A N 2 A d Q 0 diag(σ 2 ), d dt K B = K A N A N B d Q ⊤ 1 diag(σ 2 ) + K A,ξ N A N B d Q 1,ξ A diag(σ 2 ) + K B N 2 B d Q 0 diag(σ 2 ), d dt K A,ξ = K B N A N B d Q ⊤ 1,ξ A σ 2 ξ + K B,ξ N A N B d Q 2 σ 2 ξ + K A,ξ N 2 A d Q 0 σ 2 ξ , d dt K B,ξ = K A N A N B d Q ⊤ 1,ξ B σ 2 ξ + K A,ξ N A N B d Q ⊤ 2 σ 2 ξ + K B,ξ N 2 B d Q 0 σ 2 ξ . We are interested in each column of K A and K B , whose dynamics is given by the next lemma. The next lemma also decompose the dynamics along the radial and tangent directions. Lemma A.5. For any p ∈ [r] and q ∈ [d -r], we have d dt ∥[K A ] p ∥ 2 = 2 ⟨[K A ] p , [K B Q 1 ] p ⟩ N A N B d σ 2 p + 2 ⟨[K A ] p , [K B,ξ Q 1,ξ B ] p ⟩ N A N B d σ 2 p + 2 ∥[K A ] p ∥ 2 N 2 A d Q 0 σ 2 p , d dt ∥[K B ] p ∥ 2 = 2 [K B ] p , [K A Q ⊤ 1 ] p N A N B d σ 2 p + 2 ⟨[K B ] p , [K A,ξ Q 1,ξ A ] p ⟩ N A N B d σ 2 p + 2 ∥[K B ] p ∥ 2 N 2 B d Q 0 σ 2 p , d dt ∥[K A,ξ ] q ∥ 2 = 2 [K A,ξ ] q , [K B Q ⊤ 1,ξ A ] q N A N B d σ 2 ξ + 2 ⟨[K A,ξ ] q , [K B,ξ Q 2 ] q ⟩ N A N B d σ 2 ξ + 2 ∥[K A,ξ ] q ∥ 2 F N 2 A d Q 0 σ 2 ξ , d dt ∥[K B,ξ ] q ∥ 2 = 2 [K B,ξ ] q , [K A Q ⊤ 1,ξ B ] q N A N B d σ 2 ξ + 2 [K B,ξ ] q , [K A,ξ Q ⊤ 2 ] q N A N B d σ 2 ξ + 2 ∥[K B,ξ ] q ∥ 2 N 2 A d Q 0 σ 2 ξ , d dt [K A ] p = I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ + [K B,ξ Q 1,ξ B ] p ∥[K A ] p ∥ σ 2 p N A N B d , d dt [K B ] p = I -[K B ] p [K B ] p ⊤ [K A Q ⊤ 1 ] p ∥[K B ] p ∥ + [K A,ξ Q 1,ξ A ] p ∥[K B ] p ∥ σ 2 p N A N B d , d dt [K A,ξ ] q = I -[K A,ξ ] q [K A,ξ ] q ⊤ [K B Q ⊤ 1,ξ A ] q ∥[K A,ξ ] q ∥ + [K B,ξ Q 2 ] q ∥[K A,ξ ] q ∥ σ 2 ξ N A N B d , d dt [K B,ξ ] q = I -[K B,ξ ] q [K B,ξ ] q ⊤ [K A Q ⊤ 1,ξ B ] q ∥[K B,ξ ] q ∥ + [K A,ξ Q ⊤ 2 ] q ∥[K B,ξ ] q ∥ σ 2 ξ N A N B d . Finally, as a simple corollary of Lemma A.1, we have the following result on the gradients of the non-contrastive loss. Lemma A.6. For the non-contrastive loss (4), we have ∇ W A L = E    x + A (f + B ) ⊤ N A -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ N 2 A d    As a corollary, we have, in this case, d dt K A = E f + B (z + ) ⊤ N A -f + A , f + B K A N 2 A d diag(σ 2 ), d dt K A,ξ = E f + B (ξ + A ) ⊤ N A -f + A , f + B K A,ξ N 2 A d σ 2 ξ .

OMITTED PROOF OF THIS SUBSECTION

Proof of Lemma A.1. We compute ∇ W A exp(τ 2 t f + A • f - B ) = τ 2 t exp(τ 2 t f + A • f - B )∇ W A f + A , f - B = τ 2 t exp(τ 2 t f + A • f - B )∇ W A W ⊤ A x + A , f - B ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F / √ d = τ 2 t exp(τ 2 t f + A • f - B ) ∇ W A W ⊤ A x + A , f - B ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F / √ d -τ 2 t exp(τ 2 t f + A • f - B ) f + A , f - B ∇ W A ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F . For the first term, we have ∇ W A W ⊤ A x + A , f - B = x + A (f - B ) ⊤ . For the second term, we have ∇ W A ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F = ∇ W A ∥K A ∥ 2 F + ∇ W A ∥K A,ξ ∥ 2 F 2(∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F ) = AK ⊤ A + A ξ K ⊤ A,ξ ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F . Hence, ∇ W A exp(τ 2 t f + A • f - B ) = τ 2 t exp(τ 2 t f + A • f - B ) x + A (f - B ) ⊤ ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F / √ d -τ 2 t exp(τ 2 t f + A • f - B ) f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F = τ 2 t exp(τ 2 t f + A • f - B ) ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F / √ d   x + A (f - B ) ⊤ -f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F   . Proof of Lemma A.2. We compute ∇ W A L A = -E ∇ W A S A (x + A , x + B ) S A (x + A , x + B ) = -E 1 S A (x + A , x + B ) ∇ W A exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E z -exp(τ 2 t f + A • f - B ) = -E 1 S A (x + A , x + B ) ∇ W A exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E z -exp(τ 2 t f + A • f - B ) + E ∇ W A exp(τ 2 t f + A • f + B ) + K E z -∇ W A exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) + K E z -exp(τ 2 t f + A • f - B ) . By Lemma A.1, the first term is -E 1 S A (x + A , x + B ) ∇ W A exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E z -exp(τ 2 t f + A • f - B ) = - τ 2 t N A E x + A ,x + B    x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F    , and the second term is E ∇ W A exp(τ 2 t f + A • f + B ) + K E x - B ∇ W A exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) + K E z -exp(τ 2 t f + A • f - B ) = τ 2 t N A E    S A (x + A , x + B )   x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      + Kτ 2 t N A E    S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B )   x + A (f - B ) ⊤ -f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      . Thus, ∇ W A L A = - τ 2 t N A E x + A ,x + B    1 -S A (x + A , x + B )   x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      + Kτ 2 t N A E    S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B )   x + A (f - B ) ⊤ -f + A , f - B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      . Then, for ∇ W A L B , we compute ∇ W A L B = -E ∇ W A S B (x + A , x + B ) S B (x + A , x + B ) = -E 1 S B (x + A , x + B ) ∇ W A exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - A exp(τ 2 t f - A • f + B ) + E ∇ W A exp(τ 2 t f + A • f + B ) + K∇ W A E x - A exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - A exp(τ 2 t f - A • f + B ) . Again, by Lemma A.1, the first term is -E 1 S B (x + A , x + B ) ∇ W A exp(τ 2 t f + A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - A exp(τ 2 t f - A • f + B ) = - τ 2 t N A E    x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F    , and the second term is E ∇ W A exp(τ 2 t f + A • f + B ) + K∇ W A E x - A exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) + K E x - A exp(τ 2 t f - A • f + B ) = τ 2 t N A E    S B (x + A , x + B )   x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      + Kτ 2 t N A E    S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B )   x - A (f + B ) ⊤ -f - A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      . Thus, ∇ W A L B = - τ 2 t N A E    (1 -S B (x + A , x + B ))   x + A (f + B ) ⊤ -f + A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      + Kτ 2 t N A E    S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B )   x - A (f + B ) ⊤ -f - A , f + B AK ⊤ A + A ξ K ⊤ A,ξ / √ d ∥K A ∥ 2 F + ∥K A,ξ ∥ 2 F      . Combine these formulas together and we complete the proof.

A.2 THE INFINITE-WIDTH CASE

Now we consider the noiseless infinite-width dynamics. The results of this subsection will not be used in the proof. It mainly serves as a way to give intuitions on how the dynamics looks. As we have discussed in the main text, in this noiseless infinite-width case, it suffices to track κ 2 p = ∥[K A ] p ∥ 2 and κ2 p = ⟨[K A ] p , [K B ] p ⟩. Lemma A.7. In the noiseless infinite-width case, we have d dt κ 2 p = 4 1 -S κ2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ 2 p ∥κ∥ 2 σ 2 p -4 1 -S κ2 p ∥κ∥ 2 T p - κ 2 p ∥κ∥ 2 T σ 2 p , d dt κ2 p = 4 1 -S κ 2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ2 p ∥κ∥ 2 σ 2 p -4 1 -S κ 2 p ∥κ∥ 2 T p - κ2 p ∥κ∥ 2 T σ 2 p , Proof. First, note hat f + A , f - B = ⟨K BA z + , z -⟩ ∥κ∥ 2 /d = ⟨z + , z -⟩ κ2 ∥κ∥ 2 /d . This implies that (a) f + A , f + B does not depend on the actual value of z + , and (b) if we flip the signs of z ± p simultaneously, then the value of f + A , f - B remain unchanged. To compute S A , we then need to take expectation over z -. We compute E z - exp τ 2 t f + A • f - B = r k=1 E z - k exp τ 2 t κ2 k z + k z - k ∥κ∥ 2 /d = r k=1 cosh τ 2 t κ2 k ∥κ∥ 2 =: Z c . Again, it does not depend on the actual value of z + . One can conduct similar calculation for S B and, consequently, we have S A ≡ S B ≡ S for some S that depends on κ and κ but not on z + . Then, we can rewrite (5) as d dt K A = 2 1 -S E x + A ,x + B f + B (z + ) ⊤ N A -f + A , f + B K A N 2 A d diag(σ 2 ) -2K S E x + A ,x ± B exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) f - B (z + ) ⊤ N A -f + A , f - B K A N 2 A d diag(σ 2 ) = 2 1 -S K B ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 K A ∥κ∥ 2 diag(σ 2 ) - 2K S exp τ 2 t ∥κ∥ 2 ∥κ∥ 2 E x + A ,x ± B exp τ 2 t ⟨z + , z -⟩ κ2 d ∥κ∥ 2 K B z -(z + ) ⊤ ∥κ∥ 2 /d - ⟨z + , z -⟩ κ2 ∥κ∥ 2 /d K A ∥κ∥ 2 diag(σ 2 ). Note that E z ± exp τ 2 t ⟨z + , z -⟩ κ2 d ∥κ∥ 2 z - p z + q d = 1{p = q} E z ± p exp τ 2 t κ2 k z + k z - k d ∥κ∥ 2 z - p z + p d k̸ =p E z ± k exp τ 2 t κ2 k z + k z - k d ∥κ∥ 2 = 1{p = q} sinh τ 2 t κ2 k ∥κ∥ 2 k̸ =p cosh τ 2 t κ2 k ∥κ∥ 2 = 1{p = q}Z c T p . Meanwhile, note that E x + A ,x ± B exp τ 2 t ⟨z + , z -⟩ κ2 d ∥κ∥ 2 ⟨z + , z -⟩ κ2 ∥κ∥ 2 /d = r k=1 κ2 k ∥κ∥ 2 /d E z ± exp τ 2 t ⟨z + , z -⟩ κ2 d ∥κ∥ 2 z + k z - k = Z c r k=1 κ2 k ∥κ∥ 2 T k =: Z c T . Thus, d dt K A = 2 1 -S K B ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 K A ∥κ∥ 2 diag(σ 2 ) - 2K SZ c exp τ 2 t ∥κ∥ 2 ∥κ∥ 2 K B ∥κ∥ 2 diag [T k ] k∈[r] - K A ∥κ∥ 2 T diag(σ 2 ) = 2 1 -S K B ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 K A ∥κ∥ 2 diag(σ 2 ) -2 1 -S K B ∥κ∥ 2 diag [T k ] k∈[r] - K A ∥κ∥ 2 T diag(σ 2 ). As a corollary, we have d dt [K A ] p = 2 1 -S [K B ] p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 [K A ] p ∥κ∥ 2 σ 2 p -2 1 -S [K B ] p ∥κ∥ 2 T p - [K A ] p ∥κ∥ 2 T σ 2 p . Hence, d dt κ 2 p = 2 [K A ] p , d dt [K A ] p = 4 1 -S κ2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ 2 p ∥κ∥ 2 σ 2 p -2 1 -S κ2 p ∥κ∥ 2 T p - κ 2 p ∥κ∥ 2 T σ 2 p . By symmetry, for K B , we have d dt [K B ] p = 2 1 -S [K A ] p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 [K B ] p ∥κ∥ 2 σ 2 p -2 1 -S [K A ] p ∥κ∥ 2 T p - [K B ] p ∥κ∥ 2 T σ 2 p . Then, we can compute d dt κ2 p = 4 1 -S κ 2 p ∥κ∥ 2 - ∥κ∥ 2 ∥κ∥ 2 κ2 p ∥κ∥ 2 σ 2 p -4 1 -S κ 2 p ∥κ∥ 2 T p - κ2 p ∥κ∥ 2 T σ 2 p .

B STAGE 1

In this section, we show that the following hold: We formalize the main results of Stage 1 bellow. Lemma B.1 (Stage 1). Under the assumption of Theorem 4.3. We can choose a sufficiently (polynomially) large m and a sufficiently (inverse polynomially) small τ 2 t that may depend on the δ's that appear in this lemma so that the following statement holds. Let T 1 be the earliest time all the following hold: [K A ] p , [K B ] p ≥ 1 -δ -, ∀p ∈ [r], ∥[K A,ξ ] q ∥ ∥[K A ] p ∥ ≤ δ N/S , ∀p ∈ [r], q ∈ [d -r], where δ -, δ N/S ∈ 1/ poly(d) are two given parameters. We have T 1 ≤ poly(d). Moreover, at any time t ∈ [0, T 1 ], we have κ 0 := max p,q∈[r] ∥[K A ] p ∥ / ∥[K A ] q ∥ ≤ O( √ d) and ∥[K A ] p ∥ 2 = (1 ± δ A/B ) ∥[K B ] p ∥ 2 , ∥[K A,ξ ] q ∥ 2 = (1 ± δ A/B ) ∥[K B,ξ ] q ∥ 2 , ∀p ∈ [r], q ∈ [d -r], 1 - ∥[K A,ξ ] p ∥ ∥[K A,ξ ] q ∥ ≤ δ ξ,κ0 , ∀p, q ∈ [d -r], max [K C ] p , [K D ] q : C, D ∈ {A, B} ≤ δ AB,⊥ , ∀p ̸ = q ∈ [r], max [K C ] p , [K D,ξ ] q , [K A,ξ ] s , [K B,ξ ] q : C, D ∈ {A, B} ≤ δ ξ,⊥ , ∀p ∈ [r], q, s ∈ [d -r], (10) where δ A/B , δ ξ,κ0 , δ AB,⊥ , δ ξ,⊥ ∈ 1/ poly(d) are given parameters. Basically, the conditions in (10) mean that the norm of the corresponding columns are roughly the same, and the columns of all these matrices are approximately orthogonal to each other. Both of them are true in the infinite-width limit, and by some standard concentration argument, one can make all these errors to be arbitrarily inverse-polynomially small, at least at initialization. Note that, as a simple corollary of (10), we have N A = N B 1 ± δ A/B . For notational simplicity, we also define ρ -= max p∈[r] 1 -[K A ] p , [K B ] p , ρ N/S = max p∈[r] q∈[d-r] ∥[K A,ξ ] q ∥ ∥[K A ] p ∥ . Characterizing the dynamics in Stage 1 is relatively straightforward. We will see in Section B.1 that all those Q-matrices have simple forms. The main tool we use to control the condition number and the discretization error is the following nonlinear version of Gronwall's lemma. Lemma B.2. Let A t be a positive process. Let X t and Y t be defined as Ẋt ≤ -A t X t , Ẏt ≤ αA t X t Y t , with X 0 , Y 0 , α being positive. Then, for any T ≥ 0, we have Y T ≤ Y 0 exp(αX 0 ). Remark. Here, X t represent the progress we have made and Y t the error. In our case, X t is the maximum between 1 -[K A ] p , [K B ] p and the noise-signal ratio, and Y t the discretization error. This lemma says that, if the error growth rate depends on the progress, then by coupling these two processes, we can make sure the error does not blow up. The point of this lemma is that, with coupling, we do not need a very tight estimation on the convergence time nor the error growth rate. ♣ Proof. The solution of this ODE system is given by X T = X 0 exp - T 0 A t dt , Y T = Y 0 exp αX 0 T 0 A t exp - t 0 A s ds dt . Note that T 0 A t exp - t 0 A s ds dt = - T 0 d exp - t 0 A s ds = 1 -exp - T 0 A t dt . Hence, Y T = Y 0 exp αX 0 1 -exp - T 0 A t dt ≤ Y 0 exp (αX 0 ) . The organization of this section is as follows. In Section B.  f + A , f - B ) = 1 ± O(τ 2 t ). With this approximation, one can derive the following estimation for S A and S B . Lemma B.3 (Estimations for S). In Stage 1, we have The proof of this lemma and all following lemma is deferred to the end of this subsection. Note that we derive a slightly finer estimation for the noise-related part. This additional ρ N/S will be used cancel with terms like ∥ S A = 1 1 + K ± O z τ 2 t ± O τ 2 t dδ ξ,⊥ ρ N/S , [K A ] p ∥ / ∥[K A,ξ ] q ∥, at the cost of a κ 0 factor, in later analysis. We emphasize here that O z does not depend on the noises so that later we can argue E ξ O z (τ 2 t )ξ = 0. With this lemma, we now derive estimations for Q 1 , Q 1,ξ , Q 2 and Q 0 , respectively. Lemma B.4 (Estimations for Q 1 ). In Stage 1, we have Q 1 = 2K 1 + K I d ± O dτ 2 t . Lemma B.5 (Estimations for Q 1,ξ and Q 2 ). In Stage 1, we have max ∥Q 1,ξ A ∥ F , ∥Q 1,ξ B ∥ F , ∥Q 2 ∥ F = ±O τ 2 t d 2 ρ N/S δ ξ,⊥ . The above two lemmas say that in Stage 1, Q 1 is approximately diagonal, and Q 1,ξ and Q 2 can more or less be ignored. Lemma B.6 (Estimations for Q 0 ). In Stage 1, we have Q 0 = - 2K 1 + K ⟨K A , K B ⟩ N A N B d ± O dτ 2 t . The proof of Lemma B.6 is essentially the same as the proof of Lemma B.4 so we omit it. With these three lemmas, we can now simplify Lemma A.5 as follows. Corollary B.7. In Stage 1, for any p ∈ [r] and q ∈ [d -r], we have d dt ∥[K A ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ∥[K A ] p ∥ 2 - ⟨K A , K B ⟩ N A N B d ∥[K A ] p ∥ 2 ± O σ 2 p N A N B d δ A/B + τ 2 t d κ 2 p , d dt ∥[K A,ξ ] q ∥ 2 = - 4K 1 + K σ 2 ξ N A N B d ⟨K A , K B ⟩ N A N B d ∥[K A,ξ ] q ∥ 2 ± O σ 2 ξ N A N B d dτ 2 t + δ A/B ∥[K A,ξ ] q ∥ 2 , The formulas for K B and K B,ξ can be obtained by interchanging the roles of A and B. Basically, the two parts of the first term of d dt ∥[K A ] p ∥ 2 correspond to the signal and normalization, respectively. For d dt ∥[K A,ξ ] q ∥ 2 , the signal term is 0 because the noises in x + A and x + B are independent and have mean 0. Corollary B.8. In Stage 1, we have d dt [K A ] p = 2K 1 + K σ 2 p N A N B d I -[K A ] p [K A ] p ⊤ [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B , d dt [K A,ξ ] q = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ . Interchange the roles of A and B and one can obtain the formulas for K B and K B,ξ . Note that, after normalization, the normalization terms cancel with each other, and the signal terms drive [K A ] p and [K A ] q toward each other. Moreover, without the normalization terms, different [K A ] p are approximately independent of each other. This allows us to maintain the orthogonality between different columns. Note that the above results also imply the following lemma. Lemma B.9. The dynamics of the non-contrastive method and the Stage 1 dynamics are equivalent, up to a multiplicative constant and some higher-order terms.

OMITTED PROOFS OF THIS SUBSECTION

Proof of Lemma B.3. First, we write f + A , f - B = ⟨K A z + , K B z -⟩ N A N B + K A z + , K B,ξ ξ - B + K A,ξ ξ + A , K B z - N A N B + K A,ξ ξ + A , K B,ξ ξ - B N A N B . For the second term, we write K A z + , K B,ξ ξ - B N A N B = i∈[r],j∈[d-r] ∥[K A ] i ∥ 2 N A N B d ∥[K B,ξ ] j ∥ ∥[K A ] i ∥ [K A ] i , [K B,ξ ] j z + i ξ - B,j d. For each summand, we have ∥[K B,ξ ] j ∥ / ∥[K A ] i ∥ ≤ O(ρ N/S ), [K A ] i , [K B,ξ ] j = ±O(δ ξ,⊥ ), and |z + i ξ - B,j d| ≤ 1. Hence, ⟨KAz + ,K B,ξ ξ - B ⟩ N A N B = ±O(dδ ξ,⊥ ρ N/S ). The same is also true for ⟨KA,ξξ + A ,K B z -⟩ N A N B d and the third term. Therefore, f + A , f - B = ⟨K A z + , K B z -⟩ N A N B ± O dδ ξ,⊥ ρ N/S . Then, we compute exp τ 2 t f + A , f - B = exp τ 2 t ⟨K A z + , K B z -⟩ N A N B 1 ± O τ 2 t dδ ξ,⊥ ρ N/S = 1 ± O z τ 2 t ± O τ 2 t dδ ξ,⊥ ρ N/S . Thus, S A = 1 1 + K ± O z τ 2 t ± O τ 2 t dδ ξ,⊥ ρ N/S . The proof for S B is essentially the same. Proof of Lemma B.4. Recall that Q 1 := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) z + (z + ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) z -(z + ) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) z + (z -) ⊤ d . Since S A = (1 + K) -1 ± O(τ 2 t ) and S B = (1 + K) -1 ± O(τ 2 t ), we have 2 -S A (x + A , x + B ) -S B (x + A , x + B ) = 2 - 2 1 + K ± O(τ 2 t ) = 2K 1 + K ± O(τ 2 t ). As a result, Q 1 = 2K 1 + K E z + (z + ) ⊤ d - K 1 + K E z -(z + ) ⊤ d - K 1 + K E z + (z -) ⊤ d ± O dτ 2 t . Note that E z -(z + ) ⊤ = 0 and E z + (z + ) ⊤ = I d /d. Therefore, Q 1 = 2K 1 + K I d ± O dτ 2 t . Proof of Lemma B.5. Recall that Q 1,ξ A := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) ξ + A (z + ) ⊤ d -K E S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) ξ + A (z -) ⊤ d -K E S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) ξ - A (z + ) ⊤ d . Note that if some quantity X does not depend on ξ, then E {Xξ} = 0. Hence, by Lemma B.3, we have Q 1,ξ A = ±O τ 2 t d 2 ρ N/S δ ξ,⊥ . The proof for Q 1,ξ A and Q 2 is essentially the same. Proof of Corollary B.7. Recall that d dt ∥[K A ] p ∥ 2 = 2 ⟨[K A ] p , [K B Q 1 ] p ⟩ N A N B d σ 2 p + 2 ⟨[K A ] p , [K B,ξ Q 1,ξ B ] p ⟩ N A N B d σ 2 p + 2 ∥[K A ] p ∥ 2 N 2 A d Q 0 σ 2 p = 3 i=1 T i d dt ∥[K A ] p ∥ 2 . We now estimate these terms one-by-one. For T 1 , we have ⟨[K A ] p , [K B Q 1 ] p ⟩ = ⟨[K A ] p , [K B ] p ⟩ [Q 1 ] p,p + k̸ =p ⟨[K A ] p , [K B ] k ⟩ [Q 1 ] k,p = 2K 1 + K ⟨[K A ] p , [K B ] p ⟩ ± O τ 2 t dκ 2 p ± O τ 2 t dκ 2 p κ 0 δ AB,⊥ = 2K 1 + K ⟨[K A ] p , [K B ] p ⟩ ± O τ 2 t dκ 2 p . Hence, T 1 d dt ∥[K A ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ± O σ 2 p N A N B d τ 2 t dκ 2 p . For T 2 , we compute T 2 d dt ∥[K A ] p ∥ 2 = 2σ 2 p N A N B d d-r k=1 ⟨[K A ] p , [K B,ξ ] k ⟩ [Q 1,ξ B ] k,p = ±O(1) σ 2 p N A N B d d-r k=1 ∥[K A ] p ∥ 2 ∥[K B,ξ ] k ∥ ∥[K A ] p ∥ [K A ] p , [K B,ξ ] k [Q 1,ξ B ] k,p = ±O σ 2 p N A N B d τ 2 t d 3 ρ 2 N/S δ 2 ξ,⊥ κ 2 p . Finally, for T 3 , we compute T 3 d dt ∥[K A ] p ∥ 2 = 2 ∥[K A ] p ∥ 2 N 2 A d - 2K 1 + K ⟨K A , K B ⟩ N A N B d ± O dτ 2 t σ 2 p = - 4K 1 + K σ 2 p N A N B d 1 ± δ A/B ⟨K A , K B ⟩ N A N B d ∥[K A ] p ∥ 2 ± O κ 2 p N A N B d τ 2 t dσ 2 p = - 4K 1 + K σ 2 p N A N B d ⟨K A , K B ⟩ N A N B d ± O σ 2 p N A N B d δ A/B + τ 2 t d κ 2 p . Combine these together and we obtain d dt ∥[K A ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ - 4K 1 + K σ 2 p N A N B d ⟨K A , K B ⟩ N A N B d ∥[K A ] p ∥ 2 ± O σ 2 p N A N B d τ 2 t d 3 ρ 2 N/S δ 2 ξ,⊥ κ 2 p ± O σ 2 p N A N B d δ A/B + τ 2 t d κ 2 p = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ∥[K A ] p ∥ 2 - ⟨K A , K B ⟩ N A N B d ∥[K A ] p ∥ 2 ± O σ 2 p N A N B d δ A/B + τ 2 t d κ 2 p . Interchange the roles of A and B and we obtain the formula for d dt ∥[K B ] p ∥ 2 . Now we consider K A,ξ . We write d dt ∥[K A,ξ ] q ∥ 2 = 2 [K A,ξ ] q , [K B Q ⊤ 1,ξ A ] q N A N B d σ 2 ξ + 2 ⟨[K A,ξ ] q , [K B,ξ Q 2 ] q ⟩ N A N B d σ 2 ξ + 2 ∥[K A,ξ ] q ∥ 2 F N 2 A d Q 0 σ 2 ξ = 3 i=1 T i d dt ∥[K A,ξ ] q ∥ 2 . For T 1 , we compute T 1 d dt ∥[K A,ξ ] q ∥ 2 = 2 r k=1 ⟨[K A,ξ ] q , [K B ] k ⟩ [Q 1,ξ A ] q,k N A N B d σ 2 ξ = 2σ 2 ξ N A N B d r k=1 ∥[K A,ξ ] q ∥ 2 ∥[K B ] k ∥ ∥[K A,ξ ] q ∥ [K A,ξ ] q , [K B ] k [Q 1,ξ A ] q,k = ±O(1) 2σ 2 ξ N A N B d ∥[K A,ξ ] q ∥ 2 r k=1 κ 0 ρ N/S δ ξ,⊥ τ 2 t d 2 ρ N/S δ ξ,⊥ = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ 2 ξ,⊥ ∥[K A,ξ ] q ∥ 2 . Similarly, one can show that this bound also holds for T 2 . Finally, for T 3 , by Lemma B.6, we have T 3 d dt ∥[K A,ξ ] q ∥ 2 = 2 ∥[K A,ξ ] q ∥ 2 F N A N B d 1 ± δ A/B - 2K 1 + K ⟨K A , K B ⟩ N A N B d ± O dτ 2 t σ 2 ξ = - 4K 1 + K σ 2 ξ N A N B d ⟨K A , K B ⟩ N A N B d ∥[K A,ξ ] q ∥ 2 F ± O σ 2 ξ N A N B d dτ 2 t + δ A/B ∥[K A,ξ ] q ∥ 2 F . Combine these together and we obtain d dt ∥[K A,ξ ] q ∥ 2 = - 4K 1 + K σ 2 ξ N A N B d ⟨K A , K B ⟩ N A N B d ∥[K A,ξ ] q ∥ 2 F ±O σ 2 ξ N A N B d dτ 2 t + δ A/B ∥[K A,ξ ] q ∥ 2 F . Interchange the roles of A and B and we obtain the formula for K B,ξ . Proof of Corollary B.8. We write d dt [K A ] p = I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ σ 2 p N A N B d + I -[K A ] p [K A ] p ⊤ [K B,ξ Q 1,ξ B ] p ∥[K A ] p ∥ σ 2 p N A N B d = T 1 d dt [K A ] p + T 2 d dt [K A ] p . For T 1 , we have T 1 d dt [K A ] p = r k=1 I -[K A ] p [K A ] p ⊤ [K B ] k ∥[K B ] k ∥ [Q 1 ] k,p ∥[K A ] p ∥ σ 2 p N A N B d = I -[K A ] p [K A ] p ⊤ [K B ] p ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ σ 2 p N A N B d + k̸ =p I -[K A ] p [K A ] p ⊤ [K B ] k ∥[K B ] k ∥ [Q 1 ] k,p ∥[K A ] p ∥ σ 2 p N A N B d . For the first term, by Lemma B.4, we have ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ = 1 ± δ A/B 2K 1 + K ± O dτ 2 t = 2K 1 + K ± O dτ 2 t + δ A/B . Also by Lemma B.4, for each summand in the second term, we have ∥[K B ] k ∥ [Q 1 ] k,p ∥[K A ] p ∥ = ±O τ 2 t dκ 0 . Therefore, T 1 d dt [K A ] p = 2K 1 + K σ 2 p N A N B d I -[K A ] p [K A ] p ⊤ [K B ] p ± σ 2 p N A N B d r k=1 O τ 2 t dκ 0 + δ A/B [K B ] k . For T 2 , by Lemma B.5, we have T 2 d dt [K A ] p = σ 2 p N A N B d d-r k=1 I -[K A ] p [K A ] p ⊤ [K B,ξ ] k ∥[K B,ξ ] k ∥ [Q 1,ξ B ] k,p ∥[K A ] p ∥ = σ 2 p N A N B d d-r k=1 O τ 2 t d 2 ρ 2 N/S δ ξ,⊥ [K B,ξ ] k . Combine these together, and we obtain d dt [K A ] p = 2K 1 + K σ 2 p N A N B d I -[K A ] p [K A ] p ⊤ [K B ] p ± σ 2 p N A N B d r k=1 O τ 2 t dκ 0 + δ A/B [K B ] k ± σ 2 p N A N B d d-r k=1 O τ 2 t d 2 ρ 2 N/S δ ξ,⊥ [K B,ξ ] k = 2K 1 + K σ 2 p N A N B d I -[K A ] p [K A ] p ⊤ [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B . Now, we consider K A,ξ . Again, we write d dt [K A,ξ ] q = I -[K A,ξ ] q [K A,ξ ] q ⊤ [K B Q ⊤ 1,ξ A ] q ∥[K A,ξ ] q ∥ σ 2 ξ N A N B d + I -[K A,ξ ] q [K A,ξ ] q ⊤ [K B,ξ Q 2 ] q ∥[K A,ξ ] q ∥ σ 2 ξ N A N B d =: T 1 d dt [K A,ξ ] q + T 2 d dt [K A,ξ ] q . For the first term, by Lemma B.5, we have T 1 d dt [K A,ξ ] q = r k=1 I -[K A,ξ ] q [K A,ξ ] q ⊤ [K B ] k ∥[K B ] k ∥ [Q 1,ξ A ] q,k ∥[K A,ξ ] q ∥ σ 2 ξ N A N B d = ± r k=1 O σ 2 ξ N A N B d κ 0 ρ N/S τ 2 t d 2 ρ N/S δ ξ,⊥ = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ . The same bound also hold for T 2 . In fact, we can have a slightly sharper bound for it because we no longer have ∥[K B ] k ∥ / ∥[K A,ξ ] q ∥. Combine these and we obtain d dt [K A,ξ ] q = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ . Proof of Lemma B.9. The proof of Corollary B.8 and Corollary B.7, mutatis mutandis, yields d dt K A = 2K 1 + K K B N A N B d diag(σ 2 ) - 2K 1 + K K A N 2 A d ⟨K A , K B ⟩ N A N B d diag(σ 2 ) ± O σ 2 max N A N B d dτ 2 t + δ A/B ∥K A ∥ F , d dt K A,ξ = - 2K 1 + K K A,ξ N 2 A d ⟨K A , K B ⟩ N A N B d σ 2 ξ ± O σ 2 ξ N 2 A d dτ 2 t ∥K A,ξ ∥ F . Recall from Lemma A.6 that, in the non-contrastive case, we have d dt K A = E f + B (z + ) ⊤ N A -f + A , f + B K A N 2 A d diag(σ 2 ) = K B N A N B d diag(σ 2 ) - ⟨K A , K B ⟩ N A N B d K A N 2 A d diag(σ 2 ), d dt K A,ξ = - ⟨K A , K B ⟩ N A N B d K A,ξ N 2 A d σ 2 ξ . Note that they are exactly the same, except for a 2K/(1 + K) factor and some higher order error terms.

B.2 CONVERGENCE RATE AND THE CONDITION NUMBER

In this subsection, we estimate the rate at which 1 -ρ -and ρ N/S converge to 0 and the growth rate of the condition number. The basic idea is to use the estimations we have derived in Corollary B.7 and Corollary B.8 to approximate the finite-width dynamics with the infinite-width ones. Lemma B.10 (Convergence rate of ρ -). In Stage 1, we have, for any p ∈ [r], d dt [K A ] p , [K B ] p = 4K 1 + K σ 2 p N A N B d 1 + [K A ] p , [K B ] p 1 -[K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + δ A/B . Note that 1 + [K A ] p , [K B ] p = Θ(1). Hence, [K A ] p , [K B ] p will converge to 1 at a linear rate. Now we consider the noise-signal ratio. For some technical reason, instead of characterizing the dynamics of ρ N/S , we consider ρN/S := ∥K A,ξ ∥ F ∥K A + K B ∥ F . Note that we always have ρ2 N/S ≥ Θ(1) ∥K A,ξ ∥ 2 F ∥K A ∥ 2 F ≥ Θ(1) (d -r) ∥[K A,ξ ] q ∥ 2 κ 2 0 r ∥[K A ] p ∥ 2 In other words, ρ N/S can be bounded by ρ N/S ≤ O dκ0 r ρN/S . Lemma B.11 (Convergence rate of ρ N/S ). In Stage 1, we have d dt ρ2 N/S ≤ - 4K 1 + K σ 2 min N A N B d ρ2 N/S + O σ 2 max N A N B d dτ 2 t + δ A/B ρ2 N/S . This lemma says the noise-signal ratio decreases exponentially fast. Lemma B.12 (Growth rate of the condition number). Define ρ p/q = ∥[K A ] p ∥ 2 / ∥[K A ] q ∥ 2 . In Stage 1, we have ρp/q ≤ 16K 1 + K σ 2 max N A N B d ρ -+ min ρN/S , 1 ρ p/q ± O σ 2 max N A N B d δ A/B + τ 2 t d ρ p/q . Note that error growth slows down as ρ -and ρN/S decrease. This allows us to use Lemma B.2 (cf. Section B.4).

OMITTED PROOFS OF THIS SUBSECTION

Proof of Lemma B.10. By Corollary B.8, we have d dt [K A ] p , [K B ] p = 2K 1 + K σ 2 p N A N B d 1 -[K A ] p , [K B ] p 2 ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + δ A/B = 2K 1 + K σ 2 p N A N B d 1 + [K A ] p , [K B ] p 1 -[K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + δ A/B . Hence, by symmetry, we have d dt [K A ] p , [K B ] p = 4K 1 + K σ 2 p N A N B d 1 + [K A ] p , [K B ] p 1 -[K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + δ A/B . Proof of Lemma B.11. Similar to the proof of Corollary B.7 and Corollary B.8, we have d dt K A = K B N A N B d Q 1 diag(σ 2 ) + K B,ξ N A N B d Q 1,ξ B diag(σ 2 ) + K A N 2 A d Q 0 diag(σ 2 ) = K B N A N B d 2K 1 + K I d ± O dτ 2 t diag(σ 2 ) ± K B,ξ N A N B d O τ 2 t d 2 ρ N/S δ ξ,⊥ diag(σ 2 ) + K A N 2 A d - 2K 1 + K ⟨K A , K B ⟩ N A N B d ± O dτ 2 t diag(σ 2 ) = 2K 1 + K K B N A N B d diag(σ 2 ) - 2K 1 + K K A N A N B d ⟨K A , K B ⟩ N A N B d diag(σ 2 ) ± O σ 2 max N A N B d dτ 2 t + δ A/B ∥K A ∥ F . Define K = K A + K B . Then, by symmetry, we have d dt K = 2K 1 + K 1 N A N B d 1 - ⟨K A , K B ⟩ N A N B d K diag(σ 2 )±O σ 2 max N A N B d dτ 2 t + δ A/B ∥K∥ F . Hence, d dt ∥K∥ 2 F = 4K 1 + K 1 N A N B d 1 - ⟨K A , K B ⟩ N A N B d K, K diag(σ 2 ) ± O σ 2 max N A N B d dτ 2 t + δ A/B ∥K∥ 2 F ≥ 4K 1 + K σ 2 min N A N B d 1 - ⟨K A , K B ⟩ N A N B d ∥K∥ 2 F -O σ 2 max N A N B d dτ 2 t + δ A/B ∥K∥ 2 F . Similarly, we also have d dt K A,ξ = K B N A N B d Q ⊤ 1,ξ A σ 2 ξ + K B,ξ N A N B d Q 2 σ 2 ξ + K A,ξ N 2 A d Q 0 σ 2 ξ = K B N A N B d O τ 2 t d 2 ρ N/S δ ξ,⊥ σ 2 ξ + K B,ξ N A N B d O τ 2 t d 2 ρ N/S δ ξ,⊥ σ 2 ξ + K A,ξ N 2 A d - 2K 1 + K ⟨K A , K B ⟩ N A N B d ± O dτ 2 t σ 2 ξ = - 2K 1 + K K A,ξ N 2 A d ⟨K A , K B ⟩ N A N B d σ 2 ξ ± O σ 2 ξ N 2 A d dτ 2 t ∥K A,ξ ∥ F ± O σ 2 ξ N A N B d τ 2 t d 2 δ ξ,⊥ dκ 0 r ∥K A,ξ ∥ F = - 2K 1 + K K A,ξ N 2 A d ⟨K A , K B ⟩ N A N B d σ 2 ξ ± O σ 2 ξ N 2 A d dτ 2 t ∥K A,ξ ∥ F . Therefore, d dt ∥K A,ξ ∥ 2 F = - 4K 1 + K σ 2 ξ N 2 A d ⟨K A , K B ⟩ N A N B d ∥K A,ξ ∥ 2 F ± O σ 2 ξ N 2 A d dτ 2 t ∥K A,ξ ∥ 2 F . Then, we compute d dt ρ2 N/S = d dt ∥K A,ξ ∥ 2 F ∥K∥ 2 F -ρ2 N/S d dt ∥K A ∥ 2 F ∥K∥ 2 F ≤ - 4K 1 + K σ 2 ξ N 2 A d ⟨K A , K B ⟩ N A N B d ρ2 N/S ± O σ 2 ξ N 2 A d dτ 2 t ρ2 N/S -ρ2 N/S 4K 1 + K σ 2 min N A N B d 1 - ⟨K A , K B ⟩ N A N B d -O σ 2 max N A N B d dτ 2 t + δ A/B ≤ - 4K 1 + K σ 2 min N A N B d ρ2 N/S + O σ 2 max N A N B d dτ 2 t + δ A/B ρ2 N/S . Proof of Lemma B.12. For notational simplicity, define ρ p/q := ∥ [K A ] p ∥ 2 / ∥[K A ] q ∥ 2 . By Corol- lary B.7, we have ρp/q = d dt ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 -ρ p/q d dt ∥[K A ] q ∥ 2 ∥[K A ] q ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ∥[K A ] p ∥ 2 - ⟨K A , K B ⟩ N A N B d ρ p/q ± O σ 2 p N A N B d δ A/B + τ 2 t d ρ p/q -ρ p/q 4K 1 + K σ 2 q N A N B d ⟨[K A ] q , [K B ] q ⟩ ∥[K A ] q ∥ 2 - ⟨K A , K B ⟩ N A N B d ± O σ 2 q N A N B d δ A/B + τ 2 t d = 4K 1 + K σ 2 p N A N B d [K A ] p , [K B ] p - ⟨K A , K B ⟩ N A N B d ρ p/q - 4K 1 + K σ 2 q N A N B d [K A ] q , [K B ] q - ⟨K A , K B ⟩ N A N B d ρ p/q ± O σ 2 max N A N B d δ A/B + τ 2 t d ρ p/q . Now we bound [K A ] p , [K B ] p -⟨K A , K B ⟩ /(N A N B d). Clear that this term is bounded by 2. Meanwhile, by definition, we have [K A ] p , [K B ] p = 1 ± ρ -. For the second term, we have ⟨K A , K B ⟩ N A N B d = r k=1 [K A ] k , [K B ] k ∥K A ∥ k ∥K B ∥ k ∥K A ∥ F ∥K B ∥ F ∥K A ∥ F ∥K B ∥ F N A N B d = r k=1 (1 ± ρ -) 1 ± min ρN/S , 1 κ 2 k ∥κ∥ 2 1 ± δ A/B = 1 ± ρ -± min ρN/S , 1 1 ± δ A/B . Combine these together and we obtain [K A ] p , [K B ] p - ⟨K A , K B ⟩ N A N B d ≤ 2ρ -+ 2 min ρN/S , 1 ± δ A/B . The same is also true for q. Thus, ρp/q ≤ 16K 1 + K σ 2 max N A N B d ρ -+ min ρN/S , 1 ρ p/q ± O σ 2 max N A N B d δ A/B + τ 2 t d ρ p/q .

B.3 CONTROLLING THE DISCRETIZATION ERROR

In this subsection, we estimate the growth rate of the errors described in (10). As in the previous subsections, the proofs are deferred to the end of this subsection. First, we consider the relative difference between ∥[K A ] p ∥ 2 and ∥[K B ] p ∥ 2 . Instead of directly control the difference, we define ρ A/B,p := ∥[K A ] p ∥ 2 ∥[K B ] p ∥ 2 and ρ B/A,p := ∥[K B ] p ∥ 2 ∥[K A ] p ∥ 2 . Note that ρ A/B,p + ρ B/A,p ≥ 2, with equality obtained iff ∥[K A ] p ∥ 2 = ∥[K B ] p ∥ 2 . Meanwhile, at initialization, this quantity can be made arbitrarily close to 2. Hence, it suffices to control the growth of this quantity. The reason we consider ρ A/B,p + ρ B/A,p is to leverage the symmetry. Similarly, we also define ρ A/B,ξ,q := ∥[K A,ξ ] q ∥ 2 ∥[K B,ξ ] q ∥ 2 and ρ B/A,ξ,q := ∥[K B,ξ ] q ∥ 2 ∥[K A,ξ ] q ∥ 2 , and analyze ρ A/A,ξ + ρ B/A,ξ . Lemma B.13 (Difference of diagonal terms). In Stage 1, we have d dt ρ A/B,p + ρ B/A,p ≤ O σ 2 p N A N B d δ 2 A/B + τ 2 t d , d dt ρ A/B,ξ,q + ρ B/A,ξ,q ≤ O σ 2 max N A N B d δ 2 A/B + τ 2 t d . Note that the RHS are higher-order terms, which implies the relative difference of the norms barely grows. Now we consider the condition number of K A,ξ . Unlike K A , for the noises, the σ's for different coordinates are the same. Hence, we have the following simple bound on the growth rate of δ ξ,κ0 . Lemma B.14 (Condition number of K A,ξ ). Define ρ ξ,p/q := ∥[K A,ξ ] p ∥ 2 / ∥[K A,ξ ] q ∥ 2 . In Stage 1, we have ρξ,p/q ≤ O σ 2 ξ N A N B d dτ 2 t + δ A/B . Again, the RHS is a higher-order term that can be made arbitrarily small by choosing a small τ 2 t and maintaining a small δ A/B . Then, we consider the orthogonality conditions.

Lemma B.15 (Orthogonality between signals). For any

p ̸ = q, define δ⊥,p,q := [K A ] p , [K B ] q 2 + [K B ] p , [K A ] q 2 + [K A ] p , [K A ] q 2 + [K B ] p , [K B ] q 2 . In Stage 1, we have d dt δ⊥,p,q ≤ O σ 2 max N A N B d ρ -δ⊥,p,q + O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Recall that ρ -converges to 0 at a sufficiently fast rate. As a result, by Lemma B.2, δ⊥,p,q will not blow up. Meanwhile, since δ⊥,p,q can be made arbitrarily small at initialization, this implies that we can make sure it is still small at the end of Stage 1. Finally, we consider the orthogonality conditions between the signals and noises and between noises. The proof follows the same spirit.

Lemma B.16 (Orthogonality between signals and noises). For any

p ∈ [r] and q ∈ [d -r], define δ⊥,ξ A ,p,q = [K A ] p , [K A,ξ ] q 2 + [K B ] p , [K A,ξ ] q 2 , and define δ⊥,ξ B ,p,q similarly. In Stage 1, we have d dt δ⊥,ξ A ,p,q ≤ O σ 2 p N A N B d ρ -δ⊥,ξ A ,p,q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . For any q, s ∈ [d -r], define δ⊥,ξ,p,q = [K A,ξ ] q , [K B,ξ ] s 2 + [K B,ξ ] q , [K A,ξ ] s 2 , In Stage 1, we have d dt δ⊥,ξ,p,q ≤ O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ δ⊥,ξ,p,q .

OMITTED PROOFS OF THIS SUBSECTION

Proof of Lemma B.13 . Note that we cannot directly use Corollary B.7 as the error term contains δ A/B , the quantity we wish to control. However, by the proof of it, we still have d dt ∥[K A ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ - 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d ∥[K A ] p ∥ 2 N 2 A d ± O σ 2 p N A N B d τ 2 t dκ 2 p . Interchange the roles of A and B and we obtain d dt ∥[K B ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ - 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d ∥[K B ] p ∥ 2 N 2 B d ± O σ 2 p N A N B d τ 2 t dκ 2 p . For notational simplicity, define ρ A/B,p = ∥[K A ] p ∥ 2 / ∥[K B ] p ∥ 2 . Then, we compute ρA/B,p = d dt ∥[K A ] p ∥ 2 ∥[K B ] p ∥ 2 -ρ A/B,p d dt ∥[K B ] p ∥ 2 ∥[K B ] p ∥ 2 = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ∥[K A ] p ∥ 2 ρ A/B,p - 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d ρ A/B,p N 2 A d - 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ ∥[K B ] p ∥ 2 ρ A/B,p + 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d ρ A/B,p N 2 B d ± O σ 2 p N A N B d τ 2 t d = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ 1 ∥[K A ] p ∥ 2 - 1 ∥[K B ] p ∥ 2 ρ A/B,p - 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d 1 N 2 A d - 1 N 2 B d ρ A/B,p ± O σ 2 p N A N B d τ 2 t d . Then, by symmetry, we have d dt ρ A/B,p + ρ B/A,p = 4K 1 + K σ 2 p N A N B d ⟨[K A ] p , [K B ] p ⟩ 1 ∥[K A ] p ∥ 2 - 1 ∥[K B ] p ∥ 2 ρ A/B,p -ρ B/A,p - 4K 1 + K ⟨K A , K B ⟩ σ 2 p N A N B d 1 N 2 A d - 1 N 2 B d ρ A/B,p -ρ B/A,p ± O σ 2 p N A N B d τ 2 t d ≤ O σ 2 p N A N B d δ 2 A/B + τ 2 t d . The above proof, mutatis mutandis, yields the result for ρ A/B,ξ,q + ρ B/A,ξ,q . Proof of Lemma B.14. By Corollary B.7, we have ρξ,p/q = d dt ∥[K A,ξ ] p ∥ 2 ∥[K A,ξ ] q ∥ 2 -ρ ξ,p/q d dt ∥[K A,ξ ] q ∥ 2 ∥[K A,ξ ] q ∥ 2 = - 4K 1 + K σ 2 ξ N A N B d ⟨K A , K B ⟩ N A N B d ρ ξ,p/q ± O σ 2 ξ N A N B d dτ 2 t + δ A/B -ρ ξ,p/q - 4K 1 + K σ 2 ξ N A N B d ⟨K A , K B ⟩ N A N B d ± O σ 2 ξ N A N B d dτ 2 t + δ A/B = ±O σ 2 ξ N A N B d dτ 2 t + δ A/B . Proof of Lemma B.15. By Corollary B.8, we have d dt [K A ] p , [K B ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K B ] q -[K A ] p , [K B ] q [K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B = 2K 1 + K σ 2 p N A N B d [K B ] p , [K B ] q -[K A ] p , [K B ] q + 2K 1 + K σ 2 p N A N B d [K A ] p , [K B ] q 1 -[K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B . Interchange the roles of p, q and A, B and we obtain [K A ] p , d dt [K B ] q = 2K 1 + K σ 2 q N A N B d [K A ] p , [K A ] q -[K A ] p , [K B ] q + 2K 1 + K σ 2 q N A N B d [K A ] p , [K B ] q 1 -[K A ] q , [K B ] q ± O σ 2 q N A N B d τ 2 t d 2 κ 0 + d δ A/B . Therefore, d dt [K A ] p , [K B ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K B ] q -[K A ] p , [K B ] q + 2K 1 + K σ 2 q N A N B d [K A ] p , [K A ] q -[K A ] p , [K B ] q ± O σ 2 max N A N B d ρ -[K A ] p , [K B ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Interchange the roles of p and q and we obtain d dt [K B ] p , [K A ] q = 2K 1 + K σ 2 q N A N B d [K B ] p , [K B ] q -[K B ] p , [K A ] q + 2K 1 + K σ 2 p N A N B d [K A ] p , [K A ] q -[K B ] p , [K A ] q ± O σ 2 max N A N B d ρ -[K B ] p , [K A ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Similarly, we compute d dt [K A ] p , [K A ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A ] q -[K A ] p , [K A ] q [K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A ] q -[K A ] p , [K A ] q + 2K 1 + K σ 2 p N A N B d [K A ] p , [K A ] q 1 -[K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A ] q -[K A ] p , [K A ] q ± O σ 2 max N A N B d ρ -[K A ] p , [K A ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Then, by interchanging the roles of p and q, we obtain [K A ] p , d dt [K A ] q = 2K 1 + K σ 2 q N A N B d [K A ] p , [K B ] q -[K A ] p , [K A ] q ± O σ 2 max N A N B d ρ -[K A ] p , [K A ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Add them together and we get d dt [K A ] p , [K A ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A ] q -[K A ] p , [K A ] q + 2K 1 + K σ 2 q N A N B d [K A ] p , [K B ] q -[K A ] p , [K A ] q ± O σ 2 max N A N B d ρ -[K A ] p , [K A ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Interchange the roles of p and q and we obtain d dt [K B ] p , [K B ] q = 2K 1 + K σ 2 p N A N B d [K A ] p , [K B ] q -[K B ] p , [K B ] q + 2K 1 + K σ 2 q N A N B d [K B ] p , [K A ] q -[K B ] p , [K B ] q ± O σ 2 max N A N B d ρ -[K B ] p , [K B ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . For notational simplicity, define Z 1 = [K A ] p , [K B ] q , Z 2 = [K B ] p , [K A ] q , Z 3 = [K A ] p , [K A ] q , and Z 4 = [K B ] p , [K B ] q . Also define G p = 2K 1+K σ 2 p N A N B d . Then, we can summarize the above equations as d dt    Z 1 Z 2 Z 3 Z 4    =    -G p -G q 0 G q G p 0 -G p -G q G p G q G q G p -G p -G q 0 G p G q 0 -G p -G q       Z 1 Z 2 Z 3 Z 4    ± O σ 2 max N A N B d ρ -    Z 1 Z 2 Z 3 Z 4    ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Note that the eigenvalues of the first matrix is -2G p -2G q , -2G p , -2G q and 0. Namely, it is negative semi-definite. Thus, d dt ∥Z∥ 2 ≤ O σ 2 max N A N B d ρ -∥Z∥ 2 + O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Proof of Lemma B.16. By Corollary B.8, d dt [K A ] p , [K A,ξ ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A,ξ ] q -[K A ] p , [K A,ξ ] q [K A ] p , [K B ] p ± O σ 2 p N A N B d τ 2 t d 2 κ 0 + d δ A/B , and [K A ] p , d dt [K A,ξ ] q = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ . Therefore, d dt [K A ] p , [K A,ξ ] q = 2K 1 + K σ 2 p N A N B d [K B ] p , [K A,ξ ] q -[K A ] p , [K A,ξ ] q ± O σ 2 p N A N B d ρ -[K A ] p , [K A,ξ ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Similarly, we also have d dt [K B ] p , [K A,ξ ] q = 2K 1 + K σ 2 p N A N B d [K A ] p , [K A,ξ ] q -[K B ] p , [K A,ξ ] q ± O σ 2 p N A N B d ρ -[K B ] p , [K A,ξ ] q ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Thus, d dt [K A ] p , [K A,ξ ] q 2 + [K B ] p , [K A,ξ ] q 2 = - 4K 1 + K σ 2 p N A N B d [K A ] p , [K A,ξ ] q 2 -[K B ] p , [K A,ξ ] q 2 2 ± O σ 2 p N A N B d ρ - [K A ] p , [K A,ξ ] q 2 + [K B ] p , [K A,ξ ] q 2 ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B ≤ O σ 2 p N A N B d ρ - [K A ] p , [K A,ξ ] q 2 + [K B ] p , [K A,ξ ] q 2 ± O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . For the orthogonality between noises, by Corollary B.8, we have [K A,ξ ] q , d dt [K A,ξ ] s = ±O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ Clear that this bound holds for all other combinations. Thus, d dt δ⊥,ξ,p,q ≤ O σ 2 ξ N A N B d τ 2 t d 3 κ 0 δ ξ,⊥ δ⊥,ξ,p,q .

B.4 PROOF OF THE MAIN LEMMA OF STAGE 1

Proof of Lemma B.1. First, we recap the estimations we have derived in previous subsections and introduce some notations. By Lemma B.10 and Lemma B.11, we have d dt ρ -≤ -Ω(1) σ 2 min N A N B d ρ -+ O σ 2 max N A N B d τ 2 t d 2 κ 0 + δ A/B , d dt ρN/S ≤ -Ω(1) σ 2 min N A N B d ρN/S + O σ 2 max N A N B d dτ 2 t + δ A/B d . Define ρ := max ρ -, ρN/S to be the indicator of the progress we have made. We have d dt ρ ≤ -Ω(1) σ 2 min N A N B d ρ + O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . . ( ) Let κ0 := max p,q ∥[K A ] p ∥ 2 / ∥[K A ] q ∥ 2 be the condition number at time t. By Lemma B.12, we have d dt κ0 ≤ O(1) σ 2 max N A N B d ρκ 0 . Now we consider the discretization errors, define δA/B = max p∈[r],q∈[d-r] ρ A/B,p + ρ B/A,p -2, ρ A/B,ξ,q + ρ B/A,ξ,q -2 . Note that at time t, the first condition of (10) holds with δ A/B replaced by O( δA/B ). Meanwhile, by Lemma B.13, we have d dt δA/B ≤ O σ 2 max N A N B d δ 2 A/B + τ 2 t d . Let δξ,κ0 (t) be the smallest number such that the second condition of (10) holds at time t. By Lemma B.14, we have d dt δξ,κ0 ≤ O σ 2 ξ N A N B d dτ 2 t + δ A/B . Then, define δ⊥ := max δ⊥,p,q , δ⊥,ξ A ,p,k , δ⊥,ξ B ,p,k , δ⊥,ξ,k,l , : p ̸ = q ∈ [d], k, l ∈ [d -r] Clear that the last two conditions hold at time t when δ AB,⊥ and δ ξ,⊥ are replaced by δ⊥ (t). By Lemma B.15 and Lemma B.16, we have d dt δ⊥ ≤ O σ 2 max N A N B d ρ δ⊥ + O σ 2 max N A N B d τ 2 t d 2 κ 0 + d δ A/B . Now, we are ready to show that the errors do not blow up in Stage 1. Note that for all these δ's, we can make them arbitrarily inverse-polynomially small by choosing a sufficiently large m. First, we consider δA/B . Note that the dependence of the RHS of (13) on δA/B is quadratic. Hence, by making the initial value of δA/B , the RHS can be made to be dominated the τ 2 t -related terms. Hence, δA/B ≤ O σ 2 max N A N B d τ 2 t dT 1 . As we will see later, T 1 = poly(d). Therefore, by choosing a sufficiently small τ 2 t , we can make δA/B remain small throughout Stage 1. Then, we consider δξ,κ0 . As we have argued earlier, the RHS of ( 14) can be made arbitrarily small, so that δξ,κ0 remains small in Stage 1. Now, we consider the condition number κ0 and δ⊥ . For δ⊥ , by our previous argument, the second term of the RHS of ( 15) can be merged into the first term, by choosing a sufficiently large m and a sufficiently small τ 2 t . The same is also true for (11). Hence, for these quantities, we have d dt ρ ≤ -Ω(1) σ 2 min N A N B d ρ, d dt κ0 ≤ O(1) σ 2 max N A N B d ρκ 0 , d dt δ⊥ ≤ O(1) σ 2 max N A N B d ρ δ⊥ . Hence, by Lemma B.2, we have κ0 ≤ κ0 (0) exp O σ 2 max σ 2 min ρ(0) , δ⊥ ≤ δ⊥ (0) exp O σ 2 max σ 2 min ρ(0) . Note that ρ -(0) = O(1) and ρN/S ≤ (d-r)σ 2 ξ rσ 2 min . Therefore, exp O σ 2 max σ 2 min ρ(0) ≤ exp O(1) σ 2 max σ 2 min max 1, (d -r)σ 2 ξ rσ 2 min ≤ exp 1 2 log d = √ d. In other words, both κ0 and δ⊥ can at most grow √ d times. Finally, we derive an upper bound on T 1 to complete the proof. Similar to the proof for the condition number, one can show that ∥ [K A ] p ∥ 2 can at most grow √ d times in Stage 1. As a result, 1/(N A N B d ) is lower bounded by some 1/ poly(d). Thus, by (11), T 1 ≤ poly(d).

B.5 NEGATIVE RESULTS

Lemma B.17. There exists a σ ∈ R r satisfying the assumptions of Theorem 4.3 such that, at the end of Stage 1, the condition number of K A is d Ω(1) . Proof. We choose d = r and σ 2 1 = c log d, σ 2 2 = • • • = σ 2 d = 1. Clear that this satisfies the condition of Theorem 4.3. Note that it suffices to consider the infinite-width case, since, as we have proved earlier, the finite-width trajectory tracks the infinite-width one. By Lemma B.10, we have d dt [K A ] p , [K B ] p ≈ 4K 1 + K σ 2 p N A N B d 1 + [K A ] p , [K B ] p 1 -[K A ] p , [K B ] p . By the proof of Lemma B.12, we have ρp/q ≈ 4K 1 + K σ 2 p N A N B d [K A ] p , [K B ] p - ⟨K A , K B ⟩ N A N B d ρ p/q - 4K 1 + K σ 2 q N A N B d [K A ] q , [K B ] q - ⟨K A , K B ⟩ N A N B d ρ p/q . Note that, in the infinite-width case, we have [K A ] 1 , [K B ] 1 ≥ [K A ] 2 , [K B ] 2 = • • • = [K A ] d , [K B ] d . Therefore, [K A ] p , [K B ] p -⟨K A ,K B ⟩ N A N B d ≥ 0a and [K A ] q , [K B ] q -⟨K A ,K B ⟩ N A N B d ≤ 0 for any q ≥ 2. Hence, ρ1/2 ≥ 4K 1 + K σ 2 1 N A N B d [K A ] 1 , [K B ] 1 - ⟨K A , K B ⟩ N A N B d ρ 1/2 ≥ 4K 1 + K σ 2 1 N A N B d 1 - κ 2 1 ∥κ∥ 2 [K A ] 1 , [K B ] 1 - (d -1)κ 2 1 ∥κ∥ 2 [K A ] 2 , [K B ] 2 ρ 1/2 . For notational simplicity, define X 1 = 1 -[K A ] 1 , [K B ] 1 , X 2 = 1 -[K A ] 2 , [K B ] 2 , Y = ρ 1/2 , A = 4K 1+K 1 N A N B d . Then we have Ẋ1 ≤ -σ 2 1 AX 1 , Ẋ2 ≥ -2AX 2 . First, by Gronwall's lemma, we have X 1 (T ) ≤ exp -σ 2 1 T 0 A and X 2 (T ) ≥ exp -2 T 0 A ≥ X 2/σ 2 1 1 (T ). As a result, when X 1 reaches 1/d, we have X 2 = Ω(1). Let T 1 be the time X 1 reaches 1/d and T 2 the time X 2 (T 2 ) = X 2 (T 1 )/2. On [T 1 , T 2 ], we have ρ1/2 ≥ Ω(1)σ 2 1 Aρ 1/2 . By Gronwall's lemma, in order for X 2 to half, we need exp(-2 T2 T1 A) = 1/2. Hence, ρ 1/2 (T 2 ) ≥ ρ 1/2 (T 1 ) exp Ω(1)σ 2 1 T2 T1 A ≥ 2 Ω(σ 2 1 ) = d Ω(1) .

C STAGE 2

In this section, we show that, throughout Stage 2, the discretization error and the noise-signal ratio still remain small, and, at the end of Stage 2, the condition number is close to 1. Formally, we prove the following. Lemma C.1 (Stage 2). Suppose that at the beginning of Stage 2, we have κ 0 ≤ √ d and all errors mentioned in (16) are sufficiently smallfoot_1 . Let c Target > 1 be a constant. Let T 1 be the earliest time that ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ≤ c Target , ∀p, q ∈ [r]. We have T 1 ≤ poly(d). Moreover, throughout Stage 2, we have max 1 -[K A ] p , [K B ] p , 1 - ∥[K A ] p ∥ 2 ∥[K B ] p ∥ 2 ≤ δ - ∀p ∈ [r], max 1 - ∥[K A,ξ ] p ∥ 2 ∥[K B,ξ ] p ∥ 2 , 1 - ∥[K A,ξ ] p ∥ 2 ∥[K A,ξ ] q ∥ 2 ≤ δ - ∀p, q ∈ [d -r], max ∥[K C,ξ ] q ∥ ∥[K D ] p ∥ : C, D ∈ {A, B} ≤ δ N/S , ∀p ∈ [r], q ∈ [d -r], max [K C ] p , [K D ] q : C, D ∈ {A, B} ≤ δ AB,⊥ , ∀p ̸ = q ∈ [r], max [K C ] p , [K D,ξ ] q , [K A,ξ ] s , [K B,ξ ] q : C, D ∈ {A, B} ≤ δ ξ,⊥ , ∀p ∈ [r], q, s ∈ [d -r], where the δ's are some small 1/ poly(d) values. The rest of this section is organized as follows. We derive estimations for the Q-matrices in Section C.1. In Section C.2, we maintain the last two conditions of ( 16). In Section C.3, we handle the first two conditions of ( 16). In Section C.4, we deal with the noise-signal ratio. We estimate the convergence rate in Section C.5. Finally, we prove Lemma C.1 in Section C.6.

C.1 ESTIMATIONS FOR Q

As in Stage 1, we estimate the Q-matrices in this subsection. The analysis here will be more complicated than the one in Section B.1 since now τ 2 t is no longer close to 0, and we cannot simply approximate S A and S B with (1 + K) -1 . However, the idea is still fairly straightforward. We split all terms into the infinite-width part and the discretization error part. Then we Taylor expand the corresponding function around the infinite-width part to factor out the first-order error terms. Then, we evaluate and simplify these first-order terms with E z -and E z ± . First, we need the following lemma which gives closed-form formulas for some expectations we will encounter later. Lemma C.2. Define ⟨z + , z -⟩ κ2 := r k=1 κ2 k z + k z - k and T p := tanh κ2 p /d N A N B , p ∈ [r]. For any p ̸ = q ∈ [r], we have E z -exp ⟨z + , z -⟩ κ2 N A N B = r k=1 cosh κ2 k /d N A N B =: Z c , E z -exp ⟨z + , z -⟩ κ2 N A N B z - p = Z c T p z + p , E z - exp ⟨z + , z -⟩ κ2 N A N B z - p z - q = Z c T p T q z + p z + q . Then, we derive estimations for S A and S B . There are two types of errors we need to consider. The first one comes from the noises and the second one from the non-diagonalness of K ⊤ A K B . Similar to Lemma B.4, we deal with them separately. The next lemma handles the first type of error. Lemma C.3 (Estimations for S). Define E +,-:= exp ⟨K A z + , K B z -⟩ N A N B , Ẽ+,-:= exp ⟨z + , z -⟩ κ2 N A N B , δ +,ξ-:= K A z + , K B,ξ ξ - B N A N B , δ ξ+,T + := K A,ξ ξ + A , K B diag([T k ] k∈[r] )z + N A N B , Z A,c := E z - E +,-, Z B,c := E z - E -,+ , SA := E +,+ E +,+ + KZ A,c , SB := E +,+ E +,+ + KZ B,c , S := Ẽ+,+ Ẽ+,+ + KZ c . In Stage 2, we have S A = SA + S(1 -S) (δ +,ξ+ + δ ξ+,+ -δ ξ+,T + ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S , S B = SB + S(1 -S) (δ +,ξ+ + δ ξ+,+ -δ T +,ξ+ ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S , and S A (x + A , x + B ) exp(f + A • f - B ) exp(f + A • f + B ) = SA E +,- E +,+ + S Ẽ+,- Ẽ+,+ δ +,ξ-+ δ ξ+,--S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ ξ+,T + ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S , S B (x + A , x + B ) exp(f - A • f + B ) exp(f + A • f + B ) = SB E -,+ E +,+ + S Ẽ-,+ Ẽ+,+ 1 + δ ξ-,+ + δ -,ξ+ -S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ T +,ξ+ ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Then, we consider the error comes from the non-diagonalness of K ⊤ A K B . Lemma C.4 (Further estimations for S). Define δ+,-= i̸ =j ⟨[K A ] i , [K B ] j ⟩ N A N B z + i z - j , δ+,T + = i̸ =j ⟨[K A ] i , [K B ] j ⟩ N A N B z + i T j z + j , Ẽ0 := Ẽ+,+ = Ẽ-,-= exp ∥κ∥ 2 N A N B . In Stage 2, we have SA (z + ) = S 1 + (1 -S)( δ+,+ -δ+,T + ) ± O d 2 δ 2 AB,⊥ , SB (z + ) = S 1 + (1 -S)( δ+,+ -δT +,+ ) ± O d 2 δ 2 AB,⊥ , SA E +,- E +,+ = S Ẽ+,- Ẽ0 1 -S δ+,+ -(1 -S) δ+,T + + δ+,-± O d 2 δ 2 AB,⊥ , SB E -,+ E +,+ = S Ẽ-,+ Ẽ0 1 -S δ+,+ -(1 -S) δT +,+ + δ-,+ ± O d 2 δ 2 AB,⊥ . With the above two lemmas in hand, we can now derive estimations for the Q-matrices. Lemma C.5 (Estimations for Q 1 ). Define K AB = K ⊤ A K B and K BA = K ⊤ B K A . In Stage 2, for any p ̸ = q ∈ [r], we have [Q 1 ] p,p = 2(1 -S)(1 -T p ) ± O d 2 δ 2 AB,⊥ , [Q 1 ] p,q = -S(1 -S) [K AB ] p,q + [K BA ] q,p N A N B d (2 -T p -T q ) -(1 -S) [K AB ] p,q N A N B d S (2T p T q -T p -T q ) -(1 -S) [K AB ] q,p N A N B d 2 -S(T p + T q ) -(1 -S)(T 2 p + T 2 q ) ± O d 2 δ 2 AB,⊥ . In particular, we have |[Q 1 ] p,q | ≤ O κ 0 δ AB,⊥ d ≤ O δ AB,⊥ √ d . Note that the diagonal term has a zero-order term, i.e., it is not proportional to some error. That is the signal term. On the other hand, all off-diagonal terms depend on [K AB ] p,q (p ̸ = q). Recall that the dynamics of K A and K B can be described using these Q-matrices. Therefore, for any p ̸ = q, we can have equations of form d dt [K AB ] p,q ≈ G([K AB ] p,q , [K AB ] q,p ), where G is some complicated matrix. By carefully analyzing G, we can then derive bounds for the off-diagonal entries using Gronwall's lemma. Similar things also hold for Q 1,ξ and Q 2 . The difference here is that for these two matrices, we do not have signal terms. Lemma C.6 (Estimations for Q 1,ξ ). In Stage 2, for any p ∈ [d -r] and q ∈ [r], we have [Q 1,ξ A ] p,q = -(1 -S) 1 + S + (1 -S)T 2 q ⟨[K B ] q , [K A,ξ ] p ⟩ N A N B d ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S , [Q 1,ξ B ] p,q = -(1 -S) 1 + S + (1 -S)T 2 q ⟨[K A ] q , [K B,ξ ] p ⟩ N A N B d ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . In particular, we have max {[Q 1,ξ A ] p,q , [Q 1,ξ B ] p,q } ≤ O δ N/S δ ξ,⊥ . Lemma C.7 (Estimations for Q 2 ). In Stage 2, for any p, q ∈ [d -r], we have [Q 2 ] p,q = ±O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Lemma C.8 (Estimations for Q 0 ). In Stage 2, we have Q 0 = - r k=1 κ 2 k ∥κ∥ 2 [Q 1 ] k,k ± O d 2 δ 2 AB,⊥ + δ -+ κ 0 dδ N/S δ ξ,⊥ . Finally, we use these estimations to simplify the formulas for the norms. We do not consider the tangent movement here since the situation is trickier there, and we will handle them in later subsections. Corollary C.9 (Dynamics of the norms). In Stage 2, we have d dt ∥[K A ] p ∥ 2 = 2σ 2 p ∥[K A ] p ∥ ∥[K B ] p ∥ N A N B d [K A ] p , [K B ] p [Q 1 ] p,p + 2 ∥[K A ] p ∥ 2 N 2 A d Q 0 σ 2 p ± O σ 2 p κ 2 p N A N B d κ 0 dδ 2 AB,⊥ d dt ∥[K A,ξ ] q ∥ 2 = 2 ∥[K A,ξ ] q ∥ 2 N 2 A d Q 0 σ 2 ξ ± O σ 2 ξ ∥[K A,ξ ] q ∥ 2 N A N B d dδ 2 ξ,⊥ . The formulas for ∥[K B ] p ∥ and ∥[K B,ξ ] q ∥ can be obtained by interchanging the roles of A and B.

OMITTED PROOFS OF THIS SUBSECTION

Proof of Lemma C.2. First, we compute E z -exp ⟨z + , z -⟩ κ2 N A N B = r k=1 E z - k exp κ2 k z + k z - k N A N B = r k=1 1 2 exp κ2 k /d N A N B + exp - κ2 k /d N A N B = r k=1 cosh κ2 k /d N A N B . Similarly, we also have E z -exp ⟨z + , z -⟩ κ2 N A N B z - p = E z - p exp κ2 k z + p z - p N A N B z - p k̸ =p E z - k exp κ2 k z + k z - k N A N B . Each factor in k̸ =p is still cosh κ2 k /d N A N B . For the first term, we have E z - p exp κ2 k z + p z - p N A N B z - p = 1 2 √ d E z - p exp κ2 k z + p / √ d N A N B - 1 2 √ d E z - p exp -κ 2 k z + p / √ d N A N B z - p = 1 √ d sinh κ2 k z + p / √ d N A N B = sgn z + o √ d sinh κ2 k /d N A N B = sinh κ2 k /d N A N B z + p . Therefore, E z - exp ⟨z + , z -⟩ κ2 N A N B z - p = z + p sinh κ2 k /d N A N B k̸ =p cosh κ2 k /d N A N B = Z c T p z + p . The above calculation, mutatis mutandis, also yields the last identity. Proof of Lemma C.3. First, we write f + A , f - B = K A z + + K A,ξ ξ + A , K B z -+ K B,ξ ξ - B N A N B = ⟨K A z + , K B z -⟩ N A N B + K A z + , K B,ξ ξ - B N A N B + K A,ξ ξ + A , K B z - N A N B ± O d 2 δ ξ,⊥ δ 2 N/S . Also note that the middle terms are O(d 2 δ ξ,⊥ δ N/S ). Then, we compute exp f + A • f - B = exp 1 + K A z + , K B,ξ ξ - B N A N B + K A,ξ ξ + A , K B z - N A N B ± O d 2 δ ξ,⊥ δ 2 N/S . Similar results also hold for other combinations of ±. With the notations defined in this lemma, we can write these results as exp f + A • f - B = E +,-(1 + δ +,ξ-+ δ ξ+,-) ± O d 2 δ ξ,⊥ δ 2 N/S , exp f - A • f + B = E -,+ (1 + δ -,ξ+ + δ ξ-,+ ) ± O d 2 δ ξ,⊥ δ 2 N/S , exp f + A • f + B = E +,+ (1 + δ +,ξ+ + δ ξ+,+ ) ± O d 2 δ ξ,⊥ δ 2 N/S . To compute S A and S B , we then need to take expectations over the negative examples. Note that by taking expectation over ξ - B , the term E +,-δ +,ξ-becomes 0. Therefore, we have E x - B exp f + A • f - B = E x - B {E +,-(1 + δ ξ+,-)} ± O d 2 δ ξ,⊥ δ 2 N/S . Unfortunately, the same argument does not apply to δ ξ+,-since both E +,-and δ ξ+,-depend on z -. However, it is still possible to further simplify the expression. First, we write E z -{E +,-δ ξ+,-} = E z - Ẽ+,-(1 ± O(dδ AB,⊥ )) δ ξ+,-= E z - Ẽ+,-δ ξ+,-±O(d 3 δ AB,⊥ δ ξ,⊥ δ N/S ). Recall Lemma C.2. Then, we compute E z - Ẽ+,-δ ξ+,-= E z - Ẽ+,- K A,ξ ξ + A , K B z - N A N B = K A,ξ ξ + A , K B E z -Ẽ+,-z - N A N B = Z c K A,ξ ξ + A , K B diag([T k ] k∈[r] )z + N A N B = Z c δ ξ+,T + . Hence, E x - B exp f + A • f - B = E x - B {E +,-} + E x - B {E +,-δ ξ+,-} ± O d 2 δ ξ,⊥ δ 2 N/S = Z A,c + Z c δ ξ+,T + ± O(d 3 δ AB,⊥ δ ξ,⊥ δ N/S ) ± O d 2 δ ξ,⊥ δ 2 N/S = Z A,c + Z c δ ξ+,T + ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Similarly, we also have E x - A exp f + A • f - B = Z B,c + Z c δ T +,ξ+ ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Recall that exp f + A • f + B = E +,+ (1 + δ +,ξ+ + δ ξ+,+ ) ± O d 2 δ ξ,⊥ δ 2 N/S . Hence, we have S A (x + A , x + B ) = E +,+ (1 + δ +,ξ+ + δ ξ+,+ ) E +,+ (1 + δ +,ξ+ + δ ξ+,+ ) + KZ A,c + KZ c δ ξ+,T + ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S = SA 1 -SA (δ +,ξ+ + δ ξ+,+ ) -(1 -SA )δ ξ+,T + + SA (δ +,ξ+ + δ ξ+,+ ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S = SA + S(1 -S) (δ +,ξ+ + δ ξ+,+ -δ ξ+,T + ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Similarly, we also have S B (x + A , x + B ) = SB + S(1 -S) (δ +,ξ+ + δ ξ+,+ -δ T +,ξ+ ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Then, we compute S A (x + A , x + B ) exp(f + A • f - B ) exp(f + A • f + B ) = SA + S(1 -S) (δ +,ξ+ + δ ξ+,+ -δ ξ+,T + ) E +,-(1 + δ +,ξ-+ δ ξ+,-) E +,+ (1 + δ +,ξ+ + δ ξ+,+ ) ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S = SA E +,- E +,+ 1 + δ +,ξ-+ δ ξ+,--S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ ξ+,T + ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S = SA E +,- E +,+ + S Ẽ+,- Ẽ δ +,ξ-+ δ ξ+,--S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ ξ+,T + ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S , S B (x + A , x + B ) exp(f - A • f + B ) exp(f + A • f + B ) = SB E -,+ E +,+ + S Ẽ-,+ Ẽ 1 + δ ξ-,+ + δ -,ξ+ -S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ T +,ξ+ ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Proof of Lemma C.4. We write ⟨K A z + , K B z -⟩ N A N B = r k=1 ⟨[K A ] k , [K B ] k ⟩ N A N B z + k z - k + i̸ =j ⟨[K A ] i , [K B ] j ⟩ N A N B z + i z - j =: I +,-+ δ+,-. Note that, as a special case, we have I +,+ = I -,-= I 0 . In other words, I +,+ and I -,-do not depend on the actual value of z ± . Also note that I +,-is bounded by O(dδ AB,⊥ ). Then, we compute . Take expectation over z -and we obtain E z -E +,-= E z -exp(I +,-) + E z -exp(I +,-) δ+,-± O d 2 δ 2 AB,⊥ . By Lemma C.2, the first term is Z c . For the second term, we compute E z -exp(I +,-) δ+,-= i̸ =j [K AB ] i,j N A N B E z -exp(I +,-)z - j z + i = Z c i̸ =j [K AB ] i,j N A N B T j z + j z + i = Z c δ+,T + , where the second line again comes from Lemma C.2. Hence, we have E z -E +,-= Z c + Z c δ+,T + ± O d 2 δ 2 AB,⊥ . Then, for SA , we have SA (z + ) = E +,+ E +,+ + K E z -E +,- = exp(I 0 ) 1 + δ+,+ exp(I 0 ) 1 + δ+,+ + KZ c + KZ c δ+,T + ± O d 2 δ 2 AB,⊥ = S 1 -S δ+,+ -(1 -S) δ+,T + + S δ+,+ ± O d 2 δ 2 AB,⊥ = S 1 + (1 -S)( δ+,+ -δ+,T + ) ± O d 2 δ 2 AB,⊥ . Similarly, we also have SB (z + ) = S 1 + (1 -S)( δ+,+ -δT +,+ ) ± O d 2 δ 2 AB,⊥ . Then, we compute SA E +,- E +,+ = S 1 + (1 -S)( δ+,+ -δ+,T + ) exp(I +,-) 1 + δ+,- exp(I +,-) 1 + δ+,+ ± O d 2 δ 2 AB,⊥ = S exp(I +,-) exp(I 0 ) 1 -S δ+,+ -(1 -S) δ+,T + + δ+,-± O d 2 δ 2 AB,⊥ . Similarly, we also have SB E -,+ E +,+ = S exp(I -,+ ) exp(I 0 ) 1 -S δ+,+ -(1 -S) δT +,+ + δ-,+ ± O d 2 δ 2 AB,⊥ . Proof of Lemma C.5. Recall that Q 1 := E 2 -S A (x + A , x + B ) -S B (x + A , x + B ) z + (z + ) ⊤ d -K E S A (x + A , x + B ) exp(f + A • f - B ) exp(f + A • f + B ) z -(z + ) ⊤ d -K E S B (x + A , x + B ) exp(f - A • f + B ) exp(f + A • f + B ) z + (z -) ⊤ d . Note that there is no ξ here other than the ones in the coefficient. As a result, all terms contain δ +,ξ+ and alike are 0. Hence, we have Q 1 := E 2 -SA (x + A , x + B ) -SB (x + A , x + B ) z + (z + ) ⊤ d -K E SA E +,- E +,+ z -(z + ) ⊤ d -K E SB E -,+ E +,+ z + (z -) ⊤ d =: T 1 (Q 1 ) + T 2 (Q 1 ) + T 3 (Q 1 ). Now we estimate each of these three terms separately. We also deal with the diagonal and offdiagonal terms separately. By Lemma C.4, we have T 1 ([Q 1 ] p,p ) = E 2 -SA (x + A , x + B ) -SB (x + A , x + B ) = E 2 -S 1 + (1 -S)( δ+,+ -δ+,T + ) -S 1 + (1 -S)( δ+,+ -δT +,+ ) ± O d 2 δ 2 AB,⊥ = 2(1 -S) ± O d 2 δ 2 AB,⊥ . Note that we use the fact that all these δ's have mean 0. Also by Lemma C.4, we have T 2 ([Q 1 ] p,p ) = -K E S Ẽ+,- Ẽ0 1 -S δ+,+ -(1 -S) δ+,T + + δ+,- z - p z + p d ± O d 2 δ 2 AB,⊥ = - SK Ẽ0 E Ẽ+,-z - p z + p d - SK Ẽ0 E Ẽ+,--S δ+,+ -(1 -S) δ+,T + + δ+,-z - p z + p d ± O d 2 δ 2 AB,⊥ . Note that δ+,+ , δ+,and δ+,T + only contain cross terms of form z + i z ± j with i ̸ = j. Hence, the second term is 0. Meanwhile, by Lemma C.2, the first term is - SK Ẽ0 E Ẽ+,-z - p z + p d = - SKZ c T p Ẽ0 = -(1 -S)T p . As a result, T 2 ([Q 1 ] p,p ) = -(1 -S)T p ± O d 2 δ 2 AB,⊥ . Similarly, one can show that T 3 ([Q 1 ] p,p ) = -(1 -S)T p ± O d 2 δ 2 AB,⊥ also holds. Thus, [Q 1 ] p,p = 2(1 -S)(1 -T p ) ± O d 2 δ 2 AB,⊥ . Now, we consider the off-diagonal terms. For notational simplicity, we define K AB = K ⊤ A K B . For any p ̸ = q, we compute T 1 ([Q 1 ] p,q ) = -E SA (x + A , x + B ) + SB (x + A , x + B ) z + p z + q d = -S(1 -S) E 2 δ+,+ -δ+,T + -δT +,T z + p z + q d ± O d 2 δ 2 AB,⊥ = -S(1 -S) i̸ =j [K AB ] i,j N A N B (2 -T i -T j ) E z + i z + j z + p z + q d ± O d 2 δ 2 AB,⊥ . Clear that the summand is nonzero only if (i, j) = (p, q) or (i, j) = (q, p). Hence, T 1 ([Q 1 ] p,q ) = -S(1 -S) [K AB ] p,q + [K AB ] q,p N A N B d (2 -T p -T q ) ± O d 2 δ 2 AB,⊥ . Then, for T 2 , we compute T 2 ([Q 1 ] p,q ) = -K E S A (x + A , x + B ) exp(f + A • f - B ) exp(f + A • f + B ) z - p z + q d = -K E S Ẽ+,- Ẽ0 1 -S δ+,+ -(1 -S) δ+,T + + δ+,- z - p z + q d ± O d 2 δ 2 AB,⊥ = -K E S Ẽ+,- Ẽ0 -S δ+,+ -(1 -S) δ+,T + + δ+,- z - p z + q d ± O d 2 δ 2 AB,⊥ = -K i̸ =j [K AB ] i,j N A N B E S Ẽ+,- Ẽ0 -Sz + i z + j -(1 -S)z + i z + j T j + z + i z - j z - p z + q d ± O d 2 δ 2 AB,⊥ . Again, the summand is nonzero only if (i, j) = (p, q) or (i, j) = (q, p). By Lemma C.2, we have T 2 ([Q 1 ] p,q ) = - SK Ẽ0 [K AB ] p,q N A N B -S -(1 -S)T q E exp(I +,-)z + p z - p + E exp(I +,-)z + p z + q z - p z - q d - SK Ẽ0 [K AB ] q,p N A N B -S -(1 -S)T p E exp(I +,-)z + p z - p + d -1 E {exp(I +,-)} ± O d 2 δ 2 AB,⊥ = - SKZ c Ẽ0 [K AB ] p,q N A N B d -S -(1 -S)T q T p + T p T q - SKZ c Ẽ0 [K AB ] q,p N A N B d -S -(1 -S)T p T p + 1 ± O d 2 δ 2 AB,⊥ = -(1 -S) [K AB ] p,q N A N B d ST p (T q -1) -(1 -S) [K AB ] q,p N A N B d -ST p -(1 -S)T 2 p + 1 ± O d 2 δ 2 AB,⊥ . We now estimate each of these three terms. Again, the strategy is to leverage the symmetry of the distribution of ξ to argue that some part of the expectation is 0. For T 1 , we have T 1 ([Q 2 ] p,q ) = -S(1 -S) E (2δ +,ξ+ + 2δ ξ+,+ -δ ξ+,T + -δ T +,ξ+ ) [ξ + B ] p [ξ + A ] q d ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Note that none of these δ's depends on both ξ + A and ξ + B . Therefore, the first term is 0. Similarly, for T 2 , we have T 2 ([Q 2 ] p,q ) = - K S Ẽ E Ẽ+,-δ +,ξ-+ δ ξ+,--S (δ +,ξ+ + δ ξ+,+ ) -(1 -S)δ ξ+,T + [ξ - B ] p [ξ + A ] q d ± O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S = ±O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . The same is also true for T 3 . Thus, [Q 2 ] p,q = ±O d 3 δ AB,⊥ + δ N/S δ ξ,⊥ δ N/S . Proof of Lemma C.8. Recall from Corollary A.4 that Q 0 is defined as Q 0 := -E x + A ,x + B 2 -S A (x + A , x + B ) -S B (x + A , x + B ) f + A , f + B + K E x + A ,x ± B S A (x + A , x + B ) exp(τ 2 t f + A • f - B ) exp(τ 2 t f + A • f + B ) f + A , f - B + K E x ± A ,x + B S B (x + A , x + B ) exp(τ 2 t f - A • f + B ) exp(τ 2 t f + A • f + B ) f - A , f + B . We write f + A , f - B = ⟨K A z + , K B z -⟩ N A N B ± O κ 0 dδ N/S δ ξ,⊥ = i,j∈[r] [K AB ] i,j N A N B z + i z - j ± O κ 0 dδ N/S δ ξ,⊥ . Therefore, we have Q 0 = - i,j∈[r] [K AB ] i,j N A N B [Q 1 ] i,j ± O κ 0 dδ N/S δ ξ,⊥ = - r k=1 κ 2 k ∥κ∥ 2 [Q 1 ] k,k ± O d 2 δ 2 AB,⊥ + δ -+ κ 0 dδ N/S δ ξ,⊥ . Proof of Corollary C.9. Recall from Lemma A.5 that d dt ∥[K A ] p ∥ 2 = 2 ⟨[K A ] p , [K B Q 1 ] p ⟩ N A N B d σ 2 p + 2 ⟨[K A ] p , [K B,ξ Q 1,ξ B ] p ⟩ N A N B d σ 2 p + 2 ∥[K A ] p ∥ 2 N 2 A d Q 0 σ 2 p =: 3 i=1 T i d dt ∥[K A ] p ∥ 2 . By Lemma C.5, we have T 1 d dt ∥[K A ] p ∥ 2 = 2 r k=1 ∥[K A ] p ∥ ∥[K B ] k ∥ N A N B d [K A ] p , [K B ] k [Q 1 ] k,p σ 2 p = 2σ 2 p ∥[K A ] p ∥ ∥[K B ] p ∥ N A N B d [K A ] p , [K B ] p [Q 1 ] p,p ± O σ 2 p κ 2 p N A N B d κ 0 dδ 2 AB,⊥ . By Lemma C.6, we have T 2 d dt ∥[K A ] p ∥ 2 = 2 d-r k=1 ∥[K A ] p ∥ ∥[K B,ξ ] k ∥ N A N B d [K A ] p , [K B,ξ ] k [Q 1,ξ B ] k,p σ 2 p = O σ 2 p κ 2 p N A N B d dδ 2 N/S δ 2 ξ,⊥ . Combine these together and we obtain d dt ∥[K A ] p ∥ 2 = 2σ 2 p ∥[K A ] p ∥ ∥[K B ] p ∥ N A N B d [K A ] p , [K B ] p [Q 1 ] p,p + 2 ∥[K A ] p ∥ 2 N 2 A d Q 0 σ 2 p ± O σ 2 p κ 2 p N A N B d κ 0 dδ 2 AB,⊥ . To get the formula for d dt ∥[K B ] p ∥ 2 , it suffices to interchange the roles of A and B. Similarly, for the noises, we write d dt ∥[K A,ξ ] q ∥ 2 = 2 [K A,ξ ] q , [K B Q ⊤ 1,ξ A ] q N A N B d σ 2 ξ + 2 ⟨[K A,ξ ] q , [K B,ξ Q 2 ] q ⟩ N A N B d σ 2 ξ + 2 ∥[K A,ξ ] q ∥ 2 F N 2 A d Q 0 σ 2 ξ = 3 i=1 T i d dt ∥[K A,ξ ] q ∥ 2 . By Lemma C.6, we have T 1 d dt ∥[K A,ξ ] q ∥ 2 = 2 r k=1 ∥[K A,ξ ] q ∥ ∥[K B ] k ∥ N A N B d [K A,ξ ] q , [K B ] k [Q 1,ξ A ] q,k σ 2 ξ = O σ 2 ξ ∥[K A,ξ ] q ∥ 2 N A N B d dδ 2 ξ,⊥ . By Lemma C.7, we have T 1 d dt ∥[K A,ξ ] q ∥ 2 = 2 d-r k=1 ∥[K A,ξ ] q ∥ ∥[K B,ξ ] k ∥ N A N B d [K A,ξ ] q , [K B,ξ ] k [Q 2 ] k,q σ 2 ξ = O σ 2 ξ ∥[K A,ξ ] q ∥ 2 N A N B d d 4 δ AB,⊥ + δ N/S δ 2 ξ,⊥ δ N/S . Combine these together and we get d dt ∥[K A,ξ ] q ∥ 2 = 2 ∥[K A,ξ ] q ∥ 2 N 2 A d Q 0 σ 2 ξ ± O σ 2 ξ ∥[K A,ξ ] q ∥ 2 N A N B d dδ 2 ξ,⊥ .

C.2 MAINTAINING THE ORTHOGONALITY

In this subsection, we control the growth of δ AB,⊥ and δ ξ,⊥ . First, we derive the equations that govern the evolution of the off-diagonal terms. For the first term, we have [K A ] p ⊤ I -[K B ] q [K B ] q ⊤ [K A Q ⊤ 1 ] q ∥[K B ] q ∥ = r k=1 [K A ] p , [K A ] k -[K A ] p , [K B ] q [K B ] q , [K A ] k ∥[K A ] k ∥ [Q 1 ] q,k ∥[K B ] q ∥ . When k / ∈ {p, q}, we have [K A ] p , [K A ] k ≤ O(δ AB,⊥ ), [K B ] q , [K A ] k ≤ O(δ AB,⊥ ), and ∥[K A ] k ∥ / ∥[K B ] q ∥ ≤ κ 0 . Meanwhile, by Lemma C.5, we also have |[Q 1 ] q,k | ≤ O(δ AB,⊥ ). Hence, k / ∈{p,q} [K A ] p , [K A ] k -[K A ] p , [K B ] q [K B ] q , [K A ] k ∥[K A ] k ∥ [Q 1 ] q,k ∥[K B ] q ∥ ≤ O dκ 0 δ 2 AB,⊥ . Therefore, [K A ] p , d dt [K B ] q σ 2 q N A N B d -1 = 1 -[K A ] p , [K B ] q [K B ] q , [K A ] p ∥[K A ] p ∥ [Q 1 ] q,p ∥[K B ] q ∥ + [K A ] p , [K A ] q -[K A ] p , [K B ] q [K B ] q , [K A ] q ∥[K A ] q ∥ [Q 1 ] q,q ∥[K B ] q ∥ ± O dκ 0 δ 2 AB,⊥ ± O dδ 2 ξ,⊥ δ 2 N/S . For the first term, we have 1 -[K A ] p , [K B ] q [K B ] q , [K A ] p ∥[K A ] p ∥ [Q 1 ] q,p ∥[K B ] q ∥ = 1 ± δ 2 AB,⊥ κ p [Q 1 ] q,p κ q 1 ± δ - = κ p [Q 1 ] q,p κ q ± O κ 0 δ AB,⊥ δ -± O κ 0 δ 3 AB,⊥ . For the second term, we have [K A ] p , [K A ] q -[K A ] p , [K B ] q [K B ] q , [K A ] q ∥[K A ] q ∥ [Q 1 ] q,q ∥[K B ] q ∥ = [K A ] p , [K A ] q -[K A ] p , [K B ] q (1 ± δ -) [Q 1 ] q,q 1 ± δ - = [K A ] p , [K A ] q -[K A ] p , [K B ] q [Q 1 ] q,q ± O δ ⊥ δ -. Thus, [K A ] p , d dt [K B ] q σ 2 q N A N B d -1 = κ p [Q 1 ] q,p κ q ± O κ 0 δ AB,⊥ δ -± O κ 0 δ 3 AB,⊥ + [K A ] p , [K A ] q -[K A ] p , [K B ] q [Q 1 ] q,q ± O δ ⊥ δ - ± O dκ 0 δ 2 AB,⊥ ± O dδ 2 ξ,⊥ δ 2 N/S = κ p [Q 1 ] q,p κ q + [K A ] p , [K A ] q -[K A ] p , [K B ] q [Q 1 ] q,q ± O dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . Hence, [K A ] p , d dt [K B ] q = κ p [Q 1 ] q,p κ q σ 2 q N A N B d + [K A ] p , [K A ] q -[K A ] p , [K B ] q [Q 1 ] q,q σ 2 q N A N B d ± O σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ Interchange the roles of A, B, p, q, replace Q 1 with Q ⊤ 1 , and we obtain d dt [K A ] p , [K B ] q = κ q [Q 1 ] q,p κ p σ 2 p N A N B d + [K B ] p , [K B ] q -[K A ] p , [K B ] q [Q 1 ] p,p σ 2 p N A N B d ± O σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ Combine these together and we complete the proof. Proof of Lemma C.11. Similar to the proof of the previous lemma, we compute d dt [K A ] p , [K A ] q = [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ + [K B,ξ Q 1,ξ B ] p ∥[K A ] p ∥ σ 2 p N A N B d = [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ σ 2 p N A N B d ± O σ 2 max N A N B d dδ 2 ξ,⊥ δ 2 N/S . Again, we have [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ = [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B ] p ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ + [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B ] q ∥[K B ] q ∥ [Q 1 ] q,p ∥[K A ] p ∥ + k / ∈{p,q} [K A ] q ⊤ I -[K A ] p [K A ] p ⊤ [K B ] k ∥[K B ] k ∥ [Q 1 ] k,p ∥[K A ] p ∥ = κ q [Q 1 ] q,p κ p + [K B ] p , [K A ] q -[K A ] p , [K A ] q [Q 1 ] p,p ± dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . Therefore, d dt [K A ] p , [K A ] q = [Q 1 ] q,p κ q /κ p σ 2 p N A N B d + [K B ] p , [K A ] q -[K A ] p , [K A ] q [Q 1 ] p,p σ 2 p N A N B d ± σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . Interchange the roles of p, q and we obtain [K A ] p , d dt [K A ] q = [Q 1 ] p,q κ p /κ q σ 2 q N A N B d + [K B ] q , [K A ] p -[K A ] p , [K A ] q [Q 1 ] q,q σ 2 q N A N B d ± σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . Combine these together and we get d dt [K A ] p , [K A ] q = [Q 1 ] q,p κ q /κ p σ 2 p N A N B d + [Q 1 ] p,q κ p /κ q σ 2 q N A N B d -[K A ] p , [K A ] q [Q 1 ] p,p σ 2 p N A N B d + [Q 1 ] q,q σ 2 q N A N B d + [K A ] p , [K B ] q [Q 1 ] q,q σ 2 q N A N B d + [K B ] p , [K A ] q [Q 1 ] p,p σ 2 p N A N B d ± σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . Interchange the roles of A, B, replace Q 1 with Q ⊤ 1 , and we obtain the formula for d dt [K B ] p , [K B ] q . Proof of Lemma C.12. First, we consider the [Q 1 ] p,q -related terms, by Lemma C.5, we have [Q 1 ] p,q = -S(1 -S) [K AB ] p,q + [K BA ] q,p N A N B d (2 -T p -T q ) -(1 -S) [K AB ] p,q N A N B d S (2T p T q -T p -T q ) -(1 -S) [K AB ] q,p N A N B d 2 -S(T p + T q ) -(1 -S)(T 2 p + T 2 q ) ± O d 2 δ 2 AB,⊥ = O κ p κ q δ AB,⊥ N A N B d . Hence, [Q 1 ] p,q κ q /κ p σ 2 p N A N B d = O κ 2 q N A N B d σ 2 p N A N B d δ AB,⊥ = O 1 √ d σ 2 max N A N B d δ AB,⊥ . The same bound also hold for other [Q 1 ] p,q -related terms. The important thing here is that we have and additional 1/ √ d factor. Now, we are ready to prove the result. For notational simplicity, define Z 1 = [K A ] p , [K B ] q , Z 2 = [K B ] p , [K A ] q , Z 3 = [K A ] p , [K A ] q , and Z 4 = [K B ] p , [K B ] q . Also define G p := [Q 1 ] p,p σ 2 p /(N A N B d ) and similarly for G q . Then, we can write the results of Lemma C.10 and Lemma C.11 as d dt    Z 1 Z 2 Z 3 Z 4    =    -G p -G q 0 G q G p 0 -G p -G q G p G q G q G p -G p -G q 0 G p G q 0 -G p -G q       Z 1 Z 2 Z 3 Z 4    ± O 1 √ d σ 2 max N A N B d δ AB,⊥ ± O σ 2 max N A N B d dκ 0 δ AB,⊥ δ -+ δ AB,⊥ . The eigenvalues of the first matrix are -2G p , -2G q , -2G p -2G q and 0. For the first three eigenvalues, note that G p = [Q 1 ] p,p σ 2 p N A N B d ≥ 1 √ d σ 2 max N A N B d . Hence, the signals will dominate the noises, in particular, the first term on the second line, and push ∥Z∥ toward 0. Now we consider the eigen-pair (0, (1, 1, 1, 1)), for which we will use the actual Lemma C.15. In Stage 2, we have  d dt [K A ] p , [K B ] p = Ω σ 2 p [Q 1 ] p,p N A N B d 1 -[K A ] p , [K B ] p ± O σ 2 max N A N B d d dt [K A ] p , [K B ] p = [K B ] p ⊤ I -[K A ] p [K A ] p ⊤ [K B Q 1 ] p ∥[K A ] p ∥ σ 2 p N A N B d + [K B ] p ⊤ I -[K A ] p [K A ] p ⊤ [K B,ξ Q 1,ξ B ] p ∥[K A ] p ∥ σ 2 p N A N B d =: T 1 d dt [K A ] p , [K B ] p + T 2 d dt [K A ] p , [K B ] p . For T 1 , we compute T 1 d dt [K A ] p , [K B ] p = [K B ] p ⊤ I -[K A ] p [K A ] p ⊤ [K B ] p ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ σ 2 p N A N B d + k̸ =p [K B ] p ⊤ I -[K A ] p [K A ] p ⊤ [K B ] k ∥[K B ] k ∥ [Q 1 ] k,p ∥[K A ] p ∥ σ 2 p N A N B d = 1 -[K A ] p , [K B ] p 2 ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ σ 2 p N A N B d ± O σ 2 max N A N B d dκ 0 δ 2 AB,⊥ . For T 2 , we compute T 2 d dt [K A ] p , [K B ] p = d-r k=1 [K B ] p ⊤ I -[K A ] p [K A ] p ⊤ [K B,ξ ] k [Q 1,ξ B ] k,p ∥[K A ] p ∥ σ 2 p N A N B d = ±O σ 2 max N A N B d dδ 2 N/S δ 2 ξ,⊥ . Therefore, d dt [K A ] p , [K B ] p = 1 -[K A ] p , [K B ] p 2 ∥[K B ] p ∥ [Q 1 ] p,p ∥[K A ] p ∥ σ 2 p N A N B d ±O σ 2 max N A N B d dκ 0 δ 2 AB,⊥ .

C.5 ESTIMATING THE CONVERGENCE RATE

In this subsection, we estimate how fast the condition number will become close to 1. Lemma C.19. Suppose that ∥[K A ] p ∥ is the largest and ∥[K A ] q ∥ the smallest among all {∥[K A ] k ∥} k∈ [r] . In Stage 2, we have  d dt ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ≤ - 4(1 -S)σ 2 min N A N B d (T p -T q ) ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ± O σ 2 p N A N B d d dt ∥[K A ] p ∥ 2 = 2σ 2 p N A N B d [Q 1 ] p,p ∥[K A ] p ∥ 2 + 2σ 2 p N 2 A d Q 0 ∥[K A ] p ∥ 2 ± O σ 2 p κ 2 p N A N B d κ 0 dδ 2 AB,⊥ + δ - = 4(1 -S)σ 2 p N A N B d (1 -T p ) ∥[K A ] p ∥ 2 + 4(1 -S)σ 2 p N 2 A d r k=1 κ 2 k ∥κ∥ 2 (1 -T p ) ∥[K A ] p ∥ ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 = d dt ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 - ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 d dt ∥[K A ] q ∥ 2 ∥[K A ] q ∥ 2 = - 4(1 -S) N A N B d σ 2 p T p -T -σ 2 q T q -T ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ± O σ 2 p N A N B d κ 2 0 d 2 δ 2 AB,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ . Since ∥[K A ] p ∥ is the largest, ∥[K A ] q ∥ is the smallest, T p is positively correlated with ∥[K A ] p ∥, and T is a weighted average of T p , we have σ 2 p T p -T -σ 2 q T q -T ≥ σ 2 q (T p -T q ) . Thus, d dt ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ≤ - 4(1 -S)σ 2 min N A N B d (T p -T q ) ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ± O σ 2 p N A N B d d 2 δ 2 AB,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ κ 2 0 . Proof of Corollary C.20. Recall that T p = tanh κ2 p N A N B d = tanh ∥[K A ] p ∥ 2 ∥K A ∥ 2 F 1 ± O(δ -+ δ N/S ) . Since κ 0 ≤ √ d, we have ∥[K A ] p ∥ 2 / ∥[K A ] F ∥ 2 ≤ 1/2. Note that tanh ′ (z) = 1 -tanh 2 (z) = Ω(1) for any z ≤ 1.1/2. Therefore, T p -T q ≥ Ω ∥[K A ] p ∥ 2 -∥[K A ] q ∥ 2 ∥K A ∥ 2 F ± O(δ -+ δ N/S ) ≥ Ω 1 d . Then, by Lemma C.19, we have d dt ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ≤ - 4(1 -S)σ 2 min N A N B d (T p -T q ) ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 ± O σ 2 p N A N B d d 2 δ 2 AB,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ κ 2 0 ≤ -Ω σ 2 min N A N B d ∥[K A ] p ∥ 2 ∥[K A ] q ∥ 2 . By  ∥[K A,ξ ] q ∥ 2 ∥[K A ] p ∥ 2 ≤ O σ 2 ξ N A N B d dδ 2 ξ,⊥ + d 2 δ 2 AB,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ ∥[K A,ξ ] q ∥ 2 ∥[K A ] p ∥ 2 . Note that on the RHS of these equations, the only terms whose order may potentially be smaller than or equal to LHS are the δ --reltaed terms. However, (17), we can make sure δ -is at most δ 1.5 AB,⊥ . As a result, the orders of the RHS are all greater than the orders of the LHS, which implies that these errors can at most double within poly(d) time if they are sufficiently small at the beginning of Stage 2.

D FROM GRADIENT FLOW TO GRADIENT DESCENT

Converting the above gradient flow argument to a gradient descent one is standard. All our estimations can tolerate an inverse polynomially large error. Since the all quantities of interest here are polynomially and inverse polynomially bounded, at each step of gradient descent, one can always make the GF-to-GD discretization error sufficiently (inverse polynomially) small by choose a sufficiently (inverse polynomially) small learning rate and generating sufficiently (polynomially) many samples. Since the times need for Stage 1 and Stage 2 are both polynomial, this also implies a polynomial sample complexity.



To make the proof cleaner, we write it in terms of gradient flow over population loss. See Section D for discussions on the discretization of gradient flow. By Lemma B.1, this condition indeed holds.



and the condition number of K A is upper bounded by poly(d), then the learned representations are aligned.

Figure 1: Simulation results. The first row reports the accuracies of different approaches on the classifying positive/negative pairs problem, and the second row reports the values of the largest r singular values. From left to right, the columns correspond to the non-contrastive method, contrastive method with τ t ≡ 1 throughout the entire process, and contrastive method with τ t switch from 0.01 to 1 at the vertical dashed line. One can make several observations here. (a) All methods can quickly attain near 100% accuracy. (b) Only contrastive methods will reduce the condition number to approximately 1. (c) Even when τ t ≡ 1, we still have the stage-wise behavior, where the models first align the representations in Stage 1, and then balance the representations in Stage 2.

Figure 2: Results of the MSCOCO experiments. The top row plots report the training loss, alignment scores, and balance scores during training, respectively. The bottom row figures plot the downstream accuracies and the largest 25 singular value of Σ f at different epochs, normalized so that the largest one has value 1. One can see that the alignment score quickly reaches near 100%, and the balance score, as well as the downstream accuracy, increases gradually during training, which matches our theoretical analysis.

(a) K A ≈ K B after Stage 1. (b) The noise-signal ratio is small after Stage 1. (c) The condition number is O( √ d) in Stage 1. (d) The trajectory is still close to the infinite-width one in Stage 1.

+,-= exp(I +,-) 1 + δ+,-± O d 2 δ 2 AB,⊥ and E +,+ = exp(I 0 ) 1 + δ+,+ ± O d 2 δ 2 AB,⊥

+ δ -+ d 2 δ N/S δ ξ,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ ,

1, we derive estimations for the Qmatrices defined in Corollary A.4 and use them to simplify the equations governing the training dynamics. In Section B.2, we estimate the rate at which 1 -[K A ] p , [K B ] p and the noise-signal ratio converge to 0 and the growth rate of the condition number. In Section B.3, we estimate the growth rate of the discretization error. Then, in Section B.4, we prove Lemma B.1, the main lemma of Stage 1. Finally, we prove the negative result for non-contrastive learning in Section B.5.B.1 ESTIMATIONS FOR Q AND THE DYNAMICSThanks to Lemma A.5, in order to analyze the dynamics, it suffices to estimate the Q-matrices. In this subsection, we derive estimations for them and use these estimations to simplify the equations in Lemma A.5. Recall that, in Stage 1, τ 2 t is small. Hence, exp(τ 2 t

dκ 0 δ 2 AB,⊥ . Lemma C.16. Define ρ A/B,p := ∥[K A ] p ∥ 2 / ∥[K B ] p ∥ 2 and ρ B/A,p := ∥[K B ] p ∥ 2 / ∥[K A ] p ∥ 2 . In Stage 2, we have d dt ρ A/B,p + ρ B/A,p = 2[Q 1 ] p,p σ 2 p N A N B d [K A ] p , [K B ] p 2 -ρ A/B,p -ρ B/A,p For any p, q ∈ [d -r], define ρ ξ,A/A,p/q = ∥[K A,ξ ] p ∥ 2 / ∥[K A,ξ ] q ∥ 2 and ρ ξ,A/B,p/q = ∥[K A,ξ ] p ∥ 2 / ∥[K B,ξ ] q ∥ 2 .In Stage 2, we have

d 2 δ 2 AB,⊥ + δ -+ d 2 δ N/S δ ξ,⊥ κ 2 0 . Corollary C.20 (Convergence rate). Suppose that ∥[K A ] p ∥ is the largest and ∥[K A ] q ∥ the smallest among all {∥[K A ] k ∥} k∈[r]. For any constant c > 1, it takes at most poly(d) amount of time for∥[K A ] p ∥ 2 / ∥[K A ] q ∥2 to become smaller than c.OMITTED PROOF OF THIS SUBSECTIONProof of Lemma C.19. By Corollary C.9, Lemma C.8 and Lemma C.5, we have

the proof of Lemma C.19, the largest∥[K A ] p ∥ 2 is non-increasing. Hence, N A N B d is upperbounded by some poly(d). Thus, it takes at most poly(d) for[K A ] 2 p / ∥[K A ] q ∥ to become smaller than c.C.6 PROOF OF THE MAIN LEMMA OF STAGE 2Proof of Lemma C.1. The polynomial bound on the convergence time has been proved in Corollary C.20. For the errors, recall from Lemma C.12, Lemma C.13 and Lemma C.14 that Recall from Lemma C.15, Lemma C.16, and Lemma C.17 thatd dt [K A ] p , [K B ] p = Ω σ 2 p [Q 1 ] p,p N A N B d 1 -[K A ] p , [K B ] p ± O σ

annex

Similarly, for T 3 , we haveCombine these together and we obtainProof of Lemma C.6. We writeWe will use the fact that if some quantity X does not depend on ξ, then E {Xξ} = 0 to simplify these terms. For T 1 , by Lemma C.3, we haveHence, we can further rewrite TNote that the summand is nonzero only if i = q and j = p. Thus,Similarly, we also haveThen, we writeIf j ̸ = p, clear that that summand is 0. If i ̸ = q, then by flipping the sign of z ± q simultaneously, we flip the sign of Ẽ+,z + i z + q . Therefore, when i ̸ = q, the summand is also 0. Thus,Similarly, for T 3 , we computeThus,Combine these together, rearrange terms and we obtainTo obtain the formula for Q 1,ξ A , it suffices to interchange the roles of A and B.Proof of Lemma C.7. Recall thatLemma C.10. In Stage 2, for any p ̸ = q ∈ [r], we haveLemma C.11. In Stage 2, for any p ̸ = q ∈ [r], we haveNow, we are ready to control the off-diagonalness. The proof is similar to the one of Lemma B.15.To provide intuitions, we first consider an idealized case. That is, we assume here the noise-signal ratio is 0 and K A = K B , and explain how to controlr] under these assumptions. In this case, we haveandNote that the coefficient is negative. As a result, [K A ] p , [K B ] q will move towards 0.When K A = K B is not exactly true, the situation is trickier because we need to deal with the middle four terms of the RHS ofThe idea is to view all these off-diagonal errors as a whole and show that their sum is non-increasing, up to some higher-order terms. Lemma C.12 (Orthogonality between signals). For any p ̸ = q, define δ⊥,p,q :In Stage 2, we haveNote that the LHS is of order δ 2 AB,⊥ and for the RHS, the only term that can potentially have the same order is the δ -δ⊥,p,q -related term. We will show later that δ -= o(δ AB,⊥ ), whence this is also a higher order term.Then, we consider the orthogonality between signals and noises. Lemma C.13 (Orthogonality between signals and noises). For any p ∈In Stage 2, we haveFinally, we deal with the orthogonality between the noises. Lemma C.14 (Orthogonality between noises). In Stage 2, for any p, q ∈ [d -r], we have

OMITTED PROOF OF THIS SUBSECTION

Proof of Lemma C.10. Recall thatThen, we writeFor the second term, by Lemma C.6, we haveform of [Q 1 ] p,q we obtained in Lemma C.5. We haveBy Lemma C.5, we haveThen, we writeTherefore,Note that the coefficient of the first term is negative. Combine this with the previous bound for ∥Z∥, and we complete the proof.Proof of Lemma C.13. Recall thatThen, we computeand, by Lemma C.6, we haveCombine these together and we getThen, we compute [K A ] p , d dt [K A,ξ ] q . We writeFor T 1 , when k = p, by Lemma C.6, we haveWhen k ̸ = p, we haveNow we consider T 2 . By Lemma C.7, we haveTherefore,Thus,Similarly, one can show thatFor notational simplicity, defineThen, we can writeThe first matrix is negative semi-definite, whenceSince the roles of K A,ξ and K B,ξ are interchangeable, the same bound also holds forProof of Lemma C.14. We writeThen, we computeNote that, the first part of each summand is bounded by O(δ ξ,⊥ ). For the second part, by Lemma C.6 we also haveNow we consider T 2 . By Lemma C.7, We haveCombine this together and we obtainSimilarly, one can derive the same bound for d dt [K A,ξ ] p , [K B,ξ ] q and complete the proof.

C.3 MAINTAINING K

In this subsection, we show thatThe proofs are similar to the corresponding ones in Stage 1.Then, by symmetry, we haveProof of Lemma C.16. Similar to Stage 1, we define. By Lemma A.5, we haveThen, by Lemma C.5 and Lemma C.6, we haveThus,Interchange the roles of A, B and we getHence,Proof of Lemma C.17. For notational simplicity, we drop the subscript ξ in the proof. Recall from Corollary C.9 thatHence, for any p, q ∈ [d -r], we haveSimilarly, we haveBy symmetry, we also haveHence,

C.4 CONTROLLING THE NOISE-SIGNAL RATIO

In this subsection, we show that the noise-signal ratio remains small throughout Stage 2.Lemma C.18. Let ∥[K A ] p ∥ be the smallest among all {∥[K A ] k ∥} k∈ [r] . For any q ∈ [r], in Stage 2, we haveProof. Recall from Corollary C.9 thatSince the condition number of K A is bounded by √ d, it suffices to consider the smallest ∥[K A ] p ∥, for which we havewhere the last line comes from Lemma C.8. By Lemma B.4, [Q 1 ] p,p is negative correlated with κ 2 p . As a result, we haveFor K A,ξ , we simply haveThus,

