UNDERSTANDING THE ROLE OF NONLINEARITY IN TRAINING DYNAMICS OF CONTRASTIVE LEARNING

Abstract

While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, existing theoretical works on SSL understanding still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation h(x) = h ′ (x)x. We have two major theoretical discoveries. First, the presence of nonlinearity can lead to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned. This suggests that models with lots of parameters can be regarded as a brute-force way to find these local optima induced by nonlinearity. Second, in the 2-layer case, linear activation is proven not capable of learning specialized weights into diverse patterns, demonstrating the importance of nonlinearity. In addition, for 2-layer setting, we also discover global modulation: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.

1. INTRODUCTION

Over the last few years, deep models have demonstrated impressive empirical performance in many disciplines, not only in supervised but also in recent self-supervised setting (SSL), in which models are trained with a surrogate loss (e.g., predictive (Devlin et al., 2018; He et al., 2021) , contrastive (Chen et al., 2020; Caron et al., 2020; He et al., 2020) or noncontrastive loss (Grill et al., 2020; Chen & He, 2020) ) and its learned representation is then used for downstream tasks. From the theoretical perspective, understanding the roles of nonlinearity in deep neural networks is one critical part of understanding how modern deep models work. Currently, most works focus on linear variants of deep models (Jacot et al., 2018; Arora et al., 2019a; Kawaguchi, 2016; Jing et al., 2022; Tian et al., 2021; Wang et al., 2021) . When nonlinearity is involved, deep models are often treated as richer families of black-box functions than linear ones (Arora et al., 2019b; HaoChen et al., 2021) . The role played by nonlinearity is also studied, mostly on model expressibility (Gühring et al., 2020; Raghu et al., 2017; Lu et al., 2017) in which specific weights are found to fit the complicated structure of the data well, regardless of the training algorithm. However, many questions remain open: if model capacity is the key, why traditional models like k-NN (Fix & Hodges, 1951) or kernel SVM (Cortes & Vapnik, 1995) do not achieve comparable empirical performance, even if theoretically they can also fit any functions (Hammer & Gersmann, 2003; Devroye et al., 1994) . Moreover, while traditional ML theory suggests carefully controlling model capacity to avoid overfitting, large neural models often generalize well in practice (Brown et al., 2020; Chowdhery et al., 2022) . In this paper, we study the critical role of nonlinearity in the training dynamics of contrastive learning (CL). Specifically, by extending the recent α-CL framework (Tian, 2022) and linking it to kernels (Paulsen & Raghupathi, 2016), we show that even with 1-layer nonlinear networks, nonlinearity plays a critical role by creating many local optima. As a result, the more nonlinear nodes in 1-layer networks with different initialization, the more local optima are likely to be collected as learned patterns in the trained weights, and the richer the resulting representation becomes. Moreover, popular loss functions like InfoNCE tends to have more local optima than quadratic ones. In contrast, in the linear setting, contrastive learning becomes PCA under certain conditions (Tian, 2022), and only the most salient pattern (i.e., the maximal eigenvector of the data covariance matrix) is learned while other less salient ones are lost, regardless of the number of hidden nodes. Based on this finding, we extend our analysis to 2-layer ReLU setting with non-overlapping receptive fields. In this setting, we prove the fundamental limitation of linear networks: the gradients of multiple weights at the same receptive field are always co-linear, preventing diverse pattern learning. Finally, we also characterize the interaction between layers in 2-layer network: while in each receptive field, many patterns exist, those contributing to global patterns are prioritized to learn by the training dynamics. This global modulation changes the eigenstructure of the low-level covariance matrix so that relevant patterns are learned with higher probability. In summary, through the lens of training dynamics, we discover unique roles played by nonlinearity which linear activation cannot do: (1) nonlinearity creates many local optima for different patterns of the data, and (2) nonlinearity enables weight specialization to diverse patterns. In addition, we also discover a mechanism for how global pattern prioritizes which local patterns to learn, shedding light on the role played by network depth. Preliminary experiments on simulated data verify our findings. 2021) that analyzes feature learning process in linear models of CL, we focus on the critical role played by nonlinearity. Our analysis is also more general than Li & Yuan (2017) that focuses on 1-layer ReLU network with symmetric weight structure trained on sparse linear models. Along the line of studying dynamics of contrastive learning, Jing et al. ( 2022) analyzes dimensional collapsing on 1 and 2 layer linear networks. Tian ( 2022) proves that such collapsing happens in linear networks of any depth and further analyze ReLU scenarios but with strong assumptions (e.g., one-hot positive input). Our work uses much more relaxed assumptions and performs in-depth analysis for homogeneous activations.

2. PROBLEM SETUP

Notation. In this section, we introduce our problem setup of contrastive learning. Let x 0 ∼ p D (•) be a sample drawn from the dataset, and x ∼ p aug (•|x 0 ) be a augmentation view of the sample x 0 . Here both x 0 and x are random variables. Let f = f (x; θ) be the output of a deep neural network that maps input x into some representation space with parameter θ to be optimized. Given a batch of size N , x 0 [i] represent i-th sample (i.e., instantiation) of corresponding random variables, and x[i] and x[i ′ ] are two of its augmented views. Here x[•] has 2N samples, 1 ≤ i ≤ N and N + 1 ≤ i ′ ≤ 2N . Contrastive learning (CL) aims to learn the parameter θ so that the representation f are distinct from each other: we want to maximize squared distance d 2 ij := ∥f [i] -f [j]∥ 2 2 /2 between samples i ̸ = j and minimize d 2 i := ∥f [i] -f [i ′ ]∥ 2 2 /2 between two views x[i] and x[i ′ ] from the same sample x 0 [i]. Many objectives in contrastive learning have been proposed to combine these two goals into one. For example, InfoNCE (Oord et al., 2018) minimizes the following (here τ is the temperature): L nce := -τ N i=1 log exp(-d 2 i /τ ) ϵ exp(-d 2 i /τ )+ j̸ =i exp(-d 2 ij /τ ) In this paper, we follow α-CL (Tian, 2022) that proposes a general CL framework that covers a broad family of existing CL losses. α-CL maximizes an energy function E α (θ) using gradient ascent: θ t+1 = θ t + η∇ θ E sg(α(θt)) (θ), where η is the learning rate, sg(•) is the stop gradient operator, the energy function E α (θ) := 1 2 trC α [f , f ] and C α [•, •] is the contrastive covariance (Tian, 2022; Jing et al., 2022) 1 : C α [a, b] := 1 2N 2 N i,j=1 α ij (a[i] -a[j])(b[i] -b[j]) ⊤ -(a[i] -a[i ′ ])(b[i] -b[i ′ ]) ⊤ (3) One important quantity is the pairwise importance α(θ) = [α ij (θ)] N i,j=1 , which are Nfoot_1 weights on pairwise pairs of N samples in a batch. Intuitively, these weights make the training focus more on hard negative pairs, i.e., distinctive sample pairs that are similar in the representation space but are supposed to be separated away. Many existing CL losses (InfoNCE, triplet loss, etc) 



Compared to Tian (2022), our Cα definition has an additional constant term 1/2N to simply the notation.



Related works. Many works analyze network at initialization(Hayou et al., 2019; Roberts et al.,  2021)  and avoid the complicated training dynamics. Previous works(Wilson et al., 1997; Li & Yuan,  2017; Tian et al., 2019; Tian, 2017; Allen-Zhu & Li, 2020)  that analyze training dynamics mostly focus on supervised learning. Different from Saunshi et al. (2022); Ji et al. (

are special cases of α-CL (Tian, 2022) by choosing different α(θ), e.g., quadratic loss corresponds to α ij := const and InfoNCE (with ϵ = 0) corresponds toα ij := exp(-d 2 ij /τ )/ j̸ =i exp(-d 2 ij /τ ). For brevity C α [x] := C α [x, x]. For the energy function E α (θ) := trC α [f (x; θ)],in this work we mainly study its landscape, i.e., existence of local optima, their local properties and overall

