SIMILARITY AND GENERALIZATION: FROM NOISE TO CORRUPTION Anonymous

Abstract

Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other and different ones are far apart. We study how neural networks generalize in similarity learning in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We introduce two representative noise sources that can act on SNNs: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the density of pairs in the dataset crucially affects generalization. Training SNNs on sparse datasets affected by the same amount of PLN or SLN gives the same performance. On the contrary, using dense datasets, PLN cases generalize worse than SLN ones in the overparametrized region. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.

1. INTRODUCTION

In recent years, several works have studied generalization in neural networks (NNs) and the connection between the classical underparametrized regime, where the number of training samples is larger than the number of parameters in the model, and that of deep learning, where the opposite is usually the norm. Indeed, the empirical success of overparameterized NNs challenges conventional wisdom in classical statistical learning as it is widely known among practitioners that larger models (with more parameters) often obtain better generalization: Szegedy et al. (2015) ; Huang et al. (2019); Radford et al. (2019) . Two frameworks adopted to study generalization in regression or classification tasks are Double Descent (DD) and online/offline learning correspondence, which we describe in the following. DD from Belkin et al. ( 2019) connects "classical" and "modern" machine learning by observing that once the model complexity is large enough to interpolate the dataset (i.e., when the training error reaches zero), the test error decreases again, reducing the generalization gap. This pattern has been empirically observed for several models and datasets, ranging from linear models, as in Loog et al. (2020) , to modern DNNs, as in Spigler et al. (2019); Nakkiran et al. (2020a) . Instead, the online/offline learning correspondence of Nakkiran et al. (2021) , studies the relationship between online optimization and offline generalization. The conjecture, empirically verified on supervised image classification, states that generalization in an offline setting can be effectively reduced to an optimization problem in the infinite-data limit. This means that online and offline test errors coincide if the NN is trained for a fixed number of weight updates. This setup aims to find a connection between under-and overparameterized models: the infinite-data limit (online) sits in the under-parameterized region (number of samples > number of parameters), while the finite-data case (offline) corresponds to the overparameterized regime (number of samples < number of parameters). Here, we test if this correspondence is also valid for similarity tasks. DD and online/offline correspondence are two complementary approaches that look at different generalization properties: while DD studies how the network adjusts to an increasing number of parameters, online/offline training compares the network performance by varying the dataset size while fixing the number of weight updates. Although these approaches have mainly been applied to classification and regression, if they are associated with some fundamental properties of DNNs, they should also hold for other tasks such as similarity learning. There are key differences between similar-different discrimination and classification. For similarity learning, the relation among features is crucial but not necessarily the features themselves. For this reason, a priori it is not possible to predict whether the DD behavior and the online/offline learning correspondence will also occur for similarity problems. To take the first steps towards understanding how DNNs generalize in similarity learning, we export both frameworks to the simplest contrastive learning representative, Siamese Neural Networks (SNNs) from Bromley et al. (1994); Chopra et al. (2005) . A Siamese architecture is made of two identical networks sharing weights and biases that are simultaneously updated during supervised training. The two networks are connected by a final layer, which computes the distance between branch outputs. SNNs are trained using pairs of data that are labeled as similar or different. The task of a successfully trained network is to decide if the pair samples belong to the same class. Studying the DD and online/offline correspondence in SNNs and comparing the results with those found in classification problems requires identifying which properties/characteristics of the training set most influence similarity learning. We identified two crucial sources of variability: (i) the effect of noisy data in SNNs, and (ii) the density of pairs in the training set. Noise is crucial in understanding generalization as it appears in every real-world dataset and may compromise model performance. While DD was also studied in the presence of noise,foot_0 very little (if none) attention was devoted to noise in the online/offline setting. By construction, SNNs can be affected by more complex types of noise than classification problems. This derives from the use of pairwise relations defining a similarity graph. To show the reaction of SNNs to different noise sources, we introduce two representative examples with distinctive properties: Single Label Noise (SLN) and Pair Label Noise (PLN), which we extensively describe in Sec. 2 and illustrate in the top panel of Fig. 1 . As we will show, SLN breaks similar/different pairs balancing but preserves similarity relations. Instead, PLN acts symmetrically on pair labels, but it breaks transitivity and, thus, similarity. Furthermore, we show that similarity learning is strongly influenced by the density of pairs in the training set. In particular, we will show how pairs created from populations with different levels of similarity graph density/image diversity, i.e., the average number of different images appearing in a set of pairs, give rise to very different learning models. We discuss sparse and dense connections in detail in Sec. 2. Our results show that • DD clearly appears in SNNs, regardless of the noise level, a phenomenon rarely found in classification problems in the absence of noise. • DD is exacerbated by noise (in line with Nakkiran et al. (2020a)) and its shape is affected by the pair training set density. While SNNs trained on sparse datasets show similar DD curves in the presence of SLN and PLN, these become quite distinct when the similarity relations in the training set are dense. Specifically, the interpolation threshold in the presence of PLN requires more parameters and its test error remains higher in the overparameterized region. An example of this behavior is shown in the bottom right plot of Fig. 1 . • We show that the poor performances of PLN derive from its similarity-breaking nature that manifests when input data are highly connected. We show that the interpolation threshold (training error = 0) cannot be achieved in this scenario, and we derive the analytic formula for the asymptotic training error value in the deep overparametrized regime. We call this phenomenon Density-Induced Break of Similarity (DIBS). • We test the correspondence between offline generalization and online optimization for similarity learning. We study how the architecture and the presence of noisy labels can differently impact these two regimes. We find that the conjecture only holds for clean data. • In the presence of label noise, we find that the online/offline correspondence breaks down for all choices of training settings considered. In particular, the effect of label noise is notably more relevant in the offline case.



Notably, it is known that the DD curve is exacerbated in the presence of random label noise in supervised classification (see, e.g.,Nakkiran et al. (2020a)).

