SIMILARITY AND GENERALIZATION: FROM NOISE TO CORRUPTION Anonymous

Abstract

Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other and different ones are far apart. We study how neural networks generalize in similarity learning in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We introduce two representative noise sources that can act on SNNs: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the density of pairs in the dataset crucially affects generalization. Training SNNs on sparse datasets affected by the same amount of PLN or SLN gives the same performance. On the contrary, using dense datasets, PLN cases generalize worse than SLN ones in the overparametrized region. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.

1. INTRODUCTION

In recent years, several works have studied generalization in neural networks (NNs) and the connection between the classical underparametrized regime, where the number of training samples is larger than the number of parameters in the model, and that of deep learning, where the opposite is usually the norm. Indeed, the empirical success of overparameterized NNs challenges conventional wisdom in classical statistical learning as it is widely known among practitioners that larger models (with more parameters) often obtain better generalization: Szegedy et al. ( 2015 2021), studies the relationship between online optimization and offline generalization. The conjecture, empirically verified on supervised image classification, states that generalization in an offline setting can be effectively reduced to an optimization problem in the infinite-data limit. This means that online and offline test errors coincide if the NN is trained for a fixed number of weight updates. This setup aims to find a connection between under-and overparameterized models: the infinite-data limit (online) sits in the under-parameterized region (number of samples > number of parameters), while the finite-data case (offline) corresponds to the overparameterized regime (number of samples < number of parameters). Here, we test if this correspondence is also valid for similarity tasks. DD and online/offline correspondence are two complementary approaches that look at different generalization properties: while DD studies how the



);Huang et al. (2019);Radford  et al. (2019). Two frameworks adopted to study generalization in regression or classification tasks are Double Descent (DD) and online/offline learning correspondence, which we describe in the following. DD from Belkin et al. (2019) connects "classical" and "modern" machine learning by observing that once the model complexity is large enough to interpolate the dataset (i.e., when the training error reaches zero), the test error decreases again, reducing the generalization gap. This pattern has been empirically observed for several models and datasets, ranging from linear models, as in Loog et al. (2020), to modern DNNs, as in Spigler et al. (2019); Nakkiran et al. (2020a). Instead, the online/offline learning correspondence of Nakkiran et al. (

