UNBIASED SUPERVISED CONTRASTIVE LEARNING

Abstract

Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (ϵ-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with ϵ-SupInfoNCE, reaching stateof-the-art performance on a number of biased datasets, including real instances of biases "in the wild".

1. INTRODUCTION

Deep learning models have become the predominant tool for learning representations suited for a variety of tasks. Arguably, the most common setup for training deep neural networks in supervised classification tasks consists in minimizing the cross-entropy loss. Cross-entropy drives the model towards learning the correct label distribution for a given sample. However, it has been shown in many works that this loss can be affected by biases in the data (Alvi et al., 2018; Kim et al., 2019; Nam et al., 2020; Sagawa et al., 2019; Tartaglione et al., 2021; Torralba et al., 2011) or suffer by noise and corruption in the labels Elsayed et al. (2018) ; Graf et al. (2021) . In fact, in the latest years, it has become increasingly evident how neural networks tend to rely on simple patterns in the data (Geirhos et al., 2019; Li et al., 2021) . As deep neural networks grow in size and complexity, guaranteeing that they do not learn spurious elements in the training set is becoming a pressuring issue to tackle. It is indeed a known fact that most of the commonly-used datasets are biased (Torralba et al., 2011) and that this affects the learned models (Tommasi et al., 2017) . In particular, when the biases correlate very well with the target task, it is hard to obtain predictions that are independent of the biases. This can happen, e.g., in presence of selection biases in the data. Furthermore, if the bias is easy to learn (e.g. a simple pattern or color), we will most likely obtain a biased model, whose predictions majorly rely on these spurious attributes and not on the true, generalizable, and discriminative features. Learning fair and robust representations of the underlying samples, especially when dealing with highly-biased data, is the main objective of this work. Contrastive learning has recently gained attention for this purpose, showing superior robustness to cross-entropy Graf et al. (2021) . For this reason, in this work, we adopt a metric learning approach for supervised representation learning. Based on that, we provide a unified framework to analyze and compare existing formulations of contrastive losses 1 such as the InfoNCE loss (Chen et al., 2020; Oord et al., (a) (b) (c) Figure 1 : With ϵ-SupInfoNCE (a) we aim at increasing the minimal margin ϵ, between the distance d + of a positive sample x + (+ symbol inside) from an anchor x and the distance d -of the closest negative sample x -(-symbol inside). By increasing the margin, we can achieve a better separation between positive and negative samples. We show two different scenarios without margin (b) and with margin (c). Filling colors of datapoints represent different biases. We observe that, without imposing a margin, biased clusters might appear containing both positive and negative samples (b). This issue can be mitigated by increasing the ϵ margin (c). 2019), the InfoL1O loss (Poole et al., 2019) and the SupCon loss (Khosla et al., 2020) . Furthermore, we also propose a new supervised contrastive loss that can be seen as the simplest extension of the InfoNCE loss (Chen et al., 2020; Oord et al., 2019) to a supervised setting with multiple positives. Using the proposed metric learning approach, we can reformulate each loss as a set of contrastive, and surprisingly sometimes even non-contrastive, conditions. We show that the widely used SupCon loss is not a "straightforward" extension of the InfoNCE loss since it actually contains a set of "latent" non-contrastive constraints. Our analysis results in an in-depth understanding of the different loss functions, fully explaining their behavior from a metric point of view. Furthermore, by leveraging the proposed metric learning approach, we explore the issue of biased learning. We outline the limitations of the studied contrastive loss functions when dealing with biased data, even if the loss on the training set is apparently minimized. By analyzing such cases, we provide a more formal characterization of bias. This eventually allows us to derive a new set of regularization constraints for debiasing that is general and can be added to any contrastive or non-contrastive loss. Our contributions are summarized below: 1. We introduce a simple but powerful theoretical framework for supervised representation learning, from which we derive different contrastive loss functions. We show how existing contrastive losses can be expressed within our framework, providing a uniform understanding of the different formulations. We derive a generalized form of the SupCon loss (ϵ-SupCon), propose a novel loss ϵ-SupInfoNCE, and demonstrate empirically its effectiveness; 2. We provide a more formal definition of bias, thanks to the proposed metric learning approach, which is based on the distances among representations, This allows us to derive a new set of effective debiasing regularization constraints, which we call FairKL. We also analyze, theoretically and empirically, the debiasing power of the different contrastive losses, comparing ϵ-SupInfoNCE and SupCon.

2. RELATED WORKS

can also find methods such as triplet losses (Chopra et al., 2005; Hermans et al., 2017) and based on distance metrics Schroff et al. (2015) ; Weinberger et al. (2006) . The latter are the most relevant for this work, as we propose a metric learning approach for supervised representation learning. Debiasing Addressing the issue of biased data and how it affects generalization in neural networks has been the subject of numerous works. Some approaches in this direction include the use of different data sources in order to mitigate biases (Gupta et al., 2018) and data clean-up thanks to the use of a GAN (Sattigeri et al., 2018; Xu et al., 2018) . However, they share some major limitations due to the complexity of working directly on the data. In the debiasing related literature, we can most often find approaches based on ensembling or adversarial setups, and regularization terms that aim at obtaining an unbiased model using biased data. The typical adversarial approach is represented by BlindEye (Alvi et al., 2018) . They employ an explicit bias classifier, trained on the same representation space as the target classifier, in an adversarial way, forcing the encoder to extract unbiased representations. This is also similar to Xie et al. (2017) . Kim et al. (2019) use adversarial learning and gradient inversion to reach the same goal. Wang et al. (2019b) adopt an adversarial approach to remove unwanted features from intermediate representations of a neural network. All of these works share the limitations of adversarial training, which is well known for its potential training instability. Other ensembling approaches can be found in Clark et al. (2019) ; Wang et al. (2020) , where feature independence among the different models is promoted. Bahng et al. (2020) propose ReBias, aiming at promoting independence between biased and unbiased representations with a min-max optimization. In (Lee et al., 2021) disentanglement between bias features and target features is maximized to perform the augmentation in the latent space. Nam et al. (2020) propose LfF, where a bias-capturing model is trained with a focus on easier samples (bias-aligned), while a debiased network is trained by giving more importance to the samples that the bias-capturing model struggles to discriminate. Another approach is proposed in Wang et al. (2019a) with HEX, where a neural-network-based gray-level co-occurrence matrix (Haralick et al., 1973; Lam, 1996) , is employed for learning invariant representations to some bias. However, all these methods require training additional models, which in practice can be resource and time-consuming. Obtaining representations that are robust and/or invariant to some secondary attribute can also be achieved by applying constraints and regularization to the model. Using regularization terms for debiasing has gained traction also due to the typically lower complexity when compared to methods such as exembling. For example, recent works attempt to discourage the learning of certain features with the aim of data privacy (Barbano et al., 2021; Song et al., 2017) and fairness (Beutel et al., 2019) . For example, Sagawa et al. (2019) propose Group-DRO, which aims at improving the model performance on the worst-group, defined based on prior knowledge of the bias distribution. In RUBi (Cadene et al., 2019) , logits re-weighting is used to promote the independence of the predictions on the bias features. Tartaglione et al. (2021) propose EnD, which is a regularization term that aims at bringing representations of positive samples closer together in case of different biases, and pulling apart representations of negative samples sharing the same bias attributes. A similar method is presented in (Hong & Yang, 2021) , where a contrastive formulation is employed to reach a similar goal. Our method belongs to this latter class of approaches, as it consists of a regularization term which can be optimized during training. 3 CONTRASTIVE LEARNING: AN ϵ-MARGIN POINT OF VIEW Let x ∈ X be an original sample (i.e., anchor), x + i a similar (positive) sample, x - j a dissimilar (negative) sample and P and N the number of positive and negative samples respectively. Contrastive learning methods look for a parametric mapping function f : X → S d-1 that maps "semantically" similar samples close together in the representation space (a (d-1)-sphere) and dissimilar samples far away from each other. Once pre-trained, f is fixed and its representation is evaluated on a downstream task, such as classification, through linear evaluation on a test set. In general, positive samples x + i can be defined in different ways depending on the problem: using transformations of x (unsupervised setting), samples belonging to the same class as x (supervised) or with similar image attributes of x (weakly-supervised). The definition of negative samples x - j varies accordingly. Here, we focus on the supervised case, thus samples belonging to the same/different class, but the proposed framework could be easily applied to the other cases. We define s(f (a), f (b)) as a similarity measure (e.g., cosine similarity) between the representation of two samples a and b. Please note that since ||f (a)|| 2 = ||f (b)|| 2 = 1, using a cosine similarity is equivalent to using a L2-distance (d(f (a), f (b)) = ||f (a) -f (b)|| 2 2 ). Similarly to Chopra et al. (2005) ; Hadsell et al. (2006) ; Schroff et al. (2015) ; Sohn (2016) ; Wang et al. (2014; 2019c); Weinberger et al. (2006) ; Yu & Tao (2019) , we propose to use a metric learning approach which allows us to better formalize recent contrastive losses, such as InfoNCE (Chen et al., 2020; Oord et al., 2019) , InfoL1O (Poole et al., 2019) and SupCon (Khosla et al., 2020) , and derive new losses that better approximate the mutual information and can take into account data biases. Using an ϵ-margin metric learning point of view, probably the simplest contrastive learning formulation is looking for a mapping function f such that the following ϵ-condition is always satisfied: d(f (x), f (x + )) d + -d(f (x), f (x - j )) d - j < -ϵ ⇐⇒ s(f (x), f (x - j )) s - j -s(f (x), f (x + ) s + ≤ -ϵ ∀j (1) where ϵ ≥ 0 is a margin between positive and negative samples and we consider, for now, a single positive sample.

Derivation of InfoNCE

The constraint of Eq. 1 can be transformed in an optimization problem using, as it is common in contrastive learning, the max operator and its smooth approximation LogSumExp (full derivation in the Appendix A.1.1): s - j -s + ≤ -ϵ ∀j arg min f max(-ϵ, {s - j -s + } j=1,...,N ) ≈ arg min f -log exp(s + ) exp(s + -ϵ) + j exp(s - j ) ϵ-Inf oN CE Here, we can notice that when ϵ = 0, we retrieve the InfoNCE loss, also known as N-Pair loss (Sohn, 2016) , whereas when ϵ → ∞ we obtain the InfoL1O loss. It has been shown in Poole et al. (2019) that these two losses are lower and upper bound of the Mutual Information I(X + , X) respectively: log exp s + exp s + + j exp s - j Inf oN CE ≤ I(X + , X) ≤ log exp s + j exp s - j Inf oL1O (3) By using a value of ϵ ∈ [0, ∞), one might find a tighter approximation of I(X + , X) since the exponential function at the denominator exp(-ϵ) monotonically decreases as ϵ increases. Proposed supervised loss (ϵ-SupInfoNCE) The inclusion of multiple positive samples (s + i ) can lead to different formulations. Some of them can be found in the Appendix A.1.2. Here, considering a supervised setting, we propose to use the following one, that we call ϵ-SupInfoNCE: s - j -s + i ≤ -ϵ ∀i, j i max(-ϵ, {s - j -s + i } j=1,...,N ) ≈ - i log exp(s + i ) exp(s + i -ϵ) + j exp(s - j ) ϵ-SupInf oN CE Please note that this loss could also be used in other settings, like in an unsupervised one, where positive samples could be defined as transformations of the anchor. Furthermore, even here, the ϵ value can be adjusted in the loss function, in order to increase the ϵ-margin. This time, contrarily to what happens with Eq. 2 and InfoNCE, if we consider ϵ = 0, we do not obtain the SupCon loss. Derivation of ϵ-SupCon (generalized SupCon) It's interesting to notice that Eq. 4 is similar to L sup out , which is one of the two SupCon losses proposed in Khosla et al. (2020) , but they differ for a sum over the positive samples at the denominator. The L sup out loss, presented as the "most straightforward way to generalize" the InfoNCE loss, actually contains another non-contrastive constraint on the positive samples: s + ts + i ≤ 0 ∀i, t. Fulfilling this condition alone would force all positive samples to collapse to a single point in the representation space. However, it does not take into account negative samples. That is why we define it as a non-contrastive condition. Considering both contrastive and non-contrastive conditions, we obtain: s - j -s + i ≤ -ϵ ∀i, j and s + t -s + i ≤ 0 ∀i, t ̸ = i 1 P i max(0, {s - j -s + i + ϵ} j , {s + t -s + i } t̸ =i ) ≈ ϵ - 1 P i log exp(s + i ) t exp(s + t -ϵ) + j exp(s - j ) ϵ-SupCon (5) when ϵ = 0 we retrieve exactly L sup out . The second loss proposed in Khosla et al. (2020) , called L sup in , minimizes a different contrastive problem, which is a less strict condition and probably explains the fact that this loss did not work well in practice (Khosla et al., 2020) : max(s - j ) < max(s + i ) ≈ log( j exp(s - j )) -log( i exp(s + i )) < 0 (6) arg min f max(0, max(s - j ) -max(s + i )) ≈ -log i exp(s + i ) t exp(s + t ) + j exp(s - j ) L sup in (7) It's easy to see that, differently from Eq. 4 and L sup out , this condition is fulfilled when just one positive sample is more similar to the anchor than all negative samples. Similarly, another contrastive condition that should be avoided is: j s(f (x), f (x - j )) -i s(f (x), f (x + i )) < -ϵ since one would need only one (or few) negative samples far away from the anchor in the representation space (i.e., orthogonal) to fulfil the condition.

3.1. FAILURE CASE OF INFONCE: THE ISSUE OF BIASES

Satisfying the ϵ-condition (1) can generally guarantee good downstream performance, however, it does not take into account the presence of biases (e.g. selection biases). A model could therefore take its decision based on certain visual features, i.e. the bias, that are correlated with the target downstream task but don't actually characterize it. This means that the same bias features would probably have a worse performance if transferred to a different dataset (e.g. different acquisition settings or image quality). Specifically, in contrastive learning, this can lead to settings where we are still able to minimize any InfoNCE-based loss (e.g. SupCon or ϵ-SupInfoNCE), but with degraded classification performance (Fig. 1b ). To tackle this issue, in this work, we propose the FairKL regularization technique, a set of debiasing constraints that prevent the use of the bias features within the proposed metric learning approach. In order to give a more in-depth explanation of the ϵ-InfoNCE failure case, we employ the notion of bias-aligned and bias-conflicting samples as in Nam et al. (2020) . In our context, a bias-aligned sample shares the same bias attribute of the anchor, while a bias-conflicting sample does not. In this work, we assume that the bias attributes are either known a priori or that they can be estimated using a bias-capturing model, such as in Hong & Yang (2021) . . Given an anchor x, if the bias is "strong" and easy-to-learn, a positive bias-aligned sample x +,b will probably be closer to the anchor x in the representation space than a positive biasconflicting sample (of course, the same reasoning can be applied for the negative samples). This is why even in the case in which the ϵ-condition is satisfied and the ϵ-SupInfoNCE is minimized, we could still be able to distinguish between bias-aligned and bias-conflicting samples. Hence, we say that there is a bias if we can identify an ordering on the learned representations, such as:

Characterization of bias

d(f (x), f (x +,b i )) d +,b i < d(f (x), f (x +,b ′ k ) d +,b ′ k ≤ d(f (x), f (x -,b t )) d -,b t -ϵ < d(f (x), f (x -,b ′ j )) d -,b ′ j -ϵ ∀i, k, t, j This represents the worst-case scenario, where the ordering is total (i.e., ∀i, k, t, j). Of course, there can also be cases in which the bias is not as strong, and the ordering may be partial. FairKL regularization for debiasing Ideally, we would enforce the conditions d +,b ′ k -d +,b i = 0 ∀i, k and d -,b ′ t -d -,b j = 0 ∀t, j, meaning that every positive (resp. negative) bias-conflicting sample should have the same distance from the anchor as any other positive (resp. negative) biasaligned sample. However, in practice, this condition is very strict, as it would enforce uniform distance among all positive (resp. negative) samples. A more relaxed condition would instead force the distributions of distances, {d •,b ′ k } and {d •,b i }, to be similar. Here, we propose two new debiasing constraints for both positive and negative samples using either the first moment (mean) of the distributions or the first two moments (mean and variance). Using only the average of the distributions, we obtain: 1 P a i d +,b i - 1 P c k d +,b ′ k = 0 ⇐⇒ 1 P c k |s +,b ′ k | - 1 P a i |s +,b i | = 0 where P a and P c are the number of positive bias-aligned and bias-conflicting samples, respectivelyfoot_0 . Denoting the first moments with µ +,b = 1 Pa i d +,b i , µ +,b ′ = 1 Pc k d +,b ′ k , and the second mo- ments of the distance distributions with σ 2 +,b = 1 Pa i (d +,b i -µ +,b ) 2 , σ 2 +,b ′ = 1 Pc k (d +,b ′ k - µ +,b-) 2 , and making the hypothesis that the distance distributions follow a normal distribution, we can define a new set of debiasing constraints using, for example, the Kullback-Leibler divergence: D KL ({d +,b i }||{d +,b ′ k }) = 1 2 σ 2 +,b + (µ +,b -µ +,b ′ ) 2 σ 2 +,b ′ -log σ 2 +,b σ 2 +,b ′ -1 = 0 (10) In practice, one could also use another distribution such as the log-normal, the Jeffreys divergence (D KL (p||q) + D KL (q||p)), or a simplified version, such as the difference of the two statistics (e.g., (µ +,b -µ +,b ′ ) 2 + (σ +,b -σ +,b ′ ) 2 ). The proposed debiasing constrains can be easily added to any contrastive loss using the method of the Lagrange multipliers, as a regularization term R F airKL = D KL ({d +,b i }||{d +,b ′ k }) . Thus, the final loss function that we propose to minimize is: L = -α i log exp(s + i ) exp(s + i -ϵ) + j exp(s - j ) ϵ-SupInf oN CE +λR F airKL where α and λ are positive hyperparameters.

3.1.1. COMPARISON WITH OTHER DEBIASING METHODS

SupCon It is interesting to notice that the non-contrastive conditions in Eq. 5: s + t -s + i ≤ 0 ∀i, t ̸ = i are actually all fulfilled only when s + i = s + t ∀i, t ̸ = i. This means that one tries to align all positive samples, regardless of their bias b, to a single point in the representation space. In other terms, at the optimal solution, one would also fulfill the following conditions: s +,b i = s +,b t , s +,b ′ i = s +,b ′ t , s +,b i = s +,b ′ t , s +,b ′ i = s +,b t ∀i, t ̸ = i (12) Realistically, this could lead to suboptimal solutions: we argue that the optimization process would mainly focus on the easier task, namely aligning bias-aligned samples, and neglecting the biasconflicting ones. In highly biased settings, this could lead to worse performance than ϵ-SupInfoNCE. More empirical results supporting this hypothesis are presented in Appendix C.2. End The constraint in Eq. 9 is very similar to what was recently proposed in Tartaglione et al. (2021) with EnD. However, EnD lacks the additional constraint on the standard deviation of the distances, which is given by Eq. 10. An intuitive difference can be found in Fig. 3 and 4 of the Appendix. The constraints imposed by EnD (only first moments) can be fulfilled even if there is an effect of the bias features on the ordering of the positive samples. The use of constraints on the second moments, as in the proposed method, can remove the effect of the bias. An analytical comparison can be found in Appendix A.3. BiasCon In Hong & Yang (2021) , authors propose the BiasCon loss, which is similar to SupCon but only aligns positive bias-conflicting samples. It looks for an encoder f that fulfills: s - j -s +,b ′ i ≤ -ϵ ∀i, j and s +,b p -s +,b ′ i ≤ 0 ∀i, p and s +,b ′ t -s +,b ′ i ≤ 0 ∀i, t ̸ = i (13) The problem here is that we try to separate the negative samples from only the positive biasconflicting samples, ignoring the positive bias-aligned samples. This is probably why the authors proposed to combine this loss with a standard Cross Entropy.

4. EXPERIMENTS

In this section, we describe the experiments we perform to validate our proposed losses. We perform two sets of experiments. First, we benchmark our framework, presented in Sec. 3, on standard vision datasets such as: CIFAR-10 (Krizhevsky et al., a), CIFAR-100 (Krizhevsky et al., b) and ImageNet-100 (Deng et al., 2009) . Then, we analyze biased settings, employing BiasedMNIST (Bahng et al., 2020) , Corrupted-CIFAR10 (Hendrycks & Dietterich, 2019) , bFFHQ (Lee et al., 2021) , 9-Class ImageNet (Ilyas et al., 2019) and ImageNet-A (Hendrycks et al., 2021) . The code can be found at https://github.com/EIDOSLAB/unbiased-contrastive-learning.

4.1. EXPERIMENTS ON GENERIC VISION DATASETS

We conduct an empirical analysis of the ϵ-SupCon and ϵ-SupInfoNCE losses on standard vision datasets to evaluate the different formulations and to assess the impact of the ϵ parameter. We compare our results with baseline implementations including Cross Entropy (CE) and SupCon. Experimental details We use the original setup from SupCon (Khosla et al., 2020) , employing a ResNet-50, a large batch size (1024), a learning rate of 0.5, a temperature of 0.1, and multiview augmentation, for CIFAR-10 and CIFAR-100. Additional experimental details (including ImageNet-100foot_1 ) and the different hyperparameters configurations are provided in Sec. B of the Appendix. Results First, we compare our proposed ϵ-SupInfoNCE loss with the ϵ-SupCon loss derived in Sec. 3. As reported in Tab. 1, ϵ-SupInfoNCE performs better than ϵ-SupCon: we conjecture that the lack of the non-contrastive term of Eq. 5 leads to increased robustness, as it will also be shown in Sec. 4.2. For this reason, we focus on ϵ-SupInfoNCE. Further comparison with different values of ϵ can be found in Sec. C.1, showing that SupCon ≤ ϵ-SupCon ≤ ϵ-SupInfoNCE in terms of accuracy. Results on general computer vision datasets are presented in Tab. 2, in terms of top-1 accuracy. We report the performance for the best value of ϵ; the complete results can be found in Sec. C.1. The results are averaged across 3 trials for every configuration, and we also report the standard deviation. We obtain significant improvement with respect to all baselines and, most importantly, SupCon, on all benchmarks: on CIFAR-10 (+0.5%), on CIFAR-100 (+0.63%), and on ImageNet-100 (+1.31%). 

4.2. EXPERIMENTS ON BIASED DATASETS

Next, we move on to analyzing how our proposed loss performs on biased learning settings. We employ five datasets, ranging from synthetic data to real facial images: Biased-MNIST, Corrupted-CIFAR10, bFFHQ, and 9-Class ImageNet along with ImageNet-A. The detailed setup and experimental details are provided in the appendix B. Biased-MNIST is a biased version of MNIST (Deng, 2012) , proposed in Bahng et al. (2020) . A color bias is injected into the dataset, by colorizing the image background with ten predefined colors associated with the ten different digits. Given an image, the background is colored with the predefined color for that class with a probability ρ, and with any one of the other colors with a probability 1ρ. Higher values of ρ will lead to more biased data. In this work, we explore the datasets in It is noticeable that for ρ ≤ 0.997, ϵ-SupInfoNCE and ϵ-SupCon are comparable, while for ρ = 0.999 the gap is significantly larger: this could be due to the additional non-contrastive condition of SupCon. different values of ρ: 0.999, 0.997, 0.995 and 0.99. An unbiased test set is built with ρ = 0.1. We compare with cross entropy baseline and with other debiasing techniques, namely EnD (Tartaglione et al., 2021) , LNL (Nam et al., 2020) and BiasCon and BiasBal (Hong & Yang, 2021) . Analysis of ϵ-SupInfoNCE and ϵ-SupCon: First, we perform an evaluation of the ϵ-SupCon and ϵ-SupInfoNCE losses alone, without our debiasing regularization term. Fig. 2 shows the accuracy on the unbiased test set, with the different values of ρ. Baseline results of a cross-entropy model (CE) are reported in Tab. 3. Both losses result in higher accuracy compared to the cross entropy. The generally higher robustness of contrastive-based formulations is also confirmed by the related literature (Khosla et al., 2020) . Interestingly, in the most biased setting (ρ = 0.999), we observe that ϵ-SupInfoNCE obtains higher accuracy than ϵ-SupCon. Our conjecture is that the non-contrastive term of SupCon in Eq. 5 (s + ts + i ≤ 0 ∀i, t) can lead, in highly biased settings, to more biased representations as the bias-aligned samples will be especially predominant among the positives. For this reason, we focus on ϵ-SupInfoNCE in the remaining of this work. Debiasing with FairKL: Next, we apply our regularization technique FairKL jointly with ϵ-SupInfoNCE, and compare it with the other debiasing methods. The results are shown in Tab. 3. Our technique achieves the best results in all experiments, with high gaps in accuracy, especially in the most difficult settings (lower ρ). For completeness, we also evaluate the debiasing power of FairKL with different losses, i.e. CE and ϵ-SupCon. With FairKL we obtain better results than most of the other baselines with either CE, ϵ-SupCon or ϵ-SupInfoNCE; the latter achieves the best performance, confirming the results observed in Sec 4.1. For this reason, in the rest of the work, we focus on ϵ-SupInfoNCE. Corrupted CIFAR-10 is built from the CIFAR-10 dataset, by correlating each class with a certain texture (brightness, frost, etc.) following the protocol proposed in Hendrycks & Dietterich (2019) . Similarly to Biased-MNIST, the dataset is provided with five different levels of ratio between bias-conflicting and bias-aligned samples. The results are shown in Tab. 4. Notably, we obtain the best results in the most difficult scenario, when the amount of bias-conflicting samples is the lowest. Again, for the other settings, we obtain comparable results with the state of the art. bFFHQ is proposed by Lee et al. (2021) , and contains facial images. They construct the dataset in such a way that most of the females are young (age range 10-29), while most of the males are older (age range 40-59). The ratio between bias-conflicting and bias-aligned provided for this dataset is 0.5. The results are shown in Tab. 4, where our technique outperforms all other methods. 9-Class ImageNet and ImageNet-A We also test our method on the more complex and realistic 9-Class ImageNet (Ilyas et al., 2019) dataset. This dataset is a subset of ImageNet, which is known to contain textural biases (Geirhos et al., 2019) . It aggregates 42 of the original classes into 9 macro categories. Following Hong & Yang (2021) , we train a BagNet18 (Brendel & Bethge, 2019) as the bias-capturing model, which we then use to compute a bias score for the training samples, to apply within our regularization term. More details and the experimental setup can be found in the Sec. B.2.4. We evaluate the accuracy on the test set (biased) along with the unbiased accuracy (UNB), computed with the texture labels assigned in Brendel & Bethge (2019) . We also report accuracy results on ImageNet-A (IN-A) dataset, which contains bias-conflicting samples (Hendrycks et al., 2021) . Results are shown in Tab. 5. On the biased test set, the results are comparable with SoftCon, while on the harder sets unbiased and ImageNet-A we achieve SOTA results. 

5. CONCLUSIONS

In this work, we propose a metric-learning-based framework for supervised representation learning. We propose a new loss, called ϵ-SupInfoNCE, that is based on the definition of the ϵ-margin, which is the minimal margin between positive and negative samples. By adjusting this value, we are able to find a tighter approximation of the mutual information and achieve better results compared to standard Cross-Entropy and to the SupCon loss. Then, we tackle the problem of learning unbiased representations when the training data contains strong biases. This represents a failure case for InfoNCE-like losses. We propose FairKL, a debiasing regularization term derived from our framework. With it, we enforce equality between the distribution of distances of bias-conflicting samples and bias-aligned samples. This, together with the increase of the ϵ margin, allows us to reach stateof-the-art performances in the most extreme cases of biases in different datasets, comprising both synthetic data and real-world images. A THEORETICAL RESULTS

A.1 COMPLETE DERIVATIONS FOR SECTION 3

In this section, we present the complete analytical derivation for the equations found in Sec. 3. All of the presented derivations are based on the smooth max approximation with the LogSumExp (LSE) operator: max(x 1 , x 2 , ..., x 3 ) ≈ log( i exp(x i )) (14) A.1.1 COMPUTATIONS OF ϵ-INFONCE (2) We consider Eq. 2 and we obtain: arg min f max(-ϵ, {s - j -s + } j=1,...,N ) ≈ arg min f -log exp(s + ) exp(s + -ϵ) + j exp(s - j ) ϵ-Inf oN CE Starting from the left-hand side, we have: max(-ϵ, {s - j -s + } j=1,...,N ) ≈ log   exp(-ϵ) + j exp(s - j -s + )   = log   exp(-ϵ) + exp(-s + ) j exp(s - j )   = log   exp(-s + )   exp(s + -ϵ) + j exp(s - j )     = log exp(-s + ) + log   exp(s + -ϵ) + j exp(s - j )   = -log exp(s + ) exp(s + -ϵ) + j exp(s - j ) ϵ-Inf oN CE A.1.2 MULTIPLE POSITIVE EXTENSION Extending Eq. 2 to multiple positives can be done in different ways. Here, we list four possible choices. Empirically, we found that solution c) gave the best results and is the most convenient to implement for efficiency reasons. a) max(-ϵ, {s - j -s + i } i=1,..,P j=1,...,N ) = -log exp( i s + i ) exp( i s + i -ϵ) + ( j exp(s - j ))( i exp( t̸ =i s + t )) b) j max(-ϵ, {s - j -s + i } i=1,...,P ) = - j log exp( i s + i ) exp( i s + i -ϵ) + exp(s - j )( i exp( t̸ =i s + t )) c) i max(-ϵ, {s - j -s + i } j=1,...,N ) = - i log exp(s + ) exp(s + -ϵ) + j exp(s - j ) d) i j max(-ϵ, s - j -s + i ) = - i j log exp(s + ) exp(s + -ϵ) + exp(s - j ) A.1.3 COMPUTATIONS OF ϵ-SUPINFONCE (4) The computations are very similar to Eq. 16. We obtain: arg min f i max(-ϵ, {s - j -s + i } j=1,...,N ) ≈ arg min f   i log   exp(-ϵ) + j exp(s - j -s + i )     Starting from the left-hand side, we have: i max(-ϵ, {s - j -s + i } j=1,...,N ) ≈ i log   exp(-ϵ) + j exp(s - j -s + i )   = i log exp(-ϵ) + j exp(s - j ) exp(s + i ) = i log exp(s + i -ϵ) j exp(s - j ) exp(s + i ) = - i log exp(s + i ) exp(s + i -ϵ) j exp(s - j ) ϵ-SupInf oN CE A.1.4 COMPUTATIONS OF ϵ-SUPCON (5) We extend Eq. 4 by adding the non contrastive conditions: s - j -s + i ≤ -ϵ ∀i, j and s + t -s + i ≤ 0 ∀i, t ̸ = i and we show 1 P i max(0, {s - j -s + i +ϵ} j , {s + t -s + i } t̸ =i ) ≈ ϵ - 1 P i log exp(s + i ) t exp(s + i -ϵ) + j exp(s - j ) ϵ-SupCon Starting from the left-hand side, we have: 1 P i max(0, {s - j -s + i + ϵ} j=1,...,N ,{s + t -s + i } t̸ =i ) ≈ 1 P i log   1 + j exp(s - j -s + i + ϵ) + t̸ =i exp(s + t -s + i )   = = 1 P i log 1 + j exp(s - j ) exp(s + i -ϵ) + t̸ =i exp(s + t ) exp(s + i ) = 1 P i log exp(s + i -ϵ) + j exp(s - j ) + t̸ =i exp(s + t -ϵ) exp(s + i -ϵ) = - 1 P i log exp(s + i -ϵ) t exp(s + t -ϵ) + j exp(s - j ) = ϵ - 1 P i log exp(s + i ) t exp(s + t -ϵ) + j exp(s - j ) ϵ-SupCon A.1.5 COMPUTATIONS OF L sup in (7) Here we show that: max(s - j ) < max(s + i ) ≈ -log i exp(s + i ) t exp(s + t ) + j exp(s - j )) L sup in Starting from the left-hand side, and given that: max(s - j ) < max(s + i ) ≈ log( j exp(s - j )) -log( i exp(s + i )) < 0 we have: max   0, log( j exp(s - j )) -log( i exp(s + i ))   ≈ log   1 + exp   log( j exp(s - j )) -log( i exp(s + i ))     = log 1 + exp log j exp(s - j ) i exp(s + i ) = log 1 + j exp(s - j )) i exp(s + i ) = log i exp(s + i ) + j exp(s - j )) i exp(s + i ) = -log i exp(s + i ) t exp(s + t ) + j exp(s - j )) L sup in A.1.6 COMPUTATIONS OF EQ.17-A arg min f max(-ϵ, {s - js + i } i=1,..,P j=1,...,N ) ≈ arg min f log   exp(-ϵ) + i j exp(s - j -s + i )   We have: L = log exp(-ϵ) + P i j exp(s - j ) exp(s + i )) = log P i exp(-ϵ) P + j exp(s - j ) exp(s + i )) = log P i exp(s + i -ϵ) + P ( j exp(s - j )) P (exp(s + i ))) = log i 1 P exp(s + i -ϵ)) + j exp(s - j ) exp(s + i ) = log   i 1 P exp(s + i -ϵ)( t̸ =i exp(s + t )) + ( j exp(s - j ))( t̸ =i exp(s + t )) i exp(s + i )   = log exp(-ϵ) i exp(s + i ) + ( j exp(s - j ))( i t̸ =i exp(s + t )) i exp(s + i ) = -log exp( i s + i ) exp( i s + i -ϵ) + ( j exp(s - j ))( i exp( t̸ =i s + t )) A.2 BOUNDNESS OF THE ϵ-MARGIN In this section, we give insights on how an optimal value of ϵ can be estimated. First of all, it is easy to show that ϵ is bounded and cannot grow to infinity. Given that ||f (x)|| 2 = 1, then we Figure 3 : When considering only Eq. 9 a-left (average of distances) or Eq. 9a-right (average of similarities), that is when using EnD Tartaglione et al. (2021) , we may obtain a suboptimal configuration such as (a), where we can still (partially) order the distances of positive samples from the anchor based on the bias features. We can see that the conditions in Eq. 9 are fulfilled, namely the average of the distances of bias-aligned and bias-conflicting samples from the anchor are the same (µ +,b = µ +,b ′ = 2) . This is only partially mitigated when using a margin ϵ > 0 (b). However, the standard deviations of the distances of bias-aligned and bias-conflicting samples in (a) and (b) are different (σ +,b = 0, while σ +,b ′ = 1). This can be computed using the distances d reported in the figure . If we also consider the conditions on the standard deviations of the distances, as proposed in FairKL (Eq. 10), the ordering is removed and thus also the effect of the bias (c). In (c), we show the case in which both mean and standard deviation of the distributions match (in a simplified case with σ=0). A simulated example is shown in Fig. 4 . max(d + -d -) = 2 . This is the case in which the two samples are aligned at opposite poles of the hypersphere. We can conclude that, in general, ϵ will be less than 2. If we also take into account the temperature τ , when ϵ ≤ 2/τ . This is always true, however, a stricter upper bound can be found if we consider the geometric properties of the latent space. For example, Graf et al. (2021) show that when the SupCon loss converges to its minimum value, then the representations of the different classes are aligned on a regular simplex. This property could be used to compute a precise upper bound of the ϵ margin, depending on the number of classes in the dataset. We leave further analysis on this matter as future work.

A.3 THEORETICAL COMPARISON WITH END

Here, we present a more detailed theoretical analysis of EnD (Tartaglione et al., 2021) , and we show that the EnD regularization term can be equivalent to the condition in Eq. 9: a) 1 P c k |s +,b ′ k | - 1 P a i |s +,b i | = 0 b) 1 N c t |s -,b ′ t | - 1 N a j |s -,b j | = 0 which can be turned into a minimization term R, using the method of Lagrange multipliers: R = -λ 1 1 P c k |s +,b ′ k | - 1 P a i |s +,b i | -λ 2   1 N c t |s -,b ′ t | - 1 N a j |s -,b j |   Now, if we assume λ 1 = λ 2 = 1, we can re-arrange the terms, obtaining: R =   1 P a i |s +,b i | + 1 N a j |s -,b j |   R ⊥ - 1 P c k |s +,b ′ k | + 1 N c t |s -,b ′ t | R ∥ train for 100 epochs with a batch size of 512, and we decay the learning rate by a factor of 0.1 every 30 epochs.

B.2 BIASED DATASETS

When employing our debiasing term, we find that scaling the ϵ-SupInfoNCE loss by a small factor α (≤ 1) and using λ closer to 1, is stabler then using values of λ >> 1 (as done in Tartaglione et al. (2021) ) and tends to produce better results. For biased datasets, we do not make use of the projection head used in Khosla et al. (2020) ; Chen et al. (2020) . For this reason, we also avoid the aggressive augmentation usually employed by contrastive methods (more on this in Sec. C.3). Furthermore, as also done by Hong & Yang (2021) , we also experimented with a small contribution of the cross entropy loss for training the model end-to-end; however, we did not find any benefit in doing so, compared to training a linear classifier separately.

B.2.1 BIASED-MNIST

We emply the network architecture SimpleConvNet proposed by Bahng et al. (2020) , consisting of four convolutional layers with 7 × 7 kernels. We use the Adam optimizer with a learning rate of 0.001, a weight decay of 10 -5 and a batch size of 256. We decay the learning rate by a factor of 0.1 at 1/3 and 2/3 of the epochs (26 and 53). We train for 80 epochs.

Hyperparameters configuration

The hyperparameters for Tab. 3 are reported in Tab. 6. For this dataset we employ the ResNet-18 architecture. We use the Adam optimizer, with an initial learning rate of 0.001, a weight decay of 0.0001. The other Adam parameters are the pytorch default ones (β 1 = 0.9, β 2 = 0.999, ϵ = 10 -8 ). We train for 200 epochs with a batch size of 256. We decay the learning rate using a cosine annealing policy. Hyperparameters configuration: Table 7 shows the hyperparameters for the results reported in Tab. 4 of the main paper. Table 7 : Corrupted CIFAR-10 hyperparameters Ratio (%) 0.5 1.0 2.0 5.0 α 0.1 0.1 0.1 0.1 λ 1.0 1.0 1.0 1.0 ϵ 0.1 0.25 0.5 0.25 B.2.3 BFFHQ Following Lee et al. (2021) ,we use the ResNet-18 architecture. We use the Adam optimizer, with an initial learning rate of 0.0001, and train for 100 epochs. For this experiment, we set α = 0.1, ϵ = 0.25 and λ = 1.5. Differently from Lee et al. (2021) we use a batch size of 256 (vs 64) as contrastive losses benefit more from larger batch sizes (Chen et al., 2020; Khosla et al., 2020) . Additionally, we also use a weight decay of 10 -4 , rather than 0. These changes do not provide advantages to the debiasing task: we obtain an accuracy of 54.8% without FairKL, which is in line with the 56.87% reported for the vanilla model.

B.2.4 9-CLASS IMAGENET AND IMAGENET-A

Our proposed method can also be applied in cases in which no prior label about the bias attributes is given, with only a slight change in formulation. For example, for ImageNet. Similarly to other works (Nam et al., 2020; Hong & Yang, 2021) we exploit a bias-capturing model for this purpose. 

FairKL with bias-capturing model

N i d + i (1 -b+ i ) , where N is the batch size. Setup We pretrain the bias-capturing model BagNet18 (Brendel & Bethge, 2019) for 120 epochs. For the main model ResNet18, we use the Adam optimizer, with learning rate 0.001, β 1 = 0.9 and β 2 = 0.999, weight decay of 0.0001 and a cosine decay of the learning rate. We use a batch size of 256 and train for 200 epochs. We employ as augmentation: random resized crop, random flip, and, as done in Hong & Yang (2021) random color jitter and random gray scale (p = 0.2). We use ϵ = 0.5 and λ = 1. Given the higher complexity of this dataset, we employ α = 0.5.

C ADDITIONAL EXPERIMENTS

In this section, we present some additional experiments we conducted, for a more in depth analysis of our proposed framework.

C.1 COMPLETE RESULTS FOR COMMON VISION DATASETS

In Table 8 we report the results on CIFAR-10, CIFAR-100 and ImageNet-100 for different values of ϵ. In Table 9 the full comparison between ϵ-SupCon and ϵ-SupInfoNCE on ImageNet-100 is presented. Our proposed ϵ-SupInfoNCE outperforms SupCon in all datasets for all the ϵ values, reaching the best results. Furthermore, on ImageNet-100, we observe that the lowest accuracy obtained by ϵ-SupInfoNCE (83.02%) is still higher than the best accuracy obtained by ϵ-SupCon (82.83%) on the same dataset, even though ϵ-SupCon is always higher than SupCon. In terms of accuracy, the results in Tab. 9 show that SupCon ≤ ϵ -SupCon ≤ ϵ -SupInf oN CE. We perform a more in-depth analysis of the debiasing capabilities of ϵ-SupInfoNCE and ϵ-SupCon. In Sec. 4.2 of the main text, we hypothesize that the non-constrastive condition of Eq. 20 . Here ϵ-SupInfoNCE, even if marginally, is able to increase the number of similar bias-conflicting samples. ϵ-SupCon focuses more on bias-aligned samples, resulting in more biased representations. With ϵ-SupInfoNCE, this behavior is less pronounced, as the lack of the non-contrastive condition leads to be less focused on bias-aligned samples and more focused on the bias-conflicting ones. We hypothesize that this is the reason ϵ-SupInfoNCE appears to obtain better results than ϵ-SupCon in more biased datasets. s + i -s + j ≤ 0 ∀i, t ̸ = i ϵ = 0                   



The same reasoning can be applied to negative samples (omitted for brevity.) Due to computing resources constraints, we were not able to evaluate ImageNet-1k.



We denote bias-aligned samples with x •,b and bias-conflicting samples with x •,b ′

Figure2: Comparison of ϵ-SupCon and ϵ-SupInfoNCE on Biased-MNIST. It is noticeable that for ρ ≤ 0.997, ϵ-SupInfoNCE and ϵ-SupCon are comparable, while for ρ = 0.999 the gap is significantly larger: this could be due to the additional non-contrastive condition of SupCon.

To use a continuous score, rather than a discrete bias label, we compute the similarity of the bias features b+ i = s(g(x), g(x + i )),where g(•) is the bias-capturing BagNet18 model. The bias similarity bi is used to obtain a weighted sample similarity: s+,b i = s + i b+ i for bias-aligned samples, and ŝ+,b ′ i = s + i (1 -b+ i ) for bias-conflicting. By doing so, for example, the terms µ +,b = 1 Pa i d +,b i and µ +,b ′ = 1 Pc k d +,b ′ k become μ+,b = 1 N i d + i b+ i and μ+,b ′ = 1

Figure 5: (first and second columns) Distribution of positive bias-aligned similarities s +,b . Here ϵ-SupCon tends to produce a much tighter distribution, with similarities close to 1; (third and fourth columns) Distribution of positive bias-conflicting similarities s +,b ′. Here ϵ-SupInfoNCE, even if marginally, is able to increase the number of similar bias-conflicting samples. ϵ-SupCon focuses more on bias-aligned samples, resulting in more biased representations. With ϵ-SupInfoNCE, this behavior is less pronounced, as the lack of the non-contrastive condition leads to be less focused on bias-aligned samples and more focused on the bias-conflicting ones. We hypothesize that this is the reason ϵ-SupInfoNCE appears to obtain better results than ϵ-SupCon in more biased datasets.

Comparison of ϵ-SupInfoNCE and ϵ-SupCon on ImageNet-100.

Accuracy on vision datasets. SimCLR and Max-Margin results fromKhosla et al. (2020). Results denoted with * are (re)implemented with mixed precision due to memory constraints.

Top-1 accuracy (%) on Biased-MNIST. Reference results fromHong & Yang (2021). Results denoted with * are re-implemented without color-jittering and bias-conflicting oversampling.

Top-1 accuracy (%) on Corrupted CIFAR-10 with different corruption ratio (%) and on bFFHQ. Reference results are taken fromLee et al. (2021).



Complete results for common vision datasets, for different values of ϵ, in terms of top-1 accuracy (%). With every value of ϵ we obtain better results than SupCon (and CE) on the same dataset. 41±0.19 75.85±0.07 76.04±0.01 75.99±0.06 ImageNet-100 82.1±0.59 81.99±0.08 83.25±0.39 83.02±0.41 83.3±0.06 C.2 ANALYSIS OF ϵ-SUPCON FOR DEBIASING

annex

aligned conflicting Figure 4 : Toy example with simulated data to better explain the suboptimal solution of Fig. 3 . We make the hypothesis that the distributions of the distances do follow a Gaussian distribution. In blue and in orange are shown the bias-aligned and the bias-conflicting samples respectively. The green sample represents the anchor. On the left, data points are sampled from two normal distributions with the same mean but different std. We can see that the two distributions do not match. This shows that, even if the first order constraints of EnD are fulfilled, there might still be an effect of the bias. On the contrary, on the right, the two distributions have almost the same statistics (both average and std) and the KL divergence is almost 0. In that case, the bias effect is basically removed.The two terms we obtain are equivalent to the disentangling term R ⊥ and to the entangling term R ∥ of EnD: R ⊥ tries to decorrelate all of the samples which share the same bias attribute, while the R ∥ tries to maximize the correlation of samples which belong to the same class but have different bias attributes. However, some practical differences between the two formulation remain in how the different terms are weighted: for Eq. 9 we would have:while for EnD we would have:B EXPERIMENTAL SETUP All of our experiments were run using PyTorch 1.10.0. We used a cluster with 4 NVIDIA V100 GPUs and a cluster of 8 NVIDIA A40 GPUs. For consistency, when training with constrastive losses we use a temperature value τ = 0.1 across all of our experiments.

B.1 GENERIC VISION DATASETS

B.1.1 CIFAR-10 AND CIFAR-100We use the original setup from SupCon (Khosla et al., 2020) , employing a ResNet-50, large batch size (1024), learning rate of 0.5, temperature of 0.1 and multiview augmentation, for CIFAR-10 and CIFAR-100. We use SGD as optimizer with a momentum of 0.9, and train for 1000 epochs.Learning rate is decayed with a cosine policy with warmup from 0.01, with 10 warmup epochs.

B.1.2 IMAGENET-100

For ImageNet-100 we employ the ResNet50 architectures (He et al., 2015) . We use SGD as optimizer with a weight decay of 10 -4 and momentum of 0.9, with an initial learning rate of 0.1. We Table 9 : Complete comparison of ϵ-SupInfoNCE and ϵ-SupCon on ImageNet-100 in terms of top-1 accuracy (%). The results of ϵ-SupInfoNCE are higher than any results of ϵ-SupCon.Loss ϵ = 0.1 ϵ = 0.25 ϵ = 0.5 ϵ-SupInfoNCE 83.25±0.39 83.02±0.41 83.3±0.06 ϵ-SupCon 82.83±0.11 82.54±0.09 82.77±0.14 might actually be the reason for the loss of accuracy in ϵ-SupCon when compared to ϵ-InfoNCE, as shown on the analysis on Biased-MNIST in Fig. 2 of the main text.In this section, we provide more empirical insights supporting this hypothesis. We plot the similarity of bias-aligned samples (s +,b ) and bias-conflicting samples (s +,b ′ ) during training, to understand how they are affected. Fig. 5 shows the bivariate histogram of the similarities obtained with the two loss functions, at different training epochs and values of ϵ, on Biased-MNIST, with a training ρ of 0.999. Focusing on the bias-aligned samples (first two columns), we observe that, in both cases, most values are close to 1. However, while this is true for most of the shown histograms, the presence of the non-contrastive condition of Eq. 20 produces a much tighter distribution for ϵ-SupCon, when compared to ϵ-SupInfoNCE. In fact, with ϵ-SupInfoNCE we obtain significantly more bias-aligned samples with a similarity smaller than 1. This is especially evident if we focus on the last training epochs.More interestingly, if we focus on the bias-conflicting similarities (last two columns), we can also notice how, on average, the distribution of similarities of bias-conflicting samples for ϵ-SupCon tends to be more concentrated around the value of 0. This means that bias-conflicting samples have dissimilar representations even if they are both positives and should be mapped to the same point in the representation space. The effect of the bias is thus still quite important and it has not been discarded. On the other hand, with ϵ-SupInfoNCE, we obtain a much more spread distribution, especially as the number of training epochs increases. This means that a higher number of biasconflicting samples have a greater similarity (in the representation space), leading to more robust representations.Clearly, ϵ-SupCon focuses more on bias-aligned samples as most of them have a similarity close to 1, whereas most of the bias-conflicting samples have a similarity close to 0. With our proposed loss ϵ-SupInfoNCE, this behavior is less pronounced, as the lack of the non-contrastive condition leads the model to be less focused on bias-aligned samples. This could explain why ϵ-SupInfoNCE can perform better than ϵ-SupCon in highly biased settings.

C.3 TRAINING WITH A PROJECTION HEAD

In Tab. 10 we show the results on Corrupted CIFAR-10 with and without using a projection head. When employing a projection head, the loss term and the regularization are applied on the projected and original space respectively, and the final classification is performed in the original latent space before the projection head. We conjecture that the loss in accuracy is likely due i.) to the absence of the aggressive augmentation typically used for generating multiviews in contrastive setups, which are probably attenuated by the projection head ii.) minimizing ϵ-SupInfoNCE and the FairKL term on the same latent space rather than two different ones, could be more beneficial for the optimization process. We perform an ablation study of our debiasing regularization on Corrupted CIFAR-10 and on bFFHQ. We test two variants of the regularization term:1. Only with the conditions on the mean of the representations µ + and µ -(Eq. equation 9-a and equation 9-b), similarly to (Tartaglione et al., 2021) , but with the differences in formulations of Sec. A.3;2. Full FairKL debiasing term of Eq. 9-c and Eq. 9-d.The results are shown in Tab. 11. As it can be easily observed, employing the full regularization constraint consistently results in better accuracy. We conduct an analysis on the importance and stability of the weights α and λ of Eq. 11. We perform multiple experiments selecting α ∈ {0.01, 0.1, 1.0}. For simplicity, we fix ϵ = 0, and we report the accuracy scored on the Biased-MNIST test. The results are show in Tab. 12. There seems to be a correlation between the value of α and the strength of the bias: for stronger biases it is better to give more importance to the regularization term rather than the target loss function. Additionaly, we also find that α depends on the complexity of the dataset: for example on Corrupted-CIFAR10 and bFFHQ we use α = 0.1, for 9-Class ImageNet we use α = 0.5. 

