RETHINKING POSITIVE SAMPLING FOR CONTRASTIVE LEARNING WITH KERNEL Anonymous authors Paper under double-blind review

Abstract

Data augmentation is a crucial component in unsupervised contrastive learning (CL). It determines how positive samples are defined and, ultimately, the quality of the representation. Even if efforts have been made to find efficient augmentations for ImageNet, CL underperforms compared to supervised methods and it is still an open problem in other applications, such as medical imaging, or in datasets with easy-to-learn but irrelevant imaging features. In this work, we propose a new way to define positive samples using kernel theory along with a novel loss called decoupled uniformity. We propose to integrate prior information, learnt from generative models viewed as feature extractor, or given as auxiliary attributes, into contrastive learning, to make it less dependent on data augmentation. We draw a connection between contrastive learning and the conditional mean embedding theory to derive tight bounds on the downstream classification loss. In an unsupervised setting, we empirically demonstrate that CL benefits from generative models, such as VAE and GAN, to less rely on data augmentations. We validate our framework on vision and medical datasets including CIFAR10, CIFAR100, STL10, ImageNet100, CheXpert and a brain MRI dataset. In the weakly supervised setting, we demonstrate that our formulation provides state-of-the-art results.

1. INTRODUCTION

Figure 1 : Illustration of the proposed method. Each point is an original image x. Two points are connected if they can be transformed into the same augmented image using a distribution of augmentations A. Colors represent semantic (unknown) classes and light disks represent the support of augmentations for each sample x, A(•|x). From an incomplete augmentation graph (1) where intra-class samples are not connected (e.g. augmentations are insufficient or not adapted), we reconnect them using a kernel defined on prior information (either learnt with generative model, viewed as feature extractor, or given as auxiliary attributes). The extended augmentation graph (3) is the union between the (incomplete) augmentation graph (1) and the kernel graph (2) . In (2) , the gray disk indicates the set of points x′ that are close to the anchor (blue star) in the kernel space. Contrastive Learning (CL) (44; 3; 4; 7; 10) is a paradigm designed for representation learning which has been applied to unsupervised (10; 13) , weakly supervised (55; 20) and supervised problems (37) . It gained popularity during the last years by achieving impressive results in the unsupervised setting on standard vision datasets (e.g. ImageNet) where it almost matched the performance of its supervised counterpart (10; 29) . The objective in CL is to increase the similarity in the representation space between positive samples (semantically close), while decreasing the similarity between negative samples (semantically distinct). Despite its simple formulation, it requires the definition of a similarity function (that can be seen as an energy term (42) ),and of a rule to decide whether a sample should be considered positive or negative. Similarity functions, such as the Euclidean scalar product (e.g. InfoNCE (44) ), take as input the latent representations of an encoder f ∈ F, such as a CNN (11) or a Transformer (9) for vision datasets. In supervised learning (37) , positives are simply images belonging to the same class while negatives are images belonging to different classes. In unsupervised learning (10) , since labels are unknown, positives are usually defined as transformed versions (views) of the same original image (a.k.a. the anchor) and negatives are the transformed versions of all other images. As a result, the augmentation distribution A used to sample both positives and negatives is crucial (10) and it conditions the quality of the learnt representation. The most-used augmentations for visual representations involve aggressive crop and color distortion. Cropping induces representations with high occlusion invariance (46) while color distortion may avoids the encoder f to take a shortcut (10) while aligning positive sample representations and fall into the simplicity bias (51) . Nevertheless, learning a representation that mainly relies on augmentations comes at a cost: both crop and color distortion induce strong biases in the final representation (46) . Specifically, dominant objects inside images can prevent the model from learning features of smaller objects (12) (which is not apparent in object-centric datasets such as ImageNet) and few, irrelevant and easy-tolearn features, that are shared among views, are sufficient to collapse the representation (12) (a.k.a feature suppression). Finding the right augmentations in other visual domains, such as medical imaging, remains an open challenge (20) since we need to find transformations that preserve semantic anatomical structures (e.g. discriminative between pathological and healthy) while removing unwanted noise. If the augmentations are too weak or inadequate to remove irrelevant signal w.r.t. a discrimination task, then how can we define positive samples? In our work, we propose to integrate prior information, learnt from generative models or given as auxiliary attributes, into contrastive learning, to make it less dependent on data augmentation. Using the theoretical understanding of CL through the augmentation graph, we make the connection with kernel theory and introduce a novel loss with theoretical guarantees on downstream performance. Prior information is integrated into the proposed contrastive loss using a kernel. In the unsupervised setting, we leverage pre-trained generative models, such as GAN (24) and VAE (38) , to learn a prior representation of the data. We provide a solution to the feature suppression issue in CL (12) and also demonstrate SOTA results with weaker augmentations on visual benchmarks. In visual domain where data augmentations are not adapted to the downstream task (e.g. medical imaging), we show that we can improve CL, alleviating the need to find efficient augmentations. In the weakly supervised setting, we use instead auxiliary/prior information, such as image attributes (e.g. birds color or size) and we show better performance than previous conditional formulations based on these attributes (55) . In summary, we make the following contributions: 1. We propose a new framework for contrastive learning allowing the integration of prior information, learnt from generative models or given as auxiliary attributes, into the positive sampling. 2. We derive theoretical bounds on the downstream classification risk that rely on weaker assumptions for data augmentations than previous works on CL. 3. We empirically show that our framework can benefit from the latest advances of generative models to learn a better representation while relying on less augmentations. 4 . We show that we achieve SOTA results in the unsupervised and weakly supervised setting.

2. RELATED WORKS

In a weakly supervised setting, recent studies (20; 55) have shown that positive samples can be defined conditionally to an auxiliary attribute in order to improve the final representation, in particular for medical imaging (20) . From an information bottleneck perspective, these approaches essentially compress the representation to be predictive of the auxiliary attributes. This might harm the performance of the model when these attributes are too noisy to accurately approximate the true semantic labels for a given downstream task. In an unsupervised setting, recent approaches (22; 65; 66; 43) used the encoder f , learnt during optimization, to extend the positive sampling procedure to other views of different instances (i.e. distinct from the anchor) that are close to the anchor in the latent space. In order to avoid representation collapse, multiple instances of the same sample (2), a support set (22) , a momentum encoder (43) or another small network (65) can be used to select the positive samples. In clustering approaches (43; 8) , distinct instances with close semantics are attracted in the latent space using prototypes. These prototypes can be estimated through K-means (43) or Sinkhorn-Knopp algorithm (8) . All these methods rely on the past representation of a network to improve the current one. They require strong augmentations and they essentially assume that the closest points in the representation space belong to the same latent class in order to better select the positives. This inductive bias is still poorly understood from a theoretical point of view (50) and may depend on the visual domain. For medical imaging, ImageNet self-supervised pre-training was beneficial for all subsequent tasks (2). Our work also relates to generative models for learning representations. VAEs (38) learn the data distribution by mapping each input to a Gaussian distribution that we can easily sample from to reconstruct the original image. GANs (24) , instead, sample directly from a Gaussian distribution to generate images that are classified by a discriminator in a min-max game. The discriminator representation can then be used (48) as feature extractor. Other models (ALI (21) , BiGAN (18) and BigBiGAN( 19)) learn simultaneously a generator and an encoder that can be used directly for representation learning (63) . All these models do not require particular augmentations to model the data distribution but they perform generally poorer than recent discriminative approaches (64; 11) for representation learning. A first connection between generative models and contrastive learning has emerged very recently (36) . In (36) , authors study the feasibility of learning effective visual representations using only generated samples, and not real ones, with a contrastive loss. Their empirical analysis is complementary to our work. Here, we leverage the representation capacity of the generative models, rather than their generative power, to learn prior representation of the data.

3. CONSTRASTIVE LEARNING WITH DECOUPLED UNIFORMITY

Problem setup. The general problem in contrastive learning is to learn a data representation using an encoder f ∈ F : X → S d-foot_0 that is pre-trained with a set of n original samples (x i ) i∈[1. .n] ∈ X , sampled from the data distribution p(x) 1 These samples are transformed to generate positive samples (i.e., semantically similar to x) in X , space of augmented images, using a distribution of augmentations A(•|x). Concretely, for each xi , we can sample views of xi using x ∼ A(•|x i ) (e.g., by applying color jittering, flip or crop with a given probability). For consistency, we assume A(x) = p(x) so that the distributions A(•|x) and p(x) induce a marginal distribution p(x) over X . Given an anchor xi , all views x ∼ A(•|x j ) from different samples xj̸ =i are considered as negatives. Once pre-trained, the encoder f is fixed and its representation f ( X ) is evaluated through linear evaluation on a classification task using a labeled dataset D = {(x i , y i )} ∈ X × Y where Y = [1..K], with K the number of classes. Linear evaluation. To evaluate the representation of f on a classification task, we train a linear classifier g(x) = W f (x) (f is fixed) that minimizes the multi-class classification error. Objective. The popular InfoNCE loss (45; 44) , often used in CL, imposes 1) alignment between positives and 2) uniformity between the views (x i.i.d. ∼ A(•|x)) of all instances x (57)-two properties that correlate well with downstream performance. However, by imposing uniformity between all views, we essentially try to both attract (alignment) and repel (uniformity) positive samples and therefore we cannot achieve a perfect alignment and uniformity, as noted in (57) . Moreover, InfoNCE has been originally designed for only two views (i.e., one couple of positive) and its extension to multiple views is not straightforward (40) . Previous works have proposed a solution to either the first (54) or second (60) issue. Here, we propose a modified version of the uniformity loss, presented in (57) , that solves both issues since it: i) decouples positives from negatives, similarly to (60) and ii) is generalizable to multi-views as in (54) . We introduce the Decoupled Uniformity loss as: L d unif (f ) = log E p(x)p(x ′ ) e -||µx-µ x′ || 2 where µ x = E A(x|x) f (x) is called a centroid of the views of x. This loss essentially repels distinct centroids µ x through an average pairwise Gaussian potential. Interestingly, it implicitly optimizes alignment between positives through the maximization of ||µ x||foot_1 , so we do not need to explicitly add an alignment term. It can be shown (see Appendix B), that minimizing this loss brings to a representation space where the sum of similarities between views of the same sample is greater than the sum of similarities between views of different samples. We will study its main properties hereafter and we will see that, contrary to other contrastive losses, prior information can be added during the estimation step of these centroids using a kernel. First, we define a measure of the risk on a downstream task. Supervised risk. While previous analysis (58; 1) generally used the mean cross-entropy loss (as it has closer analytic form with InfoNCE), we use a supervised loss closer to decoupled uniformity with the same guarantees as the mean cross-entropy loss (see Appendix C.1). Notably, the geometry of the representation space at optimum is the same as cross-entropy and SupCon (37) and we can theoretically achieve perfect linear classification. Definition 3.1. (Downstream supervised loss) For a given downstream task D = X × Y, we define the classification loss as: L sup (f ) = log E y,y ′ ∼p(y)p(y ′ ) e -||µy-µ y ′ || 2 , where µ y = E p(x|y) µ x. This loss depends on centroids µ x rather than f (x). Empirically, it has been shown ( 23) that performing feature averaging gives better performance on the downstream task.

3.1. GEOMETRICAL ANALYSIS OF DECOUPLED UNIFORMITY

Definition 3.2. (Finite-samples estimator) For n samples  (x i ) i∈[1..n] i.i.d. ∼ p(x), the (biased) empir- ical estimator of L d unif (f ) is: Ld unif (f ) = log 1 n(n-1) i̸ =j e -||µx i -µx j || 2 . It converges in law to L d unif (f ) with rate O n -1/2 ∼ A(•|x i ), f * (x) = f * (x ′ ) for all i ∈ [1..n]. Proof in Appendix E.2. Theorem 1 gives a complete geometrical characterization when the batch size n set during training is not too large compared to the representation space dimension d. By removing the coupling between positives and negatives, we see that Decoupled Uniformity can realize both perfect alignment and uniformity, contrary to InfoNCE (57) . Most recent theories about CL (58; 28) make the hypothesis that samples from the same semantic class have overlapping augmented views to provide guarantees on the downstream task when optimizing InfoNCE (10) or Spectral Contrastive loss (28) . This assumption, known as intra-class connectivity hypothesis, is very strong and only relies on the augmentation distribution A. In particular, augmentations should not be "too weak", so that all intra-class samples are connected among them, and at the same time not "too strong", to prevent connections between inter-class samples and thus preserve the semantic information. Here, we prove that we can relax this hypothesis if we can provide a kernel (viewed as a similarity function between original samples x) that is "good enough" to relate intra-class samples not connected by the augmentations (see Fig. 1 ). In practice, we show that generative models (viewed as feature extractor) or auxiliary information can define such kernel. We first recall the definition of the augmentation graph (58) , and intra-class connectivity hypothesis before presenting our main theorems. For simplicity, we assume that the set of images X is finite (similarly to (58; 28)). Our bounds and theoretical guarantees will never depend on the cardinality | X |. INTRA-CLASS CONNECTIVITY HYPOTHESIS. Definition 3.3. (Augmentation graph (28; 58)) Given a set of original images X , we define the augmentation graph G A (V, E) for an augmentation distribution A through 1) a set of vertices V = X and 2) a set of edges E such that (x, x′ ) = e ∈ E if the two original images x, x′ can be transformed into the same augmented image through A, i.e supp A(•|x) ∩ supp A(•|x ′ ) ̸ = ∅. Previous analysis in CL make the hypothesis that there exists an optimal (accessible) augmentation module A * that fulfills: Assumption 1. (Intra-class connectivity (58)) For a given downstream classification task D = X × Y ∀y ∈ Y, the augmentation subgraph, G y ⊂ G A * containing images only from class y, is connected. Under this hypothesis, Decoupled Uniformity loss can also tightly bound the downstream supervised risk but for a bigger class of encoders than prior work (not restricted to L-smooth functions (58)). Definition 3.4. (Weak-aligned encoder) An encoder f ∈ F is ϵ ′ -weak (ϵ ′ ≥ 0) aligned on A if: ||f (x) -f (x ′ )|| ≤ ϵ ′ ∀x ∈ X , ∀x, x ′ i.i.d. ∼ A(•|x) Theorem 2. (Guarantees with A * ) Given an optimal augmentation module A * , for any ϵ-weak aligned encoder f ∈ F we obtain: L d unif (f ) ≤ L sup (f ) ≤ 8Dϵ + L d unif (f ) where D is the maximum diameter of all intra-class graphs G y (y ∈ Y). Proof in Appendix E.5. Contrary to previous work (58) , this theorem does not require L-smoothness of f ∈ F (strong assumption) and provides tighter lower bound. In practice, the diameter D can be controlled by a small constant in some cases (e.g., 4 in ( 58)) but it remains specific to the dataset at hand. Furthermore, we observe (see Appendix A.1) that f realizes alignment with small error ϵ during optimization of L d unif (f ) for augmentations close to the sweet spot A * (54) on CIFAR-10 and CIFAR-100. In the next section, we study the case when A * is not accessible or very hard to find.

KERNEL

Having access to optimal augmentations is a strong assumption and, for many real-world applications (e.g medical imaging (20) ), it may not be accessible. If we have only weak augmentations (e.g., supp A(•|x) ⊊ supp A * (•|x) for any x), then some intra-class points might not be connected and we would need to reconnect them to ensure good downstream accuracy (see Theorem 7 in Appendix C.2). Augmentations are intuitive and they have been hand-crafted for decades by using human perception (e.g., a rotated chair remains a chair and a gray-scale dog is still a dog). However, we may know other prior information about objects that are difficult to transfer through invariance to augmentations (e.g., chairs should have 4 legs). This prior information can be either given as image attributes (e.g., age or sex of a person, color of a bird, etc.) or, in an unsupervised setting, directly learnt through a generative model (e.g., GAN or VAE). Now, we ask: how can we integrate this information inside a contrastive framework to reconnect intra-class images that are actually disconnected in G A ? We rely on conditional mean embedding theory and use a kernel defined on the prior representation/information. This allows us to estimate a better configuration of the centroids in the representation space, with respect to the downstream task, and, ultimately, provide theoretical guarantees on the classification risk. ϵ-KERNEL GRAPH. Definition 3.5. (RKHS on X ) We define the RKHS (H X , K X ) on X associated with a kernel K X . Example. If we work with large natural images, assuming that we know a prior z(x) about our images (e.g., given by a generative model), we can compute K X using z through K X (x, x′ ) = K(z(x), z(x ′ )) where K is a standard kernel (e.g., Gaussian or Cosine). To link kernel theory with the previous augmentation graph, we need to define a kernel graph that connects images with high similarity in the kernel space. Definition 3.6. (ϵ-Kernel graph) Let ϵ > 0. We define the ϵ-kernel graph G ϵ K X (V, E K ) for the kernel K X on X through 1) a set of vertices V = X and 2) a set of edges E K X such that e ∈ E K X between x, x′ ∈ X iff max(K X (x, x), K X (x ′ , x′ )) -K X (x, x′ ) ≤ ϵ. The condition max(K X (x, x), K X (x ′ , x′ )) -K X (x, x′ ) ≤ ϵ implies that d K X (x, x′ ) ≤ 2ϵ where d K X (x, x′ ) = K X (x, x)+K X (x ′ , x′ )-2K X (x, x′ ) is the kernel distance. For kernels with constant norm (e.g., the standard Gaussian, Cosine or Laplacian kernel), it is in fact an equivalence. Intuitively, it means that we connect two original points in the kernel graph if they have small distance in the kernel space. We give now our main assumption to derive a better estimator of the centroid µ x in the insufficient augmentation regime. Assumption 2. (Extended intra-class connectivity) For a given task D = X × Y, the extended graph G = G A ∪ G ϵ K X = (V, E ∪ E K X ) (union between augmentation graph and ϵ-kernel graph) is class-connected for all y ∈ Y. This assumption is notably weaker than Assumption 1 w.r.t augmentation distribution A. Here, we do not need to find the optimal distribution of augmentations A * , as long as we have a kernel K X such that disconnected points in the augmentation graph are connected in the ϵ-kernel graph. If K is not well adapted to the data-set (i.e it gives very low values for intra-class points), then ϵ needs to be large to re-connect these points and, as shown in Appendix A.2, the classification error will be high. In practice, this means that we need to tune the hyper-parameter of the kernel (e.g., σ for a RBF kernel) so that all intra-class points are reconnected with a small ϵ.

CONDITIONAL MEAN EMBEDDING.

Decoupled Uniformity loss includes no kernel in its raw form. It only depends on centroids µ x = E A(x|x) f (x). Here, we show that another consistent estimator of these centroids can be defined, using the previous kernel K X . To show it, we fix an encoder f ∈ F and require the following technical assumption in order to apply conditional mean embedding theory (52; 39) . Assumption 3. (Expressivity of K X ) The (unique) RKHS (H f , K f ) defined on X with kernel K f = ⟨f (•), f (•)⟩ R d fulfills ∀g ∈ H f , E A(x|•) g(x) ∈ H X Theorem 3. (Centroid estimation) Let (x i , xi ) i∈[1..n] iid ∼ A(x, x). Assuming 3, a consistent estima- tor of the centroid is: ∀x ∈ X , μx = n i=1 α i (x)f (x i ) where α i (x) = n j=1 [(K n + nλI n ) -1 ] ij K X (x j , x) and K n = [K X (x i , xj )] i,j∈[1..n] . It converges to µ x with the ℓ 2 norm at a rate O(n -1/4 ) for λ = O(n -1/2 ). Proof in Appendix E.6. Intuition. This theorem says that we can use representations of images close to an anchor x, according to our prior information, to accurately estimate µ x. Consequently, if the prior is "good enough" to connect intra-class images disconnected in the augmentation graph (i.e. fulfills Assumption 2), then this estimator allows us to tightly control the classification risk. From this theorem, we naturally derive the empirical Kernel Decoupled Uniformity loss using the previous estimator. Definition 3.7. (Empirical Kernel Decoupled Uniformity Loss) Let (x i , xi ) i∈[1..n] iid ∼ A(x, x). Let μxj = n i=1 α i,j f (x i ) with α i,j = ((K n +λnI n ) -1 K n ) ij , λ = O(n -1/2 ) a regularization constant and K n = [K X (x i , xj )] i,j∈[1..n] . We define the empirical kernel decoupled uniformity loss as: Ld unif (f ) def = log 1 n(n -1) n i,j=1 exp(-||μ xi -μxj || 2 ) Extension to multi-views. If we have V views (x (v) i ) v∈[1..V ] for each xi , we can easily extend the previous estimator with μxi = 1 V V v=1 μ(v) xj where μ(v) xj = n i=1 α i,j f (x (v) i ). The computational cost added is roughly O(n 3 ) (to compute the inverse matrix of size n × n) but it remains negligible compared to the back-propagation time using classical stochastic gradient descent. Importantly, the gradients associated to α i,j are not computed.

A TIGHT BOUND ON THE CLASSIFICATION LOSS WITH WEAKER ASSUMPTIONS.

We show here that Ld unif (f ) can tightly bound the supervised classification risk for well-aligned encoders f ∈ F. Theorem 4. We assume 2 and 3 hold for a reproducible kernel K X and augmentation distribution A. Let (x i , xi ) i∈[1..n] iid ∼ A(x, x). For any ϵ ′ -weak aligned encoder f ∈ F: Ld unif (f ) -O n -1/4 ≤ L sup (f ) ≤ Ld unif (f ) + 4D(2ϵ ′ + β n (K X )ϵ) + O n -1/4 (4) where β n (K X ) = ( λmin(Kn) √ n + √ nλ) -1 = O(1) for λ = O(n -1/2 ), K n = (K X (x i , xj )) i,j∈[1. .n] and D is the maximal diameter of all sub-graphs Gy ⊂ G where y ∈ Y. We noted λ min (K n ) > 0 the minimal eigenvalue of K n . Proof in Appendix E.7. Interpretation. Theorem 4 gives tight bounds on the classification loss L sup (f ) with weaker assumptions than current work (1; 58; 28). We don't require perfect alignment for f ∈ F or Lsmoothness and we don't have class collision term (even if the extended augmentation graph may contain edges between inter-class samples), contrarily to (1) . Also, the estimation error doesn't depend on the number of views (which is low in practice))-as it was always the case in previous formulations (58; 1; 28) -but rather on the batch size n and the eigenvalues of the kernel matrix (controlling the variance of the centroid estimator ( 27)) . Contrarily to CCLK (55), we don't condition our representation to weak attributes but rather we provide better estimation of the conditional mean embedding conditionally to the original image. Eventually, our loss remains in an unconditional contrastive framework driven by the augmentations A and the prior K X on input images. Theorem 2 becomes a special case ϵ = 0 and A = A * (i.e the augmentation graph is classconnected, a stronger assumption than 2). In Appendix A.2, we provide empirical evidence that better kernel quality (measured by k-NN accuracy in kernel graph) improves downstream accuracy, as theoretically expected by the theorem. It also provides a new way to select a priori a good kernel.

4. EXPERIMENTS

Here, we study several problems where Kernel Decoupled Uniformity outperforms current contrastive SOTA models. In unsupervised learning, we show that we can leverage generative models representation to outperform current self-supervised models when the augmentations are insufficient to remove irrelevant signals from images. In weakly supervised, we demonstrate the superiority of our unconditional formulation when noisy auxiliary attributes are available. Evading feature suppression with VAE. Previous investigations (12) have shown that a few easyto-learn irrelevant features not removed by augmentations can prevent the model from learning all semantic features inside images. We propose here a first solution to this issue. RandBits dataset (12) . We build a RandBits dataset based on CIFAR-10. For each image, we add a random integer sampled in [0, 2 k -1] where k is a controllable number of bits. To make it easy to learn, we take its binary representation and repeat it to define k channels that are added to the original RGB channels. Importantly, these channels will not be altered by augmentations, so they will be shared across views. We train a ResNet18 on this dataset with standard SimCLR augmentations (10) and varying k. For kernel decoupled uniformity, we use a β-VAE representation (ResNet18 backbone, β = 1, also trained on RandBits) to define K V AE (x, x′ ) = K(µ(x), µ(x ′ )) where µ(•) is the mean Gaussian distribution of x in the VAE latent space and K is a standard RBF kernel. Table 1 shows the linear evaluation accuracy computed on a fixed encoder trained with various contrastive (SimCLR, Decoupled Uniformity and Kernel Decoupled Uniformity) and non-contrastive (BYOL and β-VAE) methods. As noted previously (12) , β-VAE is the only method insensitive to the number of added bits, but its representation quality remains low compared to other discriminative approaches. All contrastive approaches fail for k ≥ 10 bits. This can be explained by noticing that, as the number of bits k increases, the number of edges between intra-class images in the augmentation graph G A decreases. For k bits, on average N/2 k images share the same random bits (N = 50000 is the dataset size). So only these images can be connected in G A . For k = 20 bits, < 1 image share the same bits which means that they are almost all disconnected, and it explains why standard contrastive approaches fail. Same trend is observed for non-contrastive approaches (e.g. BYOL) with a degradation in performance even faster than SimCLR. Interestingly, encouraging a disentangled representation by imposing higher β > 1 in β-VAE does not help. Only our K V AE Decoupled Uniformity loss obtains good scores, regardless of the number of bits. BigBiGAN as prior. We show that very recent advances in generative modeling improve representations of contrastive models in Table 2 with our approach. Due to our limited computational resources, we study ImageNet100 (54) (100-class subset of ImageNet used in the literature (54; 15; 57)) and we leverage BigBiGAN representation (19) as prior. In particular, we use BigBiGAN pre-trained on ImageNet to define a kernel K GAN (x, x′ ) = K(z(x), z(x ′ )) (with K an RBF kernel and z(•) BigBiGAN's encoder). We demonstrate SOTA representation with this prior compared to all other contrastive and non-contrastive approaches. We use ResNet50 trained for 400 epochs. Please note that in our implementation, we do not use data augmentation for testing (i.e. during linear evaluation). Towards weaker augmentations. Color distortion (including color jittering and gray-scale) and crop are the two most important augmentations for SimCLR and other contrastive models to ensure a good representation on ImageNet (10) . Whether they are best suited for other datasets (e.g medical imaging (20) or multi-objects images (12) ) is still an open question. Here, we ask: can generative models remove the need for such strong augmentations? We use standard benchmarking datasets (CIFAR-10, CIFAR-100 and STL-10) and we study the case where augmentations are too weak to connect all intra-class points. We compare to the baseline where all augmentations are used. We use a trained VAE to define K V AE as before and a trained DCGAN (48) K GAN (x, x′ ) def = K(z(x), z(x ′ )) where z(•) denotes the discriminator output of the penultimate layer. In Table 3 , we observe that our contrastive framework with DCGAN representation as prior is able to approach the performance of self-supervised models by applying only crop augmentations and flip. Additionally, when removing almost all augmentations (crop and color distortion), we approach the performance of the prior representations of the generative models. This is expected by our theory since we have an augmentation graph that is almost disjoint for all points and thus we only rely on the prior to reconnect them. This experiment shows that our method is less sensitive than all other SOTA self-supervised methods to the choice of the "optimal" augmentations, which could be relevant in applications where they are not known a priori, or hard to find. Weakly supervised learning on natural images. In Table 4 , we suppose that we have access to image attributes that correlate with the true semantic labels (e.g birds color/size for birds classification). We use three datasets: CUB-200-2011 (59), ImageNet100 (54) and UTZappos (62) , following (55) . CUB-200-2011 contains 11788 images of 200 bird species with 312 binary attributes available (encoding size, color, etc.). UTZappos contains 50025 images of shoes from several brands sub-categorized into 21 groups that we use as downstream classification labels. It comes with seven attributes. Finally, for ImageNet100 we follow (55) and use the pre-trained CLIP (47) model (trained on pairs (text, image)) to extract 512-d features considered as prior information. We compare our method with SOTA Siamese models (SimCLR and BYOL) and with CCLK, a conditional contrastive model that defines positive samples only according to the conditioning attributes. The proposed method outperforms all other models on the three datasets. Filling the gap for medical imaging. Data augmentations on natural images have been handcrafted over decades. However, we argue they are not adapted in other visual domain such as medical imaging (20) . We study 1) bipolar disorder detection (BD), a challenging binary classification task, on brain MRI dataset BIOBD (32) and 2) chest radiography interpretation, a 5-class classification task on CheXpert (35) . BIOBD contains 356 healthy controls (HC) and 306 patients with BD. We use BHB (20) as a large pre-training dataset containing 10k 3D images of healthy subjects. For CheXpert, we use Gloria (34) representation, a multi-modal approach trained with (medical report, image) pairs to extract 2048-d features as weak annotations. We show that our approach improve contrastive model in both unsupervised (BD) and weakly supervised (CheXpert) setting for medical imaging. 

5. CONCLUSION

In this work, we have showed that we can integrate prior information into CL to improve the final representation. In particular, we draw connections between kernel theory and CL to build our theoretical framework. We demonstrate tight bounds on downstream classification performance with weaker assumptions than previous works. Empirically, we show that generative models provide a good prior when augmentations are too weak or insufficient to remove easy-to-learn noisy features. We also show applications in medical imaging in both unsupervised and weakly supervised setting where our method outperforms all other models. Thanks to our theoretical framework, we hope that CL will benefit from the future progress in generative modelling and it will widen its field of application to challenging tasks, such as computer aided-diagnosis.

A MORE EMPIRICAL EVIDENCE

In this section, we provide additional empirical evidence to confirm several claims and arguments developed in the paper. A.1 DECOUPLED UNIFORMITY OPTIMIZES ALIGNMENT We empirically show here that Decoupled Uniformity optimizes alignment, even in the regime when the batch size n > d + 1, where d is the representation space dimension. We use CIFAR-10 and CIFAR-100 datasets and we optimize Decoupled Uniformity (without kernel) with all SimCLR augmentations with d = 128 and we vary the batch size n. We report the alignment metric defined in (57) as L align = E A(x|x)A(x ′ |x)p(x) ||f (x) -f (x ′ )|| 2 . A.2 MEASURING KERNEL QUALITY AND EMPIRICAL VERIFICATION OF OUR THEORY In Fig. 3 , we vary σ and we report downstream accuracy (measured by linear evaluation) along with the optimal ϵ * to add 100 intra-class edges in the ϵ-Kernel graph obtained with K σ . The lower ϵ * , the better the downstream accuracy, which is expected since the upper bound of supervised risk becomes tighter in Theorem 4. It gives a first empirical confirmation that ϵ tightly bounds the supervised risk on downstream task. A new way to quantify kernel quality. Based on the concept of kernel graph, we measure the quality of a given kernel K using the nearest-neighbors of each image (a vertex in kernel graph). More precisely, K induces a distance d K (d K (a, b) = K(a, a) + K(b, b) -2K(a, b)) that can be used to define nearest-neighbors in its kernel graph. We compute the fraction of these nearest neighbors that belong to the same class. In Fig. 4 , we plot the downstream accuracy vs kernel quality using 10-nearest neighbors for various kernel K. They are obtained by using latent space of a VAE trained for an increasing number of epochs (2, 50, 100, 150 and 1000) and by setting K(x, x′ ) = RBF σ (µ(x), µ(x ′ )) as before (with σ = 50 fixed). It shows that this new measure of kernel quality is highly correlated with final downstream accuracy. Therefore, it can be used as a tool to compare a priori (without training) different kernels. One limitation of this metric is that it requires access to labels on the downstream task. Future work would consist in finding unsupervised properties of the kernel graph that correlates well with downstream accuracy (e.g. sparsity, clustering coefficient, etc.).

A.3 MULTI-VIEW CONTRASTIVE LEARNING WITH DECOUPLED UNIFORMITY

When the intra-class connectivity hypothesis is full-filled, we showed that Decoupled Uniformity loss can tightly bound the classification risk for well-aligned encoders (see Theorem 2). Under that hypothesis, we consider the standard empirical estimator of µ x ≈ V v=1 f (x (v) ) for V views. Using all SimCLR augmentations, we empirically verify that increasing V allows for: 1) a better estimate of µ x which implies a faster convergence and 2) better SOTA results on both small-scale (CIFAR10, CIFAR100, STL10) and large-scale (ImageNet100) vision datasets. We always use batch size n = 256 for all approaches with ResNet18 backbone for CIFAR10, CIFAR100 and STL10 and ResNet50 for ImageNet100. We report the results in Table 7 . Table 7 : A better approximation of centroids µ x (i.e. increasing number of views) when augmentation overlap hypothesis is (nearly) full-filled implies faster convergence. All models are pretrained with batch size n = 256. We use ResNet18 backbone for CIFAR10, CIFAR100, STL10 and ResNet50 for ImageNet100. We report linear evaluation accuracy (%) for a given number of epochs e.

A.4 INFLUENCE OF TEMPERATURE AND BATCH SIZE FOR DECOUPLED UNIFORMITY

InfoNCE is known to be sensitive to batch size and temperature to provide SOTA results. In our theoretical framework, we assumed that f (x) ∈ S d-1 but we can easily extend it to f (x) ∈ √ tS d-1 where t > 0 is a hyper-parameter. It corresponds to write L d unif (f ) = E p(x)p(x ′ ) e -t||µx-µ x′ || 2 . We show here that Decoupled Uniformity does not require very large batch size (as it is the case for SimCLR) and produce good representations for t ∈ [1, 5] . where n is the batch size to full-fill the hypothesis of Theorem 4. We have cross-validated this hyper-parameter λ on RandBits CIFAR-10 with k = 10 bits and we show in Table 10 The assumption n ≤ d + 1 is crucial to have the existence of a regular simplex on the hypersphere S d-1 . In practice, this condition is not always full-filled (e.g SimCLR (10) with d = 128 and n = 4096). Characterizing the optimal solution of L d unif for any n > d + 1 is still an open problem (5) but theoretical guarantees can be obtained in the limit case n → ∞. Theorem 5. (Asymptotical Optimality) When the number of samples is infinite n → ∞, then for any perfectly aligned encoder f ∈ F that minimizes L d unif , the centroids µ x for x ∼ p(x) are uniformly distributed on the hypersphere S d-1 . Proof in Appendix E.2. Under Empirically, we observe that minimizers f of Ld unif remain well-aligned when n > d + 1 on realworld vision datasets (see Appendix A.1). Decoupled uniformity thus optimizes two properties that are nicely correlated with downstream classification performance (57)-that is alignment and uniformity between centroids. However, as noted in (58; 50) , optimizing these two properties is necessary but not sufficient to guarantee a good classification accuracy. In fact, the accuracy can be arbitrarily bad even for perfectly aligned and uniform encoders (50) .

B.2 A METRIC LEARNING POINT-OF-VIEW

In this section, we provide a geometrical understanding of Decoupled Uniformity loss from a metric learning point of view. In particular, we consider the Log-Sum-Exp (LSE) operator often used in CL as an approximation of the maximum. We consider the finite-samples case with n original samples (x i ) i∈[1..n] iid ∼ p(x) and V views (x (v) i ) v∈[1..V ] iid ∼ A(•|x i ) for each sample xi . We make an abuse of notations and set µ i = 1 V V v=1 f (x (v) i ). Then we have: Ld unif = log 1 n(n -1) i̸ =j exp -||µ i -µ j || 2 = log 1 n(n -1) i̸ =j exp -s + i -s + j + 2s - ij where s + i = ||µ i || 2 = 1 V 2 v,v ′ s(x (v) i , x (v ′ ) i ), s - ij = 1 V 2 v,v ′ s(x (v) i , x (v ′ ) j ) and s(•, •) = ⟨f (•), f (•)⟩ 2 is viewed as a similarity measure. From a metric learning point-of-view, we shall see that minimizing Eq. 5 is (almost) equivalent to looking for an encoder f such that the sum of similarities of all views from the same anchor (s + i and s + j ) are higher than the sum of similarities between views from different instances (s - ij ): s + i + s + j > 2s - ij + ϵ ∀i ̸ = j (6) where ϵ is a margin that we suppose "very big" (see hereafter). Indeed, this inequality is equivalent to -ϵ > 2s - ij -s + i -s + j for all i ̸ = j, which can be written as : arg min f max(-ϵ, {2s - ij -s + i -s + j } i,j∈[1..n],j̸ =i ) This can be transformed into an optimization problem using the LSE (log-sum-exp) approximation of the max operator: arg min f log   exp(-ϵ) + i̸ =j exp (-s + i -s + j + 2s - ij )   Thus, if we use an infinite margin (lim ϵ→∞ ) we retrieve exactly our optimization problem with Decoupled Uniformity in Eq.5 (up to an additional constant depending on n).

C ADDITIONAL GENERAL GUARANTEES ON DOWNSTREAM CLASSIFICATION C.1 OPTIMAL CONFIGURATION OF SUPERVISED LOSS

In order to derive guarantees on a downstream classification task D when optimizing our unsupervised decoupled uniformity loss, we define a supervised loss that measures the risk on a downstream supervised task. We prove in the next section that the minimizers of this loss have the same geometry as the ones minimizing cross-entropy and SupCon (37): a regular simplex on the hyper-sphere (25) . More formally, we have: Lemma 6. Let a downstream task D with C classes. We assume that C ≤ d + 1 (i.e., a big enough representation space), that all classes are balanced and the realizability of an encoder f * = arg min f ∈F L sup (f ) with L sup (f ) = log E y,y ′ ∼p(y)p(y ′ ) e -||µy-µ y ′ || 2 , and µ y = E p(x|y) µ x. Then the optimal centroids (µ * y ) y∈Y associated to f * make a regular simplex on the hypersphere S d-1 and they are perfectly linearly separable, i.e min (wy) y∈Y ∈R d E (x,y)∼D 1(w y • µ * y < 0) = 0. Proof in the next section. This property notably implies that we can realize 100% accuracy at optima with linear evaluation (taking the linear classifier g (x) = W * f * (x) with W * = (µ * y ) y∈Y ∈ R C×d ).

C.2 GENERAL GUARANTEES OF DECOUPLED UNIFORMITY

In its most general formulation, we tightly bound the previous supervised loss by Decoupled Uniformity loss L d unif depending on a variance term of the centroids µ x conditionally to the labels: Theorem 7. (Guarantees for a given downstream task) For any f ∈ F and augmentation A we have: L d unif (f ) ≤ L sup (f ) ≤ 2 d j=1 Var(µ j x|y) +L d unif (f ) ≤ 4E p(x|y)p(x ′ |y) ||µ x -µ x′ ||+L d unif (f ) (7) where Var(µ j x|y) = E p(x|y) (µ j x -E p(x ′ |y) µ j x′ ) 2 , y = arg max y ′ ∈Y Var(µ j x|y ′ ) and µ j x is the j-th component of µ x = E A(x|x) f (x). Proof in the next section. Intuitively, it means that we will achieve good accuracy if all centroids (µ x) x∈ X for samples x ∈ X in the same class are not too far. This theorem is very general since we do not require the intra-class connectivity assumption on A; so any A ⊂ A * can be used.

D EXPERIMENTAL DETAILS

Code will be released upon acceptance of the manuscript. We provide a detailed pseudo-code of our algorithm as well as all experimental details to reproduce the experiments run in the manuscript.

D.1 PSEUDO-CODE

Algorithm 1 Pseudo-code of the algorithm Require: Batch of images (x 1 , ..., xn ) ∈ X , augmentation distribution A, temperature t, hyperparameter λ for centroid estimation K n ← (K(x i , xj )) i,j∈[1..n] ▷ Compute the kernel matrix α ← (K n + nλI n ) -1 K n ▷ Compute weights for centroid estimation x (1) i , ..., x (V ) i iid ∼ A(•|x i ) ▷ Sample V views per image F ← ( 1 V V v=1 f (x (v) i )) i∈[1..n] ▷ Compute the averaged image representations μ ← αF ▷ Centroid estimation Ld unif ← log 1 n(n-1) i̸ =j exp(-t||μ i -μj || 2 ) ▷ Kernel Decoupled Uniformity loss return Ld unif D.2 IMPLEMENTATION IN PYTORCH We provide a PyTorch implementation of previous pseudo-code in Algorithm 2. It is generalizable to an arbitrary number of views and kernel. CheXpert (35) This dataset is composed of 224 316 chest radiogaphs of 65240 patients. Each radiograph comes with 14 medical obervations. We use the official training set for our experiments, following (34; 35) and we test the models on the hold-out official validation split containing radiographs from 200 patients. For linear evaluation on this dataset, we train 5 linear probes to discriminate 5 pathologies (as binary classification) using only the radiographs with "certain" labels.

D.4 CONTRASTIVE MODELS

Architecture. For all small-scale vision datasets (CIFAR-10 (41), CIFAR-100 (41), STL-10 ( 16), CUB200-2011 (56) and UT-Zappos (62)) and CheXpert, we used official ResNet18 (30) backbone where we replaced the first 7 × 7 convolutional kernel by a smaller 3 × 3 kernel and we removed the first max-pooling layer for CIFAR-10, CIFAR-100 and UTZappos. For ImageNet100, we used ResNet50 (30) for stronger baselines as it is common in the literature. For medical images on brain MRI datasets (BHB (20) and BIOBD(32), we used DenseNet121 (33) as our default backbone encoder, following previous literature on these datasets (20) . We use the official Following (10) , we use the representation space after the last average pooling layer with 2048 dimensions to perform linear evaluation and use a 2-layers MLP projection head with batch normalization between each layer for a final latent space with 128 dimensions. Kernel choice. In all experiments with Kernel Decoupled Uniformity, we used an RBF kernel and we cross-validated the hyperparameter σ within {0.1, 1, 10, 30, 50, 100}. Batch size. We always use a default batch size 256 for all experiments on vision datasets and 64 for brain MRI datasets (considering the computational cost with 3D images and since it had little impact on the performance ( 20)). Optimization. We use SGD optimizer on small-scale vision datasets (CIFAR-10, CIFAR-100, STL-10, CUB200-2011, UT-Zappos) with a base learning rate 0.3 × batch size/256 and a cosine scheduler. For ImageNet100, we use a LARS (61) optimizer with learning rate 0.02 × √ batch size and cosine scheduler. In Kernel Decoupled Uniformity loss, we set λ = 0.01 √ batch size and t = 2. For SimCLR, we set the temperature to τ = 0.07 for all datasets following (60) . Unless mentioned otherwise, we use 2 views for Decoupled Uniformity (both with and without kernel) and the computational cost remains comparable with standard contrastive models. Training epochs. By default, we train the models for 200 epochs, unless mentioned otherwise for all vision data-sets excepted CUB200-2011 and UTZappos where we train them for 1000 epochs, following (55) and ImageNet100 where we train them for 400 epochs. For medical brain MRI dataset, we perform pre-training for 50 epochs, as in (20) . As for CheXpert, we train all models for 400 epochs. Augmentations. We follow (10) to define our full set of data augmentations for vision datasets including: RandomResizedCrop (uniform scale between 0.08 to 1), RandomHorizontalFlip and color distorsion (including color jittering and gray-scale). For medical brain MRI dataset, we use cutout covering 25% of the image in each direction (1/4foot_2 of the entire volume), following (20) . For CheXpert, we follow (2) and we use RandomResizedCrop (uniform scale between 0.08 to 1), RandomHor-izontalFlip, RandomRotation (up to 45 degrees) however we do not apply color jittering as we work with gray-scale images.

D.4.1 GENERATIVE MODELS AND GLORIA

Architecture. For VAE, we use ResNet18 backbone with a completely symmetric decoder using nearest-neighbor interpolation for up-sampling. For DCGAN, we follow the architecture described in (48) . We keep the original dimension for CIFAR-10 and CIFAR-100 datasets and we resize the images to 64 × 64 for STL-10. For BigBiGAN (19) , we use the ResNet50 pre-trained encoder available at https://tfhub.dev/deepmind/bigbigan-resnet50/1 with BN+CReLU features. Training. For VAE, we use PyTorch-lightning pre-trained model for STL-10 3 and we optimize VAE for CIFAR-10 and CIFAR-100 for 400 epochs using an initial learning rate 10 -foot_3 and SGD optimizer with a cosine scheduler. For RandBits experiments, the VAE is trained with the same setup as for CIFAR-10/100 on RandBits-CIFAR10. For DCGAN, we optimize it using Adam optimizer (following (48) ) and base learning rate 2 × 10 -4 . Importantly, all generative models are trained without data augmentation, providing a fair comparison with other methods. GloRIA (34) GloRIA can encode both image and text through 2 different encoders. It is pretrained on the official training set of CheXpert, as in our experiments. We use only GloRIA image's encoder (a ResNet18 in practice 4 ) to obtain weak labels on CheXpert and we leverage this weak labels with Kernel Decoupled Uniformity loss. In practice, we use an RBF kernel as in our previous experiments.

D.4.2 LINEAR EVALUATION

For all experiments, we perform linear evaluation by encoding the original training set (without augmentation) and by training a logistic regression on these features. We cross-validate an ℓ 2 penalty term between {0, 1e -2, 1e -3, 1e -4, 1e -5} for training this linear probe for 300 epochs with an initial learning rate 0.1 decayed by 0.1 at each plateau.

E PROOFS E.1 ESTIMATION ERROR WITH EMPIRICAL DECOUPLED UNIFORMITY

Property 1. Ld unif (f ) fulfills | Ld unif (f ) -L d unif (f )| ≤ O 1 √ n with a convergence in law.  | Ld unif (f ) -L d unif (f )| ≤ k 1 n(n -1) i̸ =j e -||µx i -µx j || 2 -E p(x)p(x ′ ) e -||µx-µ x′ || 2 For a fixed x ∈ X , let g n (x) = 1 n n i=1 e -||µx-µx i || 2 and g(x) = E p(x ′ ) e -||µx-µ x′ || 2 . Since (Z i ) i∈[1..n] = e -||µx-µ Xi || 2 -g(x) i∈ [1..n] are iid with bounded support in [-2, 2] and zero mean then by Berry-Esseen theorem we have |g n (x) -g(x)| ≤ O( 1 √ n ). Similarly, (Z ′ i ) i∈[1. .n] = g n ( Xi ) -E p(x) g n (x) are iid, bounded in [-2, 2] and with zero mean. So | 1 n n i=1 g n (x i ) - E p(x) g n (x)| ≤ O( 1 √ n ) by Berry-Esseen theorem. Then we have: | Ld unif (f ) -L d unif (f )| ≤ k| n (n -1)n n i=1 g n (x i ) -E p(x) g(x)| ≤ 2k| 1 n n i=1 g n (x i ) -E p(x) g n (x) + E p(x) g n (x) -E p(x) g(x)| ≤ O( 1 √ n ) + O( 1 √ n ) ≤ O( 1 √ n ) E.2 OPTIMALITY OF DECOUPLED UNIFORMITY Theorem 1. (Optimality of Decoupled Uniformity) Given n points (x i ) i∈[1. .n] such that n ≤ d + 1, the optimal decoupled uniformity loss is reached when: 1. (Perfect uniformity) All centroids (µ i ) i∈[1..n] = (µ xi ) i∈[1. .n] make a regular simplex on the hyper-sphere S d-1 2. (Perfect alignment) f is perfectly aligned, i.e ∀x, x ′ iid ∼ A(•|x i ), f (x) = f (x ′ ) PROOF. We will use Jensen's inequality and basic algebra to show these 2 properties. By triangular inequality, we have .n] . We have: Γ(µ) := n i,j=1 ||µ i -µ j || 2 = i,j ||µ i || 2 + ||µ j || 2 -2µ i • µ j ≤ i,j (2 -2µ i • µ j ) = 2n 2 -2|| i µ i || 2 ≤ 2n 2 with equality if and only if n i=1 µ i = 0 and ∀i ∈ [1..n], ||µ i || = 1. By strict convexity of u → e -u , we have: i̸ =j exp(-||µ i -µ j || 2 ) ≥ n(n -1) exp - Γ(µ) n(n -1) ≥ n(n -1) exp - 2n n -1 with equality if and only if all pairwise distance ||µ i -µ j || are equal (equality case in Jensen's inequality for strict convex function), n i=1 µ i = 0 and ||µ i || = 1. So all centroids must form a regular n -1-simplex inscribed on the hypersphere S d-1 centered at 0. Finally, since ||µ i || = 1 then we have equality in the Jensen's inequality ||µ i || = ||E A(x|xi) f (x)|| ≤ E A(x|xi) ||f (x)|| = 1. Since || • || is strictly convex on the hyper-sphere, then f must be constant on supp A(•|x i ), for all xi so f must be perfectly aligned. Theorem 5. (Asymptotical Optimality) When the number of samples is infinite n → ∞, then for any perfectly aligned encoder f ∈ F that minimizes L d unif , the centroids µ x for x ∼ p(x) are uniformly distributed on the hypersphere S d-1 . PROOF. Let f ∈ F perfectly aligned. Then all centroids µ x = f (x) lie on the hypersphere S d-1 and we are optimizing: arg min f L d unif (f ) = arg min f E x,x ′ iid ∼ p(x) e -||f (x)-f (x ′ )|| 2 So a direct application of Proposition 1. in (57) shows that the uniform distribution on S d-1 is the unique solution to this problem and that all centroids are uniformly distributed on the hyper-sphere. E.3 OPTIMALITY OF SUPERVISED LOSS Lemma 6. Let a downstream task D with C classes. We assume that C ≤ d + 1 (i.e., a big enough representation space), that all classes balanced and the realizability of an encoder f * = arg min f ∈F L sup (f ) with L sup (f ) = log E y,y ′ ∼p(y)p(y ′ ) e -||µy-µ y ′ || 2 , and µ y = E p(x|y) µ x. Then the optimal centroids (µ * y ) y∈Y associated to f * make a regular simplex on the hypersphere S d-1 and they are perfectly linearly separable, i.e min (wy) y∈Y ∈R d E (x,y)∼D 1(w y • µ * y < 0) = 0. PROOF. This proof is very similar to the one in Theorem 1. We first notice that all "labelled" centroids µ y = E p(x|y) µ x are bounded by 1 (||µ y || ≤ E p(x|y) E A(x|x) ||f (x)|| = 1 by Jensen's inequality applied twice). Then, since all classes are balanced, we can re-write the supervised loss as: L sup (f ) = log 1 C 2 C y,y ′ =1 e -||µy-µ y ′ || 2 We have: Γ Y (µ) := C y,y ′ =1 ||µ y -µ y ′ || 2 = y,y ′ ||µ y || 2 + ||µ y ′ || 2 -2µ y • µ y ′ ≤ y,y ′ (2 -2µ y • µ y ′ ) = 2C 2 -2|| y µ y || 2 ≤ 2C 2 with equality if and only if C y=1 µ y = 0 and ∀y ∈ [1..C], ||µ y || = 1. By strict convexity of u → e -u , we have: y̸ =y ′ exp(-||µ y -µ y ′ || 2 ) ≥ C(C -1) exp - Γ Y (µ) C(C -1) ≥ C(C -1) exp - 2C C -1 with equality if and only if all pairwise distance ||µ y -µ y ′ || are equal (equality case in Jensen's inequality for strict convex function), 

E.4 GENERALIZATION BOUNDS FOR DECOUPLED UNIFORMITY

Theorem 7. (Guarantees for a given downstream task) For any f ∈ F and augmentation distribution A, we have: L d unif (f ) ≤ L sup unif (f ) ≤ 2 d j=1 Var(µ j x|y)+L d unif (f ) ≤ 4E p(x|y)p(x ′ |y) ||µ x-µ x′ ||+L d unif (f ) where Var(µ j x|y) = E p(x|y) (µ j x -E p(x ′ |y) µ j x′ ) 2 and µ j x is the j-th component of µ x = E A(x|x) f (x). PROOF. Lower bound. To derive the lower bound, we apply Jensen's inequality to convex function u → e -u : exp L d unif (f ) = E p(x)p(x ′ ) e -||µx-µ x′ || 2 = E p(x|y)p(x ′ |y)p(y)p(y ′ ) e -||µx-µ x′ || 2 ≤ E p(y)p(y ′ ) exp -E p(x|y)p(x ′ |y ′ ) ||µ x -µ x′ || 2 Then, by Jensen's inequality applied to ||.|| 2 : E p(x|y)p(x ′ |y ′ ) ||µ x -µ x′ || 2 (1) = E p(x|y) ||µ x|| 2 + E p(x ′ |y ′ ) ||µ x′ || 2 -2µ y • µ y ′ ≥ ||E p(x|y) µ x|| 2 + ||E p(x ′ |y ′ ) µ x′ || 2 -2µ y • µ y ′ = ||µ y -µ y ′ || 2 So we start by expending: ||µ y -µ y ′ || 2 = ||E p(x ′ |y ′ ) µ x′ || 2 + ||E p(x|y) µ x|| 2 -2E p(x|y)p(x ′ |y ′ ) µ x • µ x′ = E p(x|y) ||µ x|| 2 + E p(x ′ |y ′ ) ||µ x′ || 2 -   d j=1 Var(µ j x|y) + Var(µ j x′ |y)   -2E p(x|y)p(x ′ |y ′ ) µ x • µ x′ = E p(x|y)p(x ′ |y ′ ) ||µ x -µ x′ || 2 -2   d j=1 Var(µ j x|y)   So by applying again Jensen's inequality: exp L sup unif = E p(y)p(y ′ ) exp(-||µ y -µ y ′ || 2 ) ≤ E p(y)p(y ′ ) exp   -E p(x|y)p(x ′ |y ′ ) ||µ x -µ x′ || 2 + 2   d j=1 Var(µ j  ≤ E p(x|ym) ||µ x -E p(x ′ |ym) µ x′ ||(||µ x|| + ||E p(x|ym) µ x||) ≤ 2E p(x|ym) ||µ x -E p(x ′ |ym) µ x′ || (3) ≤ 2E p(x|ym)p(x ′ |ym) ||µ x -µ x′ || L d unif (f ) ≤ L sup unif (f ) ≤ 8Dϵ + L d unif (f ) Where D is the maximum diameter of all intra-class graphs G y (y ∈ Y).  ||µ x -µ x′ || = ||µ x1 -µ xp || = || p i=1 µ xi+1 -µ xi || ≤ p i=1 ||µ xi+1 -µ xi || = p i=1 ||µ xi+1 -f (x i ) + f (x i ) -µ xi || ≤ p i=1 ||µ xi+1 -f (x i )|| + ||f (x i ) -µ xi || ≤ p i=1 E p(x|xi+1) ||f (x) -f (x i )|| + E p(x|xi) ||f (x i ) -f (x)|| ≤ p i=1 (ϵ + ϵ) = 2ϵp ≤ 2ϵD f = [f (x 1 ), ..., f (x n )] T . An estimator of the conditional mean embedding is: ∀x ∈ X , μx = n i=1 α i (x)f (x i ) where α i (x) = n j=1 [(Φ T n Φ n + λnI n ) -1 ] ij ⟨ϕ(x j ), ϕ(x)⟩ H X . It converges to µ x with the ℓ 2 norm at a rate O(n -1/4 ) for λ = O( 1 √ n ). PROOF. Let m x = E p(x|x) ⟨f (x), f (•)⟩ ∈ H X be the conditional mean embedding operator. According to Theorem 6 in ( 52) and the assumption ∀g ∈ H X , E p(x|•) g(x) ∈ H X , this estimator can be approximated by: mx = n i=1 α i (x)⟨f (x i ), f (•)⟩ with α i defined previously in the theorem. This estimator converges with RKHS norm to m x at rate O( 1 √ nλ + λ). So we need to link m x, mx with µ x, μx . We have: ⟨m x, mx ⟩ H X = E p(x|x) ⟨f (x), f (•)⟩ R d , n i=1 α i (x)⟨f (x i ), f (•)⟩ R d H X = n i=1 α i (x) ⟨E p(x|x) f (x), f (•)⟩ R d , ⟨f (x i ), f (•)⟩ R d H X (1) = n i=1 α i (x)⟨E p(x|x) f (x), f (x i )⟩ R d = ⟨µ x, μx ⟩ R d (1) holds by the reproducing property of kernel K X in H X . We can similarly obtain: ||m x|| 2 H X = E p(x|x) ⟨f (x), f (•)⟩ R d , E p(x|x) ⟨f (x), f (•)⟩ R d H X (1) = ⟨E p(x|x) f (x), E p(x|x) f (x)⟩ R d = ||E p(x|x) f (x)|| 2 = ||µ x|| 2 Again, (1) by reproducing property of K X . And finally: || mx || 2 H X = n i=1 α i (x)⟨f (x i ), f (•)⟩ R d , n i=1 α i (x)⟨f (x i ), f (•)⟩ R d H X = i,j α i (x)α j (x)⟨f (x i ), f (x j )⟩ R d = ||μ x|| 2 R d By pooling these 3 equalities, we have: ||m x -mx || 2 H X = ||m x|| 2 + || mx || 2 -2⟨m x, mx ⟩ = ||µ x|| 2 + ||μ x|| 2 -2⟨µ x, μx ⟩ = ||µ x -μx || 2 R d We can conclude since ||m x -mx || ≤ O(λ + (nλ) -1/2 ).

HYPOTHESIS

Theorem. Assuming 3 and 2 holds for a reproducible kernel K X and augmentation distribution A. Let f ∈ F ϵ ′ -aligned. Let (x i ) i∈[1. .n] be n samples iid drawn from p(x). We have: L d unif (f ) ≤ L sup unif (f ) ≤ L d unif (f ) + 4D(2ϵ ′ + β n (K X )ϵ) + O(n -1/4 ) where β n (K X ) = ( λmin(Kn) √ n + √ nλ) -1 = O(1) for λ = O( 1 √ n ), K n = (K X (x i , xj )) i,j∈[1. .n] and D is the maximal diameter for all Gy , y ∈ Y. We noted λ min (K n ) is the minimal eigenvalue of K n . Edges in E K For this bound, we will use Theorem 3 to approximate µ ū and then derive a bound from the property of G ϵ K . Let (x k ) k∈[1.n] ∼ p(x k |x k ) n samples iid. By Theorem 3, we know that, for all j ∈ J, μūj converges to µ ūj with ℓ 2 norm at rate O(n -1/4 ) where μūj = n k,l=1 α k,l K X (x l , ūj )f (x k ) and α k,l = [(K n + nλI n ) -1 ] k,l . As a result, for any j ∈ J, we have: PROOF. Let a, b, c ∈ X . We consider the distance d(x, y) = K(x, x) + K(y, y) -2K(x, y) (it is a distance since K is a reproducible kernel so it can be expressed as K(•, •) = ⟨ϕ(•), ϕ(•)⟩). We will distinguish two cases. Case 1. We assume K(a, c) ≥ K(b, c). We have the following triangular inequality: Using the previous lemma and because (ū j , ūj+1 ) ∈ E K , we have: ||C|| 2 2 = n i=1 (K(x i , ūj+1 ) -K(x i , ūj )) 2 ≤ n i=1 (max(K(ū j+1 , ūj+1 ), K(ū j , ūj )) -K(ū j , ūj+1 )) 2 ≤ nϵ 2 . To conclude, we will prove that ||A|| 2 ≤ ||α|| 2 where α = (α ij ) i,j∈ [1. .n] 2 . For any v ∈ R n , we have: So we can conclude that: ||Av|| 2 = || n k,j=1 α k,j v j f (x k )|| 2 j∈J ||µ ūj+1 -µ ūj || ≤ j∈J √ n||(K n + λnI n ) -1 || 2 ϵ + O(n -1/4 ) = |J|||(K n + nλI n ) -1 || 2 √ nϵ + O(n -1/4 ) We set β n (K n ) = √ n||(K n + λnI n ) -1 || 2 . In order to see that β n (K n ) = ( λmin(Kn) √ n + √ nλ) -1 with λ min (K n ) > 0 the minimum eigenvalue of K n , we apply the spectral theorem on the symmetric definite-positive kernel matrix K n . Let 0 < λ 1 ≤ λ 2 ≤ ... ≤ λ n the eigenvalues of K n . According to the spectral theorem, it exists U an unitary matrix such that K n = U DU T with D = diag(λ 1 , ..., λ n ). So, by definition of spectral norm: ||(K n + nλI n ) -1 || 2 2 = λ max U (D + nλI n ) -1 U T U (D + λnI n ) -1 U T = λ max (U DU T ) = (λ 1 + nλ) -2 where D = diag( Finally, by pooling inequalities for edges over E and E K , we have: ||µ x -µ x′ || ≤ 2ϵ ′ |I| + |J|β n (K n )ϵ + O(n -1/4 ) ≤ D(2ϵ ′ + β n (K n )ϵ) + O(n -1/4 ) We can conclude by plugging this inequality in Theorem 7. Theorem 4. We assume 2 and 3 hold for a reproducible kernel K X and augmentation distribution A. Let (x i , xi ) i∈ [1. .n] ∼ A(x i , xi ) iid samples. Let μxj = n i=1 α i,j f (x i ) with α i,j = ((  K n + λI n ) -1 K n ) ij and K n = [K X (x i , xj )] i, Ld unif -O 1 n 1/4 ≤ L sup unif (f ) ≤ Ld unif + 4D(2ϵ ′ + β n (K X )ϵ) + O 1 n 1/4 PROOF. We just need to prove that, for any f ∈ F, |L d unif (f ) -Ld unif (f )| ≤ O(n -1/4 ) and we can conclude through the previous theorem. We have:  |L d unif (f ) -Ld unif (f )| = log 1 n(n -1)



With an abuse of notation, we define it as p(x) instead than p X to simplify the presentation, as it is common in the literature By Jensen's inequality ||µx|| ≤ E A(x|x) ||f (x)|| = 1 with equality iff f is constant on supp A(•|x). https://github.com/PyTorchLightning/pytorch-lightning The official model is available here:https://github.com/marshuang80/gloria



Figure 2: Alignment metric L align computed on the validation set during optimization of Decoupled Uniformity loss with various batch sizes n and a fixed latent space dimension d = 128. We use 100 positive samples per image to compute L align .

Figure 3: Empirical verification of our theory. The optimal ϵ * to add 100 edges between intra-class images in ϵ-Kernel graph is inversely correlated with the downstream accuracy, as suggested by Theorem 4. We use k = 20 bits and an RBF kernel.

Figure 4: How we can select a priori a good kernel? Downstream accuracy on RandBits CIFAR-10 is highly correlated with kernel quality measured as fraction of 10 nearest neighbors of the same CIFAR-10 class (from test set) in the kernel graph.We provide empirical evidence confirming our theory (Theorem 4 in particular) along with a new way to quantify kernel quality with respect to a downstream task for a kernel K. We perform experiments on RandBits dataset (based on CIFAR-10) with k = 20 random bits (almost all points are disconnected in the augmentation graph) and SimCLR augmentations. For a given kernel K σ defined by K σ (x, x′ ) = RBF σ (µ(x), µ(x ′ ))-where µ(•) is the mean Gaussian distribution of x in

e = 400 e = 200 e = 400 e = 200 e = 400 e = 200 e = 400 SimCLR(

Implementation f: encoder (with projection head) 6 # x: Tensor of shape [n, * ] lamb: hyper-parameter to estimate centroids 10 for x in loader: 11 alphas = (K(x, x) + n * lamb * torch.eye(n)).inverse() @ K(x, x) 12 x = aug(x, n_views) # shape=[n * n_views, * ] 13 z = f(x).view([n, n_views, d]) # shape=[n, n_views, d] 14 mu = alphas.detach() @ z.mean(dim=1) # shape=[n, d]

PROOF. For anyx ∈ X , since f (x) ∈ S d-1 , then ||µ x|| = ||E A(x|x) f (x)|| ≤ E A(x|x) ||f (x)|| = 1. As a result, e -||µx-µ x′ || 2 ∈ I def = [e -4, 1] for any x, x′ ∈ X . Since log is k-Lipschitz on I then:

||µ i || = ||E x∼A(.|xi) f (x)|| ≤ E||f (x)|| = 1 since we assume f (x) ∈ S d . So all (µ i ) are bounded by 1. Let µ = (µ i ) i∈[1.

y=1 µ y = 0 and ||µ y || = 1. So all centroids must form a regular C -1-simplex inscribed on the hypersphere S d-1 centered at 0. Furthermore, since ||µ y || = 1 then we have equality in the Jensen's inequality ||µ y || = ||E p(x|y)A(x|x) f (x)|| ≤ E p(x|y)A(x|x) ||f (x)|| = 1 so f must by perfectly aligned for all samples belonging to the same class: ∀x, x′ ∼ p(•|y), f (x) = f (x ′ ).

follows according to the previous lemma. So we can conclude:exp L d unif (f ) ≤ E p(y)p(y ′ ) exp(-||µ y -µ y ′ || 2 ) = expL sup unif Upper bound. For this bound, we will use the following equality (by definition of variance): ||E p(x|y) µ x|| 2 = ||E p(x|y) µ x|| 2 -E p(x|y) ||µ x|| 2 + E p(x|y) ||µ x|| 2 x|y) + E p(x|y) ||µ x|| 2

j x|y m )   E p(y)p(y ′ ) exp -E p(x|y)p(x ′ |y ′ ) ||µ x -µ x′ || 2 j x|y m )   exp L d unifWe set y m = arg max i,y∈[1..d]×Y Var(µ j x|y) We conclude here by taking the log on the previous inequality. Variance upper bound. Starting from the definition of conditional variance: d j=1 Var(µ j x|y m ) = E p(x|ym) ||µ x|| 2 -||E p(x|ym) µ x|| E p(x|ym) (||µ x|| -||E p(x|ym) µ x||)(||µ x|| + ||E p(x|ym) µ x||)

PROOF. Let y ∈ Y and x, x′ ∼ p(x|y)p(x ′ |y). By Assumption 2, it exists a path of lengthp ≤ D connecting x, x′ in G. So it exists (ū i ) i∈[1..p+1] ∈ X and (u i ) i∈I ∈ X s.t ∀i ∈ I, u i ∼ A(u i |ū i ) ∩ A(u i |ū i+1 ) and ∀j ∈ J, max(K(ū j , ūj ), K(ū j+1 , ūj+1 )) -K(ū j , ūj+1 ) ≤ ϵ with (I, J) a partition of [1..p]. Furthermore, ū1 = x and ūp+1 = x′ . As a result, we have:||µ x -µ x′ || = ||µ ū1 -µ ūp || = || p i=1 µ ūi+1 -µ ūi || ≤ p i=1 ||µ ūi+1 -µ ūi || = i∈I ||µ ūi+1 -µ ūi || + j∈J ||µ ūj+1 -µ ūj ||Edges in E. As in proof of Theorem 2, we use the ϵ ′ -alignment of f to derive a bound:i∈I ||µ ūi+1 -µ ūi || = i∈I ||µ ūi+1 -f (u i ) + f (u i ) -µ ūi || ≤ i∈I ||µ ūi+1 -f (u i )|| + ||f (u i ) -µ ūi || (1) ≤ i∈I E p(u|ūi+1) ||f (u) -f (u i )|| + E p(u|ūi) ||f (u i ) -f (u)|| (2) ≤ i∈I (ϵ ′ + ϵ ′ ) = 2ϵ ′ |I|(1) holds by Jensen's inequality and (2) because f is ϵ ′ -aligned.

||µ ūj+1 -µ ūj || = ||µ ūj+1 -μūj+1 + μūj+1 -μūj + μūj -µ ūj || ≤ ||µ ūj+1 -μūj+1 || + ||μ ūj+1 -μūj || + ||μ ūj -µ ūj || (1) ≤ O 1 n 1/4 + ||μ ūj+1 -μūj ||Where (1) holds by Theorem 3. Then we will need the following lemma to conclude: Lemma. For any a, b, c ∈ X , max(K(a, a), K(b, b)) -K(a, b) ≥ |K(a, c) -K(b, c)| for any reproducible kernel K.

d(a, b) + d(a, c) ≥ d(b, c) =⇒ K(a, b) + K(b, b) -2K(a, b) + K(a, a) +K(c, c) -2K(a, c) ≥ K(b, b) + K(c, c) -2K(b, c) =⇒ K(a, a) -K(a, b) ≥ K(a, c) -K(b, c) ≥ 0 So max(K(a, a), K(b, b)) -K(a, b) ≥ |K(a, c) -K(b, c)|. We assume K(b, c) ≥ K(a, c). We apply symmetrically the triangular inequality:d(a, b) + d(b, c) ≥ d(a, c) =⇒ K(b, b) -K(a, b) ≥ K(b, c) -K(a, c) ≥ 0 So max(K(a, a), K(b, b)) -K(a, b) ≥ |K(a, c) -K(b, c)|, concluding the proof.Then, by definition of μūj :||μ ūj+1 -μūj || = || n k,l=1 α k,l K(x l , ūj+1 )f (x k ) -n k,l=1 α k,l K(x l , ūj )f (x k )|| = ||AC|| Where A = ( n k=1 α kj f (x k ) i ) i,j ∈ R d×n (f (•) i is the i-th component of f (•)) and C = (K(x l , ūj+1 ) -K(x l , ūj )) l ∈ R n×1 . So, using the property of spectral ℓ 2 norm we have:||μ ūj+1 -μūj || = ||AC|| ≤ ||A|| 2 ||C|| 2

holds with Cauchy-Schwarz inequality and because f (•) ∈ S d-1 and (2) holds by definition of spectral ℓ 2 norm. So we have ∀v ∈ R d , ||Av|| ≤ ||α|| 2 ||v||, showing that ||A|| 2 ≤ ||α|| 2 .

λn+nλ) 2 ). So we can conclude that β n (K n ) = ( λ1 √ n + √ nλ) -1 = O(1) for λ = O( 1 √ n ).

j∈[1..n] . Then the empirical decoupled uniformity loss Ld exp(-||μ xi -μxj || 2 ) verifies, for any ϵ ′ -weak aligned encoder f ∈ F:

||μ xi -μxj || 2 ) -E p(x)p(x ′ ) e -||µx-µ x′ || 2 ||μ xi -μxj || 2 ) -log 1 n(n -1) e -||µx i -µx j || 2 + log 1 n(n -1) e -||µx i -µx j || 2 -E p(x)p(x ′ ) e -||µx-µ x′ || 2The second term in last inequality is bounded by O( As for the first term, we use the fact that log is k-Lipschitz continuous on [e -4 , 1] and exp is k ′ -Lipschitz continuous on [-4, 0] so:-||μx i -μx j || 2 -log 1 n(n -1) e -||µx i -µx j || 2 ≤ k n(n -1) n i,j=1 e -||μx i -μx j || 2 -e -||µx i -µx j || 2 ≤ kk ′ n(n -1) n i,j=1 ||μ xi -μxj || 2 -||µ xi -µ xj || 2Finally, we conclude using the boundness of μx and µ x by a constant C:||μ xi -μxj || 2 -||µ xi -µ xj || 2 = (||μ xi -μxj || + ||µ xi -µ xj ||)(||μ xi -μxj || -||µ xi -µ xj ||) ≤ 4C(||μ xi -μxj || -||µ xi -µ xj ||) ≤ 4C||μ xi -μxj -(µ xi -µ xj )|| ≤ 4C(||μ xi -µ xi || + ||μ xj -µ xj ||) = O 1 n -1/4

Theorem 1. (Optimality of Decoupled Uniformity) Given n points (x i ) i∈[1..n] such that n ≤ d + 1, any optimal encoder f * minimizing Ld

V AE Decoupled Unif (ours) 82.74 ±0.18 68.75 ±0.24 68.42 ±0.51 68.58 ±0.17 Linear evaluation accuracy (%) on RandBits-CIFAR10 with ResNet18 for 200 epochs. For VAE, we use a ResNet18 backbone. Once trained, we use its representation to define the kernel K V AE in kernel decoupled uniformity loss.



When augmentation overlap hypothesis is not fulfilled, generative models can provide a good kernel to connect intra-class points not connected by augmentations.

If images attributes are accessible (e.g birds color or size for CUB200), they can be leveraged as prior in our framework to improve the representation.

AUC scores(%) under linear evaluation for discriminating 5 pathologies on CheXpert. ResNet18 backbone is trained for 400 epochs (batch size N = 1024) without labels on official CheXpert training set and results are reported on validation set.



review as a conference paper at ICLR 2023 Linear evaluation accuracy (%) after training for 400 epochs with batch size n = 256 and varying temperature in Decoupled Uniformity loss with SimCLR augmentations. t = 2 gives overall the best results, similarly to the uniformity loss in(57)

Linear evaluation accuracy (%) after training for 200 epochs with a batch size n, ResNet18 backbone and latent dimension d = 128. Decoupled Uniformity is less sensitive to batch size than SimCLR thanks to its decoupling between positives and negatives, similarly to(60).

that λ = 0.01 √ n yields the best results. We have fixed this value for all our experiments in this study. .6 KERNEL CHOICE ON RANDBITS EXPERIMENTIn our experiments on RandBits, we used RBF Kernel in Decoupled Uniformity but other kernels can be considered. Here, we have compared our approach with a cosine kernel on Randbits with k = 10 and k = 20 bits. There is no hyper-parameter to tune with cosine. From Table11, we see that cosine gives comparable results for k = 10 bits with RBF but it is not appropriate for k = 20 bits.

Linear evaluation after training on RandBits-CIFAR10 with ResNet18 for 200 epochs. RBF and Cosine kernels are evaluated.A.7 LARGER PRE-TRAINED GENERATIVE MODEL INDUCES BETTER PRIORWe argue that using larger datasets (e.g., ImageNet 1K) for pre-training larger generative models will improve the prior on smaller-scale datasets and improve even more the final representations with our method. We have tested this hypothesis on CIFAR-10 and BigBiGAN as prior, compared to DCGAN and the other approaches without prior.

We evaluate Kernel Decoupled Uniformity with BigBiGAN pre-trained on ImageNet as prior knowledge. We compare this approach with a shallow DCGAN pre-trained on CIFAR-10 as prior. We train ResNet18 on CIFAR10 and we report linear evaluation accuracy. Pre-trained generative models on larger datasets improve the final representation.

CUB200-2011 (56)  This dataset is composed of 200 fine-grained bird species with 5994 training images and 5794 test images rescaled to 224 × 224.UTZappos(62) This dataset is composed of images of shoes from zappos.com. In order to be comparable with the literature on weakly supervised learning, we follow(55) and split it into 35017 training images and 15008 test images resized at 32 × 32. This dataset is composed of 10420 3D brain MRI images of size 121 × 145 × 121 with 1.5mm 3 spatial resolution. Only healthy subjects are included. BIOBD (32) It is also a brain MRI dataset including 662 3D anatomical images and used for downstream classification. Each 3D volume has size 121 × 145 × 121. It contains 306 patients with bipolar disorder vs 356 healthy controls and we aim at discriminating patients vs controls. It is particularly suited to investigate biomarkers discovery inside the brain (31).

1) Follows from standard inequality ||a -b|| ≥ |||a|| -||b||| (from Cauchy-Schwarz). (2) follows from boundness of ||µ x|| ≤ 1 and Jensen's inequality. (3) is again Jensen's inequality.

1) follows from Jensen's inequality and by definition of µ x. (2) follows because f is ϵ-weak aligned and x Theorem 3. (Conditional Mean Embedding estimation) We assume that ∀g ∈ H X , E p(x|•) g(x) ∈ H X . Let {(x 1 , x1 ), ..., (x n , xn )} iid samples from p(x|x)p(x). Let Φ n = [ϕ(x 1 ), ..., ϕ(x n )] and Ψ

