UNSUPERVISED ANOMALY DETECTION FROM SEMAN-TIC SIMILARITY SCORES

Abstract

In this paper we present SemSAD, a simple and generic framework for detecting examples that lie out-of-distribution (OOD) for a given training set. The approach is based on learning a semantic similarity measure to find for a given test example the semantically closest example in the training set and then using a discriminator to classify whether the two examples show sufficient semantic dissimilarity such that the test example can be rejected as OOD. We are able to outperform previous approaches for anomaly, novelty, or out-of-distribution detection in the visual domain by a large margin. In particular we obtain AUROC values close to one for the challenging task of detecting examples from CIFAR-10 as out-of-distribution given CIFAR-100 as in-distribution, without making use of label information.

1. INTRODUCTION

Anomaly detection or novelty detection aims at identifying patterns in data that are significantly different to what is expected. This problem is inherently a binary classification problem that classifies examples either as in-distribution or out-of-distribution, given a sufficiently large sample from the in-distribution (training set). A natural approach to OOD detection is to learn a density model from the training data and compute the likelihood ratio of OOD examples. However, in practice this approach frequently fails for high-dimensional data (Nalisnick et al. (2019)) , where it has been shown that deep generative models can assign higher likelihood to OOD examples than to in-distribution examples. This surprising result is likely the consequence of how existing deep generative models generalise. For example, Variational Autoencoders (Kingma & Welling (2014) ) generalise by superposition of examples, which is a consequence of the stochastic nature of the posterior that can map different examples to the same point in latent space. As superposition is an averaging process that reduces the information content it can be expected that examples of lower complexity than the training examples can map to high likelihood regions in latent space. Note that it is possible for a datapoint to have high likelihood under a distribution yet be nearly impossible to be sampled, a property known as asymptotic equipartition property in information theory Cover & Thomas (2001) . For autoregressive generative models, such as PixelCNN (van den Oord et al. (2016)) , it has been shown that the pixel-by-pixel generation process is strongly determined by the local surrounding of pixels (Chen et al. (2018)) , where the fact that nearby pixels of training examples frequently share the same color can explain why mono-chromatic images are assigned a high likelihood (Nalisnick et al. (2019) ). Local pixel correlations also seem to be responsible for the failure of generative models based on Normalising Flows to assign correct likelihood values to OOD examples Schirrmeister et al. (2020) . As a consequence, most of the current OOD detection approaches make use of a score function s(x) to classify test examples as in-distribution or OOD. In case that the examples of the training set are labelled, a simple score can be given by s(x) = max y p(y|x), with p(y|x) the softmax probability for predicting class labels, y ∈ {1, .., K} (Hendrycks & Gimpel (2017) ). If s(x) is below a threshold the test example is classified as OOD. Labelled data allows to learn representations that are associated with the semantic information shared by the examples in the training set, which can be used for OOD detection. However, the approach suffers from the problem that the scores for in-distribution examples can be widely distributed across the interval of possible score values, s(x) ∈ [1/K, 1], especially if the number of labels are low and the classification task is hard, which strongly increases the false-positive rate. Consequently, better performance was found for approaches that use labeled data for learning a higher dimensional representation that encodes for semantic information (Lee et al. (2018b) ). In this representation space the in-distribution occupies just a small volume and a random feature vector would be most likely classified as OOD. Another simplification arises if the OOD detection problem is supervised, with some OOD examples labelled as such and contribute to the training set. In this case the OOD detection problem boils down to an unbalanced classification problem (Chalapathy & Chawla (2019) ). In general OOD detection benefits from separating the factors of variation for the in-distribution in either relevant (e.g. object identity) or irrelevant (e.g. compression artefacts) using prior knowledge, where the relevant factors are typically those that carry salient semantic information. In line with the arguments put forward by Ahmed & Courville (2020) , this separation helps an OOD model to systematically generalise, e.g. whether we are allowed to re-colour or add noise to images for data augmentation. Generalisation over the training set is necessary, as learning under insufficient inductive bias would result in misclassification of examples from an in-distribution test set as OOD. Labeled data provide this additional information, as relevant factors can be defined as those that help the classification task, with the limitation that there might be more factors involved in characterising the in-distribution than those needed to predict the labels. In this work, we introduce a general framework for OOD detection problems that does not require label information. Our framework can be widely applied to OOD detection tasks, including visual, audio, and textual data with the only limitation that transformations must be a priori known that conserve the semantics of training examples, such as geometric transformations for images, proximity of time intervals for audio recordings (van den Oord et al. ( 2018)), or randomly masking a small fraction of words in a sentence or paragraph (Devlin et al. (2019) ). For visual data we show new state-of-the-art OOD classification accuracies for standard benchmark data sets, surpassing even the accuracies that include labels as additional information. The key contributions of this work are (2020) . The work that is most related to ours is Geometric-Transformation Classification (GEOM), proposed by Golan & El-Yaniv (2018) and improved by Bergman & Hoshen (2020) , which belongs to the class of self-supervised learning approaches (Hendrycks et al. (2019b) ). The central idea of GEOM is to construct an auxiliary in-distribution classification task by transforming each image of the training set by one of 72 different combinations of geometric transformations with fixed strength, such as rotation, reflection, and translation. The task is to predict which of the 72 transformations has been applied, given a transformed image. GEOM gives examples that show high prediction uncertainty a high OOD score. The relevant features learned by this task are salient geometrical features, such as the typical orientation of an object. Our approach differs from GEOM by the fact that we define the relevant features as those that are invariant under geometric and other transformations, such as cropping and color jitter, which are chosen of moderate strength to not change the semantics of the images in the training set.

3. METHOD

An intuitive approach for OOD detection is to learn a representation that densely maps the indistribution to a small region within a lower dimensional space (latent space), with the consequence that OOD examples will be found outside this region with high probability. The representation should include the salient semantic information of the training set, to ensure that test examples from the in-distribution are not misclassified as OOD, but disregard irrelevant factors of variation that would prevent dense mapping. As learning this mapping by Autoencoders is difficult, we split the OOD detection task into finding a semantically dense mapping of in-distribution onto a d-dimensional unit-hypersphere by contrastive learning, followed by classifying neighbouring examples on the unit-hypersphere as semantically close or distant.

3.1. LEARNING SEMANTIC SIMILARITY

A contrastive objective can be used to align feature vectors h(x) ∈ R d that are semantically similar and at the same time distributes examples of the training set almost uniformly over the unit-hypersphere (Wang & Isola (2020) ; Chen et al. ( 2020)). This representation allows to identify for any test example the semantically close example from the training set. The mapping h(x) = f (x)/||f (x)|| can be learned from training a deep neural network f (x) to minimise the contrastive loss L[h] = -E (x,x )∼T h (x,x ) log e h(x) T h(x )/τ E xneg∼T h (x) e h(x) T h(xneg)/τ , where τ denotes a temperature parameter. Here, each positive pair (x, x ) is the result of sampling from a distribution of transformations T h (x, x ) that conserve semantics between x and x , with T h (x ) the marginal of T h (x, x ). For datasets used to benchmark object recognition tasks, samples (x, x ) ∼ T h (x, x ) can be generated by picking a single example from the training set and independently apply random transformations, such as geometric transformations, colour distortions, or cropping (Appendix D). The negative pairs can be generated by applying random transformations to different training examples. We emphasise that the types of transformations and their strengths essentially define the semantics we want to encode and thus determine if, for example, the image of a black swan is classified as OOD for an in-distribution that contains only white swans. The design of transformations that capture the underlying semantics of the training dataset requires either higher level understanding of the data or extensive sampling of different combinations of transformations with evaluation on an in-distribution validation set.

3.2. LEARNING SEMANTIC DIFFERENCES

As the encoder h(x) maps any example x on the unit- OOD. We therefore train a score function s(x, x ) to detect semantic differences between nearby examples on the unit-hypersphere, which are determined by the readily trained encoder h(x). A statistically meaningful score can be given by the likelihood ratio between the probability P pos of a test example x test and its next nearest neighbour in the training set x next = arg max x h T (x)h(x test ) to be semantically close in relation to the likelihood P neg that the same two examples are semantically distant s * (x next , x test ) = log P pos (x next , x test ) P neg (x next , x test ) For the distribution of positive examples we use P pos (x, x ) = (1 -z)T p (x, x ) + z1 x ∈S(x) /|S(x)|, with z a Bernoulli distributed random variable of mean µ. We introduced with S(x) the semantic neighbourhood of x, which is defined by the k-nearest neighbours of x, using cosine similarity h T (x)h(x ) as semantic similarity measure (Fig. 2 ). The type of transformations are similar to T h but with reduced strength to ensure that the relevant factors of variation are conserved (Fig. 3 and Appendix D). For negative examples we take P neg (x, x ) = T n (x )T n (x), with T n (x) the marginals, which implies that x and x are almost always derived from two different images of the training set. The negative transformations, T n , are allowed to include stronger and more diverse transformations than T p (Fig. 3 and Appendix D). In principle negative examples can be constructed that are harder to classify, such as augmenting P neg by pairs that are independent transformations of the same example (Choi & Chung (2020) ). However, we found that this reduces the performance (Table 2 ). As shown in Appendix A, the score s * (x, x ) maximises the training objective J[s; γ], given by to any variation in γ > 0 (Appendix A). We introduced γ as it is notoriously hard to learn ratios of probability densities in high dimensional spaces, which is a central problem of generative adversarial networks (Azadi et al. (2019) ). In general, s(x, x ) learned by the objective Eq. 3 can deviate significantly from the optimal generalising likelihood ratio s * (x, x ). This deviation is most apparent if P pos (x, x ) is close to zero where P neg (x, x ) is non-zero and vice versa, as shown in Fig. 4 . In this case the objective can be maximised by any decision boundary that lies in the region between the distributions P pos (x, x ) and P neg (x, x ). To smoothen the score function s(x, x ) we sample γ at each iteration of the learning process and thereby effectively sample over an ensemble of gradients (Appendix B). Inspired by the lottery ticket hypothesis that training a deep neural network under constant objective mainly affects the weights of a small subnetwork (Frankle & Carbin (2019)), we can reason that sampling over γ affects the weights for an ensemble of overlapping subnetworks. As a consequence, s(x, x ) is the prediction from an ensemble of models, which typically results in higher prediction accuracies, less variance with respect to weight initialisation, and higher robustness to overfitting. The effect of uniform sampling of γ on stabilising the decision boundary and thus observing the train/test sets from the in-distribution within the positive score range is shown in Appendix C. Although only the difference in score values between a test example and the in-distribution test set is relevant for OOD detection, examples from the in-distribution should be sufficiently distant from P neg (x, x ) for optimal performance. It can be further shown that for the extreme case γ → ∞ the score function learns the optimal weights to realise importance sampling for P pos by sampling from P neg (Appendix B). J[s; γ] = E (x,x )∼Ppos log σ(a) + γE (x,x )∼Pneg log 1 -σ(a)

4.1. TRAINING

Experiments were carried out using either ResNet18 or ResNet34 neural network architectures, for both the encoder f (x) and the discriminator s(x, x ). We substituted ReLU activations for the discriminator by ELU units to reduce the overconfidence induced by ReLU (Meinke & Hein (2020) ), which resulted in a strong reduction of unusually large spikes in the training loss curve for the discriminator. For contrastive learning a MLP head was attached to the last hidden layer that projects to a d = 128 dimensional feature vector, h(x), whereas for the discriminator the MLP head projects to a scalar output, s(x, x ). Note that the ResNets for encoder and discriminator don't share parameters. We train the contrastive loss at batch size 2048 and the discriminator at batch size 128, using ADAM(AMSGrad) optimiser. We applied random transformations to each example in the training set before presenting them to encoder and discriminator. The transformations consist of combinations of random cropping followed by resizing to the original size, random horizontal flipping, random color jitter, grey scaling, and gaussian blurring (Appendix D). For training the encoder, h(x), we used the same transformations with the same strength as reported in Chen et al. ( 2020)). We set the temperature parameter to τ = 1, which is the value reported in Winkens et al. ( 2020)) for the same datasets used in this work. Unless otherwise specified, positive pairs for training the discriminator are two independent transformations of a single image from the training set, where the transformation strength is bounded by strong transformations (Fig. 3 and Appendix D), to make sure that we don't transform out of the in-distribution. As pairs generated from independent transformations of the same image are typically semantically closer than any two semantically close images of the in-distribution test set (Fig. 2 ), the latter would be erroneously classified as OOD. To avoid mis-classification we augment the transformed positive pairs with a fraction of semantic similar pairs, with pairing partners randomly selected from the semantic neighbourhood. The strength of augmentation is determined such that train/test sets from the in-distribution reside on the positive OOD-score side yet remain inside the sensitive range of the logistic sigmoid function (Fig. 4 ). Unless otherwise specified, we take a semantic neighbourhood size of 4 and substitute a fraction µ = 1/32 of transformed pairs in a minibatch with semantically similar pairs from the training set. For regularisation, we use weight decay of 10 -6 and uniformly sample γ ∼ U(1, 10) at each iteration of the learning process. Negative pairs are constructed by transforming two different examples from the training set, including also 'extreme' transformations and gaussian blur.

4.2. EVALUATION

We evaluate the results using Area Under the Receiver Operating Characteristic curve (AUROC), which has the advantage to be scale-invariant -measures how well predictions are ranked, rather than their absolute values -and classification-threshold-invariant -it measures how well OOD samples are separated from the training set. However, for any practical setting of OOD detection a classification threshold is needed and can be chosen such that the false positive rate of an in-distribution test set is close to some threshold, e.g. α = 0.05. We do not report values for Area Under the Precision-Recall curve (AUPR) as in this work we have no class imbalance between the OOD test set and the in-distribution test-set. As we observed significant shifts of OOD-scores for in-distribution (Hendrycks & Gimpel, 2017) 89.9 86.4 77.1 ODIN (Liang et al., 2018) 96.7 85.8 77.2 Mahalanobis (Lee et al., 2018b) 99.1 88.2 77.5 Residual flows (Zisselman & Tamar, 2020) 99.1 89.4 77.1 Outlier exposure (Hendrycks et al., 2019a) 98.4 93.3 75.7 Rotation pred. (Hendrycks et al., 2019b) 98.9 90.9 -Gram matrix (Sastry & Oore, 2020) 99.5 79.0 67.9 Contrastive Aug. (Winkens et al., 2020) 99.5 92.9 78.3 OpenHybrid (Zhang et al., 2020) 99.8 95.1 85.6 General unsupervised methods Likelihood Regret (Xiao et al., 2020) 86.6 --SVD-RND (Choi & Chung, 2020) 96.9 --Hierachical-AD (Schirrmeister et al., 2020) 99.0 86.8 62.5 SemSAD (Ours) 100 99.9 99.9 train/test sets between training runs (Appendix C), we suggest for any practical applications to carry out a majority vote over 5 independent training runs, where after each run an example is classified as OOD if the OOD-score is significantly lower than the OOD-scores of the in-distribution test set.

5. EXPERIMENTAL RESULTS

In our experiments, we focus on difficult OOD detection problems in the visual domain (Nalisnick et al. (2019) ), in particular CIFAR-10/SVHN, CIFAR-10/CIFAR-100, and CIFAR-100/CIFAR-10. The main results are summarised in Table 1 , where we used an identical setup (e.g. same hyperparameters, transformation strengths, and network size) for all datasets and averaged AUROC values over 5 subsequent runs using ResNet18 and 5 subsequent runs using ResNet34. We compare our unsupervised method (SemSAD) with supervised methods that use label information and/or OOD examples that are labeled as such. Surprisingly, we find that SemSAD outperforms not only all unsupervised methods on these problems but also all supervised methods we are aware of. This result is especially striking as supervised methods typically outperform unsupervised methods, as semantic representations learned on label information are typically of advantage. Our interpretation is that learning a representation for the semantic information shared between pairs of examples allows to identify a larger set of relevant features that characterise the in-distribution than from the large number of examples that share the same label. The more of the relevant features can be identified the tighter the in-distribution can be characterised, which helps OOD detection. The performance gain of our method is strong for all OOD detection problems considered in this work but most apparent for CIFAR-100 as in-distribution and OOD examples from CIFAR-10, with increase in state-of-the-art AUROC for unsupervised methods not using label information from 0.625 to 0.999. Note that the classes of CIFAR-10 and CIFAR-100 are mutually exclusive, and thus CIFAR-10 can be used as OOD test set. We carried out further experiments to see the effects of hyperparameter values and the transformations used for training the discriminator (Table 2 ). As expected, if we destroy semantic information by using extreme transformations for generating positive (in-distribution) pairs the performance is significantly reduced, whereas Gaussian blur on negative examples has a positive effect. The experiments further show that the semantic neighbourhood size should be taken small enough to make sure that the pairs generated from semantic neighbourhoods and used for training s(x, x ) are semantically close. In general, we observed better performance if we broaden the distribution of negative pairs, e.g. by augmenting examples with gaussian blur and use extreme transformations. The per- 

6. CONCLUSION

In this work we proposed SemSAD -a new OOD detection approach that is based on scoring semantically similar pairs of examples from the in-distribution. We showed that our method outperforms supervised and unsupervised methods on challenging OOD detection tasks in the visual domain. The definition of semantic similarity within our approach requires to identify transformations that are applicable to individual examples and are orthogonal to the salient semantic factors of the indistribution. Although semantic similarity can be broadly defined as "everything that is not noise", high predictive power can be expected if the semantic similarity score catches the higher order features that are specific to the in-distribution. In practice, there are problems where the definition of semantic similarity is challenging. For example, if genome sequence data is the input and the effect on the phenotype of the organism is the true underlying semantics. In this case, it is unclear how transformations of the genome sequence should look like that lead to the same phenotype. In contrast, for the important problem of protein folding, such transformations can be inferred from multiple sequence alignments of sequences that likely conserved the function of a protein.

B APPENDIX

We show that optimising the objective J[s; γ pos , γ neg ] by gradient ascent, with γ pos , γ neg > 0 randomly sampled in each optimisation step, leads to averaging over an ensemble of gradients. A gradient based optimisation method in its simplest form updates the parameters θ, which determine the function s(x, x ), by the rule θ ← θ + α∇ θ J[s; γ] with α the learning rate and ∇ θ J[s; γ] = γ pos E (x,x )∼Ppos 1 -σ(a) ∇ θ s(x, x ) -γ neg E (x,x )∼Pneg σ(a)∇ θ s(x, x ) = E (x,x )∼Ppos γ pos 1 + γpos γneg e s(x,x ) ∇ θ s(x, x ) -E (x,x )∼Pneg γ neg 1 + γneg γpos e -s(x,x ) ∇ θ s(x, x ) This result shows that for given s(x, x ), random values of γ pos , γ neg > 0 weight the expected gradients for the positive examples and for the negative examples differently. As a consequence, ∇ θ J[s; γ] takes different directions for each parameter update, given fixed (mini-)batch and fixed initial conditions.

If we consider the following limiting cases for

s ≈ s * lim γpos→1 γneg→∞ ∇ θ J[s; γ] = E (x,x )∼Ppos [∇ θ s(x, x )] -E (x,x )∼Pneg e s(x,x ) ∇ θ s(x, x ) (13) ≈ E (x,x )∼Ppos [∇ θ s(x, x )] -E (x,x )∼Pneg P pos (x, x ) P neg (x, x ) ∇ θ s(x, x ) and lim γpos→∞ γneg→1 ∇ θ J[s; γ] = E (x,x )∼Ppos e -s(x,x ) ∇ θ s(x, x ) -E (x,x )∼Pneg [∇ θ s(x, x )] (15) ≈ E (x,x )∼Ppos P neg (x, x ) P pos (x, x ) ∇ θ s(x, x ) -E (x,x )∼Pneg [∇ θ s(x, x )] we see that the objective learns the optimal importance weights of importance sampling. As in this work we have the situation that P pos (x, x ) = 0 for some negative examples, the case γ pos → ∞, γ neg → 1 should not be applied, which is why we set γ pos = 1 and sample uniformly from γ neg ∈ [1, N ], with N > 1. C APPENDIX 

D APPENDIX D.1 TRAINING CONTRASTIVE ENCODER

In order to train the contrastive encoder, we use a modified version of resnet18 with 128-dimensional projection head to make it suitable for cifar10 and cifar100 datasets with relatively small image size. In particular we remove the Maxpooling layer and subsitute the first 7 × 7 convolutional layer of stride 2 with a convolutional layer of kernel size 3 × 3 with padding and stride of 1. For the optimisation task, we use the Adam optimiser with learning rate of 3 • 10 -4 and weight decay of 10 -6 . The network is trained for 1500 epochs at batch size 2048.

D.2 TRAINING DISCRIMINATOR

To train the discriminator, we use ResNet18/34 and apply the same modifications as for the contrastive encoder. In addition, all the ReLU activation functions are replaced with ELU and the projecting head maps to a scalar value. Note that encoder and discriminator don't share parameters. The discriminator is trained with initial learning rate of 5 • 10 -5 using AMSGrad optimiser with weight decay of 10 -6 on batch size of 128 samples in each iteration. The learning rate is multiplied by 0. 

D.3 GEOMETRIC TRANSFORMATIONS

As data augmentation to train the contrastive encoder, we use the same transformations as in Chen et al. (2020) . including, randomly chosen geometric transformation from the set {Cropping, Horizontal Flip, Color Jitter, GrayScale, Gaussian Blurring}. Pytorch snippets for encoder transformations can be found in Table 4 . To train the discriminator we make two sets of transformations, one for positive samples and one for negative ones. The main intuition to shape a set of transformation for positive samples is to keep them in-distribution according to the original training samples. For the special case of cropping we make three different categories as weak, strong and extreme cases. Table 5 shows their cropping scale according to pytorch standard. To make positive pairs we randomly apply both weak and strong ranges for cropping, random horizontal flipping and color jittering on the same image from the training set and for negative pairs we apply all ranges of weak, strong and extreme cropping, horizontal flipping, color jittering, and Gaussian blurring on two randomly selected images. The details and pytorch snippet for positive and negative transformations can be found in Table 7 and 6 respectively. Note that for augmentation with semantically similar pairs from the training set there is no transformation applied on positive pairs.



Figure 1: Illustration of mapping examples from the in-distribution onto a unit-hypersphere. In this representation, feature vectors from the in-distribution are semantically similar if they approximately align and semantically diverse if they are separated by a large angle. If OOD examples are mapped onto the unit-hypersphere, they can align with training examples without being semantically similar. A discriminator trained to classify pairs of feature vectors from the training set as semantically close or distant, with sharing or not sharing the same semantic neighbourhood as target, can then be used for detecting OOD examples.

Figure 2: Semantic Neighbourhoods for examples from CIFAR-10/100.

Figure 3: Transformation strengths used in training P pos (weak to strong) and P neg (weak to extreme)

Figure 4: Distributions over the OOD detection score, s(x, x ), trained on CIFAR-100 pos/neg pairs (P pos in blue; P neg in red) as described in Section 4.1 and applied to semantic nearestneighbour pairs from the test sets of SVHN and CIFAR-10 (out-distributions) in comparison to semantic nearest-neighbour pairs of the CIFAR-100 test/train sets (in-distributions).

Figure 5: Random shifts of in-distribution test/train sets (green) for different training runs as indicator for instability of the decision boundary. Shown are results for 5 independent training runs for γ ∼ U(1, 10) (left column) and for γ = 1 (right column), using the same setup as used to computeTable 1 but with ResNet18.

2 and 0.1 after 200 and 500 epochs, respectively. To generate the positive pairs for training, we first find the 4 examples with the highest cosine similarity score among 10k random examples for each example in training set, from which one is randomly selected with equal chance. During the training procedure µ = 1/32 = 3.125% of each batch includes semantically similar pairs. For the γ, in each iteration a value is uniformly chosen from the range U(1, 10). The hyperparameters and their default values are shown in Table3

We propose a new OOD detection framework that is applicable in absence of labeled indistribution data or OOD examples that are labeled as such.• We show that our approach strongly improves OOD detection for challenging tasks in the

Out-of-distribution detection performance (% AUROC). Reported values for SemSAD are lower bounds.

Average over AUROC values from 5 independent training runs for CIFAR-100/CIFAR10 (in/out distribution) for different setups. The lowest AUROC values among the 5 runs are shown in brackets. Reported AUROC values are lower bounds. We applied gaussian blurring on negative samples (blur), extreme transformations on positive samples (extreme transf.), and using correlated negative pairs P neg (x, x ) derived from extreme transformations of the same image (correlated neg), and changed the fraction of semantically similar pairs (µ) per minibatch, the sampling range for γ, and the semantic neighbourhood size (N). AUROC is computed for CIFAR-100/10 test sets with 10k examples. method is tightly connected to the ability of the encoder h(x) and discriminator s(x) to extract or be sensitive to features that allow to generalise over the training-set and are thus specific to the in-distribution. These generalisable features are orthogonal to the features that change under the transformations we use in training. The type of transformations we use in this work, e.g. cropping, horizontal flip, and colour jitter, are generic in the sense that they are designed to conserve the semantics for images that are related to physical objects. In general, the type of transformations used must match the generalisable features of the in-distribution. For example, if all examples of in-distribution have a specific horizontal orientation, then horizontal flip must be excluded from the transformations.



Hyperparameters for discriminator

ACKNOWLEDGMENTS A APPENDIX

The objective J[s; γ] = γ pos E (x,x )∼Ppos log σ(a) + γ neg E (x,x )∼Pneg log 1 -σ(a)(4) with σ(a) = 1/(1 + e -a ), a = s(x, x ) + log(γ pos /γ neg )), and γ = (γ pos , γ neg ), has the upper bound J[s * ; γ] ≥ J[s; γ] for all γ pos , γ neg > 0, where s * is given byunder the condition that P pos and P neg have the same support -that is where P pos is non-zero also P neg is non-zero and vice versa.To prove that assertion we make use of variational calculus (see e.g. C. Bishop, Patter Recognition and ML, Appendix D) to compute the functional derivative δJ/δs, which is defined by the integral over an arbitrary test function, η(x, x ),where we have used that dσ(a)/ds = σ(a) 1 -σ(a) . The optimum can be computed from δJ/δs| s=s * = 0, which results in⇒ γ pos P pos (x, x ) γ pos P pos (x, x ) + γ neg P neg (x, x ) = 1 1 + e -s * (x,x )-log(γpos/γneg) (9)⇒ s * (x, x ) = log P pos (x, x ) P neg (x, x ) ∀γ pos , γ neg > 0 (10) Note that J[s; γ] is not bounded from below, so the optimum is a maximum. 

