FEW-SHOT ANOMALY DETECTION ON INDUSTRIAL IMAGES THROUGH CONTRASTIVE FINE-TUNING Anonymous

Abstract

Detecting abnormal products through imagery data is essential to quality control in manufacturing. Existing approaches towards anomaly detection (AD) often rely on substantial amount of anomaly-free samples to train representation and density models. Nevertheless, large anomaly-free datasets may not always be available before inference stage and this requires building an anomaly detection framework with only a handful of normal samples, a.k.a. few-shot anomaly detection (FSAD). We propose two techniques to address the challenges in FSAD. First, we employ a model pretrained on large source dataset to initialize model weights. To ameliorate the covariate shift between source and target domains, we adopt contrastive training on the few-shot target domain data. Second, to encourage learning representations suitable for downstream AD, we further incorporate cross-instance pairs to increase tightness within normal sample cluster and better separation between normal and synthesized negative samples. Extensive evaluations on six few-shot anomaly detection benchmarks demonstrate the effectiveness of the proposed method.

1. INTRODUCTION

Industrial defect detection is an important real-world use-case for visual anomaly detection methods. In this setting, anomaly detection models typically have to be trained with only defect-free, or normal images, as defects rarely occur on functioning production lines. Anomaly detection methods for this one-class classification setting typically assume that normal images are available in abundance, even though this may not always be the case. For example, in certain applications such as semiconductor manufacturing where image acquisition requires 3D scans using specialized equipment (Pahwa et al., 2021) , acquiring defect-free images is time-consuming and costly. Flexible manufacturing systems also require rapid adaptation to changes in the type and quantity of products to be manufactured (Shivanand, 2006) . As a result, large numbers of defect-free images may not be available for new products, or in the initial stages of bootstrapping a visual inspection system. Although anomaly detection in general is a well-studied topic (Chandola et al., 2009; Pang et al., 2021b) , anomaly detection on images with only few normal and no abnormal images, or fewshot anomaly detection (FSAD), has only recently begun to receive attention from the community (Sheynin et al., 2021; Huang et al., 2022) . In their pioneering work, Sheynin et al. (2021) developed a generative adversarial model to distinguish transformed image patches from generated ones. However, such adversarial models may be tricky to tune (Kodali et al., 2017) and the method requires multiple transformations on test samples at inference time, resulting in additional computation overhead. The more recent work of Huang et al. (2022) learns a common model over multiple classes of normal images using a feature registration proxy task, but their method requires a training set with normal images from multiple known classes, which is a more restrictive setting. In this work, we develop a simple yet effective method for few-shot anomaly detection. We achieve this by synergistically combining transfer learning from a pretrained model with representation learning on the few-shot normal data. Finetuning from a backbone network pretrained on a large source domain dataset, e.g. ImageNet (Russakovsky et al., 2015) , allows reusing good low-level feature extractors and better initialization of network parameters (Kornblith et al., 2019) . We believe finetuning from pretrained weights could particularly contribute to few-shot anomaly detection when not enough training data is available for training good representations. However, as pointed out by some existing work (Xu et al., 2022; Li et al., 2021b) , directly reusing the pretrained weights may not fully unleash the power of finetuning. This is probably caused by two factors. First, when the source domain data has a different data distribution from the target domain, the covariate shift (Wang & Deng, 2018) causes performance degradation. Second, due to the fact that anomaly detection requires feature representation that separates normal samples from abnormal ones. The representations learned from ImageNet pretraining tasks, mostly semantic image classification, is not necessarily optimal for anomaly detection. To ameliorate the covariate shift between source and target domain data, we first propose to introduce contrastive training to adapt pretrained model weights to the target data distribution for downstream anomaly detection. Given initial model weights, we optimize a contrastive loss defined on all available few-shot normal examples so that the pretrained low-level features will be adjusted towards the target data distribution. We further encourage learnt feature representations to be suited to the downstream anomaly detection task by encouraging normal samples to form a cluster in feature space. To achieve this, we introduce a cross-instance positive pair loss that randomly samples two normal samples and encourages their feature embeddings to be close. Note that this differs from standard contrastive training as closeness is encouraged across two different normal samples instead of a sample and its augmented version. Finally, when prior knowledge on the anomalies is available, e.g. we are able to synthesize negative examples (Li et al., 2021a) , we further introduce an additional negative pair loss to encourage better separation between normal and synthesized anomalous examples. We empirically reveal that the choice of negative sample synthesis is crucial to the success of FSAD and should be used only when concrete prior knowledge on the anomalies is available. We summarize the contribution of this work as below, • We approach anomaly detection for industrial defect inspection from a transfer learning perspective. We propose to do contrastive training on few-shot normal samples in the target domain to alleviate the distribution shift between source and target domains. • We further introduce an across instance positive pair loss to encourage normal samples to form a tight cluster in the embedding space for better density-based anomaly detection. • When prior knowledge on negative sample is available a negative pair loss is further incorporated to allow better separation between normal and synthesized negative samples. • We demonstrate superior performance on 4 real-world industrial defect identification datasets and 2 synthetic corruption identification datasets.

2. RELATED WORK

Anomaly Detection: Traditional anomaly detection (AD) methods include PCA, cluster analysis (Kim & Scott, 2012) and one-class classification (Schölkopf et al., 2001) . With the advent of deep learning, representation learning is employed to avoid manual feature engineering and kernel construction. This leads to novel anomaly detection methods based on generative adversarial networks (GAN) (Perera et al., 2019; Schlegl et al., 2017) and Autoencoders (Bergmann et al., 2019a) . Among them, anoGAN (Schlegl et al., 2017) was proposed to learn the manifold of normal samples and anomalous samples cannot be perfectly projected onto the normal manifold by the generator learned solely with normal samples. However, it requires expensive optimization for detecting abnormal samples and training GANs is prone to some well-known challenges including instability and mode collapse. Among the autoencoder based approaches, (Bergmann et al., 2019a) adopted SSIM metric as the similarity measure between input and reconstructed images. Recently an effective line of works approach AD through representation learning and formulate AD as detecting outliers in the learned representation space (Ruff et al., 2018; Golan & El-Yaniv, 2018; Sohn et al., 2021) . Among these works, deep SVDD (Ruff et al., 2018) proposed to learn a feature embedding that groups normal samples closer to a cluster center. Follow-up works develop self-supervised pretraining methods to learn representations suitable for separating abnormal samples from normal ones by optimizing a proxy task (Golan & El-Yaniv, 2018; Sohn et al., 2021; Li et al., 2021a) . Anomaly detection is then implemented through fitting a density model on the learnt representations of normal training samples. These approaches prevail in many anomaly detection benchmarks and are computationally efficient. Nevertheless, representation learning requires a substantial amount of training data which may not be readily available in certain industrial environments. Few-Shot Anomaly Detection (FSAD) aims to enable detecting anomalous samples with only a few normal samples as training data, and is an emerging topic in anomaly detection. We first distinguish FSAD from the semi-supervised anomaly detection setting (Ruff et al., 2019) where a limited number of labeled anomalies are available for training as it is sometimes referred to as few-shot anomaly detection in the literature (Pang et al., 2021a) 2022) addresses a different few-shot setting that requires normal samples to be provided with semantic labels. When normal data comprises multiple semantic classes, embedding all normal samples into a single cluster may result in the failure to detect anomalies occuring between semantic classes. Learning multiple prototypes was proposed to tackle this issue. In comparion, our method adapts pretrained weights to target data using only a few normal training samples; unlike some of these other works, no additional data is required during the representation learning phase. This enables our method to be applied in a broader set of industrial anomaly detection scenarios. Contrastive Learning: Pretraining feature representation through contrasting augmented samples of the same identity has demonstrated promising results. SimCLR (Chen et al., 2020; He et al., 2020) employed an N-pair loss (Sohn, 2016) to encourage two augmentations of the same instance (positive pair) to be close in the feature space and other instances (negative pair) to stay faraway. The existence of negative pairs requires large batchsize for training, BYOL (Grill et al., 2020) introduced an exponential moving average model to avoid collapsed predictions and get rid of negative pairs. Apart from representation pretraining, contrastive learning has been recently demonstrated to be effective for label efficient finetuning (Liu et al., 2021; Xu et al., 2022; Chen et al., 2022; Li et al., 2021b) . When source and target domain data distributions are subject to covariance shift, contrastive training on the target data in a unsupervised fashion can potentially alleviate the domain shift (Xu et al., 2022; Li et al., 2021b) . In this work, we demonstrate that contrastive training on the target domain data plays an important role in learning a good representation for downstream anomaly detection.

3. METHODOLOGY

In this work, we assume a model pretrained on large external image collection (e.g. ImageNet) is available. We refer to this external data as the source domain. Anomaly detection on industrial data is the task to be solved and is referred to as the target domain. We first describe contrastive training for adapting a pretrained model to the target domain distribution. We then introduce the crossinstance positive pair loss to encourage normal samples to form a cluster in the feature space. When prior knowledge on how to synthesize negative samples is available, we can introduce negative pairs to encourage better separation of normal and abnormal samples in the feature space. An overview of the proposed contrastive adaptation framework is shown in Fig. 1 . Lastly, we describe how to build the density-based anomaly detection model on the learnt representations.

3.1. CONTRASTIVE TRAINING FOR ADAPTATION

We first denote the few-shot training examples from target domain as D T = {X i } i=1•••N T . The parameters of a backbone network are denoted as Θ and z = f (X; Θ) encodes the input X into feature space. Contrastive training updates the model parameters Θ by optimizing a contrastive loss in an unsupervised manner as in Eq. 1. In this work, we consider the BYOL (Grill et al., 2020) method for contrastive learning due to its smaller memory requirements. L Con = - 1 N T Xi∈D T q(g(z i )) ⊤ g(ẑ i ) ||q(g(z i ))|| • ||g(ẑ i )|| To learn effective representations, contrastive training contrasts between two random augmentations of the same input image, denoted as t(X). The encoder network outputs the representation embedding for each augmented input as z = f (t(X); Θ). The representations are further projected to 

3.2. CROSS-INSTANCE POSITIVE PAIR LOSS

The contrastive training objective encourages adaptation to target distribution, but this does not guarantee the learned representation is suitable for downstream density-based anomaly detection. Since anomaly detection inference is often implemented as fitting a multi-variate Gaussian distribution on the normal samples in the feature space, normal samples should ideally be embedded close to each other. Inspired by the success of one-class classification (Ruff et al., 2018) we propose to encourage normal samples to form a tight cluster in the feature space. Specifically, we treat a pair of randomly selected normal samples as a positive pair, the representations of each positive pair are encouraged to be closer by minimizing the cosine similarity as in Eq. 2 where p is a random permutation of the list {1, • • • N }. L P P = -1 2N T i j∈p f (t(X i );Θ) ⊤ f (t(X j ); Θ) ||f (t(X i );Θ)||•||f (t(X j ); Θ)|| + f (t(X j );Θ) ⊤ f (t(X i ); Θ) ||f (t(X j );Θ)||•||f (t(X i ); Θ)|| (2) Compared with the alternative of maintaining a fixed cluster center as proposed in (Ruff et al., 2018) , the cross-instance positive pair loss has two advantages. First, we do not need to fix the cluster center at the start of training. This avoids introducing too much regularization on the representation embedding as the cluster center may vary during the course of training. Second, we minimize the cosine similarity between the online view and target view where the latter does not backpropagate gradients. This avoids collapse to a trivial solution (e.g. all zero weights) (Ruff et al., 2018) . We note that the cross-instance positive pair loss is calculated on the features directly from the backbone network. This is due to the fact that backbone output feature will be used for anomaly detection so the loss should be optimized in the feature space.

3.3. INCORPORATING NEGATIVE PAIR LOSS

Synthesizing negative examples have been demonstrated to be successful in pretraining representation for anomaly detection. Well-calibrated synthesis approaches even achieved the state-of-theart performance on certain datasets where the synthesized ones can match the real anomalies very well (Li et al., 2021a) . In this work, we propose to incorporate additional synthetic negative examples when prior knowledge is available. Specifically, we denote synthesizing negative sample as t n (X), to encourage better separation between normal and abnormal samples we minimize the cosine similarity between the original image embedding and the negative embedding as below: L N P = 1 N T i f (t(X i ); Θ) ⊤ f (t n (X i ); Θ) ||f (t(X i ); Θ)|| • ||f (t n (X i ); Θ)|| It is worth noting that the negative contrasting is also carried out directly on the backbone output features to reflect the constraints are applied to the feature representations. A relevant design was presented in (Ruff et al., 2019) for semi-supervised anomaly detection by minimizing the reciprocal of the distance between annotated anomalies and normal sample cluster center. Again, we believe minimizing the cosine similarity is compatible with the contrastive training objective and crossinstance positive pair loss with no risk of having a trivial solution. The final training loss combines the above three loss terms as L all = L Con + λ P P L P P + λ N P L N P .

3.4. DENSITY-BASED ANOMALY DETECTION

To perform anomaly detection using the learnt representations, we follow the density-based approach in (Li et al., 2021a) and fit a multivariate Gaussian distribution to the few-shot normal samples. Note that the learnt feature representations must be L2-normalized before density estimation and inference because during representation learning, we optimize the cosine similarity which is agnostic to the magnitude of feature representations. Moreover, to increase the amount of data for fitting the Gaussian distribution we produce N A times augmented samples from the few-shot normal samples. Formally, the mean µ and covariance Σ is obtained through maximum likelihood estimation as below where D T A = D T ∪ D T ∪ • • • D T N A times . µ = 1 |D T A | X i ∈D T A f (t(X i )) ||f (t(X i ))|| , Σ = 1 |D T A | X i ∈D T A f (t(X i )) ||f (t(X i ))|| -µ f (t(X i )) ||f (t(X i ))|| -µ ⊤ (4) The anomaly score is then given by the Mahalanobis distance as in Eq. 5 and test samples are ranked by the anomaly score for anomaly detection. d AS (X) = (f (X)/||f (X)||) ⊤ Σ -1 (f (X)/||f (X)||) (5)

4. EXPERIMENTS

We evaluate the performance of our method on four industrial defect identification datasets and two datasets with synthetic common corruptions. We benchmarked against state-of-the-art anomaly detection methods and achieved very competitive performance. Finally, we carry out ablation studies on individual components and provide further insights into the negative pair loss.

4.1. DATASETS

We provide an overview of the datasets used in the experiments. MVTec Dataset (Bergmann et al., 2019b) contains 15 object categories, including 10 non-texture object categories and 5 texture object categories. Each category contains 60-300 normal samples for training and 30-400 normal and defect samples for testing. We follow the few-shot settings from (Sheynin et al., 2021) (Huang et al., 2020) consists of 6 different types of defect magnetic tile surface images, namely "Blowhole", "Crack", "Fray", "Break", "Uneven", and "Free" (no defects). There are 1344 images in total of which 952 images contain defects. We Ours (w/o np) optimizes on the combination of contrastive loss and cross-instance positive pair loss. Assuming prior knowledge on the anomaly is available, Ours (w/ np) incorporates the negative pair loss and optimizes on the combination of all three losses. Among these methods, CutPaste and Ours (w/ np) are built upon the prior knowledge of anomalies while other methods do not make explicit assumptions on how the anomalies would look like.

4.3. EXPERIMENT DETAILS

Training Details: For all experiments, we use the ResNet18 (He et al., 2016) backbone for feature extraction. For all competing methods, we initialize backbone weights with ImageNet pretrained weights. We set the weight for cross-instance positive pair loss as 0.8 and the weight for negative pair loss as 0.6. We use the Adam optimizer Kingma & Ba (2015) for all experiments with learning rate initialized to 3 × 10 -4 , β 1 = 0.9 and β 2 = 0.99. We fix the batch size to 64, thus creating 64 pairs for contrastive training. To generate cross-instance pairs, we randomly permute the 64 images and each of the 64 image is paired with a randomly permuted one, resulting in 64 positive pairs. Similarly, we pair each image to its negatively augmented one to create another 64 negative pairs. For density model fitting, we use N A = 10. The area under the ROC (AUROC) is used to assess performance, where anomalies are treated as the positive class. Data Augmentation: For the negative pair loss, we synthesize negative examples t n (X) with Cut-Paste augmentations (Li et al., 2021a) for MVTec datasets. For all industrial datasets, regular data augmentation t(X) comprises affine transformation and color manipulations (e.g. blurring and grayscaling). For CIFAR10/100-C, regular augmentation only includes affine transformation and negative augmentation comprises blurring and randomly perturbing image brightness and contrast.

4.4. FEW-SHOT ANOMALY DETECTION ON INDUSTRIAL DATASETS

In this section, we explore identifying defects on industrial images. We first evaluate the few-shot anomaly detection performance on MVTec dataset with results in Tab. 1. We make the following observations. First, without any specific prior knowledge on the anomalies, our method (Ours (w/o np)) outperforms all competing methods by a clear margin. Furthermore, with prior knowledge on the potential anomalies, our method (Ours (w/ np)) still outperforms CutPaste with the same negative augmentations in the 2-shot and 5-shot settings. We are only slightly behind CutPaste in the 10shot case. Both observations suggest the effectiveness of contrastive adaptation and cross-instance positive pair loss. We further observe that CutPaste exhibits a significant lead on leather, wood and toothbrush images. We attribute this to the fact that these categories contain many anomalies that can be synthesized from CutPaste and scar augmentation: the "cut" defect for the "leather" category, the "scratch" defect for "wood" and scar like defects for "toothbrush". In contrast, our method relies more on adaptation from the pretrained model and learning from few-shot normal samples. As a result, its performance is generally better on more diverse types of objects: in the 2-shot setting, our method outperforms CutPaste on 8 of the 15 categories while CutPaste is winning on 4/15 categories. 

4.5. DETECTING SYNTHETIC NOISE CORRUPTIONS

In this section, we evaluate on detecting synthetic noise corruptions as anomalies. The synthesized corruptions mimic noise patterns that are commonly seen in industrial environments. We benchmark on CIFAR10-C and CIFAR100-C with 10-shot training samples for this purpose. In a similar fashion to MVTec, we adapt ImageNet pretrained weights to each individual semantic class. For CIFAR100-C, we choose the 20 superclasses as the semantic class for simplicity. We present the anomaly detection results on each semantic category of CIFAR10-C in Tab. 3. We first observe from the results that in average our methods, both with and without prior knowledge, lead the competing methods with a clear margin. The closest competing method, CutPaste, is 4% lower than Ours (w/ np). This is in contrast to the extraordinary performance demonstrated on MVTec by CutPaste. We attribute this to the fact that the corruptions in CIFAR10-C are diverse and may not be easily synthesized by the augmentation methods specifically tailored for the MVTec dataset. We further benchmark on CIFAR100-C and compare with DeepSVDD and CutPaste. We draw similar conclusions from the results. Our method is still stronger than CutPaste with a clear margin. This is caused by the mismatch between the negative samples synthesized by CutPaste and corruptions in the dataset. We present the ablation study results in Tab. 5. We make the following observations from the results. First, as we expected, reusing ImageNet pretrained weights for downstream anomaly detection yields significant improvement in performance (52.54% → 67.32%). This suggests the significance of a good representation for anomaly detection. Adapting pretrained model to the target distribution through contrastive training further improves 2% in average (67.32% → 70.05%). To encourage feature embedding suitable for density-based anomaly detection, we further incorporate the positive pair loss and this again yields additional 3% improvement (70.05% → 73.68%). As an alternative approach, one could encourage all normal samples' features to embed close to a fixed cluster center (F.C.) following Ruff et al. (2018) . However, as the cluster center must be fixed through the first forward pass, this could pose too much constraint on the representation learning and yield inferior results (68.21%). Finally, when we combine negative pair loss, this gives a final boost of performance to 74.56%. We also hypothesize that L2 normalization on feature representation is necessary and the ablation study validated the hypothesis. By removing the L2 normalization on anomaly inference features the performance drops from 74.56% to 72.11%, indicating the normalization is essential to fitting better density model and distance-based anomaly detection. As we discussed, incorporating negative example during adaptation is not always beneficial. The advantage hinges on whether prior knowledge on abnormal sample distribution is available. To verify this point, we evaluate incorporating negative pair loss on 3 industry datasets, SemiCon, AI-TEX and Magnetic Tile. SemiCon dataset features anomalies that are quite different from MVTec, while AITEX and Magnetic Tile are relatively more similar to the texture categories of MVTec. We choose CutPaste (Li et al., 2021a) as negative augmentation for these 3 datasets. The results in Tab. 6 demonstrate that when the negative augmentation is substantially different from the real anomalies, e.g. the void circles in SemiCon dataset, incorporating negative pair loss with inappropriate augmentation will harm performance (78.87 → 62.97%). On the contrary, when anomalies can be simulated, even imperfectly, e.g. Magnetic Tile dataset, incorporating negative pair loss will further improve performance. Overall, we conclude that incorporating negative pair loss is only helpful when prior knowledge on the potential anomalies is concrete and anomalies can be simulated through negative augmentation. Our model without negative pair loss is suitable for tasks without prior knowledge or when generating negative augmentation is difficult. 

5. CONCLUSION

Industry defect inspection requires the capability of anomaly detection with very limited normal samples for training. To meet this demand, we proposed a few-shot anomaly detection approach through adapting models pretrained on large external image collections to few-shot normal samples from the target task. We achieve this adaptation by optimizing a contrastive loss and cross-instance positive pair loss. When prior knowledge on possible anomalies is available we further incorporate a negative pair loss to separate normal sample embeddings from the synthesized negative samples. We extensively evaluated the performance of the proposed method on four real industrial defect detection datasets and two synthetic datasets mimicking realistic corruptions. Our method achieved state-of-the-art performance on all datasets when only a handful of normal samples are available. Finally, we show that the benefit of using synthetic negative samples is task-dependent and should only be considered when accurate prior knowledge is available.

A APPENDIX A.1 DISCUSSIONS ON CONTRASTIVE TRAINING HELPING ADAPTATION

Recent works have demonstrated that contrastive training helps adapt model parameters to target domain distributions (Liu et al., 2021; Xu et al., 2022; Chen et al., 2022; Li et al., 2021b) . We argue that contrastive training on target domain data can alleviate the negative impact of covariate shift. For simplicity, we denote the source domain dataset as D S , e.g. the ImageNet dataset, and the target domain dataset as D T , e.g. anomaly detection dataset. The objective of supervised training on the source domain can be seen as minimizing the following cross entropy loss where h(•) is the classifier on source domain and f (•) is the backbone network to be transferred. Φ S * , Θ S * = arg min Φ,Θ 1 |D S | Xi,yi∈D S L CE (h(f (X i ; Θ); Φ), y i ) Covariate shift between source and target dataset indicates a distributional misalignment, i.e. p S (X) ̸ = p T (X) which is easily manifested by the difference in the contents of the source and target domain data. Therefore, it is reasonable to believe the backbone network optimized for source domain model is not optimal for the target domain distribution. To ease the negative impact caused by covariate shift, we introduce contrastive training on target domain by optimizing an unsupervised contrastive loss with model parameters initialized by the source domain ones, as in Eq. 7. Θ T * = arg min Θ 1 |D T | Xi∈D T L Con (f (t(X i ); Θ), f (t(X i ); Θ)), s.t. Θ T 0 = Θ S * By minimizing the contrastive loss, the network is able to capture key features from the target domain to discriminate non-identical instances and we empirically demonstrate this to be effective for adapting source model to a target domain for downstream anomaly detection. We further provide another perspective into the effectiveness of contrastive learning. When the augmentations are chosen to be mimic the commonly seen variations within normal samples, contrasting two augmented images forces the network to produce similar representations regardless of the augmentations. This means the representation learned from contrastive training allows the network to learn features invariant to common variations in appearance and pose that one could encounter in industrial imaging environments. Such ability will help bring normal samples closer in the feature space, thus benefit downstream anomaly detection.

A.2 FURTHER ANALYSIS

In this section, we provide additional evaluations on qualitative examination of representation learning and discuss the when incorporating negative samples should be employed.

A.2.1 QUALITATIVE EXAMINATION OF REPRESENTATION LEARNING:

The benefit of incorporating each of the proposed three losses for adaptation to target anomaly detection are validated through empirical experiments on multiple datasets. In this section, we provide qualitative observations into the advantage of incorporating these losses through t-SNE visualization of test data representations Van der Maaten & Hinton (2008) . Specifically, we randomly select 1,500 testing samples from CIFAR10-C dataset for visualization. The feature points are projected into 2D space and visualized in Fig. 



Figure 1: Illustration of adapting source domain pretrained model through combining contrastive training loss (green arrow lines), cross-instance positive pair loss (red arrow lines) and negative pair loss (blue arrow lines) for few-shot anomaly detection. The dashed arrow lines indicate no gradient backpropagation.lower dimension through a projection head g(•). The cosine similarity is then calculated between the predictor's output q(g(z)) on the online view and the projector's output g(ẑ) on the target view. To avoid trivial solution, e.g. an encoder function giving constant outputs, the target view is the output of an exponential moving average model, i.e. ẑ = f (X; Θ) and Θt = β Θt-1 + (1 -β)Θ t where β is a moving average hyperparameter. When a source domain model Θ S is available, contrastive training on the target domain is initialized by the source domain model, i.e. Θ 0 = Θ S , such that the low-level feature extractors can be reused. Therefore, contrastive training serves as adapting a pretrained network weights to the few-shot target domain training samples. A discussion on why contrastive training helps can be found in the Appendix A.1.

Figure 2: Examples of industrial image data used for anomaly detection.4.2 COMPETING METHODSWe compare against multiple anomaly detection methods in the experiments. We first benchmark the vanilla auto-encoder (AE) employed in(Bergmann et al., 2019a)  which is trained to reconstruct input images and the difference between input and reconstructs measures the anomaly score. VAE (Kingma & Welling, 2014) imposes constraints on the latent variables to be Gaussian. Multiple samples are drawn in the latent space and decoded to image space for measuring anomaly score(An & Cho, 2015). DeepSVDD(Ruff et al., 2018) trains the network by forcing normal training samples to embed close to a cluster center. The learned cluster center can later serve as the prototype for anomaly detection by measuring the distance as anomaly score. CutPaste(Li et al., 2021a)   introduced CutPaste as an augmentation approach to synthesize negative examples for pretraining feature representation network. It is worth noticing that CutPaste mimics the real anomalies that would appear in the MVTec dataset, therefore the effectiveness of CutPaste may not generalize to other types of anomalies. RotNet(Golan & El-Yaniv, 2018) proposed a self-supervised pretraining approach by predicting the augmentations (rotations) applied. This approach is most effective in anomaly detection on natural semantic images and the advantage may disappear on industrial images where the pose of background is naturally more diverse. CSI(Tack et al., 2020) proposed to contrast distribution shifted augmented images with original images to increase the gap between normal and abnormal samples. This is similar to maintaining only the negative pair loss proposed in this work. CFLOW-AD(Gudovskiy et al., 2022) adopted a conditional normalizing flow model for fast anomaly localization. We adapt CFLOW-AD to training on few-shot anomaly detection task. TDG(Sheynin et al., 2021) proposed to employ generative model for few-shot anomaly detection by differentiating image patches into either fake or one of a list of predefined augmentations. DifferNet(Rudolph et al., 2021) estimates density through normalizing flow with a few supporting training samples. The above two approaches are compared on the MVTec dataset. Ours (w/o np) optimizes on the combination of contrastive loss and cross-instance positive pair loss. Assuming prior knowledge on the anomaly is available, Ours (w/ np) incorporates the negative pair loss and optimizes on the combination of all three losses. Among these methods, CutPaste and Ours (w/ np) are built upon the prior knowledge of anomalies while other methods do not make explicit assumptions on how the anomalies would look like.

ABLATION STUDYIn this section, we investigate the effectiveness of the individual components using the CIFAR10-C dataset. In particular, we demonstrate the importance of contrastive training on few-shot target domain normal samples (Contrast. Train), incorporating cross-instance positive pair loss (Positive Pair Loss) and incorporating the negative pair loss (Negative Pair Loss). We further evaluate incorporating L2 normalization (L2 Norm) during anomaly detection density model fitting and inference.

3. The feature embedding with ImageNet pretrained weights only, (a) w/o Adaptation, shows a substantial overlap between normal and abnormal samples. When contrastive training is applied, (b) w/ Contrastive Loss, we observe a clear seperation between normal and abnormal samples. When additional positive pair loss, (c) PP Loss, and negative pair loss, (d) w/ NP Loss, are incorporated, the normal samples are further grouped into a tighter cluster with larger distrinction between normal and abnormal samples.

Figure 3: T-SNE visualization of anomaly detection testing data on selected testing samples from CIFAR10-C dataset. Blue and red colors indicate normal and abnormal samples respectively.

Figure 4: Samples of applying CutPaste on industry dataset SemiCon. First row is the original samples, and second row is the corresponding generated sample from CutPaste augmentation. Third row shows a few samples of the real defective images.

Figure 5: Illustration of Aitex(upper) and Magnetic Tile dataset. The first row of each dataset are normal examples, second rows are defective examples.

Figure 6: Comparing different ablated models through anomaly score distributions.

to create 2/5/10-shot anomaly detection protocols. AITEX Dataset(Silvestre-Blanes et al., 2019) is dedicated to detecting defects in textile fabric. This dataset consists of 140 normal sample images and 105 defect sample images with corresponding defect mask for localization. The original image resolution is 4096×256 pixels and the defects occupy a very small percentage of pixels. To allow for

Few-shot anomaly detection on MVTec dataset. Per-category AUROC is reported for all competing methods. All numbers are in %. The results of DiffNet effective for realistic industrial image defect identification tasks. Second, methods demonstrating strong performance on MVTec dataset may not generalize to other types of defects. For example, while CutPaste is one of the best performing methods on MVTec, its performance on SemiCon and AITEX is much worse than more traditional approaches. One potential reason for this poor performance is that the synthesized negative samples used in CutPaste are not representative of the defects in SemiCon and AITEX datasets; a more detailed analysis can be found in the Appendix. Table2: Few-shot defect identification results on additional three industry image datasets. AUROC is reported as evaluation metrics. All numbers are in (%).

10-shot anomaly detection on CIFAR10-C dataset. All numbers are reported as AUROC in %. Ours (w/o np)71.68 64.68 75.74 74.38 76.55 74.69 81.40 79.10 60.02 74.53 72.27 Ours (w/ np) 72.23 67.66 78.41 77.45 75.33 75.20 75.35 75.88 72.20 75.85 74.56

Anomaly detection on CIFAR100-C dataset. Per super class performance is reported. All numbers are reported as AUROC in %. .Cls. 0 Sup.Cls. 1 Sup.Cls. 2 Sup.Cls. 3 Sup.Cls. 4 Sup.Cls. 5 Sup.Cls. 6 Sup.Cls. 7 Sup.Cls. 8 Sup.Cls. 9

Ablation study on CIFAR10-C 10-shot FSAD. CIPP stands for cross-instance positive pair.

Effect of negative pair loss with CutPaste augmentation on industry image datasets.

acknowledgement

We further evaluate defect identification performance on another three industrial datasets, namely SemiCon, AITEX and MagneticTile, with results in Tab. 2. The following observations are drawn from the results. First, without any prior knowledge, our model achieves the state-of-the-art performance under lower budgets of available training samples (5 and 10 shots) on all three datasets. It is only slightly behind DeepSVDD at 50-shot on AITEX. This suggests adapting pretrained models to

A.6 EVALUATION ON ADDITIONAL BACKBONE

We evaluate the effectiveness of the proposed representation learning approach with stronger backbone network. In specific, we evaluate ResNet101 on the SemiCon dataset in a few-shot anomaly detection setting. The results in Tab. 7 reveal that contrastive finetuning and across instance positive pair are also effective with stronger backbone network. We notice that the original anomaly detection dataset is already highly imbalanced. For example, the AiTex dataset consists of 600 normal and 100 abnormal samples in the testing set. Since the ratio between normal and abnormal samples does not affect model training, we further implemented a controlled experiment where we manually decrease the number of anomalies in the SemiCon dataset. Specifically, we fix the normal samples to 500 and randomly subsample 5, 10, 20 and 50 abnormal samples for evaluation. The results in Tab. 8 demonstrate the superiority of proposed method against existing competing methods. 

