PUSHING THE LIMITS OF FEW-SHOT ANOMALY DE-TECTION IN INDUSTRY VISION: GRAPHCORE

Abstract

In the area of few-shot anomaly detection (FSAD), efficient visual feature plays an essential role in the memory bank M-based methods. However, these methods do not account for the relationship between the visual feature and its rotated visual feature, drastically limiting the anomaly detection performance. To push the limits, we reveal that rotation-invariant feature property has a significant impact on industrial-based FSAD. Specifically, we utilize graph representation in FSAD and provide a novel visual isometric invariant feature (VIIF) as an anomaly measurement feature. As a result, VIIF can robustly improve the anomaly discriminating ability and can further reduce the size of redundant features stored in M by a large amount. Besides, we provide a novel model GraphCore via VIIFs that can fast implement unsupervised FSAD training and improve the performance of anomaly detection. A comprehensive evaluation is provided for comparing GraphCore and other SOTA anomaly detection models under our proposed few-shot anomaly detection setting, which shows GraphCore can increase average AUC by 5.8%, 4.1%, 3.4%, and 1.6% on MVTec AD and by 25.5%, 22.0%, 16.9%, and 14.1% on MPDD for 1, 2, 4, and 8-shot cases, respectively.

1. INTRODUCTION

With the rapid development of deep vision detection technology in artificial intelligence, detecting anomalies/defects on the surface of industrial products has received unprecedented attention. Changeover in manufacturing refers to converting a line or machine from processing one product to another. Since the equipment has not been completely fine-tuned after the start of the production line, changeover frequently results in unsatisfactory anomaly detection (AD) performance. How to achieve rapid training of industrial product models in the changeover scenario while assuring accurate anomaly detection is a critical issue in the actual production process. The current state of AD in the industry is as follows: (1) In terms of detection accuracy, the performance of state-ofthe-art (SOTA) AD models degrades dramatically during the changeover. Current mainstream work utilizes a considerable amount of training data as input to train the model, as shown in Fig. 1(a) . However, this will make data collecting challenging, even for unsupervised learning. As a result, many approaches based on few-shot learning at the price of accuracy have been proposed. For instance, Huang et al. (2022) employ meta-learning, as shown in Fig. 1(b) . While due to complicated settings, it is impossible to migrate to the new product during the changeover flexibly, and the detection accuracy cannot be guaranteed. (2) In terms of training speed, when a large amount of data is utilized for training, the training progress for new goods is slowed in the actual production line. As is well-known, vanilla unsupervised AD requires to collect a large amount of information. Even though meta-learning works in few-shot learning, as shown in Fig. 1 (b), it is still necessary to train a massive portion of previously collected data. For our setting (c), there is no requirement to aggregate training categories in advance. The proposed model, vision isometric invariant GNN, can fast obtain the invariant feature within a few normal samples, and its accuracy outperforms models trained in a meta-learning context. We state that AD of industrial products requires just a small quantity of data to achieve performance comparable to a large amount of data, i.e., a small quantity of image data can contain sufficient information to represent a large number of data. Due to the fact that industrial products are manufactured with high stability (no evident distortion of shape and color cast), the taken images lack the diversity of natural images, and there is a problem with the shooting angle or rotation. Therefore, it is essential to extract rotation-invariant structural features. As graph neural networks (GNNs) are capable of robustly extracting non-serialized structural features (Han et al. (2022) , Bruna et al. (2013) , Hamilton et al. (2017) , Xu et al. (2018) ), and they integrate global information better and faster Wang et al. (2020) ; Li et al. (2020) . They are more suited than convolution neural networks (CNNs) to handle the problem of extracting rotation-invariant features. For this reason, the core idea of the proposed GraphCore method in this paper is to use the visual isometric invariant features (VIIFs) as the anomaly measurement features. In the method using memory bank (M) as the AD paradigm, PatchCore (Roth et al. (2022) ) uses ResNet (He et al. ( 2016)) as the feature extractor. However, since their features obtained by CNNs do not have rotation invariance (Dieleman et al. ( 2016)), a large number of redundant features are stored in M. Note that these redundant features maybe come from multiple rotation features of the same patch structure. It will hence require a huge quantity of training data to ensure the high accuracy of the test set. To avoid these redundant features, we propose VIIFs, which not only produce more robust visual features but also dramatically lower the size of M and accelerate detection. Based on the previous considerations, the goal of our work is to handle the cold start of the production line during the changeover. As shown in Fig. 1 (c), a new FSAD method, called GraphCore, is developed that employs a small number of normal samples to accomplish fast training and competitive AD accuracy performance of the new product. On the one hand, by utilizing a small amount of data, we would rapidly train and accelerate the speed of anomaly inference. On the other hand, because we directly train new product samples, adaptation and migration of anomalies from the old product to the new product do not occur. Contributions. In summary, the main contributions of this work are as follows: • We present a feature-augmented method for FSAD in order to investigate the property of visual features generated by CNNs. • We propose a novel anomaly detection model, GraphCore, to add a new VIIF into the memory bank-based AD paradigm, which can drastically reduce the quantity of redundant visual features. • The experimental results show that the proposed VIIFs are effective and can significantly enhance the FSAD performance on MVTec AD and MPDD. Related Work. Few-shot anomaly detection (FSAD) is an attractive research topic. However, there are only a few papers devoted to the industrial image FSAD. Some works (Liznerski et al. (2020) ; Pang et al. (2021) ; Ding et al. (2022) ) experiment with few-shot abnormal images in the test set, which contradicts our assumptions that no abnormal images existed. While others (Wu et al. (2021) ; Huang et al. (2022) ) conduct experiments in a meta-learning setting. This configuration has the disadvantage of requiring a high number of base class images and being incapable of addressing the shortage of data under cold-start conditions in industrial applications. PatchCore (Roth et al. (2022) ), SPADE (Cohen & Hoshen (2020) ), and PaDiM (Defard et al. (2021) ) investigated AD performance on MVTec AD in a few-shot setting. However, these approaches are not intended for changeover-based few-shot settings. Thus their performance cannot satisfy the requirements of manufacturing changeover. In this research, we propose a feature augmentation method for FSAD that can rapidly finish the training of anomaly detection models with a small quantity of data and meet manufacturing changeover requirements.

2. APPROACH

Problem Setting. Motivation. In the realistic industrial image dataset (Bergmann et al. (2019) ; Jezek et al. (2021) ), the images under certain categories are extremely similar. Most of them can be converted to one another with simple data augmentation, such as the meta nut (Fig. 2 ) and the screw (Fig. 6 ). For instance, rotation augmentation can effectively provide a new screw dataset. Consequently, when faced with the challenges stated in Section 2, our natural inclination is to acquire additional data through data augmentation. Then, the feature memory bank (Fig. 4 ) can store more useful features. 

2.1. AUGMENTATION+PATCHCORE

To validate our insight, we have adapted PatchCore (Roth et al. (2022) ) to our model. We denote augmentation (rotation) with PatchCore as Aug.(R). The architecture is depicted in detail in Fig. 2 . Before extracting features from the ImageNet pre-trained model, we augment the data (e.g., by rotating the data).  i ∈ [0, • • • , l -1] do mi ← arg max m∈M-M C min n∈M C ∥ψ(m) -ψ(n)∥ 2 ; MC ← MC ∪ {mi}; end for M ← MC. In the training phase, the aim of the training phase is to build up a memory bank, which stores the neighborhood-aware features from all normal samples. At test time, the test image is predicted as anomalies if at least one patch is anomalous, and pixel-level anomaly segmentation is computed via the score of each patch feature. The feature memory construction method is shown in Algorithm 1. We default set ResNet18 (He et al. ( 2016)) as the feature extraction model. Conceptually, coreset sampling (Sener & Savarese (2018) ) for memory bank aims to balance the size of the memory bank with the performance of anomaly detection. And the size of the memory bank has a considerable impact on the inference speed. In Section 3.3, we discuss the effect of the sampling rate in detail. In testing phase, with the normal patches feature bank M, the image-level anomaly score s for the test image x test is computed by the maximum score s * between the test image's patch feature P(x test ) and its respective nearest neighbour m * in M. From Table 2 and Table 3 , we can easily observe that the performance of Aug.(R) greatly outperforms the SOTA models under the proposed few-shot setting.

2.2. VISION ISOMETRIC INVARIANT FEATURE

In Section 2.1, we heuristically demonstrate that Augmentation+PatchCore outperforms SOTA models in the few-shot anomaly detection context proposed. Essentially, the data augmentation method immediately incorporates the features of normal samples into the memory bank. In other words, Augmentation+PatchCore improves the probability of locating a subset feature, allowing the anomaly score of the test image to be calculated with greater precision. Therefore, we question whether it is possible to extract the invariant representational features from a small number of normal samples and add them to the feature memory bank. As demonstrated in Fig. 3 , we propose a new model for feature extraction: vision isometric invariant graph neural network (VIIG). The proposed model is motivated by Section 2 and tries to extract the visual isometric invariant feature (VIIF) from each patch of the normal sample. As previously stated, the majority of industrial visual anomaly detection datasets are transformable via rotation, translation, and flipping. Thus, the isomorphism of GNN suited industrial visual anomaly detection excellently. 

2.3. GRAPH REPRESENTATION OF IMAGE

Fig. 4 shows the feature extraction process of GraphCore. Specifically, for a normal sample image with a size of H ×W ×3, we evenly separate it as an N patch. In addition, each patch is transformed into a feature vector f i ∈ R D . So we have the features F = [f 1 , f 2 , • • • , f N ], where D is the feature dimension and i = 1, 2, • • • , N . We view these features as unordered nodes V = {v 1 , v 2 , • • • , v N }. For certain each node v i , we denote the K nearest neighbours N (v i ) and add an edge e ij directed from v j to v i for all v j ∈ N (v i ). Hence, each patch of normal samples can be denoted as a graph G = (V, E). E refers all the edges of Graph G.

2.4. GRAPH FEATURE PROCESSING

Fig. 4 shows the architecture of the proposed vision isometric invariant GNN. To be specific, we set the feature extraction as GCN (Kipf & Welling (2017) ). We aggregate features for each node by exchanging information with its neighbour nodes. In specific, the feature extraction operates as follows: G ′ = F (G, W) = U pdate(Aggregate(G, W aggregate ), W update ), where W aggregate and W update denote the weights of the aggregation and update operations. Both of them can be optimized in an end-to-end manner. Specifically, the aggregation operation for each node is calculated by aggregating neighbouring nodes' features: f ′ i = h(f i , g(f i , N (f i ), W aggregate ), W update ), ( ) where h is the node feature update function and g is the node feature aggregate feature function. N (f l i ) denotes the set of neighbor nodes of f l i at the l-th layer. Specifically, we employ max-relative graph convolution (Li et al. (2019) ) as our operator. So g and h are defined as: g(•) = f ′′ i = max({f i -f j |j ∈ N (x i )}), h(•) = f ′ i = f ′′ i W update . In Equations 3 and 4, g(•) is a max-pooling vertex feature aggregator that aggregates the difference in features between node v i and all of its neighbours. h(•) is an MLP layer with batch normalization and ReLU activation. 2.5 GRAPHCORE ARCHITECTURE Fig. 5 shows the whole architecture of GraphCore. In the training phase, the most significant difference between GraphCore and Augmentation+PatchCore is the feature memory bank construction algorithm. The feature construction algorithm is the same as Aug.(R) memory bank in Algorithm 1. Note that we use vision isometric invariant GNN as feature extractor P without data augmentation. In the testing phase, the computation of anomaly score s * for GraphCore is highly similar to the one in Augmentation+PatchCore. The only difference is the feature extraction method for each normal patch sample. The architecture details of the GraphCore are shown in the reference Table 21 .

2.6. A UNIFIED VIEW OF AUGMENTATION+PATCHCORE AND GRAPHCORE

Fig. 6 depicts a unified view of both Augmentation+PatchCore and GraphCore. Augmenta-tion+PatchCore prompts GraphCore to obtain the isometric invariant feature. Therefore, GraphCore can improve the probability of locating a feature subset, allowing the anomaly score of a test image to be calculated most precisely and rapidly. Table 1 shows the difference between PatchCore, Augmentation+PatchCore and GraphCore in terms of architectural details. 

Accuracy

Pixel AUROC @ GraphCore Image AUROC @ GraphCore Pixel AUROC @ Aug.(R) Image AUROC @ Aug.(R) Pixel AUROC @ RegAD Image AUROC @ RegAD (a) MPDD 2022)), using the official source code for comparison under our few-shot setting. PatchCore-1 is the result of our reimplementation with a 1% sampling rate, PatchCore-10 and PatchCore-25 are the results at 10% and 25% sampling rates, respectively, and RegAD-L is the RegAD experiment with our few-shot setting.

3.2. COMPARISON WITH THE SOTA METHODS

The comparative findings between MVTec and MPDD are shown in Table 2 . Especially the performance of RegAD under the meta-learning setting is also listed in the table. In comparison to SOTA models, GraphCore improves average AUC by 5.8%, 4.1%, 3.4%, and 1.6% on MVTec and by 25.5%, 22.0%, 16.9%, and 14.1% on MPDD for 1, 2, 4, and 8-shot cases, respectively. From Fig. 7 , it can be easily observed that GraphCore significantly outperforms the SOTA approach at the image and pixel level from 1-shot to 8-shot. As can be seen, the performance of GraphCore and Augmentation+PatchCore surpasses the other methods when using only a few samples for training. Considering that RegAD only shows detailed results of various categories above 2-shot, we only show the detailed results of 2-shot in the main text, and the results of 1-shot, 4-shot, and 8-shot are in the appendix. As shown in Table 3 , GraphCore outperforms all other baseline methods in 12 out of the 15 categories at the image level and outperforms all other baselines in 11 out of the 15 categories at the pixel level on MVTec AD. Moreover, results in Table 4 show that GraphCore outperforms all other baselines in 5 out of the 6 categories at the image level and outperforms all other baselines in all categories at the pixel level on MPDD.

3.3. ABLATION STUDIES

Sampling Rate. When demonstrated in Fig. 8 , our technique significantly improves as the sample rate increases from 0.0001 to 0.001, after which the increase in sampling rate has a flattening effect on the performance gain. In other words, as the sampling rate steadily increases, the performance of GraphCore is insensitive to the sampling rate. Nearest Neighbour. In Fig. 8 , the green colour represents the performance of GraphCore's 9 nearest neighbour search, and the blue colour represents the performance of GraphCore's 3 nearest neighbour search. As can be seen, increasing the number of neighbours from 3 to 9 greatly increases performance at the pixel level when the sampling rate is low, but does not enhance performance at the image level. As the sampling rate increases, the gain of the number of pixels' neighbours approaches zero. Augmentation Methods. Fig. 9 demonstrates that the performance of PatchCore on MVTec AD and MPDD is relatively weak, but Aug.(R) demonstrates higher performance. It demonstrates heuristically that our enhancement to feature rotation is significantly effective. Moreover, Graph-Core outperforms Aug.(R) by a large margin, confirming our assumption that GraphCore can extract the isometric invariant feature from industrial-based anomaly detection images. 3.4 VISUALIZATION Fig. 10 shows the visualization results obtained by our method on MVTec AD and MPDD with sampling rates of 0.01 and 1 shot, respectively. Each column represents a different item type, and the four rows, from top to bottom, are the detection image, anomaly score map, anomaly map on detection image, and ground truth. According to the results, our method can produce a satisfactory impact of anomaly localization on various objects, indicating that it has a strong generalized ability even in the 1-shot case.

4. CONCLUSION

In this study, we introduce a new approach, GraphCore, for industrial-based few-shot visual anomaly detection. Initially, by investigating the CNN-generated feature space, we present a simple pipeline -Augmentation+PatchCore -for obtaining rotation-invariant features. It turns out that this simple baseline can significantly improve anomaly detection performance. We further propose GraphCore to capture the isometric invariant features of normal industrial samples. It outperforms the SOTA models by a large margin using only a few normal samples (≤ 8) for training. The majority of industrial datasets for anomaly detection possess isomorphism, which is a property ideally suited to GraphCore. We will continue to push the limits of industrial-based few-shot anomaly detection in the future.

6.1. DATASET

MVTec AD is the most popular dataset for industrial image anomaly detection (Bergmann et al. (2019) ), which consists of 15 categories of items, including a total of 3629 normal images as a training set, and a collection of 1725 normal images and abnormal images as a test set. All images have a resolution between 700×700 and 1024×1024 pixels. MPDD is a more challenging AD dataset containing 6 classes of metal parts (Jezek et al. (2021) ). The images are taken in different spatial directions and distances and under the condition of nonuniform background, so it is more challenging. The statical result of Table 22 and Table 23 clearly demonstrate the effectiveness of GraphCore, especially for memory bank size and its inference speed. The statistical results presented in Tables 24 and 25 demonstrate that the rotation method outperforms the other augmentation techniques. We believe this indicates that the majority of industrial anomaly image datasets can be augmented by rotation. In the future, we believe that there will a more complex and realistic industrial anomaly image dataset that cannot be overcome by rotation. 



Figure 1: Different from (a) vanilla unsupervised AD and (b) few-shot unsupervised AD in meta learning. As input training samples, our setting (c) only utilizes a small number of normal samples. For our setting (c), there is no requirement to aggregate training categories in advance. The proposed model, vision isometric invariant GNN, can fast obtain the invariant feature within a few normal samples, and its accuracy outperforms models trained in a meta-learning context.

Fig. 1(c) outlines the formal definition of the problem setting for the proposed FSAD. Given a training set of only n normal samples during training, where n ≤ 8, from a specific category. At test time, given a normal or abnormal sample from a target category, the anomaly detection model should predict whether or not the image is anomalous and localize the anomaly region if the prediction result is anomalous. Challenges. For the FSAD proposed in Fig. 1(c), we attempt to detect anomalies in the test sample using only a small number of normal images as the training dataset. The key challenges consist of: (1) Each category's training dataset contains only normal samples, i.e., no annotations at the image or pixel level. (2) There are few normal samples of the training set available. In our proposed setting, there are fewer than 8 training samples.

Figure 2: Augmentation+PatchCore Architecture.

Figure 3: Convolution feature VS vision isometric invariant feature.

Figure 4: Vision isometric invariant GNN pipeline.

Figure 5: Vision isometric invariant GNN for FSAD.

Figure 7: GraphCore VS Augmentation+PatchCore VS RegAD on various numbers of shot (K).

Figure 8: Ablation results on sampling rates and the number of N nearest neighbors.

Figure 9: GraphCore vs Augmentation+PatchCore vs PatchCore on various number of shot (K).

Figure 10: Visualization results of the proposed method on MVTec AD and MPDD. The first row denotes the training example in the 1-shot setting. The second row is test samples (abnormal). The third row is the heatmap on test samples. The fourth row is the anomaly mask (ground truth).

ImageNet pre-trained ϕ, all normal samples XN , data augmentation operator α, patch feature extractor P, memory size target l, random linear projection ψ.

Unified view for three methods.



FSAD results on MVTec AD. The number of shots K is 2, and the sampling ratio is 0.01, x|y represents image AUROC and pixel AUROC. The results for PaDiM, PatchCore-10 and PatchCore-25 are reported fromRoth et al. (2022). The results for RegAD are reported fromHuang et al. (2022). The best-performing method is in bold.

FSAD results on MPDD. The number of shots K is 2, and the sampling ratio is 0.01, x|y represents image AUROC and pixel AUROC. The results for PaDiM, PatchCore-10 and PatchCore-25 are reported fromRoth et al. (2022). The results for RegAD are reported fromHuang et al. (2022). The best-performing method is in bold.

The training set contains 888 normal images, and the test set contains 176 normal images and 282 abnormal images. The resolution of all images is 1024×1024 pixels.

Results of anomaly detection. Setting: New Few-shot Setting, K (number of shot)=1, Dataset: MVTec, Sampling Ratio: 0.01, Metrics: Image AUROC. The number of shot for RegAD is 2. The data for PaDiM and PatchCore-10, PatchCore-25 are fromRoth et al. (2022).

Setting: Ours Few-shot Setting, K (number of shot)=1, Dataset: MVTec, Sampling Ratio: 0.01, Metrics: Pixel AUROC. The number of shot for RegAD is 2. The data for PaDiM and PatchCore-10, PatchCore-25 are from Roth et al. (2022).

Setting: Ours Few-shot Setting, K (number of shot)=2, Dataset: MVTec, Sampling Ratio: 0.01, Metrics: Image AUROC. The data for PaDiM and PatchCore-10, PatchCore-25 are from Roth et al. (2022).

Setting: Ours Few-shot Setting, K (number of shot)=2, Dataset: MVTec, Sampling Ratio: 0.01, Metrics: Pixel AUROC. The data for PaDiM and PatchCore-10, PatchCore-25 are from Roth et al. (2022).



Setting: New Few-shot Setting, K (number of shot)=8, Dataset: MVTec, Sampling Ratio: 0.01, Metrics: Pixel AUROC

Setting: New Few-shot Setting, K (number of shot)=2, Dataset: MPDD, Sampling Ratio: 0.01, Metrics: Pixel AUROC

Setting: New Few-shot Setting, K (number of shot)=8, Dataset: MPDD, Sampling Ratio: 0.01, Metrics: Image AUROC

Setting: New Few-shot Setting, K (number of shot)=8, Dataset: MPDD, Sampling Ratio: 0.01, Metrics: Pixel AUROC

The architecture details of GraphCore Pooling and MLP In Table21, D represents the feature dimension, whereas K represents the number of neighbors in GraphCore. H × W represents the size of the input image. We adapt GCN into the the pyramid architectureWang et al. (2021b). The training epochs is 300. The optimizer is AdamW Loshchilov & Hutter. The batch size is 128. The initial learning rate is 0.005. The learning rate schedule is Cosine. The warmup epochs is 50. The weight decay is 0.05. The loss function is the cross entropy loss function.

Ablation study for memory bank size and inference speed with respect to 1 shot

Ablation study for memory bank size and inference speed with respect to 2 shot

Ablation study with respect to Dataset: MVTec 2D, sampling rate: 0.01, Metrics: imagelevel AUROC, number of shot is 1.

Ablation study with respect to Dataset: MVTec 2D, sampling rate: 0.01, Metrics: imagelevel AUROC, number of shot is 2.

5. ACKNOWLEDGMENTS

This work is supported by the National Natural Science Foundation of China under Grant No. 62122035, 62206122, and 61972188. Y. Jin is supported by an Alexander von Humboldt Professorship for AI endowed by the German Federal Ministry of Education and Research.

