diff --git a/conclusionandoutlook.typ b/conclusionandoutlook.typ index 11a3cd2..2b2154d 100644 --- a/conclusionandoutlook.typ +++ b/conclusionandoutlook.typ @@ -1,14 +1,14 @@ = Conclusion and Outlook == Conclusion In conclusion one can say that Few-Shot learning is not the best choice for anomaly detection tasks. -It is hugely outperformed by state of the art algorithms like Patchcore or EfficientAD. +It is hugely outperformed by state of the art algorithms like PatchCore or EfficientAD. The only benefit of Few-Shot learning is that it can be used in environments where only a limited number of good samples are available. But this should not be the case in most scenarios. -Most of the time plenty of good samples are available and in this case Patchcore or EfficientAD should perform great. +Most of the time plenty of good samples are available and in this case PatchCore or EfficientAD should perform great. The only case where Few-Shot learning could be used is in a scenarios where one wants to detect the anomaly class itself. -Patchcore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is. -So chaining a Few-Shot learner after Patchcore or EfficientAD could be a good idea to use the best of both worlds. +PatchCore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is. +So chaining a Few-Shot learner after PatchCore or EfficientAD could be a good idea to use the best of both worlds. In most of the tests P>M>F performed the best. But also the simple ResNet50 method performed better than expected in most cases and can be considered if the computational resources are limited and if a simple architecture is enough. @@ -19,4 +19,4 @@ There might be a lack of research in the area where the classes to detect are ve and when building a few-shot learning algorithm tailored specifically for very similar classes this could boost the performance by a large margin. It might be interesting to test the SOT method (see @SOT) with a ResNet50 feature extractor similar as proposed in this thesis but with SOT for embedding comparison. -Moreover, TRIDENT (see @TRIDENT) could achive promising results in a anomaly detection scenario. +Moreover, TRIDENT (see @TRIDENT) could achieve promising results in an anomaly detection scenario. diff --git a/experimentalresults.typ b/experimentalresults.typ index 1efc1e2..9672afc 100644 --- a/experimentalresults.typ +++ b/experimentalresults.typ @@ -5,16 +5,16 @@ == Is Few-Shot learning a suitable fit for anomaly detection? _Should Few-Shot learning be used for anomaly detection tasks? -How does it compare to well established algorithms such as Patchcore or EfficientAD?_ +How does it compare to well established algorithms such as PatchCore or EfficientAD?_ @comparison2waybottle shows the performance of the 2-way classification (anomaly or not) on the bottle class and @comparison2waycable the same on the cable class. The performance values are the same as in @experiments but just merged together into one graph. -As a reference Patchcore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset. +As a reference PatchCore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset. Both are trained with samples from the 'good' class only. So there is a clear performance gap between Few-Shot learning and the state of the art anomaly detection algorithms. -In the @comparison2way Patchcore and EfficientAD are not included as they aren't directly compareable in the same fashion. +In the @comparison2way PatchCore and EfficientAD are not included as they aren't directly compareable in the same fashion. -That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and Patchcore or EfficientAD should be used. +That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and PatchCore or EfficientAD should be used. #subpar.grid( figure(image("rsc/comparison-2way-bottle.png"), caption: [ diff --git a/implementation.typ b/implementation.typ index dafdc26..d1fa8a5 100644 --- a/implementation.typ +++ b/implementation.typ @@ -17,27 +17,27 @@ For all of the three methods we test the following use-cases: - Inbalanced 2 Way classification (5,10,15,30 good shots, 5 bad shots) - Similar to the 2 way classification but with an inbalanced number of good shots. - Inbalanced target class prediction (5,10,15,30 good shots, 5 bad shots)#todo[Avoid bullet points and write flow text?] - - Detect only the faulty classes without the good classed with an inbalanced number of shots. + - Detect only the faulty classes without the good ones, but with an inbalanced number of shots. All those experiments were conducted on the MVTEC AD dataset on the bottle and cable classes. == Experiment Setup All the experiments were done on the bottle and cable classes of the MVTEC AD dataset. -The correspoinding number of shots were randomly selected from the dataset. +The corresponding number of shots were randomly selected from the dataset. The rest of the images was used to test the model and measure the accuracy. #todo[Maybe add real number of samples per classes] == ResNet50 === Approach -The simplest approach is to use a pre-trained ResNet50 model as a feature extractor. +The simplest approach is to use a pretrained ResNet50 model as a feature extractor. From both the support and query set the features are extracted to get a downprojected representation of the images. After downprojection the support set embeddings are compared to the query set embeddings. To predict the class of a query, the class with the smallest distance to the support embedding is chosen. If there are more than one support embedding within the same class the mean of those embeddings is used (class center). -This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just Use a Library of Pre-trained Feature +This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just use a Library of Pre-trained Feature Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net. -In this bachelor thesis a pre-trained ResNet50 (IMAGENET1K_V2) pytorch model was used. +In this bachelor thesis a pretrained ResNet50 (IMAGENET1K_V2) pytorch model was used. It is pretrained on the imagenet dataset and has 50 residual layers. To get the embeddings the last layer of the model was removed and the output of the second last layer was used as embedding output. @@ -95,7 +95,7 @@ The class with the smallest distance is chosen as the predicted class. This method performed better than expected with such a simple method. As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%. When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways. -Interestintly the model performed slightly better with fewer shots in this case. +Interestingly the model performed slightly better with fewer shots in this case. Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification. The more shots the better the performance and it peaks at around 88% accuracy with 5 shots. @@ -137,7 +137,7 @@ but this is expected as the cable class consists of 8 faulty classes. == P>M>F === Approach For P>M>F, I used the pretrained model weights from the original paper. -As backbone feature extractor a DINO model is used, which is pre-trained by facebook. +As backbone feature extractor a DINO model is used, which is pretrained by facebook. This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion. This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC- Aircraft, CUB-200-2011, Describable Textures, QuickDraw, @@ -145,7 +145,7 @@ FGVCx Fungi, VGG Flower, Traffic Signs and MSCOCO~@pmfpaper] of diverse domains by the authors of the original paper.~@pmfpaper Finally, this model is fine-tuned with the support set of every test iteration. -Every time the support set changes, we need to finetune the model again. +Every time the support set changes, we need to fine-tune the model again. In a real world scenario this should not be the case because the support set is fixed and only the query set changes. === Results diff --git a/introduction.typ b/introduction.typ index 498a8fd..431d43c 100644 --- a/introduction.typ +++ b/introduction.typ @@ -5,7 +5,7 @@ Anomaly detection is of essential importance, especially in the industrial and automotive field. Lots of assembly lines need visual inspection to find errors often with the help of camera systems. Machine learning helped the field to advance a lot in the past. -Most of the time the error rate is sub $.1%$ and therefore plenty of good data and almost no faulty data is available. +Most of the time the error rate is sub $0.1%$ and therefore plenty of good data and almost no faulty data is available. So the train data is heavily unbalanced.~#cite() PatchCore and EfficientAD are state of the art algorithms trained only on good data and then detect anomalies within unseen (but similar) data. @@ -20,7 +20,7 @@ Moreover, few-shot learning might be able not only to detect anomalies but also === Is Few-Shot learning a suitable fit for anomaly detection? _Should Few-Shot learning be used for anomaly detection tasks? -How does it compare to well established algorithms such as Patchcore or EfficientAD?_ +How does it compare to well established algorithms such as PatchCore or EfficientAD?_ === How does disbalancing the Shot number affect performance? _Does giving the Few-Shot learner more good than bad samples improve the model performance?_ @@ -38,7 +38,7 @@ How does it compare to PatchCore and EfficientAD?_ This thesis is structured to provide a comprehensive exploration of Few-Shot Learning in anomaly detection. @sectionmaterialandmethods introduces the datasets and methodologies used in this research. The MVTec AD dataset is discussed in detail as the primary source for benchmarking, along with an overview of the Few-Shot Learning paradigm. -The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as Pachcore and EfficientAD. +The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as PatchCore and EfficientAD. @sectionimplementation focuses on the practical realization of the methods described in the previous chapter. It outlines the experimental setup, including the use of Jupyter Notebook for prototyping and testing, and provides a detailed account of how each method was implemented and evaluated. diff --git a/materialandmethods.typ b/materialandmethods.typ index e253c0a..b783c28 100644 --- a/materialandmethods.typ +++ b/materialandmethods.typ @@ -55,7 +55,7 @@ More defect classes are already an indication that a classification task might b Cut outer insulation ]), , figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [ - Mising cable defect + Missing cable defect ]), , figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [ Poke insulation defect @@ -142,7 +142,7 @@ $ === Euclidean Distance The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space. It just calculates the square root of the sum of the squared differences of the coordinates. -the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors. +The euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors. @analysisrudin $ @@ -150,14 +150,14 @@ $ $ -=== Patchcore +=== PatchCore // https://arxiv.org/pdf/2106.08265 PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data. It operates on the principle that an image is anomalous if any of its patches is anomalous. The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite() #todo[Absatz umformulieren und vereinfachen] -The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches. +The PatchCore framework leverages a pretrained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches. By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet. To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite() @@ -172,13 +172,13 @@ If any patch exhibits a significant deviation, the corresponding image is flagge For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite() -Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies. +PatchCore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies. A great advantage of this method is the coreset subsampling reducing the memory bank size significantly. This lowers computational costs while maintaining detection accuracy.~#cite() #figure( image("rsc/patchcore_overview.png", width: 80%), - caption: [Architecture of Patchcore. #cite()], + caption: [Architecture of PatchCore. #cite()], ) === EfficientAD @@ -186,13 +186,13 @@ This lowers computational costs while maintaining detection accuracy.~#cite() #todo[reference to image below] The detection of anomalies is achieved through a student-teacher framework. -The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output. -An anomalie is identified when the student failes to replicate the teachers output. +The teacher network is a PDN and pretrained on normal (good) images and the student network is trained to predict the teachers output. +An anomaly is identified when the student fails to replicate the teachers output. This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training. A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite() @@ -200,7 +200,7 @@ Additionally to this structural anomaly detection, EfficientAD can also address This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite() By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected. -This is a challenge that Patchcore does not directly address.~#cite() +This is a challenge that PatchCore does not directly address.~#cite() #todo[maybe add key advantages such as low computational cost and high performance] @@ -227,7 +227,7 @@ Convolutional layers capture features like edges, textures or shapes. Pooling layers sample down the feature maps created by the convolutional layers. This helps reducing the computational complexity of the overall network and help with overfitting. Common pooling layers include average- and max pooling. -Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task. +Finally, after some convolutional layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task. @cnnarchitecture shows a typical binary classification task.~#cite() #figure( @@ -263,11 +263,11 @@ There are well established methods for pretraining which can be used such as DIN #cite() *Meta-training:* -The second stage in the pipline as in @pmfarchitecture is the meta-training. -Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone. +The second stage in the pipeline as in @pmfarchitecture is the meta-training. +Here a prototypical network (ProtoNet) is used to refine the pretrained backbone. ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification. Have a look at @prototypefewshot for a visualisation of its architecture. -The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$. +The ProtoNet only requires a backbone $f$ to map images to a m-dimensional vector space: $f: cal(X) -> RR^m$. The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances: $ @@ -276,7 +276,7 @@ $ As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula. $c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$. -The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite() +The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.~#cite() *Fine-tuning:* If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution. @@ -293,29 +293,29 @@ During this step, the entire model is fine-tuned to the new domain.~#cite() +The query image is then assigned to the class with the closest prototype.~#cite() *Performance:* P>M>F performs well across several few-shot learning benchmarks. -The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well. -The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite() +The combination of pre-training on large dataset and meta-training with episodic tasks helps the model to generalize well. +The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.~#cite() *Limitations and Scalability:* This method has some limitations. -It relies on domains with large external datasets and it requires substantial computational resources to create pre-trained models. -Fine-tuning is effective but might be slow and not work well on devices with limited computationsl resources. +It relies on domains with large external datasets and it requires substantial computational resources to create pretrained models. +Fine-tuning is effective but might be slow and not work well on devices with limited computational resources. Future research could focus on exploring faster and more efficient methods for fine-tuning models. #cite() === CAML // https://arxiv.org/pdf/2310.10971v2 CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning. -It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model. +It consists of three different components: a frozen pretrained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model. This is a universal meta-learning approach. That means no fine-tuning or meta-training is applied for specific domains.~#cite() *Architecture:* -CAML first encodes the query and support set images using the frozen pre-trained feature extractor as shown in @camlarchitecture. +CAML first encodes the query and support set images using the frozen pretrained feature extractor as shown in @camlarchitecture. This step brings the images into a low dimensional space where similar images are encoded into similar embeddings. The class labels are encoded with the ELMES class encoder. Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder. @@ -343,14 +343,14 @@ Afterwards it is passed through a simple MLP network to predict the class of the ~#cite() *Large-Scale Pre-Training:* -CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets. +CAML is pretrained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets. Those datasets span over different domains and help to detect any new visual concept during inference. Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen. ~#cite() *Inference:* During inference, CAML processes the following: -- Encodes the support set images and labels with the pre-trained feature and class encoders. +- Encodes the support set images and labels with the pretrained feature and class encoders. - Concatenates these encodings into a sequence alongside the query image embedding. - Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations. - Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite() @@ -365,7 +365,7 @@ It performes competitively against P>M>F in 8 benchmarks even though P>M>F was m CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX) and low-resolution tasks (e.g., CIFAR-fs). -Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance. +Its use of frozen pretrained feature extractors is key to avoiding overfitting and enabling robust performance. ~#cite() #todo[We should add stuff here why we have a max amount of shots bc. of pretrained model] @@ -383,17 +383,17 @@ Either they performed worse on benchmarks compared to the used methods or they w // https://arxiv.org/pdf/2211.16191v2 // https://arxiv.org/abs/2211.16191v2 -SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP. -It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model. +SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pretrained vision-language models like CLIP. +It focuses on generating better visual features for specific tasks while still using the general knowledge from the pretrained model. Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task. -This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features. +This process is supported by knowledge distillation, where detailed information from the pretrained model guides the learning of the new visual features. Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite() One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection. -The use of pre-trained knowledge helps reduce the need for large datasets. -However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model. -If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt. -This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models. +The use of pretrained knowledge helps reduce the need for large datasets. +However, a disadvantage is that it depends heavily on the quality and capabilities of the pretrained model. +If the pretrained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt. +This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pretrained models. Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite() === TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) @@ -414,7 +414,7 @@ Its ability to isolate critical features while droping irellevant context aligns // https://arxiv.org/pdf/2204.03065v1 // https://arxiv.org/abs/2204.03065v1 -The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations. +The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tasks like matching, grouping or classification by re-embedding feature representations. This transform processes features as a set instead of using them individually. This creates context-aware representations. SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite()