fix lots of typos
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s

This commit is contained in:
lukas-heiligenbrunner 2025-02-02 12:59:15 +01:00
parent 94fe252741
commit cf6f4f96ac
5 changed files with 53 additions and 53 deletions

View File

@ -1,14 +1,14 @@
= Conclusion and Outlook <sectionconclusionandoutlook> = Conclusion and Outlook <sectionconclusionandoutlook>
== Conclusion == Conclusion
In conclusion one can say that Few-Shot learning is not the best choice for anomaly detection tasks. In conclusion one can say that Few-Shot learning is not the best choice for anomaly detection tasks.
It is hugely outperformed by state of the art algorithms like Patchcore or EfficientAD. It is hugely outperformed by state of the art algorithms like PatchCore or EfficientAD.
The only benefit of Few-Shot learning is that it can be used in environments where only a limited number of good samples are available. The only benefit of Few-Shot learning is that it can be used in environments where only a limited number of good samples are available.
But this should not be the case in most scenarios. But this should not be the case in most scenarios.
Most of the time plenty of good samples are available and in this case Patchcore or EfficientAD should perform great. Most of the time plenty of good samples are available and in this case PatchCore or EfficientAD should perform great.
The only case where Few-Shot learning could be used is in a scenarios where one wants to detect the anomaly class itself. The only case where Few-Shot learning could be used is in a scenarios where one wants to detect the anomaly class itself.
Patchcore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is. PatchCore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is.
So chaining a Few-Shot learner after Patchcore or EfficientAD could be a good idea to use the best of both worlds. So chaining a Few-Shot learner after PatchCore or EfficientAD could be a good idea to use the best of both worlds.
In most of the tests P>M>F performed the best. In most of the tests P>M>F performed the best.
But also the simple ResNet50 method performed better than expected in most cases and can be considered if the computational resources are limited and if a simple architecture is enough. But also the simple ResNet50 method performed better than expected in most cases and can be considered if the computational resources are limited and if a simple architecture is enough.
@ -19,4 +19,4 @@ There might be a lack of research in the area where the classes to detect are ve
and when building a few-shot learning algorithm tailored specifically for very similar classes this could boost the performance by a large margin. and when building a few-shot learning algorithm tailored specifically for very similar classes this could boost the performance by a large margin.
It might be interesting to test the SOT method (see @SOT) with a ResNet50 feature extractor similar as proposed in this thesis but with SOT for embedding comparison. It might be interesting to test the SOT method (see @SOT) with a ResNet50 feature extractor similar as proposed in this thesis but with SOT for embedding comparison.
Moreover, TRIDENT (see @TRIDENT) could achive promising results in a anomaly detection scenario. Moreover, TRIDENT (see @TRIDENT) could achieve promising results in an anomaly detection scenario.

View File

@ -5,16 +5,16 @@
== Is Few-Shot learning a suitable fit for anomaly detection? <expresults2way> == Is Few-Shot learning a suitable fit for anomaly detection? <expresults2way>
_Should Few-Shot learning be used for anomaly detection tasks? _Should Few-Shot learning be used for anomaly detection tasks?
How does it compare to well established algorithms such as Patchcore or EfficientAD?_ How does it compare to well established algorithms such as PatchCore or EfficientAD?_
@comparison2waybottle shows the performance of the 2-way classification (anomaly or not) on the bottle class and @comparison2waycable the same on the cable class. @comparison2waybottle shows the performance of the 2-way classification (anomaly or not) on the bottle class and @comparison2waycable the same on the cable class.
The performance values are the same as in @experiments but just merged together into one graph. The performance values are the same as in @experiments but just merged together into one graph.
As a reference Patchcore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset. As a reference PatchCore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset.
Both are trained with samples from the 'good' class only. Both are trained with samples from the 'good' class only.
So there is a clear performance gap between Few-Shot learning and the state of the art anomaly detection algorithms. So there is a clear performance gap between Few-Shot learning and the state of the art anomaly detection algorithms.
In the @comparison2way Patchcore and EfficientAD are not included as they aren't directly compareable in the same fashion. In the @comparison2way PatchCore and EfficientAD are not included as they aren't directly compareable in the same fashion.
That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and Patchcore or EfficientAD should be used. That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and PatchCore or EfficientAD should be used.
#subpar.grid( #subpar.grid(
figure(image("rsc/comparison-2way-bottle.png"), caption: [ figure(image("rsc/comparison-2way-bottle.png"), caption: [

View File

@ -17,27 +17,27 @@ For all of the three methods we test the following use-cases:
- Inbalanced 2 Way classification (5,10,15,30 good shots, 5 bad shots) - Inbalanced 2 Way classification (5,10,15,30 good shots, 5 bad shots)
- Similar to the 2 way classification but with an inbalanced number of good shots. - Similar to the 2 way classification but with an inbalanced number of good shots.
- Inbalanced target class prediction (5,10,15,30 good shots, 5 bad shots)#todo[Avoid bullet points and write flow text?] - Inbalanced target class prediction (5,10,15,30 good shots, 5 bad shots)#todo[Avoid bullet points and write flow text?]
- Detect only the faulty classes without the good classed with an inbalanced number of shots. - Detect only the faulty classes without the good ones, but with an inbalanced number of shots.
All those experiments were conducted on the MVTEC AD dataset on the bottle and cable classes. All those experiments were conducted on the MVTEC AD dataset on the bottle and cable classes.
== Experiment Setup == Experiment Setup
All the experiments were done on the bottle and cable classes of the MVTEC AD dataset. All the experiments were done on the bottle and cable classes of the MVTEC AD dataset.
The correspoinding number of shots were randomly selected from the dataset. The corresponding number of shots were randomly selected from the dataset.
The rest of the images was used to test the model and measure the accuracy. The rest of the images was used to test the model and measure the accuracy.
#todo[Maybe add real number of samples per classes] #todo[Maybe add real number of samples per classes]
== ResNet50 <resnet50impl> == ResNet50 <resnet50impl>
=== Approach === Approach
The simplest approach is to use a pre-trained ResNet50 model as a feature extractor. The simplest approach is to use a pretrained ResNet50 model as a feature extractor.
From both the support and query set the features are extracted to get a downprojected representation of the images. From both the support and query set the features are extracted to get a downprojected representation of the images.
After downprojection the support set embeddings are compared to the query set embeddings. After downprojection the support set embeddings are compared to the query set embeddings.
To predict the class of a query, the class with the smallest distance to the support embedding is chosen. To predict the class of a query, the class with the smallest distance to the support embedding is chosen.
If there are more than one support embedding within the same class the mean of those embeddings is used (class center). If there are more than one support embedding within the same class the mean of those embeddings is used (class center).
This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just Use a Library of Pre-trained Feature This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just use a Library of Pre-trained Feature
Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net. Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net.
In this bachelor thesis a pre-trained ResNet50 (IMAGENET1K_V2) pytorch model was used. In this bachelor thesis a pretrained ResNet50 (IMAGENET1K_V2) pytorch model was used.
It is pretrained on the imagenet dataset and has 50 residual layers. It is pretrained on the imagenet dataset and has 50 residual layers.
To get the embeddings the last layer of the model was removed and the output of the second last layer was used as embedding output. To get the embeddings the last layer of the model was removed and the output of the second last layer was used as embedding output.
@ -95,7 +95,7 @@ The class with the smallest distance is chosen as the predicted class.
This method performed better than expected with such a simple method. This method performed better than expected with such a simple method.
As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%. As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%.
When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways. When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways.
Interestintly the model performed slightly better with fewer shots in this case. Interestingly the model performed slightly better with fewer shots in this case.
Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification. Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification.
The more shots the better the performance and it peaks at around 88% accuracy with 5 shots. The more shots the better the performance and it peaks at around 88% accuracy with 5 shots.
@ -137,7 +137,7 @@ but this is expected as the cable class consists of 8 faulty classes.
== P>M>F == P>M>F
=== Approach === Approach
For P>M>F, I used the pretrained model weights from the original paper. For P>M>F, I used the pretrained model weights from the original paper.
As backbone feature extractor a DINO model is used, which is pre-trained by facebook. As backbone feature extractor a DINO model is used, which is pretrained by facebook.
This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion. This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion.
This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC- This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC-
Aircraft, CUB-200-2011, Describable Textures, QuickDraw, Aircraft, CUB-200-2011, Describable Textures, QuickDraw,
@ -145,7 +145,7 @@ FGVCx Fungi, VGG Flower, Traffic Signs and MSCOCO~@pmfpaper]
of diverse domains by the authors of the original paper.~@pmfpaper of diverse domains by the authors of the original paper.~@pmfpaper
Finally, this model is fine-tuned with the support set of every test iteration. Finally, this model is fine-tuned with the support set of every test iteration.
Every time the support set changes, we need to finetune the model again. Every time the support set changes, we need to fine-tune the model again.
In a real world scenario this should not be the case because the support set is fixed and only the query set changes. In a real world scenario this should not be the case because the support set is fixed and only the query set changes.
=== Results === Results

View File

@ -5,7 +5,7 @@
Anomaly detection is of essential importance, especially in the industrial and automotive field. Anomaly detection is of essential importance, especially in the industrial and automotive field.
Lots of assembly lines need visual inspection to find errors often with the help of camera systems. Lots of assembly lines need visual inspection to find errors often with the help of camera systems.
Machine learning helped the field to advance a lot in the past. Machine learning helped the field to advance a lot in the past.
Most of the time the error rate is sub $.1%$ and therefore plenty of good data and almost no faulty data is available. Most of the time the error rate is sub $0.1%$ and therefore plenty of good data and almost no faulty data is available.
So the train data is heavily unbalanced.~#cite(<parnami2022learningexamplessummaryapproaches>) So the train data is heavily unbalanced.~#cite(<parnami2022learningexamplessummaryapproaches>)
PatchCore and EfficientAD are state of the art algorithms trained only on good data and then detect anomalies within unseen (but similar) data. PatchCore and EfficientAD are state of the art algorithms trained only on good data and then detect anomalies within unseen (but similar) data.
@ -20,7 +20,7 @@ Moreover, few-shot learning might be able not only to detect anomalies but also
=== Is Few-Shot learning a suitable fit for anomaly detection? === Is Few-Shot learning a suitable fit for anomaly detection?
_Should Few-Shot learning be used for anomaly detection tasks? _Should Few-Shot learning be used for anomaly detection tasks?
How does it compare to well established algorithms such as Patchcore or EfficientAD?_ How does it compare to well established algorithms such as PatchCore or EfficientAD?_
=== How does disbalancing the Shot number affect performance? === How does disbalancing the Shot number affect performance?
_Does giving the Few-Shot learner more good than bad samples improve the model performance?_ _Does giving the Few-Shot learner more good than bad samples improve the model performance?_
@ -38,7 +38,7 @@ How does it compare to PatchCore and EfficientAD?_
This thesis is structured to provide a comprehensive exploration of Few-Shot Learning in anomaly detection. This thesis is structured to provide a comprehensive exploration of Few-Shot Learning in anomaly detection.
@sectionmaterialandmethods introduces the datasets and methodologies used in this research. @sectionmaterialandmethods introduces the datasets and methodologies used in this research.
The MVTec AD dataset is discussed in detail as the primary source for benchmarking, along with an overview of the Few-Shot Learning paradigm. The MVTec AD dataset is discussed in detail as the primary source for benchmarking, along with an overview of the Few-Shot Learning paradigm.
The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as Pachcore and EfficientAD. The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as PatchCore and EfficientAD.
@sectionimplementation focuses on the practical realization of the methods described in the previous chapter. @sectionimplementation focuses on the practical realization of the methods described in the previous chapter.
It outlines the experimental setup, including the use of Jupyter Notebook for prototyping and testing, and provides a detailed account of how each method was implemented and evaluated. It outlines the experimental setup, including the use of Jupyter Notebook for prototyping and testing, and provides a detailed account of how each method was implemented and evaluated.

View File

@ -55,7 +55,7 @@ More defect classes are already an indication that a classification task might b
Cut outer insulation Cut outer insulation
]), <e>, ]), <e>,
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [ figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
Mising cable defect Missing cable defect
]), <e>, ]), <e>,
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [ figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
Poke insulation defect Poke insulation defect
@ -142,7 +142,7 @@ $ <cosinesimilarity>
=== Euclidean Distance === Euclidean Distance
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space. The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
It just calculates the square root of the sum of the squared differences of the coordinates. It just calculates the square root of the sum of the squared differences of the coordinates.
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors. The euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
@analysisrudin @analysisrudin
$ $
@ -150,14 +150,14 @@ $
$ <euclideannorm> $ <euclideannorm>
=== Patchcore === PatchCore
// https://arxiv.org/pdf/2106.08265 // https://arxiv.org/pdf/2106.08265
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data. PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
It operates on the principle that an image is anomalous if any of its patches is anomalous. It operates on the principle that an image is anomalous if any of its patches is anomalous.
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>) The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
#todo[Absatz umformulieren und vereinfachen] #todo[Absatz umformulieren und vereinfachen]
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches. The PatchCore framework leverages a pretrained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet. By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>) To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
@ -172,13 +172,13 @@ If any patch exhibits a significant deviation, the corresponding image is flagge
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>) For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies. PatchCore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly. A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>) This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
#figure( #figure(
image("rsc/patchcore_overview.png", width: 80%), image("rsc/patchcore_overview.png", width: 80%),
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)], caption: [Architecture of PatchCore. #cite(<patchcorepaper>)],
) <patchcoreoverview> ) <patchcoreoverview>
=== EfficientAD === EfficientAD
@ -186,13 +186,13 @@ This lowers computational costs while maintaining detection accuracy.~#cite(<pat
EfficientAD is another state of the art method for anomaly detection. EfficientAD is another state of the art method for anomaly detection.
It focuses on maintaining performance as well as high computational efficiency. It focuses on maintaining performance as well as high computational efficiency.
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware. At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
In comparison to Patchcore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers. In comparison to PatchCore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>) This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>)
#todo[reference to image below] #todo[reference to image below]
The detection of anomalies is achieved through a student-teacher framework. The detection of anomalies is achieved through a student-teacher framework.
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output. The teacher network is a PDN and pretrained on normal (good) images and the student network is trained to predict the teachers output.
An anomalie is identified when the student failes to replicate the teachers output. An anomaly is identified when the student fails to replicate the teachers output.
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training. This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>) A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
@ -200,7 +200,7 @@ Additionally to this structural anomaly detection, EfficientAD can also address
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>) This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected. By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected.
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>) This is a challenge that PatchCore does not directly address.~#cite(<efficientADpaper>)
#todo[maybe add key advantages such as low computational cost and high performance] #todo[maybe add key advantages such as low computational cost and high performance]
@ -227,7 +227,7 @@ Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers. Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting. This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling. Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task. Finally, after some convolutional layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>) @cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
#figure( #figure(
@ -263,11 +263,11 @@ There are well established methods for pretraining which can be used such as DIN
#cite(<pmfpaper>) #cite(<pmfpaper>)
*Meta-training:* *Meta-training:*
The second stage in the pipline as in @pmfarchitecture is the meta-training. The second stage in the pipeline as in @pmfarchitecture is the meta-training.
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone. Here a prototypical network (ProtoNet) is used to refine the pretrained backbone.
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification. ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
Have a look at @prototypefewshot for a visualisation of its architecture. Have a look at @prototypefewshot for a visualisation of its architecture.
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$. The ProtoNet only requires a backbone $f$ to map images to a m-dimensional vector space: $f: cal(X) -> RR^m$.
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances: The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
$ $
@ -276,7 +276,7 @@ $
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula. As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
$c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$. $c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>) The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.~#cite(<pmfpaper>)
*Fine-tuning:* *Fine-tuning:*
If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution. If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
@ -293,29 +293,29 @@ During this step, the entire model is fine-tuned to the new domain.~#cite(<pmfpa
*Inference:* *Inference:*
During inference the support set is used to calculate the class prototypes. During inference the support set is used to calculate the class prototypes.
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes. For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>) The query image is then assigned to the class with the closest prototype.~#cite(<pmfpaper>)
*Performance:* *Performance:*
P>M>F performs well across several few-shot learning benchmarks. P>M>F performs well across several few-shot learning benchmarks.
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well. The combination of pre-training on large dataset and meta-training with episodic tasks helps the model to generalize well.
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>) The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.~#cite(<pmfpaper>)
*Limitations and Scalability:* *Limitations and Scalability:*
This method has some limitations. This method has some limitations.
It relies on domains with large external datasets and it requires substantial computational resources to create pre-trained models. It relies on domains with large external datasets and it requires substantial computational resources to create pretrained models.
Fine-tuning is effective but might be slow and not work well on devices with limited computationsl resources. Fine-tuning is effective but might be slow and not work well on devices with limited computational resources.
Future research could focus on exploring faster and more efficient methods for fine-tuning models. Future research could focus on exploring faster and more efficient methods for fine-tuning models.
#cite(<pmfpaper>) #cite(<pmfpaper>)
=== CAML <CAML> === CAML <CAML>
// https://arxiv.org/pdf/2310.10971v2 // https://arxiv.org/pdf/2310.10971v2
CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning. CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning.
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model. It consists of three different components: a frozen pretrained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
This is a universal meta-learning approach. This is a universal meta-learning approach.
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>) That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
*Architecture:* *Architecture:*
CAML first encodes the query and support set images using the frozen pre-trained feature extractor as shown in @camlarchitecture. CAML first encodes the query and support set images using the frozen pretrained feature extractor as shown in @camlarchitecture.
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings. This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
The class labels are encoded with the ELMES class encoder. The class labels are encoded with the ELMES class encoder.
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder. Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
@ -343,14 +343,14 @@ Afterwards it is passed through a simple MLP network to predict the class of the
~#cite(<caml_paper>) ~#cite(<caml_paper>)
*Large-Scale Pre-Training:* *Large-Scale Pre-Training:*
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets. CAML is pretrained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
Those datasets span over different domains and help to detect any new visual concept during inference. Those datasets span over different domains and help to detect any new visual concept during inference.
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen. Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
~#cite(<caml_paper>) ~#cite(<caml_paper>)
*Inference:* *Inference:*
During inference, CAML processes the following: During inference, CAML processes the following:
- Encodes the support set images and labels with the pre-trained feature and class encoders. - Encodes the support set images and labels with the pretrained feature and class encoders.
- Concatenates these encodings into a sequence alongside the query image embedding. - Concatenates these encodings into a sequence alongside the query image embedding.
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations. - Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>) - Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
@ -365,7 +365,7 @@ It performes competitively against P>M>F in 8 benchmarks even though P>M>F was m
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX) CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
and low-resolution tasks (e.g., CIFAR-fs). and low-resolution tasks (e.g., CIFAR-fs).
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance. Its use of frozen pretrained feature extractors is key to avoiding overfitting and enabling robust performance.
~#cite(<caml_paper>) ~#cite(<caml_paper>)
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model] #todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
@ -383,17 +383,17 @@ Either they performed worse on benchmarks compared to the used methods or they w
// https://arxiv.org/pdf/2211.16191v2 // https://arxiv.org/pdf/2211.16191v2
// https://arxiv.org/abs/2211.16191v2 // https://arxiv.org/abs/2211.16191v2
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP. SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pretrained vision-language models like CLIP.
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model. It focuses on generating better visual features for specific tasks while still using the general knowledge from the pretrained model.
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task. Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features. This process is supported by knowledge distillation, where detailed information from the pretrained model guides the learning of the new visual features.
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>) Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection. One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
The use of pre-trained knowledge helps reduce the need for large datasets. The use of pretrained knowledge helps reduce the need for large datasets.
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model. However, a disadvantage is that it depends heavily on the quality and capabilities of the pretrained model.
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt. If the pretrained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models. This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pretrained models.
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>) Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT> === TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
@ -414,7 +414,7 @@ Its ability to isolate critical features while droping irellevant context aligns
// https://arxiv.org/pdf/2204.03065v1 // https://arxiv.org/pdf/2204.03065v1
// https://arxiv.org/abs/2204.03065v1 // https://arxiv.org/abs/2204.03065v1
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations. The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tasks like matching, grouping or classification by re-embedding feature representations.
This transform processes features as a set instead of using them individually. This transform processes features as a set instead of using them individually.
This creates context-aware representations. This creates context-aware representations.
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>) SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)