fix lots of typos
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s
This commit is contained in:
parent
94fe252741
commit
cf6f4f96ac
@ -1,14 +1,14 @@
|
|||||||
= Conclusion and Outlook <sectionconclusionandoutlook>
|
= Conclusion and Outlook <sectionconclusionandoutlook>
|
||||||
== Conclusion
|
== Conclusion
|
||||||
In conclusion one can say that Few-Shot learning is not the best choice for anomaly detection tasks.
|
In conclusion one can say that Few-Shot learning is not the best choice for anomaly detection tasks.
|
||||||
It is hugely outperformed by state of the art algorithms like Patchcore or EfficientAD.
|
It is hugely outperformed by state of the art algorithms like PatchCore or EfficientAD.
|
||||||
The only benefit of Few-Shot learning is that it can be used in environments where only a limited number of good samples are available.
|
The only benefit of Few-Shot learning is that it can be used in environments where only a limited number of good samples are available.
|
||||||
But this should not be the case in most scenarios.
|
But this should not be the case in most scenarios.
|
||||||
Most of the time plenty of good samples are available and in this case Patchcore or EfficientAD should perform great.
|
Most of the time plenty of good samples are available and in this case PatchCore or EfficientAD should perform great.
|
||||||
|
|
||||||
The only case where Few-Shot learning could be used is in a scenarios where one wants to detect the anomaly class itself.
|
The only case where Few-Shot learning could be used is in a scenarios where one wants to detect the anomaly class itself.
|
||||||
Patchcore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is.
|
PatchCore and EfficientAD can only detect if an anomaly is present or not but not what type of anomaly it actually is.
|
||||||
So chaining a Few-Shot learner after Patchcore or EfficientAD could be a good idea to use the best of both worlds.
|
So chaining a Few-Shot learner after PatchCore or EfficientAD could be a good idea to use the best of both worlds.
|
||||||
|
|
||||||
In most of the tests P>M>F performed the best.
|
In most of the tests P>M>F performed the best.
|
||||||
But also the simple ResNet50 method performed better than expected in most cases and can be considered if the computational resources are limited and if a simple architecture is enough.
|
But also the simple ResNet50 method performed better than expected in most cases and can be considered if the computational resources are limited and if a simple architecture is enough.
|
||||||
@ -19,4 +19,4 @@ There might be a lack of research in the area where the classes to detect are ve
|
|||||||
and when building a few-shot learning algorithm tailored specifically for very similar classes this could boost the performance by a large margin.
|
and when building a few-shot learning algorithm tailored specifically for very similar classes this could boost the performance by a large margin.
|
||||||
|
|
||||||
It might be interesting to test the SOT method (see @SOT) with a ResNet50 feature extractor similar as proposed in this thesis but with SOT for embedding comparison.
|
It might be interesting to test the SOT method (see @SOT) with a ResNet50 feature extractor similar as proposed in this thesis but with SOT for embedding comparison.
|
||||||
Moreover, TRIDENT (see @TRIDENT) could achive promising results in a anomaly detection scenario.
|
Moreover, TRIDENT (see @TRIDENT) could achieve promising results in an anomaly detection scenario.
|
||||||
|
@ -5,16 +5,16 @@
|
|||||||
|
|
||||||
== Is Few-Shot learning a suitable fit for anomaly detection? <expresults2way>
|
== Is Few-Shot learning a suitable fit for anomaly detection? <expresults2way>
|
||||||
_Should Few-Shot learning be used for anomaly detection tasks?
|
_Should Few-Shot learning be used for anomaly detection tasks?
|
||||||
How does it compare to well established algorithms such as Patchcore or EfficientAD?_
|
How does it compare to well established algorithms such as PatchCore or EfficientAD?_
|
||||||
|
|
||||||
@comparison2waybottle shows the performance of the 2-way classification (anomaly or not) on the bottle class and @comparison2waycable the same on the cable class.
|
@comparison2waybottle shows the performance of the 2-way classification (anomaly or not) on the bottle class and @comparison2waycable the same on the cable class.
|
||||||
The performance values are the same as in @experiments but just merged together into one graph.
|
The performance values are the same as in @experiments but just merged together into one graph.
|
||||||
As a reference Patchcore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset.
|
As a reference PatchCore reaches an AUROC score of 99.6% and EfficientAD reaches 99.8% averaged over all classes provided by the MVTec AD dataset.
|
||||||
Both are trained with samples from the 'good' class only.
|
Both are trained with samples from the 'good' class only.
|
||||||
So there is a clear performance gap between Few-Shot learning and the state of the art anomaly detection algorithms.
|
So there is a clear performance gap between Few-Shot learning and the state of the art anomaly detection algorithms.
|
||||||
In the @comparison2way Patchcore and EfficientAD are not included as they aren't directly compareable in the same fashion.
|
In the @comparison2way PatchCore and EfficientAD are not included as they aren't directly compareable in the same fashion.
|
||||||
|
|
||||||
That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and Patchcore or EfficientAD should be used.
|
That means if the goal is just to detect anomalies, Few-Shot learning is not the best choice, and PatchCore or EfficientAD should be used.
|
||||||
|
|
||||||
#subpar.grid(
|
#subpar.grid(
|
||||||
figure(image("rsc/comparison-2way-bottle.png"), caption: [
|
figure(image("rsc/comparison-2way-bottle.png"), caption: [
|
||||||
|
@ -17,27 +17,27 @@ For all of the three methods we test the following use-cases:
|
|||||||
- Inbalanced 2 Way classification (5,10,15,30 good shots, 5 bad shots)
|
- Inbalanced 2 Way classification (5,10,15,30 good shots, 5 bad shots)
|
||||||
- Similar to the 2 way classification but with an inbalanced number of good shots.
|
- Similar to the 2 way classification but with an inbalanced number of good shots.
|
||||||
- Inbalanced target class prediction (5,10,15,30 good shots, 5 bad shots)#todo[Avoid bullet points and write flow text?]
|
- Inbalanced target class prediction (5,10,15,30 good shots, 5 bad shots)#todo[Avoid bullet points and write flow text?]
|
||||||
- Detect only the faulty classes without the good classed with an inbalanced number of shots.
|
- Detect only the faulty classes without the good ones, but with an inbalanced number of shots.
|
||||||
|
|
||||||
All those experiments were conducted on the MVTEC AD dataset on the bottle and cable classes.
|
All those experiments were conducted on the MVTEC AD dataset on the bottle and cable classes.
|
||||||
|
|
||||||
== Experiment Setup
|
== Experiment Setup
|
||||||
All the experiments were done on the bottle and cable classes of the MVTEC AD dataset.
|
All the experiments were done on the bottle and cable classes of the MVTEC AD dataset.
|
||||||
The correspoinding number of shots were randomly selected from the dataset.
|
The corresponding number of shots were randomly selected from the dataset.
|
||||||
The rest of the images was used to test the model and measure the accuracy.
|
The rest of the images was used to test the model and measure the accuracy.
|
||||||
#todo[Maybe add real number of samples per classes]
|
#todo[Maybe add real number of samples per classes]
|
||||||
|
|
||||||
== ResNet50 <resnet50impl>
|
== ResNet50 <resnet50impl>
|
||||||
=== Approach
|
=== Approach
|
||||||
The simplest approach is to use a pre-trained ResNet50 model as a feature extractor.
|
The simplest approach is to use a pretrained ResNet50 model as a feature extractor.
|
||||||
From both the support and query set the features are extracted to get a downprojected representation of the images.
|
From both the support and query set the features are extracted to get a downprojected representation of the images.
|
||||||
After downprojection the support set embeddings are compared to the query set embeddings.
|
After downprojection the support set embeddings are compared to the query set embeddings.
|
||||||
To predict the class of a query, the class with the smallest distance to the support embedding is chosen.
|
To predict the class of a query, the class with the smallest distance to the support embedding is chosen.
|
||||||
If there are more than one support embedding within the same class the mean of those embeddings is used (class center).
|
If there are more than one support embedding within the same class the mean of those embeddings is used (class center).
|
||||||
This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just Use a Library of Pre-trained Feature
|
This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just use a Library of Pre-trained Feature
|
||||||
Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net.
|
Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net.
|
||||||
|
|
||||||
In this bachelor thesis a pre-trained ResNet50 (IMAGENET1K_V2) pytorch model was used.
|
In this bachelor thesis a pretrained ResNet50 (IMAGENET1K_V2) pytorch model was used.
|
||||||
It is pretrained on the imagenet dataset and has 50 residual layers.
|
It is pretrained on the imagenet dataset and has 50 residual layers.
|
||||||
|
|
||||||
To get the embeddings the last layer of the model was removed and the output of the second last layer was used as embedding output.
|
To get the embeddings the last layer of the model was removed and the output of the second last layer was used as embedding output.
|
||||||
@ -95,7 +95,7 @@ The class with the smallest distance is chosen as the predicted class.
|
|||||||
This method performed better than expected with such a simple method.
|
This method performed better than expected with such a simple method.
|
||||||
As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%.
|
As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%.
|
||||||
When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways.
|
When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways.
|
||||||
Interestintly the model performed slightly better with fewer shots in this case.
|
Interestingly the model performed slightly better with fewer shots in this case.
|
||||||
Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification.
|
Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification.
|
||||||
The more shots the better the performance and it peaks at around 88% accuracy with 5 shots.
|
The more shots the better the performance and it peaks at around 88% accuracy with 5 shots.
|
||||||
|
|
||||||
@ -137,7 +137,7 @@ but this is expected as the cable class consists of 8 faulty classes.
|
|||||||
== P>M>F
|
== P>M>F
|
||||||
=== Approach
|
=== Approach
|
||||||
For P>M>F, I used the pretrained model weights from the original paper.
|
For P>M>F, I used the pretrained model weights from the original paper.
|
||||||
As backbone feature extractor a DINO model is used, which is pre-trained by facebook.
|
As backbone feature extractor a DINO model is used, which is pretrained by facebook.
|
||||||
This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion.
|
This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion.
|
||||||
This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC-
|
This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC-
|
||||||
Aircraft, CUB-200-2011, Describable Textures, QuickDraw,
|
Aircraft, CUB-200-2011, Describable Textures, QuickDraw,
|
||||||
@ -145,7 +145,7 @@ FGVCx Fungi, VGG Flower, Traffic Signs and MSCOCO~@pmfpaper]
|
|||||||
of diverse domains by the authors of the original paper.~@pmfpaper
|
of diverse domains by the authors of the original paper.~@pmfpaper
|
||||||
|
|
||||||
Finally, this model is fine-tuned with the support set of every test iteration.
|
Finally, this model is fine-tuned with the support set of every test iteration.
|
||||||
Every time the support set changes, we need to finetune the model again.
|
Every time the support set changes, we need to fine-tune the model again.
|
||||||
In a real world scenario this should not be the case because the support set is fixed and only the query set changes.
|
In a real world scenario this should not be the case because the support set is fixed and only the query set changes.
|
||||||
|
|
||||||
=== Results
|
=== Results
|
||||||
|
@ -5,7 +5,7 @@
|
|||||||
Anomaly detection is of essential importance, especially in the industrial and automotive field.
|
Anomaly detection is of essential importance, especially in the industrial and automotive field.
|
||||||
Lots of assembly lines need visual inspection to find errors often with the help of camera systems.
|
Lots of assembly lines need visual inspection to find errors often with the help of camera systems.
|
||||||
Machine learning helped the field to advance a lot in the past.
|
Machine learning helped the field to advance a lot in the past.
|
||||||
Most of the time the error rate is sub $.1%$ and therefore plenty of good data and almost no faulty data is available.
|
Most of the time the error rate is sub $0.1%$ and therefore plenty of good data and almost no faulty data is available.
|
||||||
So the train data is heavily unbalanced.~#cite(<parnami2022learningexamplessummaryapproaches>)
|
So the train data is heavily unbalanced.~#cite(<parnami2022learningexamplessummaryapproaches>)
|
||||||
|
|
||||||
PatchCore and EfficientAD are state of the art algorithms trained only on good data and then detect anomalies within unseen (but similar) data.
|
PatchCore and EfficientAD are state of the art algorithms trained only on good data and then detect anomalies within unseen (but similar) data.
|
||||||
@ -20,7 +20,7 @@ Moreover, few-shot learning might be able not only to detect anomalies but also
|
|||||||
|
|
||||||
=== Is Few-Shot learning a suitable fit for anomaly detection?
|
=== Is Few-Shot learning a suitable fit for anomaly detection?
|
||||||
_Should Few-Shot learning be used for anomaly detection tasks?
|
_Should Few-Shot learning be used for anomaly detection tasks?
|
||||||
How does it compare to well established algorithms such as Patchcore or EfficientAD?_
|
How does it compare to well established algorithms such as PatchCore or EfficientAD?_
|
||||||
|
|
||||||
=== How does disbalancing the Shot number affect performance?
|
=== How does disbalancing the Shot number affect performance?
|
||||||
_Does giving the Few-Shot learner more good than bad samples improve the model performance?_
|
_Does giving the Few-Shot learner more good than bad samples improve the model performance?_
|
||||||
@ -38,7 +38,7 @@ How does it compare to PatchCore and EfficientAD?_
|
|||||||
This thesis is structured to provide a comprehensive exploration of Few-Shot Learning in anomaly detection.
|
This thesis is structured to provide a comprehensive exploration of Few-Shot Learning in anomaly detection.
|
||||||
@sectionmaterialandmethods introduces the datasets and methodologies used in this research.
|
@sectionmaterialandmethods introduces the datasets and methodologies used in this research.
|
||||||
The MVTec AD dataset is discussed in detail as the primary source for benchmarking, along with an overview of the Few-Shot Learning paradigm.
|
The MVTec AD dataset is discussed in detail as the primary source for benchmarking, along with an overview of the Few-Shot Learning paradigm.
|
||||||
The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as Pachcore and EfficientAD.
|
The section elaborates on the three selected methods—ResNet50, P>M>F, and CAML—while also touching upon well established anomaly detection algorithms such as PatchCore and EfficientAD.
|
||||||
|
|
||||||
@sectionimplementation focuses on the practical realization of the methods described in the previous chapter.
|
@sectionimplementation focuses on the practical realization of the methods described in the previous chapter.
|
||||||
It outlines the experimental setup, including the use of Jupyter Notebook for prototyping and testing, and provides a detailed account of how each method was implemented and evaluated.
|
It outlines the experimental setup, including the use of Jupyter Notebook for prototyping and testing, and provides a detailed account of how each method was implemented and evaluated.
|
||||||
|
@ -55,7 +55,7 @@ More defect classes are already an indication that a classification task might b
|
|||||||
Cut outer insulation
|
Cut outer insulation
|
||||||
]), <e>,
|
]), <e>,
|
||||||
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
|
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
|
||||||
Mising cable defect
|
Missing cable defect
|
||||||
]), <e>,
|
]), <e>,
|
||||||
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
|
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
|
||||||
Poke insulation defect
|
Poke insulation defect
|
||||||
@ -142,7 +142,7 @@ $ <cosinesimilarity>
|
|||||||
=== Euclidean Distance
|
=== Euclidean Distance
|
||||||
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
|
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
|
||||||
It just calculates the square root of the sum of the squared differences of the coordinates.
|
It just calculates the square root of the sum of the squared differences of the coordinates.
|
||||||
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
|
The euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
|
||||||
@analysisrudin
|
@analysisrudin
|
||||||
|
|
||||||
$
|
$
|
||||||
@ -150,14 +150,14 @@ $
|
|||||||
$ <euclideannorm>
|
$ <euclideannorm>
|
||||||
|
|
||||||
|
|
||||||
=== Patchcore
|
=== PatchCore
|
||||||
// https://arxiv.org/pdf/2106.08265
|
// https://arxiv.org/pdf/2106.08265
|
||||||
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
|
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
|
||||||
It operates on the principle that an image is anomalous if any of its patches is anomalous.
|
It operates on the principle that an image is anomalous if any of its patches is anomalous.
|
||||||
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
|
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
|
||||||
#todo[Absatz umformulieren und vereinfachen]
|
#todo[Absatz umformulieren und vereinfachen]
|
||||||
|
|
||||||
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
|
The PatchCore framework leverages a pretrained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
|
||||||
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
|
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
|
||||||
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
|
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
|
||||||
|
|
||||||
@ -172,13 +172,13 @@ If any patch exhibits a significant deviation, the corresponding image is flagge
|
|||||||
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
|
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
|
||||||
|
|
||||||
|
|
||||||
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
|
PatchCore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
|
||||||
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
|
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
|
||||||
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
|
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
|
||||||
|
|
||||||
#figure(
|
#figure(
|
||||||
image("rsc/patchcore_overview.png", width: 80%),
|
image("rsc/patchcore_overview.png", width: 80%),
|
||||||
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)],
|
caption: [Architecture of PatchCore. #cite(<patchcorepaper>)],
|
||||||
) <patchcoreoverview>
|
) <patchcoreoverview>
|
||||||
|
|
||||||
=== EfficientAD
|
=== EfficientAD
|
||||||
@ -186,13 +186,13 @@ This lowers computational costs while maintaining detection accuracy.~#cite(<pat
|
|||||||
EfficientAD is another state of the art method for anomaly detection.
|
EfficientAD is another state of the art method for anomaly detection.
|
||||||
It focuses on maintaining performance as well as high computational efficiency.
|
It focuses on maintaining performance as well as high computational efficiency.
|
||||||
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
|
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
|
||||||
In comparison to Patchcore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
|
In comparison to PatchCore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
|
||||||
This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>)
|
This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>)
|
||||||
#todo[reference to image below]
|
#todo[reference to image below]
|
||||||
|
|
||||||
The detection of anomalies is achieved through a student-teacher framework.
|
The detection of anomalies is achieved through a student-teacher framework.
|
||||||
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
|
The teacher network is a PDN and pretrained on normal (good) images and the student network is trained to predict the teachers output.
|
||||||
An anomalie is identified when the student failes to replicate the teachers output.
|
An anomaly is identified when the student fails to replicate the teachers output.
|
||||||
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
|
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
|
||||||
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
|
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
|
||||||
|
|
||||||
@ -200,7 +200,7 @@ Additionally to this structural anomaly detection, EfficientAD can also address
|
|||||||
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
|
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
|
||||||
|
|
||||||
By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected.
|
By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected.
|
||||||
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
|
This is a challenge that PatchCore does not directly address.~#cite(<efficientADpaper>)
|
||||||
#todo[maybe add key advantages such as low computational cost and high performance]
|
#todo[maybe add key advantages such as low computational cost and high performance]
|
||||||
|
|
||||||
|
|
||||||
@ -227,7 +227,7 @@ Convolutional layers capture features like edges, textures or shapes.
|
|||||||
Pooling layers sample down the feature maps created by the convolutional layers.
|
Pooling layers sample down the feature maps created by the convolutional layers.
|
||||||
This helps reducing the computational complexity of the overall network and help with overfitting.
|
This helps reducing the computational complexity of the overall network and help with overfitting.
|
||||||
Common pooling layers include average- and max pooling.
|
Common pooling layers include average- and max pooling.
|
||||||
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
|
Finally, after some convolutional layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
|
||||||
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
|
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
|
||||||
|
|
||||||
#figure(
|
#figure(
|
||||||
@ -263,11 +263,11 @@ There are well established methods for pretraining which can be used such as DIN
|
|||||||
#cite(<pmfpaper>)
|
#cite(<pmfpaper>)
|
||||||
|
|
||||||
*Meta-training:*
|
*Meta-training:*
|
||||||
The second stage in the pipline as in @pmfarchitecture is the meta-training.
|
The second stage in the pipeline as in @pmfarchitecture is the meta-training.
|
||||||
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
|
Here a prototypical network (ProtoNet) is used to refine the pretrained backbone.
|
||||||
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
|
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
|
||||||
Have a look at @prototypefewshot for a visualisation of its architecture.
|
Have a look at @prototypefewshot for a visualisation of its architecture.
|
||||||
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
|
The ProtoNet only requires a backbone $f$ to map images to a m-dimensional vector space: $f: cal(X) -> RR^m$.
|
||||||
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
|
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
|
||||||
|
|
||||||
$
|
$
|
||||||
@ -276,7 +276,7 @@ $
|
|||||||
|
|
||||||
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
|
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
|
||||||
$c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
|
$c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
|
||||||
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
|
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.~#cite(<pmfpaper>)
|
||||||
|
|
||||||
*Fine-tuning:*
|
*Fine-tuning:*
|
||||||
If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
|
If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
|
||||||
@ -293,29 +293,29 @@ During this step, the entire model is fine-tuned to the new domain.~#cite(<pmfpa
|
|||||||
*Inference:*
|
*Inference:*
|
||||||
During inference the support set is used to calculate the class prototypes.
|
During inference the support set is used to calculate the class prototypes.
|
||||||
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
|
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
|
||||||
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
|
The query image is then assigned to the class with the closest prototype.~#cite(<pmfpaper>)
|
||||||
|
|
||||||
*Performance:*
|
*Performance:*
|
||||||
P>M>F performs well across several few-shot learning benchmarks.
|
P>M>F performs well across several few-shot learning benchmarks.
|
||||||
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
|
The combination of pre-training on large dataset and meta-training with episodic tasks helps the model to generalize well.
|
||||||
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
|
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.~#cite(<pmfpaper>)
|
||||||
|
|
||||||
*Limitations and Scalability:*
|
*Limitations and Scalability:*
|
||||||
This method has some limitations.
|
This method has some limitations.
|
||||||
It relies on domains with large external datasets and it requires substantial computational resources to create pre-trained models.
|
It relies on domains with large external datasets and it requires substantial computational resources to create pretrained models.
|
||||||
Fine-tuning is effective but might be slow and not work well on devices with limited computationsl resources.
|
Fine-tuning is effective but might be slow and not work well on devices with limited computational resources.
|
||||||
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
|
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
|
||||||
#cite(<pmfpaper>)
|
#cite(<pmfpaper>)
|
||||||
|
|
||||||
=== CAML <CAML>
|
=== CAML <CAML>
|
||||||
// https://arxiv.org/pdf/2310.10971v2
|
// https://arxiv.org/pdf/2310.10971v2
|
||||||
CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning.
|
CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning.
|
||||||
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
|
It consists of three different components: a frozen pretrained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
|
||||||
This is a universal meta-learning approach.
|
This is a universal meta-learning approach.
|
||||||
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
|
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
|
||||||
|
|
||||||
*Architecture:*
|
*Architecture:*
|
||||||
CAML first encodes the query and support set images using the frozen pre-trained feature extractor as shown in @camlarchitecture.
|
CAML first encodes the query and support set images using the frozen pretrained feature extractor as shown in @camlarchitecture.
|
||||||
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
|
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
|
||||||
The class labels are encoded with the ELMES class encoder.
|
The class labels are encoded with the ELMES class encoder.
|
||||||
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
|
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
|
||||||
@ -343,14 +343,14 @@ Afterwards it is passed through a simple MLP network to predict the class of the
|
|||||||
~#cite(<caml_paper>)
|
~#cite(<caml_paper>)
|
||||||
|
|
||||||
*Large-Scale Pre-Training:*
|
*Large-Scale Pre-Training:*
|
||||||
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
|
CAML is pretrained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
|
||||||
Those datasets span over different domains and help to detect any new visual concept during inference.
|
Those datasets span over different domains and help to detect any new visual concept during inference.
|
||||||
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
|
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
|
||||||
~#cite(<caml_paper>)
|
~#cite(<caml_paper>)
|
||||||
|
|
||||||
*Inference:*
|
*Inference:*
|
||||||
During inference, CAML processes the following:
|
During inference, CAML processes the following:
|
||||||
- Encodes the support set images and labels with the pre-trained feature and class encoders.
|
- Encodes the support set images and labels with the pretrained feature and class encoders.
|
||||||
- Concatenates these encodings into a sequence alongside the query image embedding.
|
- Concatenates these encodings into a sequence alongside the query image embedding.
|
||||||
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
|
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
|
||||||
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
|
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
|
||||||
@ -365,7 +365,7 @@ It performes competitively against P>M>F in 8 benchmarks even though P>M>F was m
|
|||||||
|
|
||||||
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
|
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
|
||||||
and low-resolution tasks (e.g., CIFAR-fs).
|
and low-resolution tasks (e.g., CIFAR-fs).
|
||||||
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance.
|
Its use of frozen pretrained feature extractors is key to avoiding overfitting and enabling robust performance.
|
||||||
~#cite(<caml_paper>)
|
~#cite(<caml_paper>)
|
||||||
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
|
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
|
||||||
|
|
||||||
@ -383,17 +383,17 @@ Either they performed worse on benchmarks compared to the used methods or they w
|
|||||||
// https://arxiv.org/pdf/2211.16191v2
|
// https://arxiv.org/pdf/2211.16191v2
|
||||||
// https://arxiv.org/abs/2211.16191v2
|
// https://arxiv.org/abs/2211.16191v2
|
||||||
|
|
||||||
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP.
|
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pretrained vision-language models like CLIP.
|
||||||
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model.
|
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pretrained model.
|
||||||
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
|
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
|
||||||
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features.
|
This process is supported by knowledge distillation, where detailed information from the pretrained model guides the learning of the new visual features.
|
||||||
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
||||||
|
|
||||||
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
|
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
|
||||||
The use of pre-trained knowledge helps reduce the need for large datasets.
|
The use of pretrained knowledge helps reduce the need for large datasets.
|
||||||
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model.
|
However, a disadvantage is that it depends heavily on the quality and capabilities of the pretrained model.
|
||||||
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
|
If the pretrained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
|
||||||
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models.
|
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pretrained models.
|
||||||
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
||||||
|
|
||||||
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
|
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
|
||||||
@ -414,7 +414,7 @@ Its ability to isolate critical features while droping irellevant context aligns
|
|||||||
// https://arxiv.org/pdf/2204.03065v1
|
// https://arxiv.org/pdf/2204.03065v1
|
||||||
// https://arxiv.org/abs/2204.03065v1
|
// https://arxiv.org/abs/2204.03065v1
|
||||||
|
|
||||||
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations.
|
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tasks like matching, grouping or classification by re-embedding feature representations.
|
||||||
This transform processes features as a set instead of using them individually.
|
This transform processes features as a set instead of using them individually.
|
||||||
This creates context-aware representations.
|
This creates context-aware representations.
|
||||||
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user