add several sources and fix some errors in text

2025-01-20 11:18:32 +01:00
parent 8f28a8c387
commit 8a4b33e67a
5 changed files with 50 additions and 17 deletions
--- a/implementation.typ
+++ b/implementation.typ
@ -31,8 +31,8 @@ The rest of the images was used to test the model and measure the accuracy.
 === Approach
 The simplest approach is to use a pre-trained ResNet50 model as a feature extractor.
 From both the support and query set the features are extracted to get a downprojected representation of the images.
-The support set embeddings are compared to the query set embeddings.
-To predict the class of a query the class with the smallest distance to the support embedding is chosen.
+After downprojection the support set embeddings are compared to the query set embeddings.
+To predict the class of a query, the class with the smallest distance to the support embedding is chosen.
 If there are more than one support embedding within the same class the mean of those embeddings is used (class center).
 This approach is similar to a prototypical network @snell2017prototypicalnetworksfewshotlearning and the work of _Just Use a Library of Pre-trained Feature
 Extractors and a Simple Classifier_ @chowdhury2021fewshotimageclassificationjust but just with a simple distance metric instead of a neural net.
@ -94,13 +94,13 @@ The class with the smallest distance is chosen as the predicted class.
 === Results <resnet50perf>
 This method performed better than expected wich such a simple method.
 As in @resnet50bottleperfa with a normal 5 shot / 4 way classification the model achieved an accuracy of 75%.
-When detecting only if there occured an anomaly or not the performance is significantly better and peaks at 81% with 5 shots / 2 ways.
+When detecting if there occured an anomaly or not only the performance is significantly better and peaks at 81% with 5 shots / 2 ways.
 Interestintly the model performed slightly better with fewer shots in this case.
 Moreover in @resnet50bottleperfa, the detection of the anomaly class only (3 way) shows a similar pattern as the normal 4 way classification.
 The more shots the better the performance and it peaks at around 88% accuracy with 5 shots.

 In @resnet50bottleperfb the model was tested with inbalanced class distributions.
-With [5,10,15,30] good shots and 5 bad shots the model performed worse than with balanced classes.
+With {5, 10, 15, 30} good shots and 5 bad shots the model performed worse than with balanced classes.
 The more good shots the worse the performance.
 The only exception is the faulty or not detection (2 way) where the model peaked at 15 good shots with 83% accuracy.

@ -136,13 +136,13 @@ but this is expected as the cable class consists of 8 faulty classes.

 == P>M>F
 === Approach
-For P>M>F the pretrained model weights from the original paper were used.
+For P>M>F I used the pretrained model weights from the original paper.
 As backbone feature extractor a DINO model is used, which is pre-trained by facebook.
 This is a vision transformer with a patch size of 16 and 12 attention heads learned in a self-supervised fashion.
 This feature extractor was meta-trained with 10 public image dasets #footnote[ImageNet-1k, Omniglot, FGVC-
 Aircraft, CUB-200-2011, Describable Textures, QuickDraw,
-FGVCx Fungi, VGG Flower, Traffic Signs and MSCOCO~#cite(<pmfpaper>)]
- of diverse domains  by the authors of the original paper.#cite(<pmfpaper>)
+FGVCx Fungi, VGG Flower, Traffic Signs and MSCOCO~@pmfpaper]
+ of diverse domains  by the authors of the original paper.~@pmfpaper

 Finally, this model is finetuned with the support set of every test iteration.
 Everytime the support set changes we need to finetune the model again.
@ -182,7 +182,7 @@ So it is clearly a bad idea to add more good shots to the support set.

 == CAML
 === Approach
-For the CAML implementation the pretrained model weights from the original paper were used.
+For the CAML implementation I used the pretrained model weights from the original paper.
 The non-causal sequence model (transformer) is pretrained with every class having the same number of shots.
 This brings the limitation that it can only process default few-shot learning tasks in the n-way k-shots fashion.
 Since it expects the input sequence to be distributed with the same number of shots per class.
@ -190,7 +190,7 @@ This is the reason why for this method the two imbalanced test cases couldn't be

 As a feture extractor a ViT-B/16 model was used, which is a Vision Transformer with a patch size of 16.
 This feature extractor was already pretrained when used by the authors of the original paper.
-For the non-causal sequence model a transformer model was used
+In this case for the non-causal sequence model a transformer model was used.
 It consists of 24 Layers with 16 Attention-heads and a hidden dimension of 1024 and output MLP size of 4096.
 This transformer was trained on a huge number of images as described in @CAML.

@ -198,7 +198,8 @@ This transformer was trained on a huge number of images as described in @CAML.
 The results were not as good as expeced.
 This might be caused by the fact that the model was not fine-tuned for any industrial dataset domain.
 The model was trained on a large number of general purpose images and is not fine-tuned at all.
-It might not handle very similar images well.
+Moreover, it was not fine-tuned on the support set similar to the P>M>F method, which could have a huge impact on performance.
+It might also not handle very similar images well.

 Compared the the other two methods CAML performed poorly in almost all experiments.
 The normal few-shot classification reached only 40% accuracy in @camlperfa at best.