fix lots of typos
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s
This commit is contained in:
@ -55,7 +55,7 @@ More defect classes are already an indication that a classification task might b
|
||||
Cut outer insulation
|
||||
]), <e>,
|
||||
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
|
||||
Mising cable defect
|
||||
Missing cable defect
|
||||
]), <e>,
|
||||
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
|
||||
Poke insulation defect
|
||||
@ -142,7 +142,7 @@ $ <cosinesimilarity>
|
||||
=== Euclidean Distance
|
||||
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
|
||||
It just calculates the square root of the sum of the squared differences of the coordinates.
|
||||
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
|
||||
The euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
|
||||
@analysisrudin
|
||||
|
||||
$
|
||||
@ -150,14 +150,14 @@ $
|
||||
$ <euclideannorm>
|
||||
|
||||
|
||||
=== Patchcore
|
||||
=== PatchCore
|
||||
// https://arxiv.org/pdf/2106.08265
|
||||
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
|
||||
It operates on the principle that an image is anomalous if any of its patches is anomalous.
|
||||
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
|
||||
#todo[Absatz umformulieren und vereinfachen]
|
||||
|
||||
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
|
||||
The PatchCore framework leverages a pretrained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
|
||||
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
|
||||
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
|
||||
|
||||
@ -172,13 +172,13 @@ If any patch exhibits a significant deviation, the corresponding image is flagge
|
||||
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
|
||||
|
||||
|
||||
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
|
||||
PatchCore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
|
||||
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
|
||||
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
|
||||
|
||||
#figure(
|
||||
image("rsc/patchcore_overview.png", width: 80%),
|
||||
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)],
|
||||
caption: [Architecture of PatchCore. #cite(<patchcorepaper>)],
|
||||
) <patchcoreoverview>
|
||||
|
||||
=== EfficientAD
|
||||
@ -186,13 +186,13 @@ This lowers computational costs while maintaining detection accuracy.~#cite(<pat
|
||||
EfficientAD is another state of the art method for anomaly detection.
|
||||
It focuses on maintaining performance as well as high computational efficiency.
|
||||
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
|
||||
In comparison to Patchcore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
|
||||
In comparison to PatchCore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
|
||||
This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>)
|
||||
#todo[reference to image below]
|
||||
|
||||
The detection of anomalies is achieved through a student-teacher framework.
|
||||
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
|
||||
An anomalie is identified when the student failes to replicate the teachers output.
|
||||
The teacher network is a PDN and pretrained on normal (good) images and the student network is trained to predict the teachers output.
|
||||
An anomaly is identified when the student fails to replicate the teachers output.
|
||||
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
|
||||
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
|
||||
|
||||
@ -200,7 +200,7 @@ Additionally to this structural anomaly detection, EfficientAD can also address
|
||||
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
|
||||
|
||||
By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected.
|
||||
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
|
||||
This is a challenge that PatchCore does not directly address.~#cite(<efficientADpaper>)
|
||||
#todo[maybe add key advantages such as low computational cost and high performance]
|
||||
|
||||
|
||||
@ -227,7 +227,7 @@ Convolutional layers capture features like edges, textures or shapes.
|
||||
Pooling layers sample down the feature maps created by the convolutional layers.
|
||||
This helps reducing the computational complexity of the overall network and help with overfitting.
|
||||
Common pooling layers include average- and max pooling.
|
||||
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
|
||||
Finally, after some convolutional layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
|
||||
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
|
||||
|
||||
#figure(
|
||||
@ -263,11 +263,11 @@ There are well established methods for pretraining which can be used such as DIN
|
||||
#cite(<pmfpaper>)
|
||||
|
||||
*Meta-training:*
|
||||
The second stage in the pipline as in @pmfarchitecture is the meta-training.
|
||||
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
|
||||
The second stage in the pipeline as in @pmfarchitecture is the meta-training.
|
||||
Here a prototypical network (ProtoNet) is used to refine the pretrained backbone.
|
||||
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
|
||||
Have a look at @prototypefewshot for a visualisation of its architecture.
|
||||
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
|
||||
The ProtoNet only requires a backbone $f$ to map images to a m-dimensional vector space: $f: cal(X) -> RR^m$.
|
||||
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
|
||||
|
||||
$
|
||||
@ -276,7 +276,7 @@ $
|
||||
|
||||
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
|
||||
$c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
|
||||
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
|
||||
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.~#cite(<pmfpaper>)
|
||||
|
||||
*Fine-tuning:*
|
||||
If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
|
||||
@ -293,29 +293,29 @@ During this step, the entire model is fine-tuned to the new domain.~#cite(<pmfpa
|
||||
*Inference:*
|
||||
During inference the support set is used to calculate the class prototypes.
|
||||
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
|
||||
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
|
||||
The query image is then assigned to the class with the closest prototype.~#cite(<pmfpaper>)
|
||||
|
||||
*Performance:*
|
||||
P>M>F performs well across several few-shot learning benchmarks.
|
||||
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
|
||||
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
|
||||
The combination of pre-training on large dataset and meta-training with episodic tasks helps the model to generalize well.
|
||||
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.~#cite(<pmfpaper>)
|
||||
|
||||
*Limitations and Scalability:*
|
||||
This method has some limitations.
|
||||
It relies on domains with large external datasets and it requires substantial computational resources to create pre-trained models.
|
||||
Fine-tuning is effective but might be slow and not work well on devices with limited computationsl resources.
|
||||
It relies on domains with large external datasets and it requires substantial computational resources to create pretrained models.
|
||||
Fine-tuning is effective but might be slow and not work well on devices with limited computational resources.
|
||||
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
|
||||
#cite(<pmfpaper>)
|
||||
|
||||
=== CAML <CAML>
|
||||
// https://arxiv.org/pdf/2310.10971v2
|
||||
CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning.
|
||||
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
|
||||
It consists of three different components: a frozen pretrained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
|
||||
This is a universal meta-learning approach.
|
||||
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
|
||||
|
||||
*Architecture:*
|
||||
CAML first encodes the query and support set images using the frozen pre-trained feature extractor as shown in @camlarchitecture.
|
||||
CAML first encodes the query and support set images using the frozen pretrained feature extractor as shown in @camlarchitecture.
|
||||
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
|
||||
The class labels are encoded with the ELMES class encoder.
|
||||
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
|
||||
@ -343,14 +343,14 @@ Afterwards it is passed through a simple MLP network to predict the class of the
|
||||
~#cite(<caml_paper>)
|
||||
|
||||
*Large-Scale Pre-Training:*
|
||||
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
|
||||
CAML is pretrained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
|
||||
Those datasets span over different domains and help to detect any new visual concept during inference.
|
||||
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
|
||||
~#cite(<caml_paper>)
|
||||
|
||||
*Inference:*
|
||||
During inference, CAML processes the following:
|
||||
- Encodes the support set images and labels with the pre-trained feature and class encoders.
|
||||
- Encodes the support set images and labels with the pretrained feature and class encoders.
|
||||
- Concatenates these encodings into a sequence alongside the query image embedding.
|
||||
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
|
||||
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
|
||||
@ -365,7 +365,7 @@ It performes competitively against P>M>F in 8 benchmarks even though P>M>F was m
|
||||
|
||||
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
|
||||
and low-resolution tasks (e.g., CIFAR-fs).
|
||||
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance.
|
||||
Its use of frozen pretrained feature extractors is key to avoiding overfitting and enabling robust performance.
|
||||
~#cite(<caml_paper>)
|
||||
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
|
||||
|
||||
@ -383,17 +383,17 @@ Either they performed worse on benchmarks compared to the used methods or they w
|
||||
// https://arxiv.org/pdf/2211.16191v2
|
||||
// https://arxiv.org/abs/2211.16191v2
|
||||
|
||||
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP.
|
||||
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model.
|
||||
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pretrained vision-language models like CLIP.
|
||||
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pretrained model.
|
||||
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
|
||||
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features.
|
||||
This process is supported by knowledge distillation, where detailed information from the pretrained model guides the learning of the new visual features.
|
||||
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
||||
|
||||
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
|
||||
The use of pre-trained knowledge helps reduce the need for large datasets.
|
||||
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model.
|
||||
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
|
||||
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models.
|
||||
The use of pretrained knowledge helps reduce the need for large datasets.
|
||||
However, a disadvantage is that it depends heavily on the quality and capabilities of the pretrained model.
|
||||
If the pretrained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
|
||||
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pretrained models.
|
||||
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
||||
|
||||
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
|
||||
@ -414,7 +414,7 @@ Its ability to isolate critical features while droping irellevant context aligns
|
||||
// https://arxiv.org/pdf/2204.03065v1
|
||||
// https://arxiv.org/abs/2204.03065v1
|
||||
|
||||
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations.
|
||||
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tasks like matching, grouping or classification by re-embedding feature representations.
|
||||
This transform processes features as a set instead of using them individually.
|
||||
This creates context-aware representations.
|
||||
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
||||
|
Reference in New Issue
Block a user