fix lots of typos
All checks were successful
Build Typst document / build_typst_documents (push) Successful in 1m9s

This commit is contained in:
2025-02-02 12:59:15 +01:00
parent 94fe252741
commit cf6f4f96ac
5 changed files with 53 additions and 53 deletions

View File

@ -55,7 +55,7 @@ More defect classes are already an indication that a classification task might b
Cut outer insulation
]), <e>,
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
Mising cable defect
Missing cable defect
]), <e>,
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
Poke insulation defect
@ -142,7 +142,7 @@ $ <cosinesimilarity>
=== Euclidean Distance
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
It just calculates the square root of the sum of the squared differences of the coordinates.
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
The euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
@analysisrudin
$
@ -150,14 +150,14 @@ $
$ <euclideannorm>
=== Patchcore
=== PatchCore
// https://arxiv.org/pdf/2106.08265
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
It operates on the principle that an image is anomalous if any of its patches is anomalous.
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
#todo[Absatz umformulieren und vereinfachen]
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
The PatchCore framework leverages a pretrained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
@ -172,13 +172,13 @@ If any patch exhibits a significant deviation, the corresponding image is flagge
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
PatchCore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
#figure(
image("rsc/patchcore_overview.png", width: 80%),
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)],
caption: [Architecture of PatchCore. #cite(<patchcorepaper>)],
) <patchcoreoverview>
=== EfficientAD
@ -186,13 +186,13 @@ This lowers computational costs while maintaining detection accuracy.~#cite(<pat
EfficientAD is another state of the art method for anomaly detection.
It focuses on maintaining performance as well as high computational efficiency.
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
In comparison to Patchcore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
In comparison to PatchCore, which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convolutional layers and two pooling layers.
This results in reduced latency while retaining the ability to generate patch-level features.~#cite(<efficientADpaper>)
#todo[reference to image below]
The detection of anomalies is achieved through a student-teacher framework.
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
An anomalie is identified when the student failes to replicate the teachers output.
The teacher network is a PDN and pretrained on normal (good) images and the student network is trained to predict the teachers output.
An anomaly is identified when the student fails to replicate the teachers output.
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
@ -200,7 +200,7 @@ Additionally to this structural anomaly detection, EfficientAD can also address
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
By comparing the outputs of the autoencoder and the student logical anomalies are effectively detected.
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
This is a challenge that PatchCore does not directly address.~#cite(<efficientADpaper>)
#todo[maybe add key advantages such as low computational cost and high performance]
@ -227,7 +227,7 @@ Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
Finally, after some convolutional layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
#figure(
@ -263,11 +263,11 @@ There are well established methods for pretraining which can be used such as DIN
#cite(<pmfpaper>)
*Meta-training:*
The second stage in the pipline as in @pmfarchitecture is the meta-training.
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
The second stage in the pipeline as in @pmfarchitecture is the meta-training.
Here a prototypical network (ProtoNet) is used to refine the pretrained backbone.
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
Have a look at @prototypefewshot for a visualisation of its architecture.
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
The ProtoNet only requires a backbone $f$ to map images to a m-dimensional vector space: $f: cal(X) -> RR^m$.
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
$
@ -276,7 +276,7 @@ $
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
$c_k$, the prototype of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.~#cite(<pmfpaper>)
*Fine-tuning:*
If a novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
@ -293,29 +293,29 @@ During this step, the entire model is fine-tuned to the new domain.~#cite(<pmfpa
*Inference:*
During inference the support set is used to calculate the class prototypes.
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
The query image is then assigned to the class with the closest prototype.~#cite(<pmfpaper>)
*Performance:*
P>M>F performs well across several few-shot learning benchmarks.
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
The combination of pre-training on large dataset and meta-training with episodic tasks helps the model to generalize well.
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.~#cite(<pmfpaper>)
*Limitations and Scalability:*
This method has some limitations.
It relies on domains with large external datasets and it requires substantial computational resources to create pre-trained models.
Fine-tuning is effective but might be slow and not work well on devices with limited computationsl resources.
It relies on domains with large external datasets and it requires substantial computational resources to create pretrained models.
Fine-tuning is effective but might be slow and not work well on devices with limited computational resources.
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
#cite(<pmfpaper>)
=== CAML <CAML>
// https://arxiv.org/pdf/2310.10971v2
CAML (Context-Aware Meta-Learning) is one of the state-of-the-art methods for few-shot learning.
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
It consists of three different components: a frozen pretrained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
This is a universal meta-learning approach.
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
*Architecture:*
CAML first encodes the query and support set images using the frozen pre-trained feature extractor as shown in @camlarchitecture.
CAML first encodes the query and support set images using the frozen pretrained feature extractor as shown in @camlarchitecture.
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
The class labels are encoded with the ELMES class encoder.
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
@ -343,14 +343,14 @@ Afterwards it is passed through a simple MLP network to predict the class of the
~#cite(<caml_paper>)
*Large-Scale Pre-Training:*
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
CAML is pretrained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
Those datasets span over different domains and help to detect any new visual concept during inference.
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
~#cite(<caml_paper>)
*Inference:*
During inference, CAML processes the following:
- Encodes the support set images and labels with the pre-trained feature and class encoders.
- Encodes the support set images and labels with the pretrained feature and class encoders.
- Concatenates these encodings into a sequence alongside the query image embedding.
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
@ -365,7 +365,7 @@ It performes competitively against P>M>F in 8 benchmarks even though P>M>F was m
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
and low-resolution tasks (e.g., CIFAR-fs).
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance.
Its use of frozen pretrained feature extractors is key to avoiding overfitting and enabling robust performance.
~#cite(<caml_paper>)
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
@ -383,17 +383,17 @@ Either they performed worse on benchmarks compared to the used methods or they w
// https://arxiv.org/pdf/2211.16191v2
// https://arxiv.org/abs/2211.16191v2
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP.
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model.
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pretrained vision-language models like CLIP.
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pretrained model.
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features.
This process is supported by knowledge distillation, where detailed information from the pretrained model guides the learning of the new visual features.
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
The use of pre-trained knowledge helps reduce the need for large datasets.
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model.
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models.
The use of pretrained knowledge helps reduce the need for large datasets.
However, a disadvantage is that it depends heavily on the quality and capabilities of the pretrained model.
If the pretrained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pretrained models.
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
@ -414,7 +414,7 @@ Its ability to isolate critical features while droping irellevant context aligns
// https://arxiv.org/pdf/2204.03065v1
// https://arxiv.org/abs/2204.03065v1
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations.
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tasks like matching, grouping or classification by re-embedding feature representations.
This transform processes features as a set instead of using them individually.
This creates context-aware representations.
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)