bachelor-thesis/materialandmethods.typ

452 lines
28 KiB
Typst
Raw Normal View History

2024-11-11 14:30:21 +01:00
#import "@preview/subpar:0.1.1"
#import "utils.typ": todo
#import "@preview/equate:0.2.1": equate
2024-11-11 14:30:21 +01:00
2025-01-07 18:04:04 +01:00
= Material and Methods <sectionmaterialandmethods>
2024-10-28 12:43:59 +01:00
== Material
=== MVTec AD
MVTec AD is a dataset for benchmarking anomaly detection methods with a focus on industrial inspection.
2024-11-11 14:30:21 +01:00
It contains 5354 high-resolution images divided into fifteen different object and texture categories.
2024-10-28 12:43:59 +01:00
Each category comprises a set of defect-free training images and a test set of images with various kinds of defects as well as images without defects.
#figure(
2024-11-11 14:30:21 +01:00
image("rsc/mvtec/dataset_overview_large.png", width: 80%),
caption: [Architecture convolutional neural network. #cite(<datasetsampleimg>)],
) <datasetoverview>
2024-10-28 12:43:59 +01:00
2024-11-11 14:30:21 +01:00
In this bachelor thesis only two categories are used. The categories are "Bottle" and "Cable".
2025-01-15 07:03:10 +01:00
The bottle category contains 3 different defect classes: _broken_large_, _broken_small_ and _contamination_.
2024-11-11 14:30:21 +01:00
#subpar.grid(
figure(image("rsc/mvtec/bottle/broken_large_example.png"), caption: [
Broken large defect
]), <a>,
figure(image("rsc/mvtec/bottle/broken_small_example.png"), caption: [
Broken small defect
]), <b>,
figure(image("rsc/mvtec/bottle/contamination_example.png"), caption: [
Contamination defect
]), <c>,
columns: (1fr, 1fr, 1fr),
caption: [Bottle category different defect classes],
label: <full>,
)
2025-01-15 07:03:10 +01:00
Whereas cable has a lot more defect classes: _bent_wire_, _cable_swap_, _combined_, _cut_inner_insulation_,
_cut_outer_insulation_, _missing_cable_, _missing_wire_, _poke_insulation_.
2024-11-29 16:18:04 +01:00
So many more defect classes are already an indication that a classification task might be more difficult for the cable category.
2024-11-11 14:30:21 +01:00
#subpar.grid(
figure(image("rsc/mvtec/cable/bent_wire_example.png"), caption: [
Bent wire defect
]), <a>,
figure(image("rsc/mvtec/cable/cable_swap_example.png"), caption: [
Cable swap defect
]), <b>,
figure(image("rsc/mvtec/cable/combined_example.png"), caption: [
Combined defect
]), <c>,
figure(image("rsc/mvtec/cable/cut_inner_insulation_example.png"), caption: [
Cut inner insulation
]), <d>,
figure(image("rsc/mvtec/cable/cut_outer_insulation_example.png"), caption: [
Cut outer insulation
]), <e>,
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
Mising cable defect
]), <e>,
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
Poke insulation defect
]), <f>,
figure(image("rsc/mvtec/cable/missing_wire_example.png"), caption: [
Missing wire defect
]), <g>,
columns: (1fr, 1fr, 1fr, 1fr),
caption: [Cable category different defect classes],
label: <full>,
)
2024-10-28 12:43:59 +01:00
== Methods
=== Few-Shot Learning
Few-Shot learning is a subfield of machine-learning which aims to train a classification-model with just a few or no samples at all.
2025-01-15 07:03:10 +01:00
In contrast to traditional supervised learning, where a huge amount of labeled data is required to generalize well to unseen data,
here we only have 1-10 samples per class (so called shots).
So the model is prone to overfitting to the few training samples and this means they should represent the whole sample distribution as good as possible.~#cite(<parnami2022learningexamplessummaryapproaches>)
2024-10-28 12:43:59 +01:00
Typically a few-shot leaning task consists of a support and query set.
Where the support-set contains the training data and the query set the evaluation data for real world evaluation.
A common way to format a few-shot leaning problem is using n-way k-shot notation.
2025-01-15 07:03:10 +01:00
For Example 3 target classes and 5 samples per class for training might be a 3-way 5-shot few-shot classification problem.~@snell2017prototypicalnetworksfewshotlearning @patchcorepaper
2024-10-28 12:43:59 +01:00
A classical example of how such a model might work is a prototypical network.
2025-01-15 07:03:10 +01:00
These models learn a representation of each class in a reduced dimensionality and classify new examples based on proximity to these representations in an embedding space.~@snell2017prototypicalnetworksfewshotlearning
2024-10-28 12:43:59 +01:00
2024-10-28 16:25:02 +01:00
#figure(
image("rsc/prototype_fewshot_v3.png", width: 60%),
2025-01-15 07:03:10 +01:00
caption: [Prototypical network for 3-ways and 5-shots. #cite(<snell2017prototypicalnetworksfewshotlearning>)],
2024-10-28 16:25:02 +01:00
) <prototypefewshot>
2025-01-15 07:03:10 +01:00
The first and easiest method of this bachelor thesis uses a simple ResNet50 to calucalte those embeddings and clusters the shots together by calculating the class center.
This is basically a simple prototypical network.
See @resnet50impl.~@chowdhury2021fewshotimageclassificationjust
2024-10-28 12:43:59 +01:00
=== Generalisation from few samples
2024-10-28 12:43:59 +01:00
2024-11-01 23:22:03 +01:00
An especially hard task is to generalize from such few samples.
In typical supervised learning the model sees thousands or millions of samples of the corresponding domain during learning.
This helps the model to learn the underlying patterns and to generalize well to unseen data.
2025-01-15 07:03:10 +01:00
In few-shot learning the model has to generalize from just a few samples.#todo[Source?]#todo[Write more about. eg. class distributions]
2024-11-01 23:22:03 +01:00
2025-01-03 15:25:32 +01:00
=== Softmax
#todo[Maybe remove this section]
The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
$
sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
$ <softmax>
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
=== Cross Entropy Loss
#todo[Maybe remove this section]
Cross Entropy Loss is a well established loss function in machine learning.
@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
$
H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
$ <crel>
Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
=== Cosine Similarity
To measure the distance between two vectors some common distance measures are used.
One popular of them is the Cosine Similarity (@cosinesimilarity).
It measures the cosine of the angle between two vectors.
The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
$
cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
&= (sum_(i=1)^n A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
$ <cosinesimilarity>
#todo[Source?]
=== Euclidean Distance
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
It just calculates the square root of the sum of the squared differences of the coordinates.
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
$
cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
$ <euclideannorm>
#todo[Source?]
=== Patchcore
2024-12-19 15:24:36 +01:00
// https://arxiv.org/pdf/2106.08265
2024-12-09 16:20:48 +01:00
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
It operates on the principle that an image is anomalous if any of its patches is anomalous.
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
#todo[Absatz umformulieren und vereinfachen]
2024-10-28 12:43:59 +01:00
2024-12-09 16:20:48 +01:00
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
A crucial component of PatchCore is its memory bank, which stores patch-level features derived from the training dataset.
This memory bank represents the nominal distribution of features against which test patches are compared.
To ensure computational efficiency and scalability, PatchCore employs a coreset reduction technique to condense the memory bank by selecting the most representative patch features.
This optimization reduces both storage requirements and inference times while maintaining the integrity of the feature space. #cite(<patchcorepaper>)
#todo[reference to image below]
2024-12-09 16:20:48 +01:00
During inference, PatchCore computes anomaly scores by measuring the distance between patch features from test images and their nearest neighbors in the memory bank.
If any patch exhibits a significant deviation, the corresponding image is flagged as anomalous.
2024-12-19 15:24:36 +01:00
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
2024-12-09 16:20:48 +01:00
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
2024-12-19 15:24:36 +01:00
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
2024-12-09 16:20:48 +01:00
#figure(
image("rsc/patchcore_overview.png", width: 80%),
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)],
) <patchcoreoverview>
2024-10-28 12:43:59 +01:00
2024-12-09 16:20:48 +01:00
=== EfficientAD
2024-10-28 12:43:59 +01:00
// https://arxiv.org/pdf/2303.14535
2024-12-19 15:24:36 +01:00
EfficientAD is another state of the art method for anomaly detection.
It focuses on maintining performance as well as high computational efficiency.
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
In comparison to Patchcore which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convulutional layers and two pooling layers.
This results in reduced latency while retains the ability to generate patch-level features.~#cite(<efficientADpaper>)
#todo[reference to image below]
2024-12-19 15:24:36 +01:00
The detection of anomalies is achieved through a student-teacher framework.
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
An anomalie is identified when the student failes to replicate the teachers output.
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
Additionally to this structural anomaly detection EfficientAD can also address logical anomalies, such as violations in spartial or contextual constraints (eg. object wrong arrangments).
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
By comparing the outputs of the autoencdoer and the student logical anomalies are effectively detected.
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
#todo[maybe add key advantages such as low computational cost and high performance]
2024-12-19 15:24:36 +01:00
#figure(
image("rsc/efficientad_overview.png", width: 80%),
caption: [Architecture of EfficientAD. #cite(<efficientADpaper>)],
) <efficientadoverview>
2024-10-28 12:43:59 +01:00
=== Jupyter Notebook
A Jupyter notebook is a shareable document which combines code and its output, text and visualizations.
The notebook along with the editor provides a environment for fast prototyping and data analysis.
2025-01-10 13:14:09 +01:00
It is widely used in the data science, mathematics and machine learning community.~#cite(<jupyter>)
2024-10-28 12:43:59 +01:00
2025-01-10 13:14:09 +01:00
In the context of this bachelor thesis it was used to test and evaluate the three few-shot learning methods and to compare them.
2025-01-13 15:09:43 +01:00
Furthermore, Matplotlib was used to create the comparison plots.
2024-10-28 12:43:59 +01:00
=== CNN
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
2024-12-19 15:24:36 +01:00
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
2024-10-28 12:43:59 +01:00
#figure(
image("rsc/cnn_architecture.png", width: 80%),
caption: [Architecture convolutional neural network. #cite(<cnnarchitectureimg>)],
) <cnnarchitecture>
=== RESNet
Residual neural networks are a special type of neural network architecture.
They are especially good for deep learning and have been used in many state-of-the-art computer vision tasks.
The main idea behind ResNet is the skip connection.
The skip connection is a direct connection from one layer to another layer which is not the next layer.
This helps to avoid the vanishing gradient problem and helps with the training of very deep networks.
ResNet has proven to be very successful in many computer vision tasks and is used in this practical work for the classification task.
There are several different ResNet architectures, the most common are ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152. #cite(<resnet>)
2024-11-01 23:22:03 +01:00
For this bachelor theis the ResNet-50 architecture was used to predict the corresponding embeddings for the few-shot learning methods.
2024-12-21 18:42:59 +01:00
=== P$>$M$>$F
2024-12-31 12:23:53 +01:00
// https://arxiv.org/pdf/2204.07305
2025-01-03 15:25:32 +01:00
P>P>F (Pre-training > Meta-training > Fine-tuning) is a three-stage pipelined designed for few-shot learning.
It focuses on simplicity but still achieves competitive performance.
The three stages convert a general feature extractor into a task-specific model through fine-tuned optimization.
#cite(<pmfpaper>)
*Pre-training:*
The first stage in @pmfarchitecture initializes the backbone feature extractor.
This can be for instance as ResNet or ViT and is learned by self-supervised techniques.
This backbone is traned on large scale datasets on a general domain such as ImageNet or similar.
This step optimizes for robust feature extractions and builds a foundation model.
There are well established bethods for pretraining which can be used such as DINO (self-supervised consistency), CLIP (Image-text alignment) or BERT (for text data).
#cite(<pmfpaper>)
*Meta-training:*
The second stage in the pipline as in @pmfarchitecture is the meta-training.
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
Have a look at @prototypefewshot for a visualisation of its architecture.
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
$
p(y=k|x) = exp(-d(f(x), c_k)) / (sum_(k') exp(-d(f(x), c_k')))#cite(<pmfpaper>)
$
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
$c_k$, the prototy of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
*Fine-tuning:*
If an novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
To overcome this the model is optionally fine-tuned with the support set on a few gradient steps.
Data augmentation is used to generate a pseudo query set.
With the support set the class prototypes are calculated and compared against the models predictions for the pseudo query set.
With the loss of this steps the whole model is fine-tuned to the new domain.~#cite(<pmfpaper>)
2024-10-28 12:43:59 +01:00
2025-01-03 15:25:32 +01:00
#figure(
image("rsc/pmfarchitecture.png", width: 100%),
caption: [Architecture of P>M>F. #cite(<pmfpaper>)],
) <pmfarchitecture>
*Inference:*
During inference the support set is used to calculate the class prototypes.
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
*Performance:*
P>M>F performs well across several few-shot learning benchmarks.
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
*Limitations and Scalability:*
This method has some limitations.
It relies on domains with large external datasets, which require substantial computational computation resources to create pre-trained models.
Fine-tuning is effective but might be slow and not work well on devices with limited ocmputational resources.
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
#cite(<pmfpaper>)
2024-12-31 12:23:53 +01:00
=== CAML <CAML>
2024-12-19 16:53:50 +01:00
// https://arxiv.org/pdf/2310.10971v2
CAML (Context aware meta learning) is one of the state-of-the-art methods for few-shot learning.
2024-12-21 18:42:59 +01:00
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
2024-12-30 11:24:29 +01:00
This is a universal meta-learning approach.
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
2024-12-21 18:42:59 +01:00
2024-12-30 11:24:29 +01:00
*Architecture:*
CAML first encodes the query and support set images using the fozen pre-trained feature extractor as shown in @camlarchitecture.
2024-12-21 18:42:59 +01:00
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
The class labels are encoded with the ELMES class encoder.
2024-12-30 10:32:03 +01:00
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
2024-12-21 18:42:59 +01:00
This embedding is learned during pre-training.
Afterwards each image embedding is concatenated with the corresponding class embedding.
2024-12-30 11:24:29 +01:00
~#cite(<caml_paper>)
#todo[Add more references to the architecture image below]
2024-12-21 18:42:59 +01:00
2024-12-30 11:24:29 +01:00
*ELMES Encoder:*
The ELMES (Equal Length and Maximally Equiangular Set) encoder encodes the class labels to vectors of equal length.
2024-12-21 18:42:59 +01:00
The encoder is a bijective mapping between the labels and set of vectors that are equal length and maximally equiangular.
#todo[Describe what equiangular and bijective means]
Similar to one-hot encoding but with some advantages.
2024-12-30 10:32:03 +01:00
This encoder maximizes the algorithms ability to distinguish between different classes.
2024-12-30 11:24:29 +01:00
~#cite(<caml_paper>)
2024-12-21 18:42:59 +01:00
*Non-causal sequence model:*
2024-12-30 10:32:03 +01:00
The sequence created by the ELMES encoder is then fed into a non-causal sequence model.
This might be for instance a transormer encoder.
This step conditions the input sequence consisting of the query and support set embeddings.
Visual features from query and support set can be compared to each other to determine specific informations such as content or textures.
This can then be used to predict the class of the query image.
From the output of the sequence model the element at the same position as the query is selected.
Afterwards it is passed through a simple MLP network to predict the class of the query image.
2024-12-30 11:24:29 +01:00
~#cite(<caml_paper>)
2024-12-21 18:42:59 +01:00
*Large-Scale Pre-Training:*
2024-12-30 11:24:29 +01:00
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
Those datasets span over different domains and help to detect any new visual concept during inference.
2024-12-31 12:23:53 +01:00
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
2024-12-30 11:24:29 +01:00
~#cite(<caml_paper>)
2024-12-21 18:42:59 +01:00
2024-12-30 11:24:29 +01:00
*Inference:*
During inference, CAML processes the following:
- Encodes the support set images and labels with the pre-trained feature and class encoders.
- Concatenates these encodings into a sequence alongside the query image embedding.
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
2024-12-31 12:23:53 +01:00
*Performance:*
2024-12-30 11:24:29 +01:00
CAML achieves state-of-the-art performance in universal meta-learning across 11 few-shot classification benchmarks,
including generic object recognition (e.g., MiniImageNet), fine-grained classification (e.g., CUB, Aircraft),
and cross-domain tasks (e.g., Pascal+Paintings).
It outperformed or matched existing models in 14 of 22 evaluation settings.
It performes competitively against P>M>F in 8 benchmarks even though P>M>F was meta-trained on the same domain.
~#cite(<caml_paper>)
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
and low-resolution tasks (e.g., CIFAR-fs).
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance.
~#cite(<caml_paper>)
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
#figure(
2024-12-30 10:32:03 +01:00
image("rsc/caml_architecture.png", width: 100%),
caption: [Architecture of CAML. #cite(<caml_paper>)],
) <camlarchitecture>
2024-12-19 16:53:50 +01:00
2024-11-04 12:26:00 +01:00
== Alternative Methods
There are several alternative methods to few-shot learning as well as to anomaly detection which are not used in this bachelor thesis.
2025-01-13 22:36:44 +01:00
Either they performed worse on benchmarks compared to the used methods or they were released after my initial literature research.
2025-01-13 22:36:44 +01:00
=== SgVA-CLIP (Semantic-guided Visual Adapting CLIP)
// https://arxiv.org/pdf/2211.16191v2
// https://arxiv.org/abs/2211.16191v2
2025-01-13 22:36:44 +01:00
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP.
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model.
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features.
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
The use of pre-trained knowledge helps reduce the need for large datasets.
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model.
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models.
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
// https://arxiv.org/pdf/2208.10559v1
// https://arxiv.org/abs/2208.10559v1
TRIDENT, a variational infernce network, is a few-shot learning approach which decouples image representation into semantic and label-specific latent variables.
Semantic attributes contain context or stylistic information, while label-specific attributes focus on the characteristics crucial for classification.
By decoupling these parts TRIDENT enhances the networks ability to generalize effectively from unseen data.~#cite(<singh2022transductivedecoupledvariationalinference>)
To further improve the discriminative performance of the model, it incorporates a transductive feature extraction module named AttFEX (Attention-based Feature Extraction).
This feature extractor dynamically aligns features from both the support and the query set, promoting task-specific embeddings.~#cite(<singh2022transductivedecoupledvariationalinference>)
This model is specifically designed for few-shot classification tasks but might also work well for anomaly detection.
Its ability to isolate critical features while droping irellevant context aligns with requirements needed for anomaly detection.
=== SOT (Self-Optimal-Transport Feature Transform) <SOT>
// https://arxiv.org/pdf/2204.03065v1
// https://arxiv.org/abs/2204.03065v1
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations.
This transform processes features as a set instead of using them individually.
This creates context-aware representations.
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
SOT uses a transport plan matrix derived from optimal transport theory to redefine feature relations.
This includes calculating pairwaise similarities (e.g. cosine similarities) between features and solving a min-cost max-flow problem to find an optimal match between features.
This results in an doubly stochastic matrix where each row represents the re-embedding of the corresponding feature in context with others.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
The transform features parameterless-ness, which makes it easy to integrate into existing machine-learning pipelines.
It is differentiable which allows for end-to-end training. For example (re-)train the hosting network to adopt to SOT.
SOT is equivariant, which means that the transform is invariant to the order of the input features.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
The improvements of SOT over traditional feature transforms dpeend on the used backbone network and the task.
But in most cases it outperforms state-of-the-art methods and could be used as a drop-in replacement for existing feature transforms.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
// anomaly detect
=== GLASS (Global and Local Anomaly co-Synthesis Strategy)
// https://arxiv.org/pdf/2407.09359v1
// https://arxiv.org/abs/2407.09359v1
GLASS (Global and Local Anomaly co-Synthesis Strategy) is a anomaly detection method for industrial applications.
It is a unified network which uses two different strategies to detect anomalies which are then combined.
The first one is Global Anomaly Synthesis (GAS), it operates on the feature level.
It uses a gaussian noise, guided by gradient ascent and constrained by truncated projection to generate anomalies close to the distribution for the normal features.
This helps the detection of weak defects.
The second strategy is Local Anomaly Synthesis (LAS), it operates on the image level.
This strategy overlays textures onto normal images using masks derived from noise patterns.
LAS creates strong anomalies which are further away from the normal sample distribution.
This adds diversity to the synthesized anomalies.~#cite(<chen2024unifiedanomalysynthesisstrategy>)
GLASS combines GAS and LAS to improve anomaly detection and localization by synthesizing anomalies near and far from the normal distribution.
Experiments show that GLASS is very effective and outperforms some state-of-the-art methods on the MVTec AD dataset such as PatchCore in some cases.~#cite(<chen2024unifiedanomalysynthesisstrategy>)
//=== HETMM (Hard-normal Example-aware Template Mutual Matching)
// https://arxiv.org/pdf/2303.16191v5
// https://arxiv.org/abs/2303.16191v5