2024-11-11 14:30:21 +01:00
|
|
|
#import "@preview/subpar:0.1.1"
|
2024-12-20 11:52:51 +01:00
|
|
|
#import "utils.typ": todo
|
2024-12-30 18:34:43 +01:00
|
|
|
#import "@preview/equate:0.2.1": equate
|
2024-11-11 14:30:21 +01:00
|
|
|
|
2025-01-07 18:04:04 +01:00
|
|
|
= Material and Methods <sectionmaterialandmethods>
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
== Material
|
|
|
|
|
|
|
|
=== MVTec AD
|
|
|
|
MVTec AD is a dataset for benchmarking anomaly detection methods with a focus on industrial inspection.
|
2024-11-11 14:30:21 +01:00
|
|
|
It contains 5354 high-resolution images divided into fifteen different object and texture categories.
|
2024-10-28 12:43:59 +01:00
|
|
|
Each category comprises a set of defect-free training images and a test set of images with various kinds of defects as well as images without defects.
|
|
|
|
|
2024-10-28 16:02:53 +01:00
|
|
|
#figure(
|
2024-11-11 14:30:21 +01:00
|
|
|
image("rsc/mvtec/dataset_overview_large.png", width: 80%),
|
2024-10-28 16:02:53 +01:00
|
|
|
caption: [Architecture convolutional neural network. #cite(<datasetsampleimg>)],
|
|
|
|
) <datasetoverview>
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-11-11 14:30:21 +01:00
|
|
|
In this bachelor thesis only two categories are used. The categories are "Bottle" and "Cable".
|
|
|
|
|
2025-01-15 07:03:10 +01:00
|
|
|
The bottle category contains 3 different defect classes: _broken_large_, _broken_small_ and _contamination_.
|
2024-11-11 14:30:21 +01:00
|
|
|
#subpar.grid(
|
|
|
|
figure(image("rsc/mvtec/bottle/broken_large_example.png"), caption: [
|
|
|
|
Broken large defect
|
|
|
|
]), <a>,
|
|
|
|
figure(image("rsc/mvtec/bottle/broken_small_example.png"), caption: [
|
|
|
|
Broken small defect
|
|
|
|
]), <b>,
|
|
|
|
figure(image("rsc/mvtec/bottle/contamination_example.png"), caption: [
|
|
|
|
Contamination defect
|
|
|
|
]), <c>,
|
|
|
|
columns: (1fr, 1fr, 1fr),
|
|
|
|
caption: [Bottle category different defect classes],
|
|
|
|
label: <full>,
|
|
|
|
)
|
|
|
|
|
2025-01-15 07:03:10 +01:00
|
|
|
Whereas cable has a lot more defect classes: _bent_wire_, _cable_swap_, _combined_, _cut_inner_insulation_,
|
|
|
|
_cut_outer_insulation_, _missing_cable_, _missing_wire_, _poke_insulation_.
|
2024-11-29 16:18:04 +01:00
|
|
|
So many more defect classes are already an indication that a classification task might be more difficult for the cable category.
|
2024-11-11 14:30:21 +01:00
|
|
|
|
|
|
|
#subpar.grid(
|
|
|
|
figure(image("rsc/mvtec/cable/bent_wire_example.png"), caption: [
|
|
|
|
Bent wire defect
|
|
|
|
]), <a>,
|
|
|
|
figure(image("rsc/mvtec/cable/cable_swap_example.png"), caption: [
|
|
|
|
Cable swap defect
|
|
|
|
]), <b>,
|
|
|
|
figure(image("rsc/mvtec/cable/combined_example.png"), caption: [
|
|
|
|
Combined defect
|
|
|
|
]), <c>,
|
|
|
|
figure(image("rsc/mvtec/cable/cut_inner_insulation_example.png"), caption: [
|
|
|
|
Cut inner insulation
|
|
|
|
]), <d>,
|
|
|
|
figure(image("rsc/mvtec/cable/cut_outer_insulation_example.png"), caption: [
|
|
|
|
Cut outer insulation
|
|
|
|
]), <e>,
|
|
|
|
figure(image("rsc/mvtec/cable/missing_cable_example.png"), caption: [
|
|
|
|
Mising cable defect
|
|
|
|
]), <e>,
|
|
|
|
figure(image("rsc/mvtec/cable/poke_insulation_example.png"), caption: [
|
|
|
|
Poke insulation defect
|
|
|
|
]), <f>,
|
|
|
|
figure(image("rsc/mvtec/cable/missing_wire_example.png"), caption: [
|
|
|
|
Missing wire defect
|
|
|
|
]), <g>,
|
|
|
|
columns: (1fr, 1fr, 1fr, 1fr),
|
|
|
|
caption: [Cable category different defect classes],
|
|
|
|
label: <full>,
|
|
|
|
)
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
== Methods
|
|
|
|
|
|
|
|
=== Few-Shot Learning
|
|
|
|
Few-Shot learning is a subfield of machine-learning which aims to train a classification-model with just a few or no samples at all.
|
2025-01-15 07:03:10 +01:00
|
|
|
In contrast to traditional supervised learning, where a huge amount of labeled data is required to generalize well to unseen data,
|
|
|
|
here we only have 1-10 samples per class (so called shots).
|
|
|
|
So the model is prone to overfitting to the few training samples and this means they should represent the whole sample distribution as good as possible.~#cite(<parnami2022learningexamplessummaryapproaches>)
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
Typically a few-shot leaning task consists of a support and query set.
|
|
|
|
Where the support-set contains the training data and the query set the evaluation data for real world evaluation.
|
|
|
|
A common way to format a few-shot leaning problem is using n-way k-shot notation.
|
2025-01-15 07:03:10 +01:00
|
|
|
For Example 3 target classes and 5 samples per class for training might be a 3-way 5-shot few-shot classification problem.~@snell2017prototypicalnetworksfewshotlearning @patchcorepaper
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
A classical example of how such a model might work is a prototypical network.
|
2025-01-15 07:03:10 +01:00
|
|
|
These models learn a representation of each class in a reduced dimensionality and classify new examples based on proximity to these representations in an embedding space.~@snell2017prototypicalnetworksfewshotlearning
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-10-28 16:25:02 +01:00
|
|
|
#figure(
|
|
|
|
image("rsc/prototype_fewshot_v3.png", width: 60%),
|
2025-01-15 07:03:10 +01:00
|
|
|
caption: [Prototypical network for 3-ways and 5-shots. #cite(<snell2017prototypicalnetworksfewshotlearning>)],
|
2024-10-28 16:25:02 +01:00
|
|
|
) <prototypefewshot>
|
|
|
|
|
2025-01-15 07:03:10 +01:00
|
|
|
The first and easiest method of this bachelor thesis uses a simple ResNet50 to calucalte those embeddings and clusters the shots together by calculating the class center.
|
|
|
|
This is basically a simple prototypical network.
|
|
|
|
See @resnet50impl.~@chowdhury2021fewshotimageclassificationjust
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-10-28 16:02:53 +01:00
|
|
|
=== Generalisation from few samples
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-11-01 23:22:03 +01:00
|
|
|
An especially hard task is to generalize from such few samples.
|
|
|
|
In typical supervised learning the model sees thousands or millions of samples of the corresponding domain during learning.
|
|
|
|
This helps the model to learn the underlying patterns and to generalize well to unseen data.
|
2025-01-20 11:18:32 +01:00
|
|
|
In few-shot learning the model has to generalize from just a few samples.#todo[Write more about. eg. class distributions]
|
|
|
|
@Goodfellow-et-al-2016
|
2024-11-01 23:22:03 +01:00
|
|
|
|
2025-01-03 15:25:32 +01:00
|
|
|
=== Softmax
|
|
|
|
#todo[Maybe remove this section]
|
|
|
|
The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
|
|
|
|
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
|
|
|
|
|
|
|
|
$
|
|
|
|
sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
|
|
|
|
$ <softmax>
|
|
|
|
|
|
|
|
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
|
|
|
|
|
|
|
|
|
|
|
|
=== Cross Entropy Loss
|
|
|
|
#todo[Maybe remove this section]
|
|
|
|
Cross Entropy Loss is a well established loss function in machine learning.
|
|
|
|
@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
|
|
|
|
And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
|
|
|
|
|
|
|
|
$
|
|
|
|
H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
|
|
|
|
H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
|
|
|
|
cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
|
|
|
|
$ <crel>
|
|
|
|
|
|
|
|
Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
|
|
|
|
|
|
|
|
=== Cosine Similarity
|
|
|
|
To measure the distance between two vectors some common distance measures are used.
|
|
|
|
One popular of them is the Cosine Similarity (@cosinesimilarity).
|
|
|
|
It measures the cosine of the angle between two vectors.
|
|
|
|
The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
|
2025-01-20 11:18:32 +01:00
|
|
|
@dataminingbook@analysisrudin
|
2025-01-03 15:25:32 +01:00
|
|
|
|
|
|
|
$
|
|
|
|
cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
|
|
|
|
&= (sum_(i=1)^n A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
|
|
|
|
$ <cosinesimilarity>
|
|
|
|
|
|
|
|
=== Euclidean Distance
|
|
|
|
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
|
|
|
|
It just calculates the square root of the sum of the squared differences of the coordinates.
|
|
|
|
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
|
2025-01-20 11:18:32 +01:00
|
|
|
@analysisrudin
|
2025-01-03 15:25:32 +01:00
|
|
|
|
|
|
|
$
|
|
|
|
cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
|
|
|
|
$ <euclideannorm>
|
2025-01-20 11:18:32 +01:00
|
|
|
|
2025-01-03 15:25:32 +01:00
|
|
|
|
2024-10-28 16:02:53 +01:00
|
|
|
=== Patchcore
|
2024-12-19 15:24:36 +01:00
|
|
|
// https://arxiv.org/pdf/2106.08265
|
2024-12-09 16:20:48 +01:00
|
|
|
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
|
|
|
|
It operates on the principle that an image is anomalous if any of its patches is anomalous.
|
|
|
|
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
|
2024-12-20 11:52:51 +01:00
|
|
|
#todo[Absatz umformulieren und vereinfachen]
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-12-09 16:20:48 +01:00
|
|
|
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
|
|
|
|
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
|
|
|
|
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
|
|
|
|
|
|
|
|
A crucial component of PatchCore is its memory bank, which stores patch-level features derived from the training dataset.
|
|
|
|
This memory bank represents the nominal distribution of features against which test patches are compared.
|
|
|
|
To ensure computational efficiency and scalability, PatchCore employs a coreset reduction technique to condense the memory bank by selecting the most representative patch features.
|
|
|
|
This optimization reduces both storage requirements and inference times while maintaining the integrity of the feature space. #cite(<patchcorepaper>)
|
2024-12-20 11:52:51 +01:00
|
|
|
#todo[reference to image below]
|
2024-12-09 16:20:48 +01:00
|
|
|
|
|
|
|
During inference, PatchCore computes anomaly scores by measuring the distance between patch features from test images and their nearest neighbors in the memory bank.
|
|
|
|
If any patch exhibits a significant deviation, the corresponding image is flagged as anomalous.
|
2024-12-19 15:24:36 +01:00
|
|
|
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
|
2024-12-09 16:20:48 +01:00
|
|
|
|
|
|
|
|
|
|
|
Patchcore reaches a 99.6% AUROC on the MVTec AD dataset when detecting anomalies.
|
|
|
|
A great advantage of this method is the coreset subsampling reducing the memory bank size significantly.
|
2024-12-19 15:24:36 +01:00
|
|
|
This lowers computational costs while maintaining detection accuracy.~#cite(<patchcorepaper>)
|
|
|
|
|
2024-12-09 16:20:48 +01:00
|
|
|
#figure(
|
|
|
|
image("rsc/patchcore_overview.png", width: 80%),
|
|
|
|
caption: [Architecture of Patchcore. #cite(<patchcorepaper>)],
|
|
|
|
) <patchcoreoverview>
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2024-12-09 16:20:48 +01:00
|
|
|
=== EfficientAD
|
2024-10-28 12:43:59 +01:00
|
|
|
// https://arxiv.org/pdf/2303.14535
|
2024-12-19 15:24:36 +01:00
|
|
|
EfficientAD is another state of the art method for anomaly detection.
|
|
|
|
It focuses on maintining performance as well as high computational efficiency.
|
|
|
|
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
|
|
|
|
In comparison to Patchcore which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convulutional layers and two pooling layers.
|
|
|
|
This results in reduced latency while retains the ability to generate patch-level features.~#cite(<efficientADpaper>)
|
2024-12-20 11:52:51 +01:00
|
|
|
#todo[reference to image below]
|
2024-12-19 15:24:36 +01:00
|
|
|
|
|
|
|
The detection of anomalies is achieved through a student-teacher framework.
|
|
|
|
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
|
|
|
|
An anomalie is identified when the student failes to replicate the teachers output.
|
|
|
|
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
|
|
|
|
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
|
|
|
|
|
|
|
|
Additionally to this structural anomaly detection EfficientAD can also address logical anomalies, such as violations in spartial or contextual constraints (eg. object wrong arrangments).
|
|
|
|
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
|
|
|
|
|
|
|
|
By comparing the outputs of the autoencdoer and the student logical anomalies are effectively detected.
|
|
|
|
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
|
2024-12-20 11:52:51 +01:00
|
|
|
#todo[maybe add key advantages such as low computational cost and high performance]
|
2024-12-19 15:24:36 +01:00
|
|
|
|
|
|
|
|
|
|
|
#figure(
|
|
|
|
image("rsc/efficientad_overview.png", width: 80%),
|
|
|
|
caption: [Architecture of EfficientAD. #cite(<efficientADpaper>)],
|
|
|
|
) <efficientadoverview>
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
=== Jupyter Notebook
|
|
|
|
|
|
|
|
A Jupyter notebook is a shareable document which combines code and its output, text and visualizations.
|
|
|
|
The notebook along with the editor provides a environment for fast prototyping and data analysis.
|
2025-01-10 13:14:09 +01:00
|
|
|
It is widely used in the data science, mathematics and machine learning community.~#cite(<jupyter>)
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2025-01-10 13:14:09 +01:00
|
|
|
In the context of this bachelor thesis it was used to test and evaluate the three few-shot learning methods and to compare them.
|
2025-01-13 15:09:43 +01:00
|
|
|
Furthermore, Matplotlib was used to create the comparison plots.
|
2024-10-28 12:43:59 +01:00
|
|
|
=== CNN
|
|
|
|
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
|
|
|
|
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
|
|
|
|
Convolutional layers are a set of learnable kernels (filters).
|
|
|
|
Each filter performs a convolution operation by sliding a window over every pixel of the image.
|
|
|
|
On each pixel a dot product creates a feature map.
|
|
|
|
Convolutional layers capture features like edges, textures or shapes.
|
|
|
|
Pooling layers sample down the feature maps created by the convolutional layers.
|
|
|
|
This helps reducing the computational complexity of the overall network and help with overfitting.
|
|
|
|
Common pooling layers include average- and max pooling.
|
|
|
|
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
|
2024-12-19 15:24:36 +01:00
|
|
|
@cnnarchitecture shows a typical binary classification task.~#cite(<cnnintro>)
|
2024-10-28 12:43:59 +01:00
|
|
|
|
|
|
|
#figure(
|
|
|
|
image("rsc/cnn_architecture.png", width: 80%),
|
|
|
|
caption: [Architecture convolutional neural network. #cite(<cnnarchitectureimg>)],
|
|
|
|
) <cnnarchitecture>
|
|
|
|
|
|
|
|
=== RESNet
|
|
|
|
|
|
|
|
Residual neural networks are a special type of neural network architecture.
|
|
|
|
They are especially good for deep learning and have been used in many state-of-the-art computer vision tasks.
|
|
|
|
The main idea behind ResNet is the skip connection.
|
|
|
|
The skip connection is a direct connection from one layer to another layer which is not the next layer.
|
|
|
|
This helps to avoid the vanishing gradient problem and helps with the training of very deep networks.
|
|
|
|
ResNet has proven to be very successful in many computer vision tasks and is used in this practical work for the classification task.
|
|
|
|
There are several different ResNet architectures, the most common are ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152. #cite(<resnet>)
|
|
|
|
|
2024-11-01 23:22:03 +01:00
|
|
|
For this bachelor theis the ResNet-50 architecture was used to predict the corresponding embeddings for the few-shot learning methods.
|
|
|
|
|
2024-12-21 18:42:59 +01:00
|
|
|
=== P$>$M$>$F
|
2024-12-31 12:23:53 +01:00
|
|
|
// https://arxiv.org/pdf/2204.07305
|
2025-01-03 15:25:32 +01:00
|
|
|
P>P>F (Pre-training > Meta-training > Fine-tuning) is a three-stage pipelined designed for few-shot learning.
|
|
|
|
It focuses on simplicity but still achieves competitive performance.
|
|
|
|
The three stages convert a general feature extractor into a task-specific model through fine-tuned optimization.
|
|
|
|
#cite(<pmfpaper>)
|
|
|
|
|
|
|
|
*Pre-training:*
|
|
|
|
The first stage in @pmfarchitecture initializes the backbone feature extractor.
|
|
|
|
This can be for instance as ResNet or ViT and is learned by self-supervised techniques.
|
|
|
|
This backbone is traned on large scale datasets on a general domain such as ImageNet or similar.
|
|
|
|
This step optimizes for robust feature extractions and builds a foundation model.
|
|
|
|
There are well established bethods for pretraining which can be used such as DINO (self-supervised consistency), CLIP (Image-text alignment) or BERT (for text data).
|
|
|
|
#cite(<pmfpaper>)
|
|
|
|
|
|
|
|
*Meta-training:*
|
|
|
|
The second stage in the pipline as in @pmfarchitecture is the meta-training.
|
|
|
|
Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
|
|
|
|
ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
|
|
|
|
Have a look at @prototypefewshot for a visualisation of its architecture.
|
|
|
|
The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
|
|
|
|
The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
|
|
|
|
|
|
|
|
$
|
|
|
|
p(y=k|x) = exp(-d(f(x), c_k)) / (sum_(k') exp(-d(f(x), c_k')))#cite(<pmfpaper>)
|
|
|
|
$
|
|
|
|
|
|
|
|
As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
|
|
|
|
$c_k$, the prototy of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
|
|
|
|
The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
|
|
|
|
|
|
|
|
*Fine-tuning:*
|
|
|
|
If an novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
|
|
|
|
To overcome this the model is optionally fine-tuned with the support set on a few gradient steps.
|
|
|
|
Data augmentation is used to generate a pseudo query set.
|
|
|
|
With the support set the class prototypes are calculated and compared against the models predictions for the pseudo query set.
|
|
|
|
With the loss of this steps the whole model is fine-tuned to the new domain.~#cite(<pmfpaper>)
|
2024-10-28 12:43:59 +01:00
|
|
|
|
2025-01-03 15:25:32 +01:00
|
|
|
#figure(
|
|
|
|
image("rsc/pmfarchitecture.png", width: 100%),
|
|
|
|
caption: [Architecture of P>M>F. #cite(<pmfpaper>)],
|
|
|
|
) <pmfarchitecture>
|
|
|
|
|
|
|
|
*Inference:*
|
|
|
|
During inference the support set is used to calculate the class prototypes.
|
|
|
|
For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
|
|
|
|
The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
|
|
|
|
|
|
|
|
*Performance:*
|
|
|
|
P>M>F performs well across several few-shot learning benchmarks.
|
|
|
|
The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
|
|
|
|
The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
|
|
|
|
|
|
|
|
*Limitations and Scalability:*
|
|
|
|
This method has some limitations.
|
|
|
|
It relies on domains with large external datasets, which require substantial computational computation resources to create pre-trained models.
|
|
|
|
Fine-tuning is effective but might be slow and not work well on devices with limited ocmputational resources.
|
|
|
|
Future research could focus on exploring faster and more efficient methods for fine-tuning models.
|
|
|
|
#cite(<pmfpaper>)
|
2024-12-31 12:23:53 +01:00
|
|
|
|
|
|
|
=== CAML <CAML>
|
2024-12-19 16:53:50 +01:00
|
|
|
// https://arxiv.org/pdf/2310.10971v2
|
|
|
|
CAML (Context aware meta learning) is one of the state-of-the-art methods for few-shot learning.
|
2024-12-21 18:42:59 +01:00
|
|
|
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
|
2024-12-30 11:24:29 +01:00
|
|
|
This is a universal meta-learning approach.
|
|
|
|
That means no fine-tuning or meta-training is applied for specific domains.~#cite(<caml_paper>)
|
2024-12-21 18:42:59 +01:00
|
|
|
|
2024-12-30 11:24:29 +01:00
|
|
|
*Architecture:*
|
|
|
|
CAML first encodes the query and support set images using the fozen pre-trained feature extractor as shown in @camlarchitecture.
|
2024-12-21 18:42:59 +01:00
|
|
|
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
|
|
|
|
The class labels are encoded with the ELMES class encoder.
|
2024-12-30 10:32:03 +01:00
|
|
|
Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder.
|
2024-12-21 18:42:59 +01:00
|
|
|
This embedding is learned during pre-training.
|
|
|
|
Afterwards each image embedding is concatenated with the corresponding class embedding.
|
2024-12-30 11:24:29 +01:00
|
|
|
~#cite(<caml_paper>)
|
|
|
|
#todo[Add more references to the architecture image below]
|
2024-12-21 18:42:59 +01:00
|
|
|
|
2024-12-30 11:24:29 +01:00
|
|
|
*ELMES Encoder:*
|
|
|
|
The ELMES (Equal Length and Maximally Equiangular Set) encoder encodes the class labels to vectors of equal length.
|
2024-12-21 18:42:59 +01:00
|
|
|
The encoder is a bijective mapping between the labels and set of vectors that are equal length and maximally equiangular.
|
|
|
|
#todo[Describe what equiangular and bijective means]
|
|
|
|
Similar to one-hot encoding but with some advantages.
|
2024-12-30 10:32:03 +01:00
|
|
|
This encoder maximizes the algorithms ability to distinguish between different classes.
|
2024-12-30 11:24:29 +01:00
|
|
|
~#cite(<caml_paper>)
|
2024-12-21 18:42:59 +01:00
|
|
|
|
|
|
|
*Non-causal sequence model:*
|
2024-12-30 10:32:03 +01:00
|
|
|
The sequence created by the ELMES encoder is then fed into a non-causal sequence model.
|
|
|
|
This might be for instance a transormer encoder.
|
|
|
|
This step conditions the input sequence consisting of the query and support set embeddings.
|
|
|
|
Visual features from query and support set can be compared to each other to determine specific informations such as content or textures.
|
|
|
|
This can then be used to predict the class of the query image.
|
|
|
|
From the output of the sequence model the element at the same position as the query is selected.
|
|
|
|
Afterwards it is passed through a simple MLP network to predict the class of the query image.
|
2024-12-30 11:24:29 +01:00
|
|
|
~#cite(<caml_paper>)
|
2024-12-21 18:42:59 +01:00
|
|
|
|
|
|
|
*Large-Scale Pre-Training:*
|
2024-12-30 11:24:29 +01:00
|
|
|
CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets.
|
|
|
|
Those datasets span over different domains and help to detect any new visual concept during inference.
|
2024-12-31 12:23:53 +01:00
|
|
|
Only the non-causal sequence model is trained and the weights of the image encoder and ELMES encoder are kept frozen.
|
2024-12-30 11:24:29 +01:00
|
|
|
~#cite(<caml_paper>)
|
2024-12-21 18:42:59 +01:00
|
|
|
|
2024-12-30 11:24:29 +01:00
|
|
|
*Inference:*
|
|
|
|
During inference, CAML processes the following:
|
|
|
|
- Encodes the support set images and labels with the pre-trained feature and class encoders.
|
|
|
|
- Concatenates these encodings into a sequence alongside the query image embedding.
|
|
|
|
- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations.
|
|
|
|
- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite(<caml_paper>)
|
|
|
|
|
2024-12-31 12:23:53 +01:00
|
|
|
*Performance:*
|
2024-12-30 11:24:29 +01:00
|
|
|
CAML achieves state-of-the-art performance in universal meta-learning across 11 few-shot classification benchmarks,
|
|
|
|
including generic object recognition (e.g., MiniImageNet), fine-grained classification (e.g., CUB, Aircraft),
|
|
|
|
and cross-domain tasks (e.g., Pascal+Paintings).
|
|
|
|
It outperformed or matched existing models in 14 of 22 evaluation settings.
|
|
|
|
It performes competitively against P>M>F in 8 benchmarks even though P>M>F was meta-trained on the same domain.
|
|
|
|
~#cite(<caml_paper>)
|
|
|
|
|
|
|
|
CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX)
|
|
|
|
and low-resolution tasks (e.g., CIFAR-fs).
|
|
|
|
Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance.
|
|
|
|
~#cite(<caml_paper>)
|
|
|
|
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
|
2024-12-20 11:52:51 +01:00
|
|
|
|
|
|
|
#figure(
|
2024-12-30 10:32:03 +01:00
|
|
|
image("rsc/caml_architecture.png", width: 100%),
|
2024-12-20 11:52:51 +01:00
|
|
|
caption: [Architecture of CAML. #cite(<caml_paper>)],
|
|
|
|
) <camlarchitecture>
|
2024-12-19 16:53:50 +01:00
|
|
|
|
2024-11-04 12:26:00 +01:00
|
|
|
== Alternative Methods
|
|
|
|
|
2025-01-14 19:22:15 +01:00
|
|
|
There are several alternative methods to few-shot learning as well as to anomaly detection which are not used in this bachelor thesis.
|
2025-01-13 22:36:44 +01:00
|
|
|
Either they performed worse on benchmarks compared to the used methods or they were released after my initial literature research.
|
2025-01-08 07:45:16 +00:00
|
|
|
|
2025-01-13 22:36:44 +01:00
|
|
|
=== SgVA-CLIP (Semantic-guided Visual Adapting CLIP)
|
2025-01-08 07:45:16 +00:00
|
|
|
// https://arxiv.org/pdf/2211.16191v2
|
|
|
|
// https://arxiv.org/abs/2211.16191v2
|
|
|
|
|
2025-01-13 22:36:44 +01:00
|
|
|
SgVA-CLIP (Semantic-guided Visual Adapting CLIP) is a framework that improves few-shot learning by adapting pre-trained vision-language models like CLIP.
|
|
|
|
It focuses on generating better visual features for specific tasks while still using the general knowledge from the pre-trained model.
|
|
|
|
Instead of only aligning images and text, SgVA-CLIP includes a special visual adapting layer that makes the visual features more discriminative for the given task.
|
|
|
|
This process is supported by knowledge distillation, where detailed information from the pre-trained model guides the learning of the new visual features.
|
|
|
|
Additionally, the model uses contrastive losses to further refine both the visual and textual representations.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
|
|
|
|
|
|
|
One advantage of SgVA-CLIP is that it can work well with very few labeled samples, making it suitable for applications like anomaly detection.
|
|
|
|
The use of pre-trained knowledge helps reduce the need for large datasets.
|
|
|
|
However, a disadvantage is that it depends heavily on the quality and capabilities of the pre-trained model.
|
|
|
|
If the pre-trained model lacks relevant information for the task, SgVA-CLIP might struggle to adapt.
|
|
|
|
This might be a no-go for anomaly detection tasks because the images in such tasks are often very task-specific and not covered by general pre-trained models.
|
|
|
|
Also, fine-tuning the model can require considerable computational resources, which might be a limitation in some cases.~#cite(<peng2023sgvaclipsemanticguidedvisualadapting>)
|
|
|
|
|
2025-01-14 20:05:11 +01:00
|
|
|
=== TRIDENT (Transductive Decoupled Variational Inference for Few-Shot Classification) <TRIDENT>
|
2025-01-08 07:45:16 +00:00
|
|
|
// https://arxiv.org/pdf/2208.10559v1
|
|
|
|
// https://arxiv.org/abs/2208.10559v1
|
|
|
|
|
2025-01-14 19:22:15 +01:00
|
|
|
TRIDENT, a variational infernce network, is a few-shot learning approach which decouples image representation into semantic and label-specific latent variables.
|
|
|
|
Semantic attributes contain context or stylistic information, while label-specific attributes focus on the characteristics crucial for classification.
|
|
|
|
By decoupling these parts TRIDENT enhances the networks ability to generalize effectively from unseen data.~#cite(<singh2022transductivedecoupledvariationalinference>)
|
|
|
|
|
|
|
|
To further improve the discriminative performance of the model, it incorporates a transductive feature extraction module named AttFEX (Attention-based Feature Extraction).
|
|
|
|
This feature extractor dynamically aligns features from both the support and the query set, promoting task-specific embeddings.~#cite(<singh2022transductivedecoupledvariationalinference>)
|
|
|
|
|
|
|
|
This model is specifically designed for few-shot classification tasks but might also work well for anomaly detection.
|
|
|
|
Its ability to isolate critical features while droping irellevant context aligns with requirements needed for anomaly detection.
|
|
|
|
|
2025-01-14 20:05:11 +01:00
|
|
|
=== SOT (Self-Optimal-Transport Feature Transform) <SOT>
|
2025-01-08 07:45:16 +00:00
|
|
|
// https://arxiv.org/pdf/2204.03065v1
|
|
|
|
// https://arxiv.org/abs/2204.03065v1
|
|
|
|
|
2025-01-14 19:22:15 +01:00
|
|
|
The Self-Optimal-Transport (SOT) Feature Transform is designed to enhance feature sets for tsks like matching, grouping or classification by re-embedding feature representations.
|
|
|
|
This transform processes features as a set instead of using them individually.
|
|
|
|
This creates context-aware representations.
|
|
|
|
SOT can catch direct as well as indirect similarities between features which makes it suitable for tasks like few-shot learning or clustering.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
|
|
|
|
|
|
|
SOT uses a transport plan matrix derived from optimal transport theory to redefine feature relations.
|
|
|
|
This includes calculating pairwaise similarities (e.g. cosine similarities) between features and solving a min-cost max-flow problem to find an optimal match between features.
|
|
|
|
This results in an doubly stochastic matrix where each row represents the re-embedding of the corresponding feature in context with others.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
|
|
|
|
|
|
|
The transform features parameterless-ness, which makes it easy to integrate into existing machine-learning pipelines.
|
|
|
|
It is differentiable which allows for end-to-end training. For example (re-)train the hosting network to adopt to SOT.
|
|
|
|
SOT is equivariant, which means that the transform is invariant to the order of the input features.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
|
|
|
|
|
|
|
The improvements of SOT over traditional feature transforms dpeend on the used backbone network and the task.
|
|
|
|
But in most cases it outperforms state-of-the-art methods and could be used as a drop-in replacement for existing feature transforms.~#cite(<shalam2022selfoptimaltransportfeaturetransform>)
|
|
|
|
|
2025-01-08 07:45:16 +00:00
|
|
|
// anomaly detect
|
2025-01-14 19:22:15 +01:00
|
|
|
=== GLASS (Global and Local Anomaly co-Synthesis Strategy)
|
2025-01-08 07:45:16 +00:00
|
|
|
// https://arxiv.org/pdf/2407.09359v1
|
|
|
|
// https://arxiv.org/abs/2407.09359v1
|
|
|
|
|
2025-01-14 19:22:15 +01:00
|
|
|
GLASS (Global and Local Anomaly co-Synthesis Strategy) is a anomaly detection method for industrial applications.
|
|
|
|
It is a unified network which uses two different strategies to detect anomalies which are then combined.
|
|
|
|
The first one is Global Anomaly Synthesis (GAS), it operates on the feature level.
|
|
|
|
It uses a gaussian noise, guided by gradient ascent and constrained by truncated projection to generate anomalies close to the distribution for the normal features.
|
|
|
|
This helps the detection of weak defects.
|
|
|
|
The second strategy is Local Anomaly Synthesis (LAS), it operates on the image level.
|
|
|
|
This strategy overlays textures onto normal images using masks derived from noise patterns.
|
|
|
|
LAS creates strong anomalies which are further away from the normal sample distribution.
|
|
|
|
This adds diversity to the synthesized anomalies.~#cite(<chen2024unifiedanomalysynthesisstrategy>)
|
|
|
|
|
|
|
|
GLASS combines GAS and LAS to improve anomaly detection and localization by synthesizing anomalies near and far from the normal distribution.
|
|
|
|
Experiments show that GLASS is very effective and outperforms some state-of-the-art methods on the MVTec AD dataset such as PatchCore in some cases.~#cite(<chen2024unifiedanomalysynthesisstrategy>)
|
|
|
|
|
|
|
|
//=== HETMM (Hard-normal Example-aware Template Mutual Matching)
|
2025-01-08 07:45:16 +00:00
|
|
|
// https://arxiv.org/pdf/2303.16191v5
|
|
|
|
// https://arxiv.org/abs/2303.16191v5
|