The first and easiest method of this bachelor thesis uses a simple ResNet to calucalte those embeddings and is basically a simple prototypical netowrk.
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
It operates on the principle that an image is anomalous if any of its patches is anomalous.
The method achieves state-of-the-art performance on benchmarks like MVTec AD with high accuracy, low computational cost, and competitive inference times. #cite(<patchcorepaper>)
The PatchCore framework leverages a pre-trained convolutional neural network (e.g., WideResNet50) to extract mid-level features from image patches.
By focusing on intermediate layers, PatchCore balances the retention of localized information with a reduction in bias associated with high-level features pre-trained on ImageNet.
To enhance robustness to spatial variations, the method aggregates features from local neighborhoods using adaptive pooling, which increases the receptive field without sacrificing spatial resolution. #cite(<patchcorepaper>)
A crucial component of PatchCore is its memory bank, which stores patch-level features derived from the training dataset.
This memory bank represents the nominal distribution of features against which test patches are compared.
To ensure computational efficiency and scalability, PatchCore employs a coreset reduction technique to condense the memory bank by selecting the most representative patch features.
This optimization reduces both storage requirements and inference times while maintaining the integrity of the feature space. #cite(<patchcorepaper>)
During inference, PatchCore computes anomaly scores by measuring the distance between patch features from test images and their nearest neighbors in the memory bank.
If any patch exhibits a significant deviation, the corresponding image is flagged as anomalous.
For localization, the anomaly scores of individual patches are spatially aligned and upsampled to generate segmentation maps, providing pixel-level insights into the anomalous regions.~#cite(<patchcorepaper>)
EfficientAD is another state of the art method for anomaly detection.
It focuses on maintining performance as well as high computational efficiency.
At its core, EfficientAD uses a lightweight feature extractor, the Patch Description Network (PDN), which processes images in less than a millisecond on modern hardware.
In comparison to Patchcore which relies on a deeper, more computationaly heavy WideResNet-101 network, the PDN uses only four convulutional layers and two pooling layers.
This results in reduced latency while retains the ability to generate patch-level features.~#cite(<efficientADpaper>)
The detection of anomalies is achieved through a student-teacher framework.
The teacher network is a PDN and pre-trained on normal (good) images and the student network is trained to predict the teachers output.
An anomalie is identified when the student failes to replicate the teachers output.
This works because of the abscence of anomalies in the training data and the student network has never seen an anomaly while training.
A special loss function helps the student network not to generalize too broadly and inadequatly learn to predict anomalous features.~#cite(<efficientADpaper>)
Additionally to this structural anomaly detection EfficientAD can also address logical anomalies, such as violations in spartial or contextual constraints (eg. object wrong arrangments).
This is done by the integration of an autoencoder trained to replicate the teacher's features.~#cite(<efficientADpaper>)
By comparing the outputs of the autoencdoer and the student logical anomalies are effectively detected.
This is a challenge that Patchcore does not directly address.~#cite(<efficientADpaper>)
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model.
*Architecture:* CAML first encodes the query and support set images using the fronzen pre-trained feature extractor as shown in @camlarchitecture.
This step brings the images into a low dimensional space where similar images are encoded into similar embeddings.
The class labels are encoded with the ELMES class encoder.
Since the class of the query image is unknown in this stage we add a special learnable "unknown token" to the encoder.
This embedding is learned during pre-training.
Afterwards each image embedding is concatenated with the corresponding class embedding.
#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model]
*ELMES Encoder:* The ELMES (Equal Length and Maximally Equiangular Set) encoder encodes the class labels to vectors of equal length.
The encoder is a bijective mapping between the labels and set of vectors that are equal length and maximally equiangular.
#todo[Describe what equiangular and bijective means]
Similar to one-hot encoding but with some advantages.
Equation~$cal(L)(p,q)$~\eqref{eq:crelbinarybatch}\cite{handsonaiI} is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.