diff --git a/materialandmethods.typ b/materialandmethods.typ index 80482f2..822f99c 100644 --- a/materialandmethods.typ +++ b/materialandmethods.typ @@ -202,21 +202,26 @@ Todo // https://arxiv.org/pdf/2310.10971v2 CAML (Context aware meta learning) is one of the state-of-the-art methods for few-shot learning. It consists of three different components: a frozen pre-trained image encoder, a fixed Equal Length and Maximally Equiangular Set (ELMES) class encoder and a non-causal sequence model. +This is a universal meta-learning approach. +That means no fine-tuning or meta-training is applied for specific domains.~#cite() -*Architecture:* CAML first encodes the query and support set images using the fozen pre-trained feature extractor as shown in @camlarchitecture. +*Architecture:* +CAML first encodes the query and support set images using the fozen pre-trained feature extractor as shown in @camlarchitecture. This step brings the images into a low dimensional space where similar images are encoded into similar embeddings. The class labels are encoded with the ELMES class encoder. Since the class of the query image is unknown in this stage a special learnable "unknown token" is added to the encoder. This embedding is learned during pre-training. Afterwards each image embedding is concatenated with the corresponding class embedding. +~#cite() +#todo[Add more references to the architecture image below] -#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model] - -*ELMES Encoder:* The ELMES (Equal Length and Maximally Equiangular Set) encoder encodes the class labels to vectors of equal length. +*ELMES Encoder:* +The ELMES (Equal Length and Maximally Equiangular Set) encoder encodes the class labels to vectors of equal length. The encoder is a bijective mapping between the labels and set of vectors that are equal length and maximally equiangular. #todo[Describe what equiangular and bijective means] Similar to one-hot encoding but with some advantages. This encoder maximizes the algorithms ability to distinguish between different classes. +~#cite() *Non-causal sequence model:* The sequence created by the ELMES encoder is then fed into a non-causal sequence model. @@ -226,14 +231,37 @@ Visual features from query and support set can be compared to each other to dete This can then be used to predict the class of the query image. From the output of the sequence model the element at the same position as the query is selected. Afterwards it is passed through a simple MLP network to predict the class of the query image. +~#cite() *Large-Scale Pre-Training:* -#todo[Desc. what this is] +CAML is pre-trained on a huge number of images from ImageNet-1k, Fungi, MSCOCO, and WikiArt datasets. +Those datasets span over different domains and help to detect any new visual concept during inference. +Only the non-causal sequence model is trained and the image encoder and ELMES encoder are frozen. +~#cite() *Theoretical Analysis:* #todo[Mybe not that important?] +*Inference:* +During inference, CAML processes the following: +- Encodes the support set images and labels with the pre-trained feature and class encoders. +- Concatenates these encodings into a sequence alongside the query image embedding. +- Passes the sequence through the non-causal sequence model, enabling dynamic interaction between query and support set representations. +- Extracts the transformed query embedding and classifies it using a Multi-Layer Perceptron (MLP).~#cite() + *Results:* +CAML achieves state-of-the-art performance in universal meta-learning across 11 few-shot classification benchmarks, +including generic object recognition (e.g., MiniImageNet), fine-grained classification (e.g., CUB, Aircraft), +and cross-domain tasks (e.g., Pascal+Paintings). +It outperformed or matched existing models in 14 of 22 evaluation settings. +It performes competitively against P>M>F in 8 benchmarks even though P>M>F was meta-trained on the same domain. +~#cite() + +CAML does great in generalization and inference efficiency but faces limitations in specialized domains (e.g., ChestX) +and low-resolution tasks (e.g., CIFAR-fs). +Its use of frozen pre-trained feature extractors is key to avoiding overfitting and enabling robust performance. +~#cite() +#todo[We should add stuff here why we have a max amount of shots bc. of pretrained model] #figure( image("rsc/caml_architecture.png", width: 100%),