add pmf material section

2025-01-03 15:25:32 +01:00
parent 882c6f54bb
commit 2690a3d0f2
2 changed files with 104 additions and 48 deletions
--- a/materialandmethods.typ
+++ b/materialandmethods.typ
@@ -99,6 +99,54 @@ In typical supervised learning the model sees thousands or millions of samples o
 This helps the model to learn the underlying patterns and to generalize well to unseen data.
 In few-shot learning the model has to generalize from just a few samples.
 === Softmax
 #todo[Maybe remove this section]
 The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
 Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
 $
 sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
 $ <softmax>
 The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
 === Cross Entropy Loss
 #todo[Maybe remove this section]
 Cross Entropy Loss is a well established loss function in machine learning.
@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
 And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
 $
 H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
 H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
 cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
 $ <crel>
 Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
 === Cosine Similarity
 To measure the distance between two vectors some common distance measures are used.
 One popular of them is the Cosine Similarity (@cosinesimilarity).
 It measures the cosine of the angle between two vectors.
 The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
 $
  cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
  &= (sum_(i=1)^n  A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
 $ <cosinesimilarity>
 #todo[Source?]
 === Euclidean Distance
 The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
 It just calculates the square root of the sum of the squared differences of the coordinates.
 the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
 $
  cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
 $ <euclideannorm>
 #todo[Source?]
 === Patchcore
 // https://arxiv.org/pdf/2106.08265
 PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
@@ -198,8 +246,63 @@ For this bachelor theis the ResNet-50 architecture was used to predict the corre
 === P$>$M$>$F
 // https://arxiv.org/pdf/2204.07305
 P>P>F (Pre-training > Meta-training > Fine-tuning) is a three-stage pipelined designed for few-shot learning.
 It focuses on simplicity but still achieves competitive performance.
 The three stages convert a general feature extractor into a task-specific model through fine-tuned optimization.
 #cite(<pmfpaper>)
-#todo[Todo]#cite(<pmfpaper>)
+*Pre-training:*
 The first stage in @pmfarchitecture initializes the backbone feature extractor.
 This can be for instance as ResNet or ViT and is learned by self-supervised techniques.
 This backbone is traned on large scale datasets on a general domain such as ImageNet or similar.
 This step optimizes for robust feature extractions and builds a foundation model.
 There are well established bethods for pretraining which can be used such as DINO (self-supervised consistency), CLIP (Image-text alignment) or BERT (for text data).
 #cite(<pmfpaper>)
 *Meta-training:*
 The second stage in the pipline as in @pmfarchitecture is the meta-training.
 Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
 ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
 Have a look at @prototypefewshot for a visualisation of its architecture.
 The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
 The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
 $
  p(y=k|x) = exp(-d(f(x), c_k)) / (sum_(k') exp(-d(f(x), c_k')))#cite(<pmfpaper>)
 $
 As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
 $c_k$, the prototy of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
 The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
 *Fine-tuning:*
 If an novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
 To overcome this the model is optionally fine-tuned with the support set on a few gradient steps.
 Data augmentation is used to generate a pseudo query set.
 With the support set the class prototypes are calculated and compared against the models predictions for the pseudo query set.
 With the loss of this steps the whole model is fine-tuned to the new domain.~#cite(<pmfpaper>)
 #figure(
  image("rsc/pmfarchitecture.png", width: 100%),
  caption: [Architecture of P>M>F. #cite(<pmfpaper>)],
 ) <pmfarchitecture>
 *Inference:*
 During inference the support set is used to calculate the class prototypes.
 For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
 The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
 *Performance:*
 P>M>F performs well across several few-shot learning benchmarks.
 The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
 The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
 *Limitations and Scalability:*
 This method has some limitations.
 It relies on domains with large external datasets, which require substantial computational computation resources to create pre-trained models.
 Fine-tuning is effective but might be slow and not work well on devices with limited ocmputational resources.
 Future research could focus on exploring faster and more efficient methods for fine-tuning models.
 #cite(<pmfpaper>)
 === CAML <CAML>
 // https://arxiv.org/pdf/2310.10971v2
@@ -268,53 +371,6 @@ Its use of frozen pre-trained feature extractors is key to avoiding overfitting
  caption: [Architecture of CAML. #cite(<caml_paper>)],
 ) <camlarchitecture>
 === Softmax
 #todo[Maybe remove this section]
 The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
 Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
 $
 sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
 $ <softmax>
 The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
 === Cross Entropy Loss
 #todo[Maybe remove this section]
 Cross Entropy Loss is a well established loss function in machine learning.
@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
 And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
 $
 H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
 H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
 cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
 $ <crel>
 Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
 === Cosine Similarity
 To measure the distance between two vectors some common distance measures are used.
 One popular of them is the Cosine Similarity (@cosinesimilarity).
 It measures the cosine of the angle between two vectors.
 The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
 $
  cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
  &= (sum_(i=1)^n  A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
 $ <cosinesimilarity>
 #todo[Source?]
 === Euclidean Distance
 The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
 It just calculates the square root of the sum of the squared differences of the coordinates.
 the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
 $
  cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
 $ <euclideannorm>
 #todo[Source?]
 == Alternative Methods
 There are several alternative methods to few-shot learning which are not used in this bachelor thesis.
--- a/rsc/pmfarchitecture.png
+++ b/rsc/pmfarchitecture.png