add pmf material section
	
		
			
	
		
	
	
		
	
		
			All checks were successful
		
		
	
	
		
			
				
	
				Build Typst document / build_typst_documents (push) Successful in 18s
				
			
		
		
	
	
				
					
				
			
		
			All checks were successful
		
		
	
	Build Typst document / build_typst_documents (push) Successful in 18s
				
			This commit is contained in:
		@@ -99,6 +99,54 @@ In typical supervised learning the model sees thousands or millions of samples o
 | 
				
			|||||||
This helps the model to learn the underlying patterns and to generalize well to unseen data.
 | 
					This helps the model to learn the underlying patterns and to generalize well to unseen data.
 | 
				
			||||||
In few-shot learning the model has to generalize from just a few samples.
 | 
					In few-shot learning the model has to generalize from just a few samples.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					=== Softmax
 | 
				
			||||||
 | 
					#todo[Maybe remove this section]
 | 
				
			||||||
 | 
					The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
 | 
				
			||||||
 | 
					Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
 | 
				
			||||||
 | 
					$ <softmax>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					=== Cross Entropy Loss
 | 
				
			||||||
 | 
					#todo[Maybe remove this section]
 | 
				
			||||||
 | 
					Cross Entropy Loss is a well established loss function in machine learning.
 | 
				
			||||||
 | 
					@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
 | 
				
			||||||
 | 
					And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
 | 
				
			||||||
 | 
					H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
 | 
				
			||||||
 | 
					cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
 | 
				
			||||||
 | 
					$ <crel>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					=== Cosine Similarity
 | 
				
			||||||
 | 
					To measure the distance between two vectors some common distance measures are used.
 | 
				
			||||||
 | 
					One popular of them is the Cosine Similarity (@cosinesimilarity).
 | 
				
			||||||
 | 
					It measures the cosine of the angle between two vectors.
 | 
				
			||||||
 | 
					The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					  cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
 | 
				
			||||||
 | 
					  &= (sum_(i=1)^n  A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
 | 
				
			||||||
 | 
					$ <cosinesimilarity>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#todo[Source?]
 | 
				
			||||||
 | 
					=== Euclidean Distance
 | 
				
			||||||
 | 
					The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
 | 
				
			||||||
 | 
					It just calculates the square root of the sum of the squared differences of the coordinates.
 | 
				
			||||||
 | 
					the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					  cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
 | 
				
			||||||
 | 
					$ <euclideannorm>
 | 
				
			||||||
 | 
					#todo[Source?]
 | 
				
			||||||
 | 
					
 | 
				
			||||||
=== Patchcore
 | 
					=== Patchcore
 | 
				
			||||||
// https://arxiv.org/pdf/2106.08265
 | 
					// https://arxiv.org/pdf/2106.08265
 | 
				
			||||||
PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
 | 
					PatchCore is an advanced method designed for cold-start anomaly detection and localization, primarily focused on industrial image data.
 | 
				
			||||||
@@ -198,8 +246,63 @@ For this bachelor theis the ResNet-50 architecture was used to predict the corre
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
=== P$>$M$>$F
 | 
					=== P$>$M$>$F
 | 
				
			||||||
// https://arxiv.org/pdf/2204.07305
 | 
					// https://arxiv.org/pdf/2204.07305
 | 
				
			||||||
 | 
					P>P>F (Pre-training > Meta-training > Fine-tuning) is a three-stage pipelined designed for few-shot learning.
 | 
				
			||||||
 | 
					It focuses on simplicity but still achieves competitive performance.
 | 
				
			||||||
 | 
					The three stages convert a general feature extractor into a task-specific model through fine-tuned optimization.
 | 
				
			||||||
 | 
					#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#todo[Todo]#cite(<pmfpaper>)
 | 
					*Pre-training:*
 | 
				
			||||||
 | 
					The first stage in @pmfarchitecture initializes the backbone feature extractor.
 | 
				
			||||||
 | 
					This can be for instance as ResNet or ViT and is learned by self-supervised techniques.
 | 
				
			||||||
 | 
					This backbone is traned on large scale datasets on a general domain such as ImageNet or similar.
 | 
				
			||||||
 | 
					This step optimizes for robust feature extractions and builds a foundation model.
 | 
				
			||||||
 | 
					There are well established bethods for pretraining which can be used such as DINO (self-supervised consistency), CLIP (Image-text alignment) or BERT (for text data).
 | 
				
			||||||
 | 
					#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Meta-training:*
 | 
				
			||||||
 | 
					The second stage in the pipline as in @pmfarchitecture is the meta-training.
 | 
				
			||||||
 | 
					Here a prototypical network (ProtoNet) is used to refine the pre-trained backbone.
 | 
				
			||||||
 | 
					ProtoNet constructs class centroids for each episode and then performs nearest class centroid classification.
 | 
				
			||||||
 | 
					Have a look at @prototypefewshot for a visualisation of its architecture.
 | 
				
			||||||
 | 
					The ProtoNet only requires a backbone $f$ to map images to an m-dimensional vector space: $f: cal(X) -> RR^m$.
 | 
				
			||||||
 | 
					The probability of a query image $x$ belonging to a class $k$ is given by the $exp$ of the distance of the sample to the class center divided by the sum of all distances:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					  p(y=k|x) = exp(-d(f(x), c_k)) / (sum_(k') exp(-d(f(x), c_k')))#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					$
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					As a distance metric $d$ a cosine similarity is used. See @cosinesimilarity for the formula.
 | 
				
			||||||
 | 
					$c_k$, the prototy of a class is defined as $c_k = 1/N_k sum_(i:y_i=k) f(x_i)$ and $N_k$ is just the number of samples of class $k$.
 | 
				
			||||||
 | 
					The meta-training process is dataset-agnostic, allowing for flexible adaptation to various few-shot classification scenarios.#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Fine-tuning:*
 | 
				
			||||||
 | 
					If an novel task is drawn from an unseen domain the model may fail to generalize because of a significant fail in the distribution.
 | 
				
			||||||
 | 
					To overcome this the model is optionally fine-tuned with the support set on a few gradient steps.
 | 
				
			||||||
 | 
					Data augmentation is used to generate a pseudo query set.
 | 
				
			||||||
 | 
					With the support set the class prototypes are calculated and compared against the models predictions for the pseudo query set.
 | 
				
			||||||
 | 
					With the loss of this steps the whole model is fine-tuned to the new domain.~#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#figure(
 | 
				
			||||||
 | 
					  image("rsc/pmfarchitecture.png", width: 100%),
 | 
				
			||||||
 | 
					  caption: [Architecture of P>M>F. #cite(<pmfpaper>)],
 | 
				
			||||||
 | 
					) <pmfarchitecture>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Inference:*
 | 
				
			||||||
 | 
					During inference the support set is used to calculate the class prototypes.
 | 
				
			||||||
 | 
					For a query image the feature extractor extracts its embedding in lower dimensional space and compares it to the pre-computed prototypes.
 | 
				
			||||||
 | 
					The query image is then assigned to the class with the closest prototype.#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Performance:*
 | 
				
			||||||
 | 
					P>M>F performs well across several few-shot learning benchmarks.
 | 
				
			||||||
 | 
					The combination of pre-training on large dataset and meta-trainng with episodic tasks helps the model to generalize well.
 | 
				
			||||||
 | 
					The inclusion of fine-tuning enhances adaptability to unseen domains, ensuring robust and efficient learning.#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					*Limitations and Scalability:*
 | 
				
			||||||
 | 
					This method has some limitations.
 | 
				
			||||||
 | 
					It relies on domains with large external datasets, which require substantial computational computation resources to create pre-trained models.
 | 
				
			||||||
 | 
					Fine-tuning is effective but might be slow and not work well on devices with limited ocmputational resources.
 | 
				
			||||||
 | 
					Future research could focus on exploring faster and more efficient methods for fine-tuning models.
 | 
				
			||||||
 | 
					#cite(<pmfpaper>)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
=== CAML <CAML>
 | 
					=== CAML <CAML>
 | 
				
			||||||
// https://arxiv.org/pdf/2310.10971v2
 | 
					// https://arxiv.org/pdf/2310.10971v2
 | 
				
			||||||
@@ -268,53 +371,6 @@ Its use of frozen pre-trained feature extractors is key to avoiding overfitting
 | 
				
			|||||||
  caption: [Architecture of CAML. #cite(<caml_paper>)],
 | 
					  caption: [Architecture of CAML. #cite(<caml_paper>)],
 | 
				
			||||||
) <camlarchitecture>
 | 
					) <camlarchitecture>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
=== Softmax
 | 
					 | 
				
			||||||
#todo[Maybe remove this section]
 | 
					 | 
				
			||||||
The Softmax function @softmax #cite(<liang2017soft>) converts $n$ numbers of a vector into a probability distribution.
 | 
					 | 
				
			||||||
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
$
 | 
					 | 
				
			||||||
sigma(bold(z))_j = (e^(z_j)) / (sum_(k=1)^k e^(z_k)) "for" j:={1,...,k}
 | 
					 | 
				
			||||||
$ <softmax>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19th century #cite(<Boltzmann>).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
=== Cross Entropy Loss
 | 
					 | 
				
			||||||
#todo[Maybe remove this section]
 | 
					 | 
				
			||||||
Cross Entropy Loss is a well established loss function in machine learning.
 | 
					 | 
				
			||||||
@crelformal #cite(<crossentropy>) shows the formal general definition of the Cross Entropy Loss.
 | 
					 | 
				
			||||||
And @crelbinary is the special case of the general Cross Entropy Loss for binary classification tasks.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
$
 | 
					 | 
				
			||||||
H(p,q) &= -sum_(x in cal(X)) p(x) log q(x) #<crelformal>\
 | 
					 | 
				
			||||||
H(p,q) &= -(p log(q) + (1-p) log(1-q)) #<crelbinary>\
 | 
					 | 
				
			||||||
cal(L)(p,q) &= -1/N sum_(i=1)^(cal(B)) (p_i log(q_i) + (1-p_i) log(1-q_i)) #<crelbatched>
 | 
					 | 
				
			||||||
$ <crel>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Equation~$cal(L)(p,q)$ @crelbatched #cite(<handsonaiI>) is the Binary Cross Entropy Loss for a batch of size $cal(B)$ and used for model training in this Practical Work.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
=== Cosine Similarity
 | 
					 | 
				
			||||||
To measure the distance between two vectors some common distance measures are used.
 | 
					 | 
				
			||||||
One popular of them is the Cosine Similarity (@cosinesimilarity).
 | 
					 | 
				
			||||||
It measures the cosine of the angle between two vectors.
 | 
					 | 
				
			||||||
The Cosine Similarity is especially useful when the magnitude of the vectors is not important.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
$
 | 
					 | 
				
			||||||
  cos(theta) &:= (A dot B) / (||A|| dot ||B||)\
 | 
					 | 
				
			||||||
  &= (sum_(i=1)^n  A_i B_i)/ (sqrt(sum_(i=1)^n A_i^2) dot sqrt(sum_(i=1)^n B_i^2))
 | 
					 | 
				
			||||||
$ <cosinesimilarity>
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
#todo[Source?]
 | 
					 | 
				
			||||||
=== Euclidean Distance
 | 
					 | 
				
			||||||
The euclidean distance (@euclideannorm) is a simpler method to measure the distance between two points in a vector space.
 | 
					 | 
				
			||||||
It just calculates the square root of the sum of the squared differences of the coordinates.
 | 
					 | 
				
			||||||
the euclidean distance can also be represented as the L2 norm (euclidean norm) of the difference of the two vectors.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
$
 | 
					 | 
				
			||||||
  cal(d)(A,B) = ||A-B|| := sqrt(sum_(i=1)^n (A_i - B_i)^2)
 | 
					 | 
				
			||||||
$ <euclideannorm>
 | 
					 | 
				
			||||||
#todo[Source?]
 | 
					 | 
				
			||||||
== Alternative Methods
 | 
					== Alternative Methods
 | 
				
			||||||
 | 
					
 | 
				
			||||||
There are several alternative methods to few-shot learning which are not used in this bachelor thesis.
 | 
					There are several alternative methods to few-shot learning which are not used in this bachelor thesis.
 | 
				
			||||||
 
 | 
				
			|||||||
							
								
								
									
										
											BIN
										
									
								
								rsc/pmfarchitecture.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								rsc/pmfarchitecture.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 117 KiB  | 
		Reference in New Issue
	
	Block a user