add balanced stuff and code snippet
This commit is contained in:
		
							
								
								
									
										
											BIN
										
									
								
								rsc/AUC_balanced__4_24.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								rsc/AUC_balanced__4_24.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 52 KiB  | 
										
											Binary file not shown.
										
									
								
							| 
		 Before Width: | Height: | Size: 79 KiB After Width: | Height: | Size: 32 KiB  | 
@@ -96,11 +96,14 @@ The previous process was improved by balancing the classes to give the oracle fo
 | 
				
			|||||||
The idea is that it might happen that the low certainty samples might always be of one class and thus lead to an imbalanced learning process.
 | 
					The idea is that it might happen that the low certainty samples might always be of one class and thus lead to an imbalanced learning process.
 | 
				
			||||||
The sample selection was modified as described in~\ref{par:furtherimprovements}.
 | 
					The sample selection was modified as described in~\ref{par:furtherimprovements}.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Unfortunately it didn't improve the convergence speed and it seems to make no difference compared to not balancing.
 | 
					\begin{figure}
 | 
				
			||||||
 | 
					    \centering
 | 
				
			||||||
 | 
					    \includegraphics[width=\linewidth]{../rsc/AUC_balanced__4_24}
 | 
				
			||||||
 | 
					    \caption{Dagster asset graph}
 | 
				
			||||||
 | 
					    \label{fig:balancedauc}
 | 
				
			||||||
 | 
					\end{figure}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Unfortunately it didn't improve the convergence speed and it seems to make no difference compared to not balancing and seems mostly even worse.
 | 
				
			||||||
This might be the case because the uncertainty sampling process balances the draws itself pretty well.
 | 
					This might be the case because the uncertainty sampling process balances the draws itself pretty well.
 | 
				
			||||||
 | 
					\ref{fig:balancedauc} shows the AUC curve with a batch size $\mathcal{B}=4$ and a sample size $\mathcal{S}=24$ for both, balanced and unbalanced low certainty sampling.
 | 
				
			||||||
% todo insert imgs
 | 
					The result looks similar for the other batch sizes and sample sizes.
 | 
				
			||||||
 | 
					 | 
				
			||||||
Not really.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
% todo add img and add stuff
 | 
					 | 
				
			||||||
@@ -4,7 +4,7 @@
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
To get accurate performance measures the active-learning process was implemented in a Jupyter notebook first.
 | 
					To get accurate performance measures the active-learning process was implemented in a Jupyter notebook first.
 | 
				
			||||||
This helps to choose which of the methods performs the best and which one to use in the final Dagster pipeline.
 | 
					This helps to choose which of the methods performs the best and which one to use in the final Dagster pipeline.
 | 
				
			||||||
A straight forward machine-learning pipeline was implemented with the help of Pytorch and RESNet.
 | 
					A straight forward machine-learning pipeline was implemented with the help of Pytorch and RESNet-18.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
\begin{lstlisting}[language=Python, caption=Certainty sampling process of selected metric]
 | 
					\begin{lstlisting}[language=Python, caption=Certainty sampling process of selected metric]
 | 
				
			||||||
df = df.sort_values(by=['score'])
 | 
					df = df.sort_values(by=['score'])
 | 
				
			||||||
@@ -34,9 +34,32 @@ match predict_mode:
 | 
				
			|||||||
\end{lstlisting}
 | 
					\end{lstlisting}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Moreover, the Dataset was manually imported and preprocessed with random augmentations.
 | 
					Moreover, the Dataset was manually imported and preprocessed with random augmentations.
 | 
				
			||||||
 | 
					After each loop iteration the Area Under the Curve (AUC) was calculated over the validation set to get a performance measure.
 | 
				
			||||||
 | 
					All those AUC were visualized in a line plot, see~\ref{sec:experimental-results} for the results.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
\subsection{Balanced sample selection}
 | 
					\subsection{Balanced sample selection}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					To avoid the model to learn only from one class, the sample selection process was balanced as mentioned in~\ref{par:furtherimprovements}.
 | 
				
			||||||
 | 
					Simply sort by predicted class first and then select the $\mathcal{B}/2$ lowest certain samples per class.
 | 
				
			||||||
 | 
					This should help to balance the sample selection process.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					\begin{lstlisting}[language=Python, caption=Certainty sampling process with class balancing]
 | 
				
			||||||
 | 
					# sort by pseudolabel
 | 
				
			||||||
 | 
					df.sort_values(by=['pseudolabel'], inplace=True)
 | 
				
			||||||
 | 
					# sort half batches by pseudolabel + score
 | 
				
			||||||
 | 
					df[:int(sample_size/2)] = df[:int(sample_size/2)].sort_values(by=['pseudolabel', 'score'])
 | 
				
			||||||
 | 
					df[int(sample_size/2):] = df[int(sample_size/2):].sort_values(by=['pseudolabel', 'score'])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					halfbatchsize = int(batch_size/2)
 | 
				
			||||||
 | 
					train_samples = pd
 | 
				
			||||||
 | 
					  .concat([df[:halfbatchsize],df[int(sample_size/2):int(sample_size/2)+halfbatchsize]])["sample"]
 | 
				
			||||||
 | 
					  .values.tolist()
 | 
				
			||||||
 | 
					unlabeled_samples += pd
 | 
				
			||||||
 | 
					  .concat([df[halfbatchsize:int(sample_size/2)], df[int(sample_size/2)+halfbatchsize:]])["sample"]
 | 
				
			||||||
 | 
					  .values.tolist()
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					\end{lstlisting}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
\subsection{Dagster with Label-Studio}\label{subsec:dagster-with-label-studio}
 | 
					\subsection{Dagster with Label-Studio}\label{subsec:dagster-with-label-studio}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The main goal is to implement an active learning loop with the help of Dagster and Label-Studio.
 | 
					The main goal is to implement an active learning loop with the help of Dagster and Label-Studio.
 | 
				
			||||||
@@ -45,7 +68,6 @@ This helps building reusable building blocks and to keep the code clean.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
Most of the python routines implemented in section~\ref{subsec:jupyter} were reused here and just slightly modified to fit the Dagster pipeline.
 | 
					Most of the python routines implemented in section~\ref{subsec:jupyter} were reused here and just slightly modified to fit the Dagster pipeline.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
% todo short this figure to half!
 | 
					 | 
				
			||||||
\begin{figure}
 | 
					\begin{figure}
 | 
				
			||||||
    \centering
 | 
					    \centering
 | 
				
			||||||
    \includegraphics[width=\linewidth]{../rsc/dagster/assets}
 | 
					    \includegraphics[width=\linewidth]{../rsc/dagster/assets}
 | 
				
			||||||
 
 | 
				
			|||||||
@@ -21,11 +21,9 @@ The sample-selection metric might select samples just from one class by chance.
 | 
				
			|||||||
Does balancing this distribution help the model performance?
 | 
					Does balancing this distribution help the model performance?
 | 
				
			||||||
\subsection{Outline}\label{subsec:outline}
 | 
					\subsection{Outline}\label{subsec:outline}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In section~\ref{sec:material-and-methods} we talk about general methods and materials we use.
 | 
					In section~\ref{sec:material-and-methods} we talk about general methods and materials used.
 | 
				
			||||||
First the problem is modeled mathematically in~\ref{subsubsec:mathematicalmodeling} and then implemented and benchmarked in a Jupyter notebook~\ref{subsubsec:jupyternb}
 | 
					First the problem is modeled mathematically in~\ref{subsubsec:mathematicalmodeling} and then implemented and benchmarked in a Jupyter notebook~\ref{subsubsec:jupyternb}.
 | 
				
			||||||
Section~\ref{sec:implementation} gives deeper insights to the implementation for the interested reader.
 | 
					Section~\ref{sec:implementation} gives deeper insights to the implementation for the interested reader with some code snippets.
 | 
				
			||||||
 | 
					The experimental results~\ref{sec:experimental-results} are well-presented with clear figures illustrating the performance of active learning across different sample sizes and batch sizes.
 | 
				
			||||||
The conclusion~\ref{subsec:conclusion} provides a overview of the findings, highlighting the benefits of active learning.
 | 
					The conclusion~\ref{subsec:conclusion} provides a overview of the findings, highlighting the benefits of active learning.
 | 
				
			||||||
Additionally the outlook section suggests avenues for future research which are not covered in this work.
 | 
					Additionally the outlook section~\ref{subsec:outlook} suggests avenues for future research which are not covered in this work.
 | 
				
			||||||
The experimental results are well-presented with clear figures illustrating the performance of active learning across different sample sizes and batch sizes.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
% todo proper linking to sections
 | 
					 | 
				
			||||||
		Reference in New Issue
	
	Block a user