add balanced stuff and code snippet

2024-05-17 11:19:14 +02:00
parent 2ff58491b0
commit 79d04ccef3
5 changed files with 39 additions and 16 deletions
--- a/rsc/AUC_balanced__4_24.png
+++ b/rsc/AUC_balanced__4_24.png
--- a/rsc/dagster/assets.png
+++ b/rsc/dagster/assets.png
--- a/src/experimentalresults.tex
+++ b/src/experimentalresults.tex
@@ -96,11 +96,14 @@ The previous process was improved by balancing the classes to give the oracle fo
 The idea is that it might happen that the low certainty samples might always be of one class and thus lead to an imbalanced learning process.
 The sample selection was modified as described in~\ref{par:furtherimprovements}.

-Unfortunately it didn't improve the convergence speed and it seems to make no difference compared to not balancing.
+\begin{figure}
+    \centering
+    \includegraphics[width=\linewidth]{../rsc/AUC_balanced__4_24}
+    \caption{Dagster asset graph}
+    \label{fig:balancedauc}
+\end{figure}
+
+Unfortunately it didn't improve the convergence speed and it seems to make no difference compared to not balancing and seems mostly even worse.
 This might be the case because the uncertainty sampling process balances the draws itself pretty well.
-
-% todo insert imgs
-
-Not really.
-
-% todo add img and add stuff
+\ref{fig:balancedauc} shows the AUC curve with a batch size $\mathcal{B}=4$ and a sample size $\mathcal{S}=24$ for both, balanced and unbalanced low certainty sampling.
+The result looks similar for the other batch sizes and sample sizes.
--- a/src/implementation.tex
+++ b/src/implementation.tex
@@ -4,7 +4,7 @@

 To get accurate performance measures the active-learning process was implemented in a Jupyter notebook first.
 This helps to choose which of the methods performs the best and which one to use in the final Dagster pipeline.
-A straight forward machine-learning pipeline was implemented with the help of Pytorch and RESNet.
+A straight forward machine-learning pipeline was implemented with the help of Pytorch and RESNet-18.

 \begin{lstlisting}[language=Python, caption=Certainty sampling process of selected metric]
 df = df.sort_values(by=['score'])
@@ -34,9 +34,32 @@ match predict_mode:
 \end{lstlisting}

 Moreover, the Dataset was manually imported and preprocessed with random augmentations.
+After each loop iteration the Area Under the Curve (AUC) was calculated over the validation set to get a performance measure.
+All those AUC were visualized in a line plot, see~\ref{sec:experimental-results} for the results.

 \subsection{Balanced sample selection}

+To avoid the model to learn only from one class, the sample selection process was balanced as mentioned in~\ref{par:furtherimprovements}.
+Simply sort by predicted class first and then select the $\mathcal{B}/2$ lowest certain samples per class.
+This should help to balance the sample selection process.
+
+\begin{lstlisting}[language=Python, caption=Certainty sampling process with class balancing]
+# sort by pseudolabel
+df.sort_values(by=['pseudolabel'], inplace=True)
+# sort half batches by pseudolabel + score
+df[:int(sample_size/2)] = df[:int(sample_size/2)].sort_values(by=['pseudolabel', 'score'])
+df[int(sample_size/2):] = df[int(sample_size/2):].sort_values(by=['pseudolabel', 'score'])
+
+halfbatchsize = int(batch_size/2)
+train_samples = pd
+  .concat([df[:halfbatchsize],df[int(sample_size/2):int(sample_size/2)+halfbatchsize]])["sample"]
+  .values.tolist()
+unlabeled_samples += pd
+  .concat([df[halfbatchsize:int(sample_size/2)], df[int(sample_size/2)+halfbatchsize:]])["sample"]
+  .values.tolist()
+
+\end{lstlisting}
+
 \subsection{Dagster with Label-Studio}\label{subsec:dagster-with-label-studio}

 The main goal is to implement an active learning loop with the help of Dagster and Label-Studio.
@@ -45,7 +68,6 @@ This helps building reusable building blocks and to keep the code clean.

 Most of the python routines implemented in section~\ref{subsec:jupyter} were reused here and just slightly modified to fit the Dagster pipeline.

-% todo short this figure to half!
 \begin{figure}
    \centering
    \includegraphics[width=\linewidth]{../rsc/dagster/assets}
--- a/src/introduction.tex
+++ b/src/introduction.tex
@@ -21,11 +21,9 @@ The sample-selection metric might select samples just from one class by chance.
 Does balancing this distribution help the model performance?
 \subsection{Outline}\label{subsec:outline}

-In section~\ref{sec:material-and-methods} we talk about general methods and materials we use.
-First the problem is modeled mathematically in~\ref{subsubsec:mathematicalmodeling} and then implemented and benchmarked in a Jupyter notebook~\ref{subsubsec:jupyternb}
-Section~\ref{sec:implementation} gives deeper insights to the implementation for the interested reader.
+In section~\ref{sec:material-and-methods} we talk about general methods and materials used.
+First the problem is modeled mathematically in~\ref{subsubsec:mathematicalmodeling} and then implemented and benchmarked in a Jupyter notebook~\ref{subsubsec:jupyternb}.
+Section~\ref{sec:implementation} gives deeper insights to the implementation for the interested reader with some code snippets.
+The experimental results~\ref{sec:experimental-results} are well-presented with clear figures illustrating the performance of active learning across different sample sizes and batch sizes.
 The conclusion~\ref{subsec:conclusion} provides a overview of the findings, highlighting the benefits of active learning.
-Additionally the outlook section suggests avenues for future research which are not covered in this work.
-The experimental results are well-presented with clear figures illustrating the performance of active learning across different sample sizes and batch sizes.
-
-% todo proper linking to sections
+Additionally the outlook section~\ref{subsec:outlook} suggests avenues for future research which are not covered in this work.