diff --git a/rsc/muffin_chiauaua_poster.jpg b/rsc/muffin_chiauaua_poster.jpg new file mode 100644 index 0000000..dfe3452 Binary files /dev/null and b/rsc/muffin_chiauaua_poster.jpg differ diff --git a/src/experimentalresults.tex b/src/experimentalresults.tex index 5e07c8f..7838a85 100644 --- a/src/experimentalresults.tex +++ b/src/experimentalresults.tex @@ -2,9 +2,21 @@ \subsection{Does Active-Learning benefit the learning process?}\label{subsec:does-active-learning-benefit-the-learning-process?} -With the test setup described in section~\ref{sec:implementation} a test series was performed. +A test series was performed inside a Jupyter notebook. +The active learning loop starts with a untrained RESNet-18 model and a random selection of samples. +The muffin and chihuahua dataset was used for this binary classification task. +The dataset is split into training and test set which contains $\sim4750$ train- and $\sim1250$ test-images. +(see~\ref{subsec:material-and-methods} for more infos) + +As a loss function CrossEntropyLoss was used and the Adam optimizer with a learning rate of $0.0001$. + +$\mathcal{B}$ samples are selected from the $\mathcal{S}$ samples and labeled by an oracle. +Here the oracle is just labeling the samples with the correct class because the dataset is synthetic and the labels are known. +No real human annotator was used because of huge time consumption and the goal is to benchmark the active learning process itself. +Afterwards the model is trained with this labeled samples and the loop starts again with predicting $\mathcal{B}$ samples from the $\mathcal{S}$ drawn samples. + Several different batch sizes $\mathcal{B} = \left\{ 2,4,6,8 \right\}$ and sample sizes $\mathcal{S} = \left\{ 2\mathcal{B}_i,4\mathcal{B}_i,5\mathcal{B}_i,10\mathcal{B}_i \right\}$ -dependent on the selected batch size were selected. +dependent on the selected batch size were used. We define the baseline (passive learning) AUC curve as the supervised learning process without any active learning. The following graphs are only a subselection of the test series which give the most insights. diff --git a/src/materialandmethods.tex b/src/materialandmethods.tex index bec66be..773ef30 100644 --- a/src/materialandmethods.tex +++ b/src/materialandmethods.tex @@ -2,6 +2,24 @@ \subsection{Material}\label{subsec:material} +\subsubsection{Muffin vs chihuahua} +Muffin vs chihuahua is a free dataset available on Kaggle. +It consists of $\sim6000$ images of the two classes muffins and chihuahuas. +The source data is scraped from google images and is split into a training and validation set. +The trainings set contains $\sim4750$ and test set $\sim1250$ images, overall the two classes are almost balanced. +This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar. +It is used in this practical work as a binary classification task to evaluate the performance of active learning.\cite{muffinsvschiuahuakaggle} + +\begin{figure} + \centering + \includegraphics[width=\linewidth/2]{../rsc/muffin_chiauaua_poster} + \caption{Sample images from dataset. \cite{muffinsvschiuahuakaggle_poster}} + \label{fig:roc-example} +\end{figure} + + +\subsection{Methods}\label{subsec:methods} + \subsubsection{Dagster} Dagster is an open-source data orchestrator for machine learning, analytics, and ETL workflows. It lets you define pipelines in terms of the data flow between reusable, logical components. @@ -36,14 +54,6 @@ It is widely used in the data science, mathematics and machine learning communit In the case of this practical work it can be used to test and evaluate the active learning loop before implementing it in a Dagster pipeline. \cite{jupyter} -\subsubsection{Muffin vs chihuahua} -Muffin vs chihuahua is a free dataset available on Kaggle. -It consists of $\sim6000$ images of muffins and chihuahuas. -This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar. -It is used in this practical work for a binary classification task to evaluate the performance of active learning. -\cite{muffinsvschiuahuakaggle} - -\subsection{Methods}\label{subsec:methods} \subsubsection{Active-Learning} Active learning is a subfield of supervised learning. diff --git a/src/sources.bib b/src/sources.bib index 4865f21..0202e26 100644 --- a/src/sources.bib +++ b/src/sources.bib @@ -82,6 +82,14 @@ and Sardinha, Alberto", note = "[Online; accessed 12-April-2024]" } +@misc{muffinsvschiuahuakaggle_poster, + author = {}, + title = {{Muffin vs Chihuahua Kaggle Dataset Poster Image}}, + howpublished = "\url{https://i.postimg.cc/2SXNWP7f/muffin-meme2.jpg}", + year = {2024}, + note = "[Online; accessed 12-April-2024]" +} + @INCOLLECTION{RubensRecSysHB2010, author = {Neil Rubens and Dain Kaplan and Masashi Sugiyama}, title = {Active Learning in Recommender Systems},