describe muffin dataset more in detail

experimental result describe more the setup
2024-06-04 17:37:46 +02:00
parent 5d6e8177da
commit ef23935c93
4 changed files with 40 additions and 10 deletions
@@ -2,9 +2,21 @@
 \subsection{Does Active-Learning benefit the learning process?}\label{subsec:does-active-learning-benefit-the-learning-process?}
-With the test setup described in section~\ref{sec:implementation} a test series was performed.
+A test series was performed inside a Jupyter notebook.
 The active learning loop starts with a untrained RESNet-18 model and a random selection of samples.
 The muffin and chihuahua dataset was used for this binary classification task.
 The dataset is split into training and test set which contains $\sim4750$ train- and $\sim1250$ test-images.
 (see~\ref{subsec:material-and-methods} for more infos)
 As a loss function CrossEntropyLoss was used and the Adam optimizer with a learning rate of $0.0001$.
 $\mathcal{B}$ samples are selected from the $\mathcal{S}$ samples and labeled by an oracle.
 Here the oracle is just labeling the samples with the correct class because the dataset is synthetic and the labels are known.
 No real human annotator was used because of huge time consumption and the goal is to benchmark the active learning process itself.
 Afterwards the model is trained with this labeled samples and the loop starts again with predicting $\mathcal{B}$ samples from the $\mathcal{S}$ drawn samples.
 Several different batch sizes $\mathcal{B} = \left\{ 2,4,6,8 \right\}$ and sample sizes $\mathcal{S} = \left\{ 2\mathcal{B}_i,4\mathcal{B}_i,5\mathcal{B}_i,10\mathcal{B}_i \right\}$
-dependent on the selected batch size were selected.
+dependent on the selected batch size were used.
 We define the baseline (passive learning) AUC curve as the supervised learning process without any active learning.
 The following graphs are only a subselection of the test series which give the most insights.
@@ -2,6 +2,24 @@
 \subsection{Material}\label{subsec:material}
 \subsubsection{Muffin vs chihuahua}
 Muffin vs chihuahua is a free dataset available on Kaggle.
 It consists of $\sim6000$ images of the two classes muffins and chihuahuas.
 The source data is scraped from google images and is split into a training and validation set.
 The trainings set contains $\sim4750$ and test set $\sim1250$ images, overall the two classes are almost balanced.
 This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.
 It is used in this practical work as a binary classification task to evaluate the performance of active learning.\cite{muffinsvschiuahuakaggle}
 \begin{figure}
    \centering
    \includegraphics[width=\linewidth/2]{../rsc/muffin_chiauaua_poster}
    \caption{Sample images from dataset. \cite{muffinsvschiuahuakaggle_poster}}
    \label{fig:roc-example}
 \end{figure}
 \subsection{Methods}\label{subsec:methods}
 \subsubsection{Dagster}
 Dagster is an open-source data orchestrator for machine learning, analytics, and ETL workflows.
 It lets you define pipelines in terms of the data flow between reusable, logical components.
@@ -36,14 +54,6 @@ It is widely used in the data science, mathematics and machine learning communit
 In the case of this practical work it can be used to test and evaluate the active learning loop before implementing it in a Dagster pipeline. \cite{jupyter}
 \subsubsection{Muffin vs chihuahua}
 Muffin vs chihuahua is a free dataset available on Kaggle.
 It consists of $\sim6000$ images of muffins and chihuahuas.
 This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.
 It is used in this practical work for a binary classification task to evaluate the performance of active learning.
 \cite{muffinsvschiuahuakaggle}
 \subsection{Methods}\label{subsec:methods}
 \subsubsection{Active-Learning}
 Active learning is a subfield of supervised learning.
@@ -82,6 +82,14 @@ and Sardinha, Alberto",
    note = "[Online; accessed 12-April-2024]"
 }
@misc{muffinsvschiuahuakaggle_poster,
    author = {},
    title = {{Muffin vs Chihuahua Kaggle Dataset Poster Image}},
    howpublished = "\url{https://i.postimg.cc/2SXNWP7f/muffin-meme2.jpg}",
    year = {2024},
    note = "[Online; accessed 12-April-2024]"
 }
@INCOLLECTION{RubensRecSysHB2010,
 author = {Neil Rubens and Dain Kaplan and Masashi Sugiyama},
 title = {Active Learning in Recommender Systems},