\section{Material and Methods}\label{sec:material-and-methods}

\subsection{Material}\label{subsec:material}

\subsubsection{Dagster}
Dagster is an open-source data orchestrator for machine learning, analytics, and ETL workflows.
It lets you define pipelines in terms of the data flow between reusable, logical components.
With Dagster scalable and reliable data workflows can be built.

The most important building blocks in Dagster are Assets, Jobs and Ops.
Assets are objects in persistent storage which contain a description as code how to update this object.
Whenever persistent storage is required, eg. storing a model, storing metadata, configurations a asset should be used.
Assets can be combined to an asset graph to model dependencies of the data flow.
Jobs are the main triggers of a pipeline and can be triggered by the Web UI, fix schedules or changes of a sensor.
To perform real tasks in code a Asset consists of an graph of Ops.
An Op is a function that performs a task and can be used to split the code into reusable components.

Dagster has a well-built web interface to monitor jobs and pipelines. \cite{dagster}

\subsubsection{Label-Studio}

Label-Studio is a data labeling tool that can be used to label images, text, audio and video data.
Which makes it an excellent choice incorporating human feedback into an active learning loop.

Label Studio provides a wide range of annotation interfaces and can be extended with custom ones.
Any arbitrary data can be passed to the labelling frontend using labelling tasks in the form of json files.
It is open-source and can be used for free.
Label Studio offers a seamless integration with active learning pipelines by allowing to define custom machine-learning backends.
It is designed for scalability and can be easily deployed on a cloud infrastructure using Kubernetes or Helm.\cite{labelstudio}

\subsubsection{Jupyter Notebook}\label{subsubsec:jupyternb}

A Jupyter notebook is a shareable document which combines code and its output, text and visualizations.
The notebook along with the editor provides a environment for fast prototyping and data analysis.
It is widely used in the data science, mathematics and machine learning community.

In the case of this practical work it can be used to test and evaluate the active learning loop before implementing it in a Dagster pipeline. \cite{jupyter}

\subsubsection{Muffin vs chihuahua}
Muffin vs chihuahua is a free dataset available on Kaggle.
It consists of $\sim6000$ images of muffins and chihuahuas.
This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.
It is used in this practical work for a binary classification task to evaluate the performance of active learning.
\cite{muffinsvschiuahuakaggle}

\subsection{Methods}\label{subsec:methods}

\subsubsection{Active-Learning}
Active learning is a subfield of supervised learning.
The key idea is if the algorithm is allowed to choose the data it learns from, it can perform better with less data.
A supervised classifier requires hundreds or even thousands of labeled samples to perform well.
Those labeled samples must be manually labeled by an oracle (human expert).\cite{RubensRecSysHB2010}

Clearly this results in a huge bottleneck for the training procedure.
Active learning aims to overcome this bottleneck by selecting the most informative samples to be labeled.\cite{settles.tr09}

The active learning process can be modeled as a loop as shown in~\ref{fig:active-learning-workflow}.
\begin{figure}
    \centering
    \begin{tikzpicture}[node distance=2cm]
        \node (start) [startstop] {Start};
        \node (pro1) [process, above of=start, xshift=4cm] {Uncertain Samples};
        \node (pro2) [io, right of=start, xshift= 2cm] {Model Inference};
        \node (io2) [io, right of=pro1, xshift= 2cm] {Oracle Labeling};


        \node (pro3) [process, below of=io2] {Labeled train samples};
        \node (io3) [io, below of=pro3] {Model Training};
        \node (pro4) [process, below of=pro2, xshift=0cm] {Unlabeled train samples};

        \draw [arrow] (start) -- (pro2);
        \draw [arrow] (pro2) -- (pro1);
        \draw [arrow] (pro1) -- (io2);
        \draw [arrow] (io2) -- (pro3);
        \draw [arrow] (pro3) -- (io3);
        \draw [arrow] (io3) -- (pro4);
        \draw [arrow] (pro4) -- (pro2);
    \end{tikzpicture}
    \caption{Basic active-learning workflow}
    \label{fig:active-learning-workflow}
\end{figure}

The active learning loop starts with the model inference on $\mathcal{S}$ samples.
The most uncertain samples of size $\mathcal{B}$ are selected and given to the oracle\footnote{Human annotator} for labeling.
Those labeled samples are then used to train the model.
The loop starts again with the new model and draws new samples from the unlabeled sample set $\mathcal{X}_U$.

\subsubsection{Semi-Supervised learning}
In traditional supervised learning we have a labeled dataset.
Each datapoint is associated with a corresponding target label.
The goal is to fit a model to predict the labels from datapoints.

In traditional unsupervised learning there are also datapoints but no labels are known.
The goal is to find patterns or structures in the data.
Moreover, it can be used for clustering or downprojection.

Those two techniques combined yield semi-supervised learning.
Some of the labels are known, but for most of the data we have only the raw datapoints.
The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}

\subsubsection{ROC and AUC}

A receiver operating characteristic curve can be used to measure the performance of a classifier of a binary classification task.
When using the accuracy as the performance metric it doesn't reveal much about the balance of the predictions.
There might be many true-positives and rarely any true-negatives and the accuracy is still good.
The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.
The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth/2]{../rsc/Roc_curve.svg}
    \caption{ROC curve comparision of two classifiers. \cite{ROCWikipedia}}
    \label{fig:roc-example}
\end{figure}

Furthermore, the area under this curve is called AUR curve and a useful metric to measure the performance of a binary classifier. \cite{suptechniques}

\subsubsection{RESNet}

Residual neural networks are a special type of neural network architecture.
They are especially good for deep learning and have been used in many state-of-the-art computer vision tasks.
The main idea behind ResNet is the skip connection.
The skip connection is a direct connection from one layer to another layer which is not the next layer.
This helps to avoid the vanishing gradient problem and helps with the training of very deep networks.
ResNet has proven to be very successful in many computer vision tasks and is used in this practical work for the classification task.
There are several different ResNet architectures, the most common are ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152. \cite{resnet}

Since the dataset is relatively small and the two class classification task is relatively easy the ResNet-18 architecture is used in this practical work.

\subsubsection{CNN}
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
\ref{fig:cnn-architecture} shows a typical binary classification task.
\cite{cnnintro}

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{../rsc/cnn_architecture}
    \caption{Architecture convolutional neural network. \cite{cnnarchitectureimg}}
    \label{fig:cnn-architecture}
\end{figure}

\subsubsection{Softmax}

The Softmax function converts $n$ numbers of a vector into a probability distribution.
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
\begin{equation}\label{eq:softmax}
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}
\end{equation}

The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.


\subsubsection{Cross Entropy Loss}
Cross Entropy Loss is a well established loss function in machine learning.
\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss.
And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.

\begin{align}
    H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
    H(p,q) &= - (p \log q + (1-p) \log(1-q))\label{eq:crelbinary}\\
    \mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
\end{align}

$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.\cite{crossentropy}

\subsubsection{Mathematical modeling of problem}\label{subsubsec:mathematicalmodeling}

Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.

The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.

\begin{equation}
    \label{eq:batchdef}
    \pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}

The model with the weights of the current loop iteration predicts pseudo predictions.

\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}

Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.

\begin{align}
    \label{eq:certainty}
    S(z) = | 0.5 - \sigma(\mathbf{z})_0|  \; \textit{or}  \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align}

\cite{activelearning}

With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.

\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}

\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}

This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{par:low-certainty-first}, Low-Certainty first~\ref{par:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.

\paragraph{Low certainty first}\label{par:low-certainty-first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.

\begin{equation}
    \mathcal{X}_t  = \text{min}_\mathcal{B}(S(z))
\end{equation}

\paragraph{High certainty first}\label{par:high-certainty-first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.

\begin{equation}
    \mathcal{X}_t  =\text{max}_\mathcal{B}(S(z))
\end{equation}

\paragraph{Low and High certain first}

We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.

\begin{equation}
    \mathcal{X}_t  =\text{max}_{\mathcal{B}/2}(S(z)) \cup  \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}

\paragraph{Mid certain first}

We take the middle section of the certainty scores.
To close the gap and have also the fourth variation included in the experiments.
This is expected to perform the worst but might still be better than random sampling in some cases.

\begin{equation}
    \mathcal{X}_t  =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup  \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}

\paragraph{Model training}
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.

\paragraph{Further improvement by class balancing} \label{par:furtherimprovements}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.

Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.

This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.

\begin{equation}\label{eq:balancedlowcertainty}
\mathcal{X}_t  = \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_0 < 0.5\right\}) \cup \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_1 < 0.5\right\})
\end{equation}