PWAI/src/materialandmethods.tex

\section{Material and Methods}\label{sec:material-and-methods}

\subsection{Material}\label{subsec:material}

\subsubsection{Dagster}
Dagster is an open-source data orchestrator for machine learning, analytics, and ETL workflows.
It lets you define pipelines in terms of the data flow between reusable, logical components.
With Dagster scalable and reliable data workflows can be built.

The most important building blocks in Dagster are Assets, Jobs and Ops.
Assets are objects in persistent storage which contain a description as code how to update this object.
Whenever persistent storage is required, eg. storing a model, storing metadata, configurations a asset should be used.
Assets can be combined to an asset graph to model dependencies of the data flow.
Jobs are the main triggers of a pipeline and can be triggered by the Web UI, fix schedules or changes of a sensor.
To perform real tasks in code a Asset consists of an graph of Ops.
An Op is a function that performs a task and can be used to split the code into reusable components.

Dagster has a well-built web interface to monitor jobs and pipelines. \cite{dagster}

\subsubsection{Label-Studio}
\subsubsection{Pytorch}

\subsubsection{Imagenet}
\subsubsection{Muffin vs chihuahua}
Muffin vs chihuahua is a free dataset available on Kaggle.
It consists of $\sim6000$ images of muffins and chihuahuas.
This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.
It is used in this practical work for a binary classification task to evaluate the performance of active learning.
\cite{muffinsvschiuahuakaggle}

\subsection{Methods}\label{subsec:methods}

\subsubsection{Active-Learning}
Active learning is a subfield of supervised learning.
The key idea is if the algorithm is allowed to choose the data it learns from, it can perform better with less data.
A supervised classifier requires hundreds or even thousands of labeled samples to perform well.
Those labeled samples must be manually labeled by an oracle (human expert).

Clearly this results in a huge bottleneck for the training procedure.
Active learning aims to overcome this bottleneck by selecting the most informative samples to be labeled.

Todo \cite{RubensRecSysHB2010} \cite{settles.tr09}

\begin{figure}
    \label{fig:active-learning-workflow}
    \centering
    \begin{tikzpicture}[node distance=2cm]
        \node (start) [startstop] {Start};
        \node (pro1) [process, above of=start, xshift=4cm] {Uncertain Samples};
        \node (pro2) [io, right of=start, xshift= 2cm] {Model Inference};
        \node (io2) [io, right of=pro1, xshift= 2cm] {Oracle Labeling};


        \node (pro3) [process, below of=io2] {Labeled train samples};
        \node (io3) [io, below of=pro3] {Model Training};
        \node (pro4) [process, below of=pro2, xshift=0cm] {Unlabeled train samples};

        \draw [arrow] (start) -- (pro2);
        \draw [arrow] (pro2) -- (pro1);
        \draw [arrow] (pro1) -- (io2);
        \draw [arrow] (io2) -- (pro3);
        \draw [arrow] (pro3) -- (io3);
        \draw [arrow] (io3) -- (pro4);
        \draw [arrow] (pro4) -- (pro2);
    \end{tikzpicture}
    \caption{Basic active-learning workflow}
\end{figure}

\subsubsection{Semi-Supervised learning}
In traditional supervised learning we have a labeled dataset.
Each datapoint is associated with a corresponding target label.
The goal is to fit a model to predict the labels from datapoints.

In traditional unsupervised learning there are also datapoints but no labels are known.
The goal is to find patterns or structures in the data.
Moreover, it can be used for clustering or downprojection.

Those two techniques combined yield semi-supervised learning.
Some of the labels are known, but for most of the data we have only the raw datapoints.
The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}

\subsubsection{ROC and AUC}

A receiver operating characteristic curve can be used to measure the performance of a classifier of a binary classification task.
When using the accuracy as the performance metric it doesn't reveal much about the balance of the predictions.
There might be many true-positives and rarely any true-negatives and the accuracy is still good.
The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.
The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth/2]{../rsc/Roc_curve.svg}
    \caption{Architecture convolutional neural network. Image by \href{https://cointelegraph.com/explained/what-are-convolutional-neural-networks}{SKY ENGINE AI}}
    \label{fig:roc-example}
\end{figure}

Furthermore, the area under this curve is called AUR curve and a useful metric to measure the performance of a binary classifier.

\subsubsection{RESNet}
\subsubsection{CNN}
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
\ref{fig:cnn-architecture} shows a typical binary classification task.

\begin{figure}
    \centering
    \includegraphics[width=\linewidth]{../rsc/cnn_architecture}
    \caption{Architecture convolutional neural network. Image by \href{https://cointelegraph.com/explained/what-are-convolutional-neural-networks}{SKY ENGINE AI}}
    \label{fig:cnn-architecture}
\end{figure}

\subsubsection{Softmax}

The Softmax function converts $n$ numbers of a vector into a probability distribution.
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
\begin{equation}\label{eq:softmax}
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}
\end{equation}

The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.
\subsubsection{Cross Entropy Loss}
Cross Entropy Loss is a well established loss function in machine learning.
\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss.
And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.

\begin{align}
    H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
    H(p,q) &= - (p \log q + (1-p) \log(1-q))\label{eq:crelbinary}\\
    \mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
\end{align}

$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.

\subsubsection{Adam}

\subsubsection{Mathematical modeling of problem}

Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.

The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.

\begin{equation}
    \label{eq:batchdef}
    \pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}

The model with the weights of the current loop iteration predicts pseudo predictions.

\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}

Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.

\begin{align}
    \label{eq:certainty}
    S(z) = | 0.5 - \sigma(\mathbf{z})_0|  \; \textit{or}  \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align}

\cite{activelearning}

With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.

\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}

\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}

This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.

\paragraph{Low certainty first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.

\begin{equation}
    \mathcal{X}_t  = \text{min}_\mathcal{B}(S(z))
\end{equation}

\paragraph{High certainty first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.

\begin{equation}
    \mathcal{X}_t  =\text{max}_\mathcal{B}(S(z))
\end{equation}

\paragraph{Low and High certain first}

We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.

\begin{equation}
    \mathcal{X}_t  =\text{max}_{\mathcal{B}/2}(S(z)) \cup  \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}

\paragraph{Mid certain first}

We take the middle section of the certainty scores.
To close the gap and have also the fourth variation included in the experiments.
This is expected to perform the worst but might still be better than random sampling in some cases.

\begin{equation}
    \mathcal{X}_t  =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup  \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}

\paragraph{Model training}
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.

\paragraph{Further improvement by class balancing}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.

Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.

This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.

\begin{equation}\label{eq:balancedlowcertainty}
\mathcal{X}_t  = \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_0 < 0.5\right\}) \cup \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_1 < 0.5\right\})
\end{equation}
outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\section{Material and Methods}\label{sec:material-and-methods}`

			`\subsection{Material}\label{subsec:material}`

			`\subsubsection{Dagster}`
add dagster stuff 2024-04-27 09:53:59 +02:00			`Dagster is an open-source data orchestrator for machine learning, analytics, and ETL workflows.`
add some impl stuff 2024-04-26 07:07:44 +02:00			`It lets you define pipelines in terms of the data flow between reusable, logical components.`
add dagster stuff 2024-04-27 09:53:59 +02:00			`With Dagster scalable and reliable data workflows can be built.`

			`The most important building blocks in Dagster are Assets, Jobs and Ops.`
			`Assets are objects in persistent storage which contain a description as code how to update this object.`
			`Whenever persistent storage is required, eg. storing a model, storing metadata, configurations a asset should be used.`
			`Assets can be combined to an asset graph to model dependencies of the data flow.`
			`Jobs are the main triggers of a pipeline and can be triggered by the Web UI, fix schedules or changes of a sensor.`
			`To perform real tasks in code a Asset consists of an graph of Ops.`
			`An Op is a function that performs a task and can be used to split the code into reusable components.`

move math stuff to methods add some sources 2024-05-06 16:06:47 +02:00			`Dagster has a well-built web interface to monitor jobs and pipelines. \cite{dagster}`
add some impl stuff 2024-04-26 07:07:44 +02:00
outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\subsubsection{Label-Studio}`
			`\subsubsection{Pytorch}`
add dagster stuff 2024-04-27 09:53:59 +02:00
outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\subsubsection{Imagenet}`
add dagster stuff 2024-04-27 09:53:59 +02:00			`\subsubsection{Muffin vs chihuahua}`
			`Muffin vs chihuahua is a free dataset available on Kaggle.`
move math stuff to methods add some sources 2024-05-06 16:06:47 +02:00			`It consists of $\sim6000$ images of muffins and chihuahuas.`
			`This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.`
			`It is used in this practical work for a binary classification task to evaluate the performance of active learning.`
			`\cite{muffinsvschiuahuakaggle}`
outsource text in seperate files 2024-04-10 19:21:26 +02:00
			`\subsection{Methods}\label{subsec:methods}`

			`\subsubsection{Active-Learning}`
move math stuff to methods add some sources 2024-05-06 16:06:47 +02:00			`Active learning is a subfield of supervised learning.`
			`The key idea is if the algorithm is allowed to choose the data it learns from, it can perform better with less data.`
			`A supervised classifier requires hundreds or even thousands of labeled samples to perform well.`
			`Those labeled samples must be manually labeled by an oracle (human expert).`

			`Clearly this results in a huge bottleneck for the training procedure.`
			`Active learning aims to overcome this bottleneck by selecting the most informative samples to be labeled.`

			`Todo \cite{RubensRecSysHB2010} \cite{settles.tr09}`

			`\begin{figure}`
			`\label{fig:active-learning-workflow}`
			`\centering`
			`\begin{tikzpicture}[node distance=2cm]`
			`\node (start) [startstop] {Start};`
			`\node (pro1) [process, above of=start, xshift=4cm] {Uncertain Samples};`
			`\node (pro2) [io, right of=start, xshift= 2cm] {Model Inference};`
			`\node (io2) [io, right of=pro1, xshift= 2cm] {Oracle Labeling};`


			`\node (pro3) [process, below of=io2] {Labeled train samples};`
			`\node (io3) [io, below of=pro3] {Model Training};`
			`\node (pro4) [process, below of=pro2, xshift=0cm] {Unlabeled train samples};`

			`\draw [arrow] (start) -- (pro2);`
			`\draw [arrow] (pro2) -- (pro1);`
			`\draw [arrow] (pro1) -- (io2);`
			`\draw [arrow] (io2) -- (pro3);`
			`\draw [arrow] (pro3) -- (io3);`
			`\draw [arrow] (io3) -- (pro4);`
			`\draw [arrow] (pro4) -- (pro2);`
			`\end{tikzpicture}`
			`\caption{Basic active-learning workflow}`
			`\end{figure}`

add some math formulation of label set selection 2024-04-12 15:48:57 +02:00			`\subsubsection{Semi-Supervised learning}`
			`In traditional supervised learning we have a labeled dataset.`
			`Each datapoint is associated with a corresponding target label.`
			`The goal is to fit a model to predict the labels from datapoints.`

			`In traditional unsupervised learning there are also datapoints but no labels are known.`
			`The goal is to find patterns or structures in the data.`
			`Moreover, it can be used for clustering or downprojection.`

			`Those two techniques combined yield semi-supervised learning.`
			`Some of the labels are known, but for most of the data we have only the raw datapoints.`
move most stuff to outline section add cross entropy loss infos add text to 4 different methods 2024-04-24 12:14:44 +02:00			`The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}`
add some math formulation of label set selection 2024-04-12 15:48:57 +02:00
outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\subsubsection{ROC and AUC}`
add roc infos 2024-04-18 22:54:59 +02:00
			`A receiver operating characteristic curve can be used to measure the performance of a classifier of a binary classification task.`
			`When using the accuracy as the performance metric it doesn't reveal much about the balance of the predictions.`
			`There might be many true-positives and rarely any true-negatives and the accuracy is still good.`
			`The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.`
			`The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.`

			`\begin{figure}`
			`\centering`
add imgs and text to impl 2024-04-29 21:54:43 +02:00			`\includegraphics[width=\linewidth/2]{../rsc/Roc_curve.svg}`
add roc infos 2024-04-18 22:54:59 +02:00			`\caption{Architecture convolutional neural network. Image by \href{https://cointelegraph.com/explained/what-are-convolutional-neural-networks}{SKY ENGINE AI}}`
			`\label{fig:roc-example}`
			`\end{figure}`

			`Furthermore, the area under this curve is called AUR curve and a useful metric to measure the performance of a binary classifier.`

outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\subsubsection{RESNet}`
			`\subsubsection{CNN}`
add cnn basic infos 2024-04-12 13:19:41 +02:00			`Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.`
			`A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.`
			`Convolutional layers are a set of learnable kernels (filters).`
			`Each filter performs a convolution operation by sliding a window over every pixel of the image.`
			`On each pixel a dot product creates a feature map.`
			`Convolutional layers capture features like edges, textures or shapes.`
			`Pooling layers sample down the feature maps created by the convolutional layers.`
			`This helps reducing the computational complexity of the overall network and help with overfitting.`
			`Common pooling layers include average- and max pooling.`
			`Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.`
add some math formulation of label set selection 2024-04-12 15:48:57 +02:00			`\ref{fig:cnn-architecture} shows a typical binary classification task.`
add cnn basic infos 2024-04-12 13:19:41 +02:00
add more imgs 2024-04-17 16:04:02 +02:00			`\begin{figure}`
add cnn basic infos 2024-04-12 13:19:41 +02:00			`\centering`
			`\includegraphics[width=\linewidth]{../rsc/cnn_architecture}`
			`\caption{Architecture convolutional neural network. Image by \href{https://cointelegraph.com/explained/what-are-convolutional-neural-networks}{SKY ENGINE AI}}`
			`\label{fig:cnn-architecture}`
			`\end{figure}`

outsource text in seperate files 2024-04-10 19:21:26 +02:00			`\subsubsection{Softmax}`

fix typo 2024-04-11 12:54:47 +02:00			`The Softmax function converts $n$ numbers of a vector into a probability distribution.`
outsource text in seperate files 2024-04-10 19:21:26 +02:00			`Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.`
			`\begin{equation}\label{eq:softmax}`
add some implementation stuff 2024-04-10 23:31:41 +02:00			`\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}`
			`\end{equation}`

add cnn basic infos 2024-04-12 13:19:41 +02:00			`The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.`
add some implementation stuff 2024-04-10 23:31:41 +02:00			`\subsubsection{Cross Entropy Loss}`
move most stuff to outline section add cross entropy loss infos add text to 4 different methods 2024-04-24 12:14:44 +02:00			`Cross Entropy Loss is a well established loss function in machine learning.`
			`\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss.`
			`And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.`

			`\begin{align}`
			`H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\`
			`H(p,q) &= - (p \log q + (1-p) \log(1-q))\label{eq:crelbinary}\\`
			`\mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}`
			`\end{align}`

			`$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.`

add some implementation stuff 2024-04-10 23:31:41 +02:00			`\subsubsection{Adam}`
move math stuff to methods add some sources 2024-05-06 16:06:47 +02:00
			`\subsubsection{Mathematical modeling of problem}`

			`Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.`

			`The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.`
			`We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.`
			`In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.`

			`\begin{equation}`
			`\label{eq:batchdef}`
			`\pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U`
			`\end{equation}`

			`The model with the weights of the current loop iteration predicts pseudo predictions.`

			`\begin{equation}\label{eq:equation2}`
			`z = g(\pmb{x};\pmb{w})`
			`\end{equation}`

			`Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.`
			`The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.`
			`Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.`
			`We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.`
			`Vice versa, the more centered the predictions are the more uncertain the prediction is.`
			`Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.`
			`That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.`

			`\begin{align}`
			`\label{eq:certainty}`
			`S(z) = \| 0.5 - \sigma(\mathbf{z})_0\| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5`
			`\end{align}`

			`\cite{activelearning}`

			`With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.`
			`We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.`

			`\begin{equation}\label{eq:minnot}`
			`\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S`
			`\end{equation}`

			`\begin{equation}\label{eq:maxnot}`
			`\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S`
			`\end{equation}`

			`This notation helps to define which subsets of samples to give the user for labeling.`
			`There are different ways how this subset can be chosen.`
			`In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.`
			`Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.`

			`\paragraph{Low certainty first}`
			`We take the samples with the lowest certainty score first and give it to the user for labeling.`
			`This is the most intuitive way to do active learning and might also be the most beneficial.`

			`\begin{equation}`
			`\mathcal{X}_t = \text{min}_\mathcal{B}(S(z))`
			`\end{equation}`

			`\paragraph{High certainty first}`
			`We take the samples with the highest certainty score first and give it to the user for labeling.`
			`The idea behind this is that the model is already very certain about the prediction and the user can confirm this.`
			`This might help ignoring labels which are irrelevant for the model.`

			`\begin{equation}`
			`\mathcal{X}_t =\text{max}_\mathcal{B}(S(z))`
			`\end{equation}`

			`\paragraph{Low and High certain first}`

			`We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.`
			`Benefit from both, low and high certainty samples.`

			`\begin{equation}`
			`\mathcal{X}_t =\text{max}_{\mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{B}/2}(S(z))`
			`\end{equation}`

			`\paragraph{Mid certain first}`

			`We take the middle section of the certainty scores.`
			`To close the gap and have also the fourth variation included in the experiments.`
			`This is expected to perform the worst but might still be better than random sampling in some cases.`

			`\begin{equation}`
			`\mathcal{X}_t =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))`
			`\end{equation}`

			`\paragraph{Model training}`
			`So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.`
			`After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.`
			`The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.`

			`\paragraph{Further improvement by class balancing}`
			`An intuitive improvement step might be the balancing of the class predictions.`
			`The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.`
			`This is bad for the learning process because the model might overfit to one class if always the same class is selected.`

			`Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.`
			`The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.`
			`Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.`

			`This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.`

			`\begin{equation}\label{eq:balancedlowcertainty}`
			`\mathcal{X}_t = \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_0 < 0.5\right\}) \cup \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_1 < 0.5\right\})`
			`\end{equation}`