A Jupyter notebook is a shareable document which combines code and its output, text and visualizations.
The notebook along with the editor provides a environment for fast prototyping and data analysis.
It is widely used in the data science, mathematics and machine learning community.
In the case of this practical work it can be used to test and evaluate the active learning loop before implementing it in a Dagster pipeline. \cite{jupyter}
The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}
Residual neural networks are a special type of neural network architecture.
They are especially good for deep learning and have been used in many state-of-the-art computer vision tasks.
The main idea behind ResNet is the skip connection.
The skip connection is a direct connection from one layer to another layer which is not the next layer.
This helps to avoid the vanishing gradient problem and helps with the training of very deep networks.
ResNet has proven to be very successful in many computer vision tasks and is used in this practical work for the classification task.
There are several different ResNet architectures, the most common are ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152. \cite{resnet}
Since the dataset is relatively small and the two class classification task is relatively easy the ResNet-18 architecture is used in this practical work.
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.
Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset\mathcal{X}$.
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid\text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid\text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.
\paragraph{Further improvement by class balancing}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.
Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.
This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.