The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}
Convolutional neural networks are especially good model architectures for processing images, speech and audio signals.
A CNN typically consists of Convolutional layers, pooling layers and fully connected layers.
Convolutional layers are a set of learnable kernels (filters).
Each filter performs a convolution operation by sliding a window over every pixel of the image.
On each pixel a dot product creates a feature map.
Convolutional layers capture features like edges, textures or shapes.
Pooling layers sample down the feature maps created by the convolutional layers.
This helps reducing the computational complexity of the overall network and help with overfitting.
Common pooling layers include average- and max pooling.
Finally, after some convolution layers the feature map is flattened and passed to a network of fully connected layers to perform a classification or regression task.
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.
Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset\mathcal{X}$.
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid\text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid\text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
\paragraph{Low certainty first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.
\begin{equation}
\mathcal{X}_t = \text{min}_\mathcal{B}(S(z))
\end{equation}
\paragraph{High certainty first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.
\begin{equation}
\mathcal{X}_t =\text{max}_\mathcal{B}(S(z))
\end{equation}
\paragraph{Low and High certain first}
We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.
\paragraph{Further improvement by class balancing}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.
Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.
This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.