move math stuff to methods

add some sources
This commit is contained in:
lukas-heiligenbrunner 2024-05-06 16:06:47 +02:00
parent 74ea31cf9c
commit 74ed28a377
5 changed files with 204 additions and 116 deletions

View File

@ -4,7 +4,10 @@
Active learning can hugely benefit the learning process when applied correctly.
The lower the batch size $\mathcal{B}$ the more improvement one can expect.
The higher the sampling space $\mathcal{S}$ the higher the gains but the more performance is required.
However, the higher the sampling space $\mathcal{S}$ the higher the gains but the more performance is required.
Another possible drawback is that reducing the uncertainty might not always be the best choice.
If a system gets certain about samples that does not always mean this improves the accuracy, since it can simply be certain about the wrong thing. \cite{RubensRecSysHB2010}
\subsection{Outlook}\label{subsec:outlook}

View File

@ -20,104 +20,5 @@ The sample-selection metric might select samples just from one class by chance.
Does balancing this distribution help the model performance?
\subsection{Outline}\label{subsec:outline}
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
\begin{equation}
\label{eq:batchdef}
\pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}
The model with the weights of the current loop iteration predicts pseudo predictions.
\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.
\begin{align}
\label{eq:certainty}
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align}
\cite{activelearning}
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
\subsubsection{Low certainty first}\label{subsec:low-certainty-first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.
\begin{equation}
\mathcal{X}_t = \text{min}_\mathcal{B}(S(z))
\end{equation}
\subsubsection{High certainty first}\label{subsec:high-certainty-first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.
\begin{equation}
\mathcal{X}_t =\text{max}_\mathcal{B}(S(z))
\end{equation}
\subsubsection{Low and High certain first}
We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.
\begin{equation}
\mathcal{X}_t =\text{max}_{\mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}
\subsubsection{Mid certain first}
We take the middle section of the certainty scores.
To close the gap and have also the fourth variation included in the experiments.
This is expected to perform the worst but might still be better than random sampling in some cases.
\begin{equation}
\mathcal{X}_t =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}
\subsubsection{Model training}
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.
\subsubsection{Further improvement by class balancing}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.
Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.
This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.
\begin{equation}\label{eq:balancedlowcertainty}
\mathcal{X}_t = \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_0 < 0.5\right\}) \cup \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_1 < 0.5\right\})
\end{equation}
Todo talk about what we do in which section

View File

@ -12,6 +12,20 @@
\usepackage{xcolor}
\usepackage{subfig}
\usepackage[inline]{enumitem}
\usepackage{color}
\usepackage{tikz}
\usetikzlibrary{shapes.geometric, arrows}
\tikzstyle{startstop} = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm,text centered, draw=black, fill=red!30]
\tikzstyle{io} = [rectangle, rounded corners,minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=blue!30]
\tikzstyle{process} = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=orange!30]
\tikzstyle{decision} = [diamond, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!30]
\tikzstyle{arrow} = [thick,->,>=stealth]
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
@ -40,12 +54,6 @@
%\lstset{basicstyle=\ttfamily, keywordstyle=\bfseries}
\usepackage{subfig}
\usepackage[inline]{enumitem}
\usepackage{color}
\if\ieee1
\settopmatter{printacmref=false} % Removes citation information below abstract
\renewcommand\footnotetextcopyrightpermission[1]{} % removes footnote with conference information in first column
@ -69,7 +77,7 @@
%%
%% The "title" command has an optional parameter,
%% allowing the author to define a "short title" to be used in page headers.
\title{Minimize labeling effort of binary classification tasks with active learning}
\title{Minimize labeling effort of binary classification tasks with active learning\\ Practical Work for AI}
%%
%% The "author" command and its associated commands are used to define
@ -137,7 +145,7 @@
%% The next two lines define the bibliography style to be used, and
%% the bibliography file.
\bibliographystyle{ACM-Reference-Format}
\bibliography{sources}
\bibliography{../src/sources}
%%
%% If your work has an appendix, this is the place to put it.

View File

@ -15,23 +15,57 @@ Jobs are the main triggers of a pipeline and can be triggered by the Web UI, fix
To perform real tasks in code a Asset consists of an graph of Ops.
An Op is a function that performs a task and can be used to split the code into reusable components.
Dagster has a well-built web interface to monitor jobs and pipelines.
Dagster has a well-built web interface to monitor jobs and pipelines. \cite{dagster}
\subsubsection{Label-Studio}
\subsubsection{Pytorch}
\subsubsection{NVTec}
\subsubsection{Imagenet}
\subsubsection{Muffin vs chihuahua}
Muffin vs chihuahua is a free dataset available on Kaggle.
It consists of $\sim$1500 images of muffins and chihuahuas.
It consists of $\sim6000$ images of muffins and chihuahuas.
This is expected to be a relatively hard classification task because the eyes of chihuahuas and chocolate parts of muffins look very similar.
It is used in this practical work for a binary classification task to evaluate the performance of active learning.
\cite{muffinsvschiuahuakaggle}
\subsection{Methods}\label{subsec:methods}
\subsubsection{Active-Learning}
Active learning is a subfield of supervised learning.
The key idea is if the algorithm is allowed to choose the data it learns from, it can perform better with less data.
A supervised classifier requires hundreds or even thousands of labeled samples to perform well.
Those labeled samples must be manually labeled by an oracle (human expert).
Clearly this results in a huge bottleneck for the training procedure.
Active learning aims to overcome this bottleneck by selecting the most informative samples to be labeled.
Todo \cite{RubensRecSysHB2010} \cite{settles.tr09}
\begin{figure}
\label{fig:active-learning-workflow}
\centering
\begin{tikzpicture}[node distance=2cm]
\node (start) [startstop] {Start};
\node (pro1) [process, above of=start, xshift=4cm] {Uncertain Samples};
\node (pro2) [io, right of=start, xshift= 2cm] {Model Inference};
\node (io2) [io, right of=pro1, xshift= 2cm] {Oracle Labeling};
\node (pro3) [process, below of=io2] {Labeled train samples};
\node (io3) [io, below of=pro3] {Model Training};
\node (pro4) [process, below of=pro2, xshift=0cm] {Unlabeled train samples};
\draw [arrow] (start) -- (pro2);
\draw [arrow] (pro2) -- (pro1);
\draw [arrow] (pro1) -- (io2);
\draw [arrow] (io2) -- (pro3);
\draw [arrow] (pro3) -- (io3);
\draw [arrow] (io3) -- (pro4);
\draw [arrow] (pro4) -- (pro2);
\end{tikzpicture}
\caption{Basic active-learning workflow}
\end{figure}
\subsubsection{Semi-Supervised learning}
In traditional supervised learning we have a labeled dataset.
Each datapoint is associated with a corresponding target label.
@ -106,3 +140,109 @@ And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.
\subsubsection{Adam}
\subsubsection{Mathematical modeling of problem}
Here the task is modeled as a mathematical problem to get a better understanding of how the problem is solved.
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
\begin{equation}
\label{eq:batchdef}
\pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}
The model with the weights of the current loop iteration predicts pseudo predictions.
\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.
\begin{align}
\label{eq:certainty}
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align}
\cite{activelearning}
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
\paragraph{Low certainty first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.
\begin{equation}
\mathcal{X}_t = \text{min}_\mathcal{B}(S(z))
\end{equation}
\paragraph{High certainty first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.
\begin{equation}
\mathcal{X}_t =\text{max}_\mathcal{B}(S(z))
\end{equation}
\paragraph{Low and High certain first}
We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.
\begin{equation}
\mathcal{X}_t =\text{max}_{\mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}
\paragraph{Mid certain first}
We take the middle section of the certainty scores.
To close the gap and have also the fourth variation included in the experiments.
This is expected to perform the worst but might still be better than random sampling in some cases.
\begin{equation}
\mathcal{X}_t =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}
\paragraph{Model training}
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$ as in~\eqref{eq:batchdef}.
\paragraph{Further improvement by class balancing}
An intuitive improvement step might be the balancing of the class predictions.
The selected samples of the active learning step above from $\mathcal{X}_t$ might all be from one class.
This is bad for the learning process because the model might overfit to one class if always the same class is selected.
Since nobody knows the true label during the sample selection process we cannot just sort by the true label and balance the samples.
The simplest solution to this is using the models predicted class and balance the selection by using half of the samples from one predicted class and the other one from the other.
Afterwards apply the selected scoring metric from above to do uncertainty sampling or similar to the balanced selection.
This process can be shown mathematically for low certainty sampling as in~\eqref{eq:balancedlowcertainty}.
\begin{equation}\label{eq:balancedlowcertainty}
\mathcal{X}_t = \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_0 < 0.5\right\}) \cup \text{min}_{\mathcal{B}/2}(\left\{\alpha \in S(z) : \alpha_1 < 0.5\right\})
\end{equation}

View File

@ -48,4 +48,40 @@ and Sardinha, Alberto",
editor = {Hasenöhrl, FriedrichEditor},
year = {2012},
pages = {4996},
collection = {Cambridge Library Collection - Physical  Sciences}, key = {value},}
collection = {Cambridge Library Collection - Physical  Sciences}, key = {value},}
@misc{dagster,
author = {},
title = {{Dagster getting started}},
howpublished = "\url{https://docs.dagster.io/getting-started}",
year = {2024},
note = "[Online; accessed 12-April-2024]"
}
@misc{muffinsvschiuahuakaggle,
author = {},
title = {{Muffin vs Chihuahua Kaggle Dataset}},
howpublished = "\url{https://www.kaggle.com/datasets/samuelcortinhas/muffin-vs-chihuahua-image-classification/data}",
year = {2024},
note = "[Online; accessed 12-April-2024]"
}
@INCOLLECTION{RubensRecSysHB2010,
author = {Neil Rubens and Dain Kaplan and Masashi Sugiyama},
title = {Active Learning in Recommender Systems},
booktitle = {Recommender Systems Handbook},
publisher = {Springer},
year = {2011},
editor = {P.B. Kantor and F. Ricci and L. Rokach and B. Shapira},
pages = {735-767},
doi = {10.1007/978-0-387-85820-3_23}
}
@techreport{settles.tr09,
Author = {Burr Settles},
Institution = {University of Wisconsin--Madison},
Number = {1648},
Title = {Active Learning Literature Survey},
Type = {Computer Sciences Technical Report},
Year = {2009},
}