move most stuff to outline section

add cross entropy loss infos
add text to 4 different methods
This commit is contained in:
lukas-heiligenbrunner 2024-04-24 12:14:44 +02:00
parent fb8a50639f
commit ba105985b3
6 changed files with 1323 additions and 83 deletions

View File

@ -2,4 +2,6 @@
\subsection{Conclusion}\label{subsec:conclusion}
\subsection{Outlook}\label{subsec:outlook}
\subsection{Outlook}\label{subsec:outlook}
Results might be different with a multiclass classification task and segmentation tasks.

View File

@ -1,74 +1 @@
\section{Implementation}\label{sec:implementation}
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
\begin{equation}
\label{eq:batchdef}
\pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}
The model with the weights of the current loop iteration predicts pseudo predictions.
\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.
\begin{align}
\label{eq:certainty}
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; \arg\max_j \sigma(\mathbf{z})
\end{align}
\cite{activelearning}
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, halt-high and half-low certain and only the middle section of the sorted certainty scores.
\subsection{Low certainty first}\label{subsec:low-certainty-first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
\begin{equation}
\text{min}_\mathcal{B}(S(z))
\end{equation}
\subsection{High certainty first}\label{subsec:high-certainty-first}
We take the samples with the highest certainty score first and give it to the user for labeling.
\begin{equation}
\text{max}_\mathcal{B}(S(z))
\end{equation}
\subsection{Low and High certain first}
We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
\begin{equation}
\text{max}_{\mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}
\subsection{Mid certain first}
\begin{equation}
S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}

View File

@ -18,4 +18,91 @@ Is combining Dagster with Label-Studio a good match for building scalable and re
\subsubsection{Does balancing the learning samples improve performance?}
The sample-selection metric might select samples just from one class by chance.
Does balancing this distribution help the model performance?
\subsection{Outline}\label{subsec:outline}
\subsection{Outline}\label{subsec:outline}
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
\begin{equation}
\label{eq:batchdef}
\pmb{x} \coloneqq (\pmb{x}_0,\dots,\pmb{x}_\mathcal{S}) \sim \mathcal{X}_U
\end{equation}
The model with the weights of the current loop iteration predicts pseudo predictions.
\begin{equation}\label{eq:equation2}
z = g(\pmb{x};\pmb{w})
\end{equation}
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.
\begin{align}
\label{eq:certainty}
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align}
\cite{activelearning}
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in~\ref{eq:minnot} and~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
\begin{equation}\label{eq:minnot}
\text{min}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ smallest numbers of } S
\end{equation}
\begin{equation}\label{eq:maxnot}
\text{max}_n(S) \coloneqq a \subset S \mid \text{where } a \text{ are the } n \text{ largest numbers of } S
\end{equation}
This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first~\ref{subsec:low-certainty-first}, Low-Certainty first~\ref{subsec:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
\subsubsection{Low certainty first}\label{subsec:low-certainty-first}
We take the samples with the lowest certainty score first and give it to the user for labeling.
This is the most intuitive way to do active learning and might also be the most beneficial.
\begin{equation}
\mathcal{X}_t = \text{min}_\mathcal{B}(S(z))
\end{equation}
\subsubsection{High certainty first}\label{subsec:high-certainty-first}
We take the samples with the highest certainty score first and give it to the user for labeling.
The idea behind this is that the model is already very certain about the prediction and the user can confirm this.
This might help ignoring labels which are irrelevant for the model.
\begin{equation}
\mathcal{X}_t =\text{max}_\mathcal{B}(S(z))
\end{equation}
\subsubsection{Low and High certain first}
We take half the batch-size $\mathcal{B}$ of low certainty and the other half with high certainty samples.
Benefit from both, low and high certainty samples.
\begin{equation}
\mathcal{X}_t =\text{max}_{\mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{B}/2}(S(z))
\end{equation}
\subsubsection{Mid certain first}
We take the middle section of the certainty scores.
To close the gap and have also the fourth variation included in the experiments.
This is expected to perform the worst but might still be better than random sampling in some cases.
\begin{equation}
\mathcal{X}_t =S(z) \setminus (\text{min}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)) \cup \text{max}_{\mathcal{S}/2 - \mathcal{B}/2}(S(z)))
\end{equation}
\subsubsection{Model training}
So now we have defined the samples we want to label with $\mathcal{X}_t$ and the user starts labeling this samples.
After labelling the model $g(\pmb{x};\pmb{w})$ is trained with the new samples and the weights $\pmb{w}$ are updated with the labeled samples $\mathcal{X}_t$.
The loop starts again with the new model and draws new unlabeled samples from $\mathcal{X}_U$.

1189
src/llncs.cls Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,12 +1,22 @@
\def\ieee{1}
\if\ieee1
\documentclass[sigconf]{acmart}
\else
\documentclass{llncs}
\fi
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{hyperref}
\usepackage[inline]{enumitem}
\if\ieee1
\settopmatter{printacmref=false} % Removes citation information below abstract
\renewcommand\footnotetextcopyrightpermission[1]{} % removes footnote with conference information in first column
\pagestyle{plain} % removes running headers
\fi
%%
%% \BibTeX command to typeset BibTeX logo in the docs
@ -14,7 +24,9 @@
\providecommand\BibTeX{{%
\normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}
\if\ieee1
\acmConference{Minimize labeling effort of Binary classification Tasks with Active learning}{2023}{Linz}
\fi
%%
%% end of the preamble, start of the body of the document source.
@ -32,14 +44,18 @@
%% "authornote" and "authornotemark" commands
%% used to denote shared contribution to the research.
\author{Lukas Heiligenbrunner}
\if\ieee1
\email{k12104785@students.jku.at}
\affiliation{%
\institution{Johannes Kepler University Linz}
\city{Linz}
\state{Upperaustria}
\country{Austria}
\postcode{4020}
\institution{Johannes Kepler University Linz}
\city{Linz}
\state{Upperaustria}
\country{Austria}
\postcode{4020}
}
\fi
%%
%% By default, the full list of authors will be used in the page
@ -47,11 +63,15 @@
%% other information printed in the page headers. This command allows
%% the author to define a more concise list
%% of authors' names for this purpose.
\renewcommand{\shortauthors}{Lukas Heilgenbrunner}
% \renewcommand{\shortauthors}{Lukas Heilgenbrunner}
%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\if\ieee0
\maketitle
\fi
\begin{abstract}
Active learning might result in a faster model convergence and thus less labeled samples would be required. This method might be beneficial in areas where labeling datasets is demanding and reducing computational effort is not the main objective.
\end{abstract}
@ -59,7 +79,9 @@
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\if\ieee1
\keywords{neural networks, ResNET, pseudo-labeling, active-learning}
\fi
%\received{20 February 2007}
%\received[revised]{12 March 2009}
@ -68,7 +90,9 @@
%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.
\if\ieee1
\maketitle
\fi
\input{introduction}
\input{materialandmethods}
\input{implementation}

View File

@ -24,7 +24,7 @@ Moreover, it can be used for clustering or downprojection.
Those two techniques combined yield semi-supervised learning.
Some of the labels are known, but for most of the data we have only the raw datapoints.
The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.
The basic idea is that the unlabeled data can significantly improve the model performance when used in combination with the labeled data.\cite{Xu_2022_CVPR}
\subsubsection{ROC and AUC}
@ -74,5 +74,16 @@ Its a generalization of the Sigmoid function and often used as an Activation Lay
The softmax function has high similarities with the Boltzmann distribution and was first introduced in the 19$^{\textrm{th}}$ century~\cite{Boltzmann}.
\subsubsection{Cross Entropy Loss}
% todo maybe remove this
Cross Entropy Loss is a well established loss function in machine learning.
\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss.
And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.
\begin{align}
H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
H(p,q) &= - (p \log q + (1-p) \log(1-q))\label{eq:crelbinary}\\
\mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
\end{align}
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.
\subsubsection{Adam}