add some citations to crossentropy

This commit is contained in:
lukas-heilgenbrunner 2024-06-05 13:42:26 +02:00
parent 9d2534deba
commit 32b98841b3
2 changed files with 17 additions and 10 deletions

View File

@ -115,6 +115,7 @@ When using the accuracy as the performance metric it doesn't reveal much about t
There might be many true-positives and rarely any true-negatives and the accuracy is still good. There might be many true-positives and rarely any true-negatives and the accuracy is still good.
The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot. The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.
The more the curve ascents the upper-left or bottom-right corner the better the classifier gets. The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.
Figure~\ref{fig:roc-example} shows an example of a ROC curve with differently performing classifiers.
\begin{figure} \begin{figure}
\centering \centering
@ -160,7 +161,7 @@ Figure~\ref{fig:cnn-architecture} shows a typical binary classification task.
\subsubsection{Softmax} \subsubsection{Softmax}
The Softmax function~\ref{eq:softmax}\cite{liang2017soft} converts $n$ numbers of a vector into a probability distribution. The Softmax function~\eqref{eq:softmax}\cite{liang2017soft} converts $n$ numbers of a vector into a probability distribution.
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks. Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
\begin{equation}\label{eq:softmax} \begin{equation}\label{eq:softmax}
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\} \sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}
@ -171,8 +172,8 @@ The softmax function has high similarities with the Boltzmann distribution and w
\subsubsection{Cross Entropy Loss} \subsubsection{Cross Entropy Loss}
Cross Entropy Loss is a well established loss function in machine learning. Cross Entropy Loss is a well established loss function in machine learning.
\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss. Equation~\eqref{eq:crelformal}\cite{crossentropy} shows the formal general definition of the Cross Entropy Loss.
And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks. And equation~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.
\begin{align} \begin{align}
H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\ H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
@ -180,7 +181,7 @@ And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss
\mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch} \mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
\end{align} \end{align}
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.\cite{crossentropy} Equation~$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch}\cite{handsonaiI} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this Practical Work.
\subsubsection{Mathematical modeling of problem}\label{subsubsec:mathematicalmodeling} \subsubsection{Mathematical modeling of problem}\label{subsubsec:mathematicalmodeling}
@ -188,7 +189,7 @@ Here the task is modeled as a mathematical problem to get a better understanding
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.\cite{suptechniques} The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.\cite{suptechniques}
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$. We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef}\cite{suptechniques} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$. In every active learning loop iteration we sample $\mathcal{S}$ random samples as in equation~\eqref{eq:batchdef}\cite{suptechniques} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
\begin{equation} \begin{equation}
\label{eq:batchdef} \label{eq:batchdef}
@ -203,20 +204,18 @@ z = g(\pmb{x};\pmb{w})
\end{equation} \end{equation}
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1. Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$. The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.\cite{handsonaiI}
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match. Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is. We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
Vice versa, the more centered the predictions are the more uncertain the prediction is. Vice versa, the more centered the predictions are the more uncertain the prediction is.
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$. Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}. That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.\cite{activelearning}
\begin{align} \begin{align}
\label{eq:certainty} \label{eq:certainty}
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5 S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
\end{align} \end{align}
\cite{activelearning}
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$. With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{eq:minnot} and equation~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set. We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{eq:minnot} and equation~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
@ -230,7 +229,7 @@ We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{
This notation helps to define which subsets of samples to give the user for labeling. This notation helps to define which subsets of samples to give the user for labeling.
There are different ways how this subset can be chosen. There are different ways how this subset can be chosen.
In this PW we do the obvious experiments with High-Certainty first in paragraph~\ref{par:low-certainty-first}, Low-Certainty\cite{certainty-based-al} first in paragraph~\ref{par:high-certainty-first}. In this PW we do the obvious experiments with High-Certainty first in paragraph~\ref{par:low-certainty-first}, Low-Certainty first~\cite{certainty-based-al} in paragraph~\ref{par:high-certainty-first}.
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores. Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
\paragraph{Low certainty first}\label{par:low-certainty-first} \paragraph{Low certainty first}\label{par:low-certainty-first}

View File

@ -134,6 +134,14 @@ doi = {10.1007/978-0-387-85820-3_23}
publisher={Johannes Kepler Universität Linz} publisher={Johannes Kepler Universität Linz}
} }
@misc{handsonaiI,
author = {Andreas Schörgenhumer, Bernhard Schäfl, Michael Widrich},
title = {Lecture notes in Hands On AI I, Unit 4 & 5},
month = {October},
year = {2021},
publisher={Johannes Kepler Universität Linz}
}
@online{ROCWikipedia, @online{ROCWikipedia,
author = "Wikimedia Commons", author = "Wikimedia Commons",
title = "Receiver operating characteristic", title = "Receiver operating characteristic",