add some citations to crossentropy
This commit is contained in:
parent
9d2534deba
commit
32b98841b3
@ -115,6 +115,7 @@ When using the accuracy as the performance metric it doesn't reveal much about t
|
|||||||
There might be many true-positives and rarely any true-negatives and the accuracy is still good.
|
There might be many true-positives and rarely any true-negatives and the accuracy is still good.
|
||||||
The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.
|
The ROC curve helps with this problem and visualizes the true-positives and false-positives on a line plot.
|
||||||
The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.
|
The more the curve ascents the upper-left or bottom-right corner the better the classifier gets.
|
||||||
|
Figure~\ref{fig:roc-example} shows an example of a ROC curve with differently performing classifiers.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
@ -160,7 +161,7 @@ Figure~\ref{fig:cnn-architecture} shows a typical binary classification task.
|
|||||||
|
|
||||||
\subsubsection{Softmax}
|
\subsubsection{Softmax}
|
||||||
|
|
||||||
The Softmax function~\ref{eq:softmax}\cite{liang2017soft} converts $n$ numbers of a vector into a probability distribution.
|
The Softmax function~\eqref{eq:softmax}\cite{liang2017soft} converts $n$ numbers of a vector into a probability distribution.
|
||||||
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
|
Its a generalization of the Sigmoid function and often used as an Activation Layer in neural networks.
|
||||||
\begin{equation}\label{eq:softmax}
|
\begin{equation}\label{eq:softmax}
|
||||||
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}
|
\sigma(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \; for j\coloneqq\{1,\dots,K\}
|
||||||
@ -171,8 +172,8 @@ The softmax function has high similarities with the Boltzmann distribution and w
|
|||||||
|
|
||||||
\subsubsection{Cross Entropy Loss}
|
\subsubsection{Cross Entropy Loss}
|
||||||
Cross Entropy Loss is a well established loss function in machine learning.
|
Cross Entropy Loss is a well established loss function in machine learning.
|
||||||
\eqref{eq:crelformal} shows the formal general definition of the Cross Entropy Loss.
|
Equation~\eqref{eq:crelformal}\cite{crossentropy} shows the formal general definition of the Cross Entropy Loss.
|
||||||
And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.
|
And equation~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss for binary classification tasks.
|
||||||
|
|
||||||
\begin{align}
|
\begin{align}
|
||||||
H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
|
H(p,q) &= -\sum_{x\in\mathcal{X}} p(x)\, \log q(x)\label{eq:crelformal}\\
|
||||||
@ -180,7 +181,7 @@ And~\eqref{eq:crelbinary} is the special case of the general Cross Entropy Loss
|
|||||||
\mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
|
\mathcal{L}(p,q) &= - \frac1N \sum_{i=1}^{\mathcal{B}} (p_i \log q_i + (1-p_i) \log(1-q_i))\label{eq:crelbinarybatch}
|
||||||
\end{align}
|
\end{align}
|
||||||
|
|
||||||
$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this PW.\cite{crossentropy}
|
Equation~$\mathcal{L}(p,q)$~\eqref{eq:crelbinarybatch}\cite{handsonaiI} is the Binary Cross Entropy Loss for a batch of size $\mathcal{B}$ and used for model training in this Practical Work.
|
||||||
|
|
||||||
\subsubsection{Mathematical modeling of problem}\label{subsubsec:mathematicalmodeling}
|
\subsubsection{Mathematical modeling of problem}\label{subsubsec:mathematicalmodeling}
|
||||||
|
|
||||||
@ -188,7 +189,7 @@ Here the task is modeled as a mathematical problem to get a better understanding
|
|||||||
|
|
||||||
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.\cite{suptechniques}
|
The model is defined as $g(\pmb{x};\pmb{w})$ where $\pmb{w}$ are the model weights and $\mathcal{X}$ the input samples.\cite{suptechniques}
|
||||||
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
|
We define two hyperparameters, the batch size $\mathcal{B}$ and the sample size $\mathcal{S}$ where $\mathcal{B} < \mathcal{S}$.
|
||||||
In every active learning loop iteration we sample $\mathcal{S}$ random samples~\eqref{eq:batchdef}\cite{suptechniques} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
|
In every active learning loop iteration we sample $\mathcal{S}$ random samples as in equation~\eqref{eq:batchdef}\cite{suptechniques} from our total unlabeled sample set $\mathcal{X}_U \subset \mathcal{X}$.
|
||||||
|
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
\label{eq:batchdef}
|
\label{eq:batchdef}
|
||||||
@ -203,20 +204,18 @@ z = g(\pmb{x};\pmb{w})
|
|||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
|
Those predictions might have any numerical value and have to be squeezed into a proper distribution which sums up to 1.
|
||||||
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.
|
The Softmax function has exactly this effect: $\sum^\mathcal{S}_{i=1}\sigma(z)_i=1$.\cite{handsonaiI}
|
||||||
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
|
Since we have a two class problem the Softmax results in two result values, the two probabilities of how certain one class is a match.
|
||||||
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
|
We want to calculate the distance to the class center and the more far away a prediction is from the center the more certain it is.
|
||||||
Vice versa, the more centered the predictions are the more uncertain the prediction is.
|
Vice versa, the more centered the predictions are the more uncertain the prediction is.
|
||||||
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
|
Labels $0$ and $1$ result in a class center of $\frac{0+1}{2}=\frac{1}{2}$.
|
||||||
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.
|
That means taking the absolute value of the prediction minus the class center results in the certainty of the sample~\eqref{eq:certainty}.\cite{activelearning}
|
||||||
|
|
||||||
\begin{align}
|
\begin{align}
|
||||||
\label{eq:certainty}
|
\label{eq:certainty}
|
||||||
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
|
S(z) = | 0.5 - \sigma(\mathbf{z})_0| \; \textit{or} \; S(z) = \max \sigma(\mathbf{z}) - 0.5
|
||||||
\end{align}
|
\end{align}
|
||||||
|
|
||||||
\cite{activelearning}
|
|
||||||
|
|
||||||
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
|
With the help of this metric the pseudo predictions can be sorted by the score $S(z)$.
|
||||||
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{eq:minnot} and equation~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
|
We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{eq:minnot} and equation~\ref{eq:maxnot} to define a short form of taking a subsection of the minimum or maximum of a set.
|
||||||
|
|
||||||
@ -230,7 +229,7 @@ We define $\text{min}_n(S)$ and $\text{max}_n(S)$ respectively in equation~\ref{
|
|||||||
|
|
||||||
This notation helps to define which subsets of samples to give the user for labeling.
|
This notation helps to define which subsets of samples to give the user for labeling.
|
||||||
There are different ways how this subset can be chosen.
|
There are different ways how this subset can be chosen.
|
||||||
In this PW we do the obvious experiments with High-Certainty first in paragraph~\ref{par:low-certainty-first}, Low-Certainty\cite{certainty-based-al} first in paragraph~\ref{par:high-certainty-first}.
|
In this PW we do the obvious experiments with High-Certainty first in paragraph~\ref{par:low-certainty-first}, Low-Certainty first~\cite{certainty-based-al} in paragraph~\ref{par:high-certainty-first}.
|
||||||
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
|
Furthermore, the two mixtures between them, half-high and half-low certain and only the middle section of the sorted certainty scores.
|
||||||
|
|
||||||
\paragraph{Low certainty first}\label{par:low-certainty-first}
|
\paragraph{Low certainty first}\label{par:low-certainty-first}
|
||||||
|
@ -134,6 +134,14 @@ doi = {10.1007/978-0-387-85820-3_23}
|
|||||||
publisher={Johannes Kepler Universität Linz}
|
publisher={Johannes Kepler Universität Linz}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@misc{handsonaiI,
|
||||||
|
author = {Andreas Schörgenhumer, Bernhard Schäfl, Michael Widrich},
|
||||||
|
title = {Lecture notes in Hands On AI I, Unit 4 & 5},
|
||||||
|
month = {October},
|
||||||
|
year = {2021},
|
||||||
|
publisher={Johannes Kepler Universität Linz}
|
||||||
|
}
|
||||||
|
|
||||||
@online{ROCWikipedia,
|
@online{ROCWikipedia,
|
||||||
author = "Wikimedia Commons",
|
author = "Wikimedia Commons",
|
||||||
title = "Receiver operating characteristic",
|
title = "Receiver operating characteristic",
|
||||||
|
Loading…
Reference in New Issue
Block a user