add stuff
This commit is contained in:
parent
2149d32d60
commit
c44a003c7e
@ -5,34 +5,17 @@
|
|||||||
|
|
||||||
\usepackage[inline]{enumitem}
|
\usepackage[inline]{enumitem}
|
||||||
|
|
||||||
|
\settopmatter{printacmref=false} % Removes citation information below abstract
|
||||||
|
\renewcommand\footnotetextcopyrightpermission[1]{} % removes footnote with conference information in first column
|
||||||
|
\pagestyle{plain} % removes running headers
|
||||||
|
|
||||||
%%
|
%%
|
||||||
%% \BibTeX command to typeset BibTeX logo in the docs
|
%% \BibTeX command to typeset BibTeX logo in the docs
|
||||||
\AtBeginDocument{%
|
\AtBeginDocument{%
|
||||||
\providecommand\BibTeX{{%
|
\providecommand\BibTeX{{%
|
||||||
\normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}
|
\normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}}
|
||||||
|
|
||||||
%% Rights management information. This information is sent to you
|
\acmConference{Cross-Model Pseudo-Labeling}{2023}{Linz}
|
||||||
%% when you complete the rights form. These commands have SAMPLE
|
|
||||||
%% values in them; it is your responsibility as an author to replace
|
|
||||||
%% the commands and values with those provided to you when you
|
|
||||||
%% complete the rights form.
|
|
||||||
\setcopyright{acmcopyright}
|
|
||||||
\copyrightyear{2018}
|
|
||||||
\acmYear{2018}
|
|
||||||
\acmDOI{XXXXXXX.XXXXXXX}
|
|
||||||
|
|
||||||
%% These commands are for a PROCEEDINGS abstract or paper.
|
|
||||||
\acmConference[Conference acronym 'XX]{Make sure to enter the correct
|
|
||||||
conference title from your rights confirmation emai}{June 03--05,
|
|
||||||
2018}{Woodstock, NY}
|
|
||||||
%
|
|
||||||
% Uncomment \acmBooktitle if th title of the proceedings is different
|
|
||||||
% from ``Proceedings of ...''!
|
|
||||||
%
|
|
||||||
%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection,
|
|
||||||
% June 03--05, 2018, Woodstock, NY}
|
|
||||||
\acmPrice{15.00}
|
|
||||||
\acmISBN{978-1-4503-XXXX-X/18/06}
|
|
||||||
|
|
||||||
%%
|
%%
|
||||||
%% end of the preamble, start of the body of the document source.
|
%% end of the preamble, start of the body of the document source.
|
||||||
@ -65,7 +48,7 @@
|
|||||||
%% other information printed in the page headers. This command allows
|
%% other information printed in the page headers. This command allows
|
||||||
%% the author to define a more concise list
|
%% the author to define a more concise list
|
||||||
%% of authors' names for this purpose.
|
%% of authors' names for this purpose.
|
||||||
\renewcommand{\shortauthors}{Trovato and Tobin, et al.}
|
\renewcommand{\shortauthors}{Lukas Heilgenbrunner}
|
||||||
|
|
||||||
%%
|
%%
|
||||||
%% The abstract is a short summary of the work to be presented in the
|
%% The abstract is a short summary of the work to be presented in the
|
||||||
@ -98,10 +81,10 @@ Labeling datasets is commonly seen as an expensive task and wants to be avoided
|
|||||||
Thats why there is a machine-learning field called Semi-Supervised learning.
|
Thats why there is a machine-learning field called Semi-Supervised learning.
|
||||||
The general approach is to train a model that predicts Pseudo-Labels which then can be used to train the main model.
|
The general approach is to train a model that predicts Pseudo-Labels which then can be used to train the main model.
|
||||||
|
|
||||||
The goal of this paper is a video action recognition.
|
The goal of this paper is video action recognition.
|
||||||
Given are approximately 10 seconds long videos which should be classified.
|
Given are approximately 10 seconds long videos which should be classified.
|
||||||
In this paper datasets with 400 and 101 different classes are used.
|
In this paper datasets with 400 and 101 different classes are used.
|
||||||
The papers approach is tested with 1\% and 10\% of known labels of all data points.
|
The proposed approach is tested with 1\% and 10\% of known labels of all data points.
|
||||||
The used model depends on the exact usecase but in this case a 3D-ResNet50 and 3D-ResNet18 are used.
|
The used model depends on the exact usecase but in this case a 3D-ResNet50 and 3D-ResNet18 are used.
|
||||||
|
|
||||||
\section{Semi-Supervised learning}\label{sec:semi-supervised-learning}
|
\section{Semi-Supervised learning}\label{sec:semi-supervised-learning}
|
||||||
@ -110,7 +93,7 @@ Each datapoint is associated with a corresponding target label.
|
|||||||
The goal is to fit a model to predict the labels from datapoints.
|
The goal is to fit a model to predict the labels from datapoints.
|
||||||
|
|
||||||
In traditional unsupervised learning no labels are known.
|
In traditional unsupervised learning no labels are known.
|
||||||
The goal is to find patterns and structures in the data.
|
The goal is to find patterns or structures in the data.
|
||||||
Moreover, it can be used for clustering or downprojection.
|
Moreover, it can be used for clustering or downprojection.
|
||||||
|
|
||||||
Those two techniques combined yield semi-supervised learning.
|
Those two techniques combined yield semi-supervised learning.
|
||||||
@ -125,14 +108,23 @@ Then both, the known labels and the predicted ones are used side by side to trai
|
|||||||
The labeled samples guide the learning process and the unlabeled samples gain additional information.
|
The labeled samples guide the learning process and the unlabeled samples gain additional information.
|
||||||
|
|
||||||
Not every pseudo prediction is kept to train the model further.
|
Not every pseudo prediction is kept to train the model further.
|
||||||
A confidence threshold is defined to evaluate how `confident` the model is of its prediction.
|
A confidence threshold is defined to evaluate how `confident` the model is about its prediction.
|
||||||
The prediction is dropped if the model is too less confident.
|
The prediction is dropped if the model is too less confident.
|
||||||
The quantity and quality of the obtained labels is crucial and they have an significant impact on the overall accuracy.
|
The quantity and quality of the obtained labels is crucial and they have an significant impact on the overall accuracy.
|
||||||
This means improving the pseudo-label framework as much as possible is important.
|
This means improving the pseudo-label framework as much as possible is essential.
|
||||||
|
|
||||||
|
FixMatch results in some major limitations.
|
||||||
|
It relies on a single model for generating pseudo-labels which can introduce errors and uncertainty in the labels.
|
||||||
|
Incorrect pseudo-labels may effect the learning process negatively.
|
||||||
|
Furthermore, Fixmatch uses a compareably small model for label prediction which has a limited capacity.
|
||||||
|
This can negatively affect the learning process as well.
|
||||||
|
There is no measure defined how certain the model is about its prediction.
|
||||||
|
Such a measure improves overall performance by filtering noisy and unsure predictions.
|
||||||
|
Cross-Model Pseudo-Labeling tries to address all of those limitations.
|
||||||
|
|
||||||
\subsection{Math of FixMatch}\label{subsec:math-of-fixmatch}
|
\subsection{Math of FixMatch}\label{subsec:math-of-fixmatch}
|
||||||
Equation~\ref{eq:fixmatch} defines the loss-function that trains the model.
|
Equation~\ref{eq:fixmatch} defines the loss-function that trains the model.
|
||||||
The sum over a batch size $B_u$ takes the average loss of this batch and should be straight forward.
|
The sum over a batch size $B_u$ takes the average loss of this batch and should be familiar.
|
||||||
The input data is augmented in two different ways.
|
The input data is augmented in two different ways.
|
||||||
At first there is a weak augmentation $\mathcal{T}_{\text{weak}}(\cdot)$ which only applies basic transformation such as filtering and bluring.
|
At first there is a weak augmentation $\mathcal{T}_{\text{weak}}(\cdot)$ which only applies basic transformation such as filtering and bluring.
|
||||||
Moreover, there is the strong augmentation $\mathcal{T}_{\text{strong}}(\cdot)$ which does cropouts and random augmentations.
|
Moreover, there is the strong augmentation $\mathcal{T}_{\text{strong}}(\cdot)$ which does cropouts and random augmentations.
|
||||||
@ -142,17 +134,17 @@ Moreover, there is the strong augmentation $\mathcal{T}_{\text{strong}}(\cdot)$
|
|||||||
\mathcal{L}_u = \frac{1}{B_u} \sum_{i=1}^{B_u} \mathbbm{1}(\max(p_i) \geq \tau) \mathcal{H}(\hat{y}_i,F(\mathcal{T}_{\text{strong}}(u_i)))
|
\mathcal{L}_u = \frac{1}{B_u} \sum_{i=1}^{B_u} \mathbbm{1}(\max(p_i) \geq \tau) \mathcal{H}(\hat{y}_i,F(\mathcal{T}_{\text{strong}}(u_i)))
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
The interesting part is the indicator function $\mathbbm{1}(\cdot)$ which applies a principle called `confidence-based masking`.
|
The indicator function $\mathbbm{1}(\cdot)$ applies a principle called `confidence-based masking`.
|
||||||
It retains a label only if its largest probability is above a threshold $\tau$.
|
It retains a label only if its largest probability is above a threshold $\tau$.
|
||||||
Where $p_i \coloneqq F(\mathcal{T}_{\text{weak}}(u_i))$ is a model evaluation with a weakly augmented input.
|
Where $p_i \coloneqq F(\mathcal{T}_{\text{weak}}(u_i))$ is a model evaluation with a weakly augmented input.
|
||||||
The second part $\mathcal{H}(\cdot, \cdot)$ is a standard Cross-entropy loss function which takes two inputs, the predicted and the true label.
|
The second part $\mathcal{H}(\cdot, \cdot)$ is a standard Cross-entropy loss function which takes two inputs, the predicted and the true label.
|
||||||
$\hat{y}_i$, the obtained pseudo-label and $F(\mathcal{T}_{\text{strong}}(u_i))$, a model evaluation with strong augmentation.
|
$\hat{y}_i$, the obtained pseudo-label and $F(\mathcal{T}_{\text{strong}}(u_i))$, a model evaluation with strong augmentation.
|
||||||
The indicator function evaluates in $0$ if the pseudo prediction is not confident and the current loss evaluation will be dropped.
|
The indicator function evaluates in $0$ if the pseudo prediction is not confident and the current loss evaluation will be dropped.
|
||||||
Otherwise it will be kept and trains the model further.
|
Otherwise it evaluates to 1 and it will be kept and trains the model further.
|
||||||
|
|
||||||
\section{Cross-Model Pseudo-Labeling}\label{sec:cross-model-pseudo-labeling}
|
\section{Cross-Model Pseudo-Labeling}\label{sec:cross-model-pseudo-labeling}
|
||||||
The newly invented approach of this paper is called Cross-Model Pseudo-Labeling (CMPL)\cite{Xu_2022_CVPR}.
|
The newly invented approach of this paper is called Cross-Model Pseudo-Labeling (CMPL)\cite{Xu_2022_CVPR}.
|
||||||
In Figure~\ref{fig:cmpl-structure} one can see its structure.
|
Figure~\ref{fig:cmpl-structure} visualizs the structure of CMPL\@.
|
||||||
We define two different models, a smaller auxiliary model and a larger model.
|
We define two different models, a smaller auxiliary model and a larger model.
|
||||||
The SG label means stop gradient.
|
The SG label means stop gradient.
|
||||||
The loss function evaluations are fed into the opposite model as loss.
|
The loss function evaluations are fed into the opposite model as loss.
|
||||||
@ -162,13 +154,13 @@ The two models train each other.
|
|||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\linewidth]{../presentation/rsc/structure}
|
\includegraphics[width=\linewidth]{../presentation/rsc/structure}
|
||||||
\caption{Model structures of Cross-Model Pseudo-Labeling}
|
\caption{Architecture of Cross-Model Pseudo-Labeling}
|
||||||
\label{fig:cmpl-structure}
|
\label{fig:cmpl-structure}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Math of CMPL}\label{subsec:math}
|
\subsection{Math of CMPL}\label{subsec:math}
|
||||||
The loss function of CMPL is similar to that one explaind above.
|
The loss function of CMPL is similar to that one explaind above.
|
||||||
But we have to differ from the loss generated from the supervised samples with the label known and the unsupervised loss where no labels are knonw.
|
But we have to differ from the loss generated from the supervised samples where the labels are known and the unsupervised loss where no labels are knonw.
|
||||||
|
|
||||||
The two equations~\ref{eq:cmpl-losses1} and~\ref{eq:cmpl-losses2} are normal Cross-Entropy loss functions generated with the supervised labels of the two seperate models.
|
The two equations~\ref{eq:cmpl-losses1} and~\ref{eq:cmpl-losses2} are normal Cross-Entropy loss functions generated with the supervised labels of the two seperate models.
|
||||||
|
|
||||||
@ -181,7 +173,7 @@ The two equations~\ref{eq:cmpl-losses1} and~\ref{eq:cmpl-losses2} are normal Cro
|
|||||||
\end{align}
|
\end{align}
|
||||||
|
|
||||||
Equation~\ref{eq:cmpl-loss3} and~\ref{eq:cmpl-loss4} are the unsupervised losses.
|
Equation~\ref{eq:cmpl-loss3} and~\ref{eq:cmpl-loss4} are the unsupervised losses.
|
||||||
They are very similar to FastMatch, but
|
They are very similar to FastMatch, but important to note is that the confidence-based masking is applied to the opposite corresponding model.
|
||||||
|
|
||||||
\begin{align}
|
\begin{align}
|
||||||
\label{eq:cmpl-loss3}
|
\label{eq:cmpl-loss3}
|
||||||
@ -198,18 +190,23 @@ The loss is regulated by an hyperparamter $\lambda$ to enhance the importance of
|
|||||||
\mathcal{L} = (\mathcal{L}_s^F + \mathcal{L}_s^A) + \lambda(\mathcal{L}_u^F + \mathcal{L}_u^A)
|
\mathcal{L} = (\mathcal{L}_s^F + \mathcal{L}_s^A) + \lambda(\mathcal{L}_u^F + \mathcal{L}_u^A)
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
|
\section{Architecture}\label{sec:Architecture}
|
||||||
|
The used model architectures depend highly on the task to be performed.
|
||||||
|
In this case the task is video action recognition.
|
||||||
|
A 3D-ResNet50 was chosen for the main model and a smaller 3D-ResNet18 for the auxiliary model.
|
||||||
|
|
||||||
\section{Performance}\label{sec:performance}
|
\section{Performance}\label{sec:performance}
|
||||||
|
|
||||||
In figure~\ref{fig:results} a performance comparison is shown between just using the supervised samples for training against some different pseudo label frameworks.
|
In figure~\ref{fig:results} a performance comparison is shown between just using the supervised samples for training against some different pseudo label frameworks.
|
||||||
One can clearly see that the performance gain with the new CMPL framework is quite significant.
|
One can clearly see that the performance gain with the new CMPL framework is quite significant.
|
||||||
For evaluation the Kinetics-400 and UCF-101 datasets are used.
|
For evaluation the Kinetics-400 and UCF-101 datasets are used.
|
||||||
And as a backbone model a 3D-ResNet18 and 3D-ResNet50 are used.
|
And as a backbone model a 3D-ResNet18 and 3D-ResNet50 are used.
|
||||||
|
Even when only 1\% of true labels are known for the UCF-101 dataset 25.1\% of the labels could be predicted right.
|
||||||
|
|
||||||
\begin{figure}[h]
|
\begin{figure}[h]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\linewidth]{../presentation/rsc/results}
|
\includegraphics[width=\linewidth]{../presentation/rsc/results}
|
||||||
\caption{Performance comparisons between CMPL, FixMatch and supervised learning only}
|
\caption{Performance comparisons between CMPL, FixMatch and supervised learning only}
|
||||||
\Description{A woman and a girl in white dresses sit in an open car.}
|
|
||||||
\label{fig:results}
|
\label{fig:results}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user