diff --git a/summary/main.tex b/summary/main.tex index 62a46bc..1a5620e 100644 --- a/summary/main.tex +++ b/summary/main.tex @@ -5,34 +5,17 @@ \usepackage[inline]{enumitem} +\settopmatter{printacmref=false} % Removes citation information below abstract +\renewcommand\footnotetextcopyrightpermission[1]{} % removes footnote with conference information in first column +\pagestyle{plain} % removes running headers + %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% \normalfont B\kern-0.5em{\scshape i\kern-0.25em b}\kern-0.8em\TeX}}} -%% Rights management information. This information is sent to you -%% when you complete the rights form. These commands have SAMPLE -%% values in them; it is your responsibility as an author to replace -%% the commands and values with those provided to you when you -%% complete the rights form. -\setcopyright{acmcopyright} -\copyrightyear{2018} -\acmYear{2018} -\acmDOI{XXXXXXX.XXXXXXX} - -%% These commands are for a PROCEEDINGS abstract or paper. -\acmConference[Conference acronym 'XX]{Make sure to enter the correct - conference title from your rights confirmation emai}{June 03--05, - 2018}{Woodstock, NY} -% -% Uncomment \acmBooktitle if th title of the proceedings is different -% from ``Proceedings of ...''! -% -%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection, -% June 03--05, 2018, Woodstock, NY} -\acmPrice{15.00} -\acmISBN{978-1-4503-XXXX-X/18/06} +\acmConference{Cross-Model Pseudo-Labeling}{2023}{Linz} %% %% end of the preamble, start of the body of the document source. @@ -65,7 +48,7 @@ %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. -\renewcommand{\shortauthors}{Trovato and Tobin, et al.} +\renewcommand{\shortauthors}{Lukas Heilgenbrunner} %% %% The abstract is a short summary of the work to be presented in the @@ -98,10 +81,10 @@ Labeling datasets is commonly seen as an expensive task and wants to be avoided Thats why there is a machine-learning field called Semi-Supervised learning. The general approach is to train a model that predicts Pseudo-Labels which then can be used to train the main model. -The goal of this paper is a video action recognition. +The goal of this paper is video action recognition. Given are approximately 10 seconds long videos which should be classified. In this paper datasets with 400 and 101 different classes are used. -The papers approach is tested with 1\% and 10\% of known labels of all data points. +The proposed approach is tested with 1\% and 10\% of known labels of all data points. The used model depends on the exact usecase but in this case a 3D-ResNet50 and 3D-ResNet18 are used. \section{Semi-Supervised learning}\label{sec:semi-supervised-learning} @@ -110,7 +93,7 @@ Each datapoint is associated with a corresponding target label. The goal is to fit a model to predict the labels from datapoints. In traditional unsupervised learning no labels are known. -The goal is to find patterns and structures in the data. +The goal is to find patterns or structures in the data. Moreover, it can be used for clustering or downprojection. Those two techniques combined yield semi-supervised learning. @@ -125,14 +108,23 @@ Then both, the known labels and the predicted ones are used side by side to trai The labeled samples guide the learning process and the unlabeled samples gain additional information. Not every pseudo prediction is kept to train the model further. -A confidence threshold is defined to evaluate how `confident` the model is of its prediction. +A confidence threshold is defined to evaluate how `confident` the model is about its prediction. The prediction is dropped if the model is too less confident. The quantity and quality of the obtained labels is crucial and they have an significant impact on the overall accuracy. -This means improving the pseudo-label framework as much as possible is important. +This means improving the pseudo-label framework as much as possible is essential. + +FixMatch results in some major limitations. +It relies on a single model for generating pseudo-labels which can introduce errors and uncertainty in the labels. +Incorrect pseudo-labels may effect the learning process negatively. +Furthermore, Fixmatch uses a compareably small model for label prediction which has a limited capacity. +This can negatively affect the learning process as well. +There is no measure defined how certain the model is about its prediction. +Such a measure improves overall performance by filtering noisy and unsure predictions. +Cross-Model Pseudo-Labeling tries to address all of those limitations. \subsection{Math of FixMatch}\label{subsec:math-of-fixmatch} Equation~\ref{eq:fixmatch} defines the loss-function that trains the model. -The sum over a batch size $B_u$ takes the average loss of this batch and should be straight forward. +The sum over a batch size $B_u$ takes the average loss of this batch and should be familiar. The input data is augmented in two different ways. At first there is a weak augmentation $\mathcal{T}_{\text{weak}}(\cdot)$ which only applies basic transformation such as filtering and bluring. Moreover, there is the strong augmentation $\mathcal{T}_{\text{strong}}(\cdot)$ which does cropouts and random augmentations. @@ -142,17 +134,17 @@ Moreover, there is the strong augmentation $\mathcal{T}_{\text{strong}}(\cdot)$ \mathcal{L}_u = \frac{1}{B_u} \sum_{i=1}^{B_u} \mathbbm{1}(\max(p_i) \geq \tau) \mathcal{H}(\hat{y}_i,F(\mathcal{T}_{\text{strong}}(u_i))) \end{equation} -The interesting part is the indicator function $\mathbbm{1}(\cdot)$ which applies a principle called `confidence-based masking`. +The indicator function $\mathbbm{1}(\cdot)$ applies a principle called `confidence-based masking`. It retains a label only if its largest probability is above a threshold $\tau$. Where $p_i \coloneqq F(\mathcal{T}_{\text{weak}}(u_i))$ is a model evaluation with a weakly augmented input. The second part $\mathcal{H}(\cdot, \cdot)$ is a standard Cross-entropy loss function which takes two inputs, the predicted and the true label. $\hat{y}_i$, the obtained pseudo-label and $F(\mathcal{T}_{\text{strong}}(u_i))$, a model evaluation with strong augmentation. The indicator function evaluates in $0$ if the pseudo prediction is not confident and the current loss evaluation will be dropped. -Otherwise it will be kept and trains the model further. +Otherwise it evaluates to 1 and it will be kept and trains the model further. \section{Cross-Model Pseudo-Labeling}\label{sec:cross-model-pseudo-labeling} The newly invented approach of this paper is called Cross-Model Pseudo-Labeling (CMPL)\cite{Xu_2022_CVPR}. -In Figure~\ref{fig:cmpl-structure} one can see its structure. +Figure~\ref{fig:cmpl-structure} visualizs the structure of CMPL\@. We define two different models, a smaller auxiliary model and a larger model. The SG label means stop gradient. The loss function evaluations are fed into the opposite model as loss. @@ -162,13 +154,13 @@ The two models train each other. \begin{figure}[h] \centering \includegraphics[width=\linewidth]{../presentation/rsc/structure} - \caption{Model structures of Cross-Model Pseudo-Labeling} + \caption{Architecture of Cross-Model Pseudo-Labeling} \label{fig:cmpl-structure} \end{figure} \subsection{Math of CMPL}\label{subsec:math} The loss function of CMPL is similar to that one explaind above. -But we have to differ from the loss generated from the supervised samples with the label known and the unsupervised loss where no labels are knonw. +But we have to differ from the loss generated from the supervised samples where the labels are known and the unsupervised loss where no labels are knonw. The two equations~\ref{eq:cmpl-losses1} and~\ref{eq:cmpl-losses2} are normal Cross-Entropy loss functions generated with the supervised labels of the two seperate models. @@ -181,7 +173,7 @@ The two equations~\ref{eq:cmpl-losses1} and~\ref{eq:cmpl-losses2} are normal Cro \end{align} Equation~\ref{eq:cmpl-loss3} and~\ref{eq:cmpl-loss4} are the unsupervised losses. -They are very similar to FastMatch, but +They are very similar to FastMatch, but important to note is that the confidence-based masking is applied to the opposite corresponding model. \begin{align} \label{eq:cmpl-loss3} @@ -198,18 +190,23 @@ The loss is regulated by an hyperparamter $\lambda$ to enhance the importance of \mathcal{L} = (\mathcal{L}_s^F + \mathcal{L}_s^A) + \lambda(\mathcal{L}_u^F + \mathcal{L}_u^A) \end{equation} +\section{Architecture}\label{sec:Architecture} +The used model architectures depend highly on the task to be performed. +In this case the task is video action recognition. +A 3D-ResNet50 was chosen for the main model and a smaller 3D-ResNet18 for the auxiliary model. + \section{Performance}\label{sec:performance} In figure~\ref{fig:results} a performance comparison is shown between just using the supervised samples for training against some different pseudo label frameworks. One can clearly see that the performance gain with the new CMPL framework is quite significant. For evaluation the Kinetics-400 and UCF-101 datasets are used. And as a backbone model a 3D-ResNet18 and 3D-ResNet50 are used. +Even when only 1\% of true labels are known for the UCF-101 dataset 25.1\% of the labels could be predicted right. \begin{figure}[h] \centering \includegraphics[width=\linewidth]{../presentation/rsc/results} \caption{Performance comparisons between CMPL, FixMatch and supervised learning only} - \Description{A woman and a girl in white dresses sit in an open car.} \label{fig:results} \end{figure}