BEYOND TRADITIONAL TRANSFER LEARNING: CO-FINETUNING FOR ACTION LOCALISATION

Abstract

Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large "upstream" datasets for classification, as such labels are easy to collect, and then finetuned on "downstream" tasks such as action localisation, which are smaller due to their finergrained annotations. In this paper, we question this approach, and propose co-finetuning -simultaneously training a single model on multiple "upstream" and "downstream" tasks. We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data, and also show how we can easily extend our approach to multiple "upstream" datasets to further improve performance. In particular, co-finetuning significantly improves the performance on rare classes in our downstream task, as it has a regularising effect, and enables the network to learn feature representations that transfer between different datasets. Finally, we observe how co-finetuning with public, video classification datasets, we are able to achieve significant improvements for spatio-temporal action localisation on the challenging AVA and AVA-Kinetics datasets, outperforming recent works which develop intricate models.

1. INTRODUCTION

The computer vision community has made impressive progress in video classification with deep learning, first with Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; Carreira & Zisserman, 2017; Feichtenhofer et al., 2019) , and more recently with transformers (Vaswani et al., 2017; Arnab et al., 2021a; Fan et al., 2021) . However, progress in other more challenging video understanding tasks, such as spatio-temporal action localisation (Pan et al., 2021; Tang et al., 2020; Zhao et al., 2022) , has lagged behind significantly in comparison. One major reason for this situation is the lack of data with fine-grained annotations which are not available for such complex tasks. To cope with this challenge, the de facto approach adopted the state-of-the-art is transfer learning (popularised by Girshick et al. (2014) ). In the conventional setting, a model is first pre-trained on a large "upstream" dataset, which is typically labelled with classification annotations as they are less expensive to collect. The model is then "finetuned" on a smaller dataset, often for a different task, where fewer labelled examples are available (Mensink et al., 2022) . The intuition is that a model pre-trained on an auxiliary, "upstream" dataset learns generalisable features, and therefore its parameters do not need to be significantly updated during finetuning. For video understanding, the most common "upstream" dataset is Kinetics (Kay et al., 2017) , demonstrated by the fact that the majority of recent work addressing the task of spatio-temporal action localisation pretrain on it (Zhao et al., 2022; Pan et al., 2021; Wu et al., 2019) . Similarly for imagelevel tasks, ImageNet (Deng et al., 2009) is the most common "upstream" dataset. Our objective in this paper is to train more accurate models for spatio-temporal action detection, and we do so by proposing an alternate training strategy of co-finetuning. Instead of using the additional classification data in a separate pre-training phase, we simultaneously train for both classification and detection tasks. Intuitively, the additional co-finetuning datasets can act as a regulariser during training, benefiting in particular the rare classes in the target dataset which the network could otherwise overfit on. Moreover, discriminative features which are learned for classification datasets may also transfer to the detection dataset, even though the target task and labels are different. AVA is a long-tailed dataset, and we attain significant improvements, particularly on the rare classes (shown by the "Tail" and "Mid" subsets) in the dataset. Our improvements are also consistent across the three action types defined in the AVA dataset. We split the 60 class labels from AVA into "Head" classes (> 10, 000 ground truth instances), "Tail" classes (< 1000 instances), and the remaining ones into "Mid" classes (detailed in Sec. 4.2). Our thorough experimental analyses confirm these intuitions. As shown by Fig. 1 , our co-finetuning strategy improves spatio-temporal action localisation results for the vast majority of the action classes on the AVA dataset, improving substantialy on the rarer classes with few labelled examples. In particular, co-finetuning performs better than traditional transfer learning when using the same total amount of data. Moreover, with co-finetuning we can easily make use of additional "upstream" classification datasets during co-finetuning to improve results even further. Our approach is thus in stark contrast to previous works on spatio-temporal action detection which develop complex architectures to model long-range relationships using graph neural networks (Arnab et al., 2021b; Wang & Gupta, 2018; Sun et al., 2018; Baradel et al., 2018; Zhang et al., 2019) , external memories (Wu et al., 2019; Pan et al., 2021; Tang et al., 2020) and additional object detection proposals (Wang & Gupta, 2018; Wu & Krahenbuhl, 2021; Tang et al., 2020; Zhang et al., 2019) . Instead, we use a simple detection architecture and modify only the training strategy to achieve higher accuracies, outperforming prior work. We conduct thorough ablation analyses to validate our method, and make other findings too, such as the fact that although Kinetics (Kay et al., 2017) is the most common upstream pretraining dataset for video tasks, and used by all previous work addressing spatio-temporal action detection on AVA, Moments in Time (Monfort et al., 2019) is actually better. Finally, a by-product of our strategy of co-finetuning with classification tasks is that our network can simultaneously perform these tasks, and is competitive with the state-of-the-art on these datasets too.

2. RELATED WORK

We first discuss transfer-and multi-task learning, which our proposed co-finetuning approach is related to. We then review action detection models, as this is our final task of interest. Transfer-and multi-task learning The predominant paradigm for training a deep neural network is to pre-train it on a large "upstream" dataset with many annotations, and then to finetune it on the final "downstream" dataset of interest. This approach was notably employed by R-CNN (Girshick et al., 2014) which leveraged pre-trained image classification models on ImageNet (Deng et al., 2009) , for object detection on the markedly smaller Pascal VOC dataset (Everingham et al., 2015) . Since then, this strategy has become ubiquitous in training deep neural networks (which require large amounts of data to fit) across a wide range of vision tasks (Mensink et al., 2022) . The training strategy of Girshick et al. ( 2014) is a form of "inductive transfer learning" based on the taxonomy of Pan & Yang (2009) , and we simply refer to it as "traditional transfer learning" in this paper.



Figure 1: Improvements achieved by our co-finetuning strategy on the AVA dataset Gu et al. (2018).AVA is a long-tailed dataset, and we attain significant improvements, particularly on the rare classes (shown by the "Tail" and "Mid" subsets) in the dataset. Our improvements are also consistent across the three action types defined in the AVA dataset. We split the 60 class labels from AVA into "Head" classes (> 10, 000 ground truth instances), "Tail" classes (< 1000 instances), and the remaining ones into "Mid" classes (detailed in Sec. 4.2).

