BEYOND TRADITIONAL TRANSFER LEARNING: CO-FINETUNING FOR ACTION LOCALISATION

Abstract

Transfer learning is the predominant paradigm for training deep networks on small target datasets. Models are typically pretrained on large "upstream" datasets for classification, as such labels are easy to collect, and then finetuned on "downstream" tasks such as action localisation, which are smaller due to their finergrained annotations. In this paper, we question this approach, and propose co-finetuning -simultaneously training a single model on multiple "upstream" and "downstream" tasks. We demonstrate that co-finetuning outperforms traditional transfer learning when using the same total amount of data, and also show how we can easily extend our approach to multiple "upstream" datasets to further improve performance. In particular, co-finetuning significantly improves the performance on rare classes in our downstream task, as it has a regularising effect, and enables the network to learn feature representations that transfer between different datasets. Finally, we observe how co-finetuning with public, video classification datasets, we are able to achieve significant improvements for spatio-temporal action localisation on the challenging AVA and AVA-Kinetics datasets, outperforming recent works which develop intricate models.

1. INTRODUCTION

The computer vision community has made impressive progress in video classification with deep learning, first with Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; Carreira & Zisserman, 2017; Feichtenhofer et al., 2019) , and more recently with transformers (Vaswani et al., 2017; Arnab et al., 2021a; Fan et al., 2021) . However, progress in other more challenging video understanding tasks, such as spatio-temporal action localisation (Pan et al., 2021; Tang et al., 2020; Zhao et al., 2022) , has lagged behind significantly in comparison. One major reason for this situation is the lack of data with fine-grained annotations which are not available for such complex tasks. To cope with this challenge, the de facto approach adopted the state-of-the-art is transfer learning (popularised by Girshick et al. ( 2014)). In the conventional setting, a model is first pre-trained on a large "upstream" dataset, which is typically labelled with classification annotations as they are less expensive to collect. The model is then "finetuned" on a smaller dataset, often for a different task, where fewer labelled examples are available (Mensink et al., 2022) . The intuition is that a model pre-trained on an auxiliary, "upstream" dataset learns generalisable features, and therefore its parameters do not need to be significantly updated during finetuning. For video understanding, the most common "upstream" dataset is Kinetics (Kay et al., 2017) , demonstrated by the fact that the majority of recent work addressing the task of spatio-temporal action localisation pretrain on it (Zhao et al., 2022; Pan et al., 2021; Wu et al., 2019) . Similarly for imagelevel tasks, ImageNet (Deng et al., 2009) is the most common "upstream" dataset. Our objective in this paper is to train more accurate models for spatio-temporal action detection, and we do so by proposing an alternate training strategy of co-finetuning. Instead of using the additional classification data in a separate pre-training phase, we simultaneously train for both classification and detection tasks. Intuitively, the additional co-finetuning datasets can act as a regulariser during training, benefiting in particular the rare classes in the target dataset which the network could otherwise overfit on. Moreover, discriminative features which are learned for classification datasets may also transfer to the detection dataset, even though the target task and labels are different.

