REVEALING SINGLE FRAME BIAS FOR VIDEO-AND-LANGUAGE LEARNING Anonymous authors Paper under double-blind review

Abstract

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-andlanguage learning. On a diverse set of video-and-language tasks (including textto-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Code and models will be released upon acceptance.

1. INTRODUCTION

Video and language are the two primary signals that constitute much of the world we perceive every day -we observe our surrounding environment with our eyes in the form of continuous visual input (video), and communicate with others via language. Intuitively, this leads one to assume that training an effective video-and-language model should require multiple video frames as input. 2021), training with fewer frames (e.g., a single frame) leads to significantly worse performance compared to their multi-frame counterparts. In contrast, in this work, we show that with proper modeling, single-frame models could achieve competitive performance, hence also revealing "static appearance bias" in popular video-and-language datasets. We start by building a standard image-language model, with a vision encoder and a language encoder for image and text encoding, followed by a multi-modal encoder with cross-attention for cross-modal fusion. We pre-train the model on large-scale image-text and video-text datasets Chen et al. 2019) suggests that pretraining improves model robustness in the face of label corruption for image recognition. Inspired by this, we hypothesize that large-scale pre-training helps mitigate noise from single-frame train-ing. Our analyses in Section 5 agree with our hypothesis, showing that as we increase pre-training data size, the performance of our single-frame model improves drastically and its gap with a similarly trained multi-frame model is largely eliminated. Besides training, these noisy single-frame predictions also render simple late fusion (e.g., mean-pooling in ClipBERT Lei et al. ( 2021)) less effective at inference time. To deal with this issue, we propose an early fusion strategy, which takes all frames as model inputs for directly making a more informative video-level prediction. Our analyses show that this early fusion ensemble method outperforms late fusion strategies and also delivers consistently improved performance when more frames are used. We compare our approach with existing methods on six datasets across two video-language tasks, including text-to-video retrieval (MSRVTT Xu et al. (2016) 2018)). Results show that our approach achieves competitive (mostly better) performance than existing methods that use more training frames and more pre-training data, setting new state-of-the-art for multiple tasks. This conclusion holds for short 15-second videos in MSRVTT to 180-second videos in ActivityNet, demonstrating the effectiveness of our single-frame approach in various scenarios. More importantly, this strong single-frame performance reveals that the current evaluation is biased towards still objects, scenes, etc., while the temporal dynamics seem negligible, which in fact should be important for "true" video-language understanding. To address this issue, we next propose two new tasks that are designed to test models' true temporal modeling ability. Based on the videos and annotations from the find-grained action recognition dataset Something-Something v2 (SSv2) Goyal et al. (2017a), we create two text-to-video retrieval tasks, one that use SSv2's action template as text queries, e.g., "Throwing [something] in the air and catching it", and another that uses its annotated label as text queries, e.g., "Throwing keys in the air and catching it". See examples in Figure 2 . This template task removes the objects and only keeps the actions, enabling an evaluation that focuses almost solely on temporal modeling. The label task, on the other hand, contains both actions and objects, requiring an understanding of both still objects and their motion. Lastly, we present several baselines on these new tasks and show that temporal modeling is essential in achieving high scores. In summary, our contributions are three-fold: (i) We explore single-frame training for video-andlanguage tasks. While simple, our approach can achieve state-of-the-art performance on a range of datasets, including both text-to-video retrieval and video question answering. Importantly, this result reveals the surprising static appearance bias in these existing datasets. (ii) We conduct careful analyses, which show that large-scale pre-training and a proper multi-frame ensemble strategy at inference are the core for single-frame trained models to be successful. (iii) We propose two new tasks specifically designed for testing models' ability for find-grained temporal modeling. These two new tasks complement existing benchmarks for a more comprehensive evaluation. 



Standard methods Zhu & Yang (2020); Xu et al. (2021); Li et al. (2020a); Luo et al. (2021) in this area typically use multiple densely sampled frames for training. Recent work Lei et al. (2021) proposes sparse sampling for video-and-language understanding, where it claims that a few sparsely sampled clips are sufficient for learning due to the high redundancy in videos. This technique has shown Lei et al. (2021); Zellers et al. (2021) to be successful in various video-language benchmarks Jang et al. (2017); Xu et al. (2016); Anne Hendricks et al. (2017); Krishna et al. (2017a); Xu et al. (2017); Yu et al. (2018); Lei et al. (2018). However, as demonstrated in Bain et al. (2021); Luo et al. (2021); Lei et al. (

(2015); Krishna et al. (2017b); Ordonez et al. (2011); Sharma et al. (2018); Changpinyo et al. (2021); Bain et al. (2021). For fine-tuning, we randomly sample a single frame for training, and ensemble multiple uniformly sampled frames per video for making a video-level prediction at inference. Single-frame predictions are often noisy and inaccurate, as they are made from incomplete information from single-frames without any context (see examples in Figure 5). Due to this issue, singleframe training typically performs significantly worse than multi-frame training Lei et al. (2021); Bain et al. (2021); Luo et al. (2021). Previous work Hendrycks et al. (

, DiDeMo Anne Hendricks et al. (2017), and ActivityNet Captions Krishna et al. (2017a)) and video question answering (MSRVTT-QA Xu et al. (2017), ActivityNet-QA Yu et al. (2019), and MSRVTT-MC Yu et al. (

Vision and Language. Vision and language learning considers the problem of learning from both visual and textual signals. Depending on their visual input type, methods in this area can be roughly categorized into two types, one with image Anderson et al. (2018); Tan & Bansal (2019); Lu et al. (2019); Chen et al. (2020); Li et al. (2019; 2020b; 2021b; 2022); Radford et al. (2021) and another with video Anne Hendricks et al. (2017); Sun et al. (2019); Zhu & Yang (2020); Xu et al. (2021); Li et al. (2020a); Lei et al. (2021); Zellers et al. (2021); Bain et al. (2021); Lin et al. (2021). Standard video-and-language methods Zhu & Yang (2020); Xu et al. (2021); Li et al. (2020a); Lei et al. (2021); Zellers et al. (2021); Luo et al. (2021) are typically trained with multiple video frames. This multi-frame training strategy has been the norm and is shown to work well across various datasets Xu et al. (2016); Anne Hendricks et al. (2017); Krishna et al. (2017a); Jang et al. (2017); Xu et al. (2017); Lei et al. (2018; 2020). Unlike previous work that uses multiple frames for training, we explore single-frame training (i.e., similar to training an image-text model) and show it achieves strong performance on existing video-text benchmarks. Concurrent work Buch et al. (2022) proposes a new module, atemporal probe, for selecting the best single-frame as inputs to a trained image-text model during inference; whereas we utilize multiple uniformly sampled frames and study more effective ways of ensembling information from multiple frames.

