REVEALING SINGLE FRAME BIAS FOR VIDEO-AND-LANGUAGE LEARNING Anonymous authors Paper under double-blind review

Abstract

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-andlanguage learning. On a diverse set of video-and-language tasks (including textto-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Code and models will be released upon acceptance.

1. INTRODUCTION

Video and language are the two primary signals that constitute much of the world we perceive every day -we observe our surrounding environment with our eyes in the form of continuous visual input (video), and communicate with others via language. Intuitively, this leads one to assume that training an effective video-and-language model should require multiple video frames as input. Standard methods Zhu & Yang ( 2020 2021), training with fewer frames (e.g., a single frame) leads to significantly worse performance compared to their multi-frame counterparts. In contrast, in this work, we show that with proper modeling, single-frame models could achieve competitive performance, hence also revealing "static appearance bias" in popular video-and-language datasets. We start by building a standard image-language model, with a vision encoder and a language encoder for image and text encoding, followed by a multi-modal encoder with cross-attention for cross-modal fusion. We pre-train the model on large-scale image-text and video-text datasets Chen et al. ( 2015 



); Xu et al. (2021); Li et al. (2020a); Luo et al. (2021) in this area typically use multiple densely sampled frames for training. Recent work Lei et al. (2021) proposes sparse sampling for video-and-language understanding, where it claims that a few sparsely sampled clips are sufficient for learning due to the high redundancy in videos. This technique has shown Lei et al. (2021); Zellers et al. (2021) to be successful in various video-language benchmarks Jang et al. (2017); Xu et al. (2016); Anne Hendricks et al. (2017); Krishna et al. (2017a); Xu et al. (2017); Yu et al. (2018); Lei et al. (2018). However, as demonstrated in Bain et al. (2021); Luo et al. (2021); Lei et al. (

); Krishna et al. (2017b); Ordonez et al. (2011); Sharma et al. (2018); Changpinyo et al. (2021); Bain et al. (2021). For fine-tuning, we randomly sample a single frame for training, and ensemble multiple uniformly sampled frames per video for making a video-level prediction at inference. Single-frame predictions are often noisy and inaccurate, as they are made from incomplete information from single-frames without any context (see examples in Figure 5). Due to this issue, singleframe training typically performs significantly worse than multi-frame training Lei et al. (2021); Bain et al. (2021); Luo et al. (2021). Previous work Hendrycks et al. (2019) suggests that pretraining improves model robustness in the face of label corruption for image recognition. Inspired by this, we hypothesize that large-scale pre-training helps mitigate noise from single-frame train-1

