PARAMETER EFFICIENT MULTIMODAL TRANSFORM-ERS FOR VIDEO REPRESENTATION LEARNING

Abstract

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.

1. INTRODUCTION

Learning multimodal representation from unlabeled videos has received considerable attention (Baltrušaitis et al., 2018) . Audio-visual learning is of particular interest due to the abundance of videos with natural audio-visual co-occurrence (Owens & Efros, 2018; Owens et al., 2018; Arandjelovic & Zisserman, 2018; Ephrat et al., 2018; Gao & Grauman, 2019; Alwassel et al., 2019) . However, existing approaches learn localized representations from short videos (hundreds of milliseconds to just under a few seconds), capturing only short-term dependencies in data. While this is useful for certain applications, e.g., source separation (Ephrat et al., 2018) and atomic action recognition (Gu et al., 2018) , learning representation that captures long-term dependencies is equally important, e.g., for activity recognition (Kay et al., 2017; Carreira et al., 2019; Sigurdsson et al., 2016) . Unfortunately, processing long videos requires large memory resource and capturing long-term dependencies is a long-standing problem (Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Vaswani et al., 2017) . In language understanding, strong progress has been made in large-scale learning of contextualized language representations using Transformers (Vaswani et al., 2017; Howard & Ruder, 2018; Peters et al., 2018; Radford et al., 2018; 2019; Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019) . Riding on the success of Transformers, several recent works have extended it to the multimodal setting by adding an additional vision module to the Transformer framework (Sun et al., 2019b; Lu et al., 2019) . However, these models are typically not end-to-end trained; they rely on a language-pretrained BERT (Devlin et al., 2019) , which is fixed throughout, and train only the visual components. While the pretrained BERT helps accelerate convergence and brings reliable extra supervision signal to the In this work, we make three key contributions. First, we propose an end-to-end trainable bidirectional transformer architecture that learns contextualized audio-visual representations of long videos. Our model, shown in Figure 1 , consists of audio/visual CNNs, audio/visual Transformers, and a multimodal Transformer. The CNNs operate on short (e.g., one second) video clips and are intended to capture short-term dynamics within each modality. The Transformer layers operate on long video sequences (e.g., 30 seconds), capturing long-term dynamics. To enable end-to-end training, we propose a novel parameter reduction technique that shares parts of weight parameters across Transformers and across layers within each Transformer. We show that this results in up to 97% parameter reduction, enabling end-to-end training of our model, with a minimal performance degradation. To the best of our knowledge, our work is the first to report end-to-end trained multimodal Transformers, and the first to apply Transformers for audio-visual representation learning. The quality of negative samples is crucial in contrastive learning, which is part of our learning objective. As our second contribution, we propose a content-aware negative sampling strategy that favors negatives sufficiently similar to a positive instance. Our approach measures the similarity by reusing the CNN embeddings obtained during model training, and thus do not introduce extra parameters to learn. We show that this improves performance over the standard sampling strategies. Our third contribution is a systematic evaluation of different modality fusion strategies. Existing works on multimodal BERT (all using vision-and-language data) typically apply one fusion strategy without thoroughly comparing with alternatives, e.g., some works perform early fusion (Sun et al., 2019b; Su et al., 2020) while others perform mid-level fusion (Lu et al., 2019; Tan & Bansal, 2019) . As a result, it is unclear how different fusion methods affect the final performance. In this work, we compare three fusion strategies (early, mid, late) and show the superiority of mid-level fusion. To demonstrate our approach, we pretrain our model on long (30-second) video clips from Kinetics-700 (Carreira et al., 2019) and finetune it on various video classification tasks. One benefit of the modular design of our architecture is flexibility: once pretrained, we can use any of the subnetworks for downstream tasks depending on the modalities involved (audio-only, visual-only, audio-visual) and video lengths (short and long). To show this, we evaluate our model on UCF101 (Soomro et al., 2012) and ESC-50 (Gemmeke et al., 2017) for short-term visual/audio classification, and Charades (Sigurdsson et al., 2016) and Kinetics-Sounds (Arandjelovic & Zisserman, 2017) for long-term audio-visual action recognition.

2. APPROACH

Figure 1 shows an overview of the proposed model architecture. The input to our model is a sequence of visual clips v 1:T and the corresponding sequence of audio streams a 1:T . For example, each



Figure 1: (Left) Our model consists of CNNs encoding short-term dynamics of each modality and Transformers encoding long-term dynamics of audio-visual information from videos. (Right) To alleviate excessive memory requirements, we propose an efficient parameter sharing scheme based on matrix decomposition with low-rank approximation, which allows us to train our model end-to-end.

