JOINTLY LEARNING VISUAL AND AUDITORY SPEECH REPRESENTATIONS FROM RAW DATA

Abstract

We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low-and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models are available at https://github.com/ahaliassos/raven. * Work done at Meta AI. Masking. We employ masking to encourage the students to take context into account when solving the task. Given a grayscale video x v ∈ R T ×H×W with resolution (H, W ) and T frames, and an audio sample x a ∈ R N of length N , we randomly sample with probability 0.2 each video frame to be the starting mask index, and if selected, then the consecutive three frames are zeroed out (ablation in Section 4.4). A similar mask is applied to the auditory input, except that it is enlarged by a factor of 16 000/25 = 640 (since the audio is sampled at 16,000 Hz and the video at 25 fps). Encoders. The masked video and audio, xv and xa respectively, are fed to their corresponding student encoders f v e and f a e , yielding features f v e (x v ) ∈ R T ×D and f a e (x a ) ∈ R T ×D , where D is

1. INTRODUCTION

The sound of someone articulating words coincides with the sight of movements in and around their mouth. Both a recording of a speech waveform and a corresponding silent video of mouth motion provide rich -but not identical -information on which words were uttered. Despite the difficulty of interpreting lip movements compared with an audio waveform, the task of visual speech recognition (VSR; also known as lipreading) has important applications, ranging from recognising utterances in a noisy environment (Ma et al., 2021b; Afouras et al., 2018a; Martinez et al., 2020; Makino et al., 2019) and aiding people suffering from aphonia (an inability to speak), to transcribing archival silent films and detecting DeepFake videos (Haliassos et al., 2021) . Auditory (also known as automatic) speech recognition (ASR) and VSR benefit greatly from the combination of high-capacity neural networks and large datasets. Rapid advances of modern hardware are enabling the use of ever-growing, data-hungry networks, but the effort required for transcription hinders the scaling of labelled data along with the models. One way to leverage unlabelled videos for VSR is to use an external ASR model for pseudo-labelling (Afouras et al., 2020; Ma et al., 2022) . However, this requires a large amount of labelled data to train a strong ASR model in the first place, and supervised VSR training with long sequences often poses optimisation problems, requiring costly curriculum learning strategies (Chung et al., 2017; Ma et al., 2022) or pre-training the feature extractor with isolated words (Afouras et al., 2018a; Ma et al., 2021b) . A solution is to first learn, in a self-supervised way, general representations from large corpora of unlabelled data, and then fine-tune them on smaller labelled datasets (Mohamed et al., 2022) . The fine-grained correspondence between the (synchronised) visual and auditory modalities provides a natural source of self-supervision, and can produce highly semantic representations invariant to noise not shared between the modalities. However, approaches leveraging this correspondence either (1) only work for word-level samples rather than continuous speech (Chung & Zisserman, 2016; Chung et al., 2019; 2020) ; (2) use handcrafted features (e.g., spectrograms or MFCCs) as their inputs or targets (Ma et al., 2021a; Shi et al., 2022) , which contain inductive biases that may influence the learned representations; (3) use multi-stage pre-training procedures (Ma et al., 2021a; Shi et al., 2022; Pan et al., 2022) ; and/or (4) use separate pre-training strategies for VSR and ASR (Shi et al., 2022) , complicating the process of obtaining representations suitable for both tasks. In this paper, we present a single-stage self-supervised approach that jointly learns visual and auditory speech representations from raw video and audio only. We dub our approach RAVEn (Raw Audio-Visual Speech Encoders). It involves a pair of student-teacher networks for each modality, whereby the students encode temporally-masked inputs, and, through the use of lightweight Transformer-based predictors, regress outputs of momentum-based teachers (Grill et al., 2020; Caron et al., 2021) that are presented with unmasked inputs. Further, given that audio contains more information relevant to speech than video, we propose a learning strategy that accounts for the expected difference in the quality of targets between the modalities. Namely, while the audio student predicts outputs from both video and audio teachers (cross-and within-modal learning), the video student predicts only auditory targets (cross-modal learning). As we show, this setup leads to better downstream performance for both VSR and ASR as opposed to other strategies. We conduct experiments with models and datasets of different sizes. We find that, when fine-tuning our pre-trained models for VSR and ASR with only 30 hours of labelled data from LRS3 (Afouras et al., 2018b) , RAVEn surpasses recent self-supervised methods by a large margin in most settings. Coupling pre-training with self-training reaches 23.8% WER for VSR on LRS3, even outperforming a method trained on 3000× more transcribed hours (Serdyuk et al., 2021) . At the same time, we are better than or on par with the recent AV-HuBERT method (Shi et al., 2022) on ASR, without using a task-dependent pre-training strategy nor handcrafted features. Using the full 433-hour LRS3 dataset for fine-tuning pushes the results even further, achieving 23.1% / 1.4% WER for VSR / ASR, respectively. Similarly strong performance is observed on the LRS2 dataset (Appendix B).

2. RELATED WORK

Masked prediction. The pre-training task of predicting missing content given masked inputs has proven successful in various domains, such as natural language processing (Devlin et al., 2019; Radford et al., 2018; 2019; Brown et al., 2020) , image recognition (He et al., 2021; Xie et al., 2022; Bao et al., 2021) , and speech recognition (Baevski et al., 2020; Hsu et al., 2021; Shi et al., 2022) . An important aspect of masked prediction is the nature of the targets. Some works (He et al., 2021; Xie et al., 2022) use pixels as targets; others use pre-trained tokenisers (Bao et al., 2021) or modalityspecific handcrafted features (Wei et al., 2021; Hsu et al., 2021; Shi et al., 2022) . Our method, in contrast, generates targets from raw video and audio using momentum encoders. Self-distillation for unsupervised representation learning. RAVEn is partly inspired by the success of self-distillation in self-supervised learning with visual data (Grill et al., 2020; Caron et al., 2021) ; such works target invariance w.r.t. image-specific augmentations. In contrast, RAVEn does not rely on domain-specific augmentations but rather on a combination of cross-modal learning and masked prediction to drive representation learning for visual and auditory speech signals. data2vec (Baevski et al., 2022) combines masked prediction with a momentum encoder, but, aside from being uni-modal, it is different methodologically in multiple ways. For example, it applies ad-hoc normalisation and averaging techniques to the targets to prevent representation collapse, while our targets are simply the outputs of the encoders, which are regressed via Transformer-based heads. Self-supervised audiovisual learning. Audiovisual correspondence has been used to learn global representations for action recognition through the use of clustering (Alwassel et al., 2020; Asano et al., 2020) , contrastive learning (Arandjelovic & Zisserman, 2017; 2018; Korbar et al., 2018; Patrick et al., 2020; Morgado et al., 2021; Ma et al., 2021c) , or representation matching (Recasens et al., 2021) . Cross-modal learning has also found uses in biometric matching (Nagrani et al., 2018a; b) , emotion recognition (Shukla et al., 2021) , and DeepFake detection (Haliassos et al., 2022) . We employ cross-and within-modal losses to learn temporally-varying speech representations. Figure 1 : RAVEn overview. Given masked video and audio, students predict outputs of unmasked momentum teachers, via shallow Transformer predictors that intake mask tokens. The audio student predicts outputs from both audio and video teachers; the video student predicts only audio targets. Cross-modal losses are applied on all features; the within-modal loss is computed only on masked features. Only the student encoders are fine-tuned for VSR/ASR. Frames blurred for anonymity. Self-supervised audiovisual learning for speech recognition. Earlier audiovisual selfsupervised works (Chung & Zisserman, 2016; Chung et al., 2019; 2020) tended to focus on wordlevel lipreading. Recently, some attention has been paid to the more realistic task of continuous speech recognition (Ma et al., 2021a; Shi et al., 2022; Sheng et al., 2021; Pan et al., 2022; Ma et al., 2021c) . Sheng et al. (2021) ; Ma et al. (2021c) use contrastive learning and apply their method to VSR. Ma et al. (2021a) predict speech features using an external PASE+ encoder (Ravanelli et al., 2020) . Pan et al. (2022) transfer pre-trained visual and auditory encoders, which were separately trained via contrastive losses. AV-HuBERT (Shi et al., 2022) predicts iteratively-refined cluster assignments from masked audiovisual inputs and achieves impressive VSR and ASR performance. It employs multiple stages of alternating between offline clustering and cluster assignment prediction, and relies on hand-crafted audio features (MFCCs) for cluster initialisation, which is shown to be crucial to the performance (Shi et al., 2022) . We demonstrate that it is possible to jointly learn effective representations for VSR and ASR in a single stage simply from raw video and audio.

3.1. PRE-TRAINING

Our architecture consists of a pair of student-teacher networks per modality (see Figure 1 ). The students intake masked inputs and predict targets formed by teachers receiving unmasked inputs. the dimensionality of each feature. Both f v e and f a e consist of a modality-specific, convolutional feature extractor followed by a temporal encoder, as in related VSR/ASR works (Ma et al., 2022; Baevski et al., 2020) . The video feature extractor is a 2D ResNet18 (He et al., 2016) with a 3D convolutional stem (Petridis et al., 2018) , outputting an embedding per frame. On the audio side, we use a 1D ResNet18 which produces features at 25 fps, to match the video sampling rate (see Appendix A.5). The temporal encoder for each modality is a Transformer (Vaswani et al., 2017) (without a classification head) with hidden size D. We use relative positional embeddings (Dai et al., 2019) , and following Chen et al. (2021) , we replace layernorm (Ba et al., 2016) with batchnorm (Ioffe & Szegedy, 2015) before each multilayer perceptron (MLP) block (see Appendix C.1 for a comparison). Predictors. The students contain Transformer predictors, which regress targets given 1) the encoder outputs corresponding to the unmasked portions of the inputs and 2) mask tokens associated with the masked portions. Note that mask tokens are applied to the predictors rather than the encoders (He et al., 2021) . This reduces the discrepancy between pre-training and fine-tuning: In both cases, the encoders do not see mask tokens. A predictor that takes representations corresponding to modality m 1 ∈ {v, a} and predicts targets associated with modality m 2 ∈ {v, a} is denoted as f m1→m2 p . For ease, the application of mask tokens to the encoder outputs is absorbed in the notation. Unlike other works which output global representations (one embedding per sample) and thus use MLPs as predictors (Grill et al., 2020; Chen et al., 2021) , we use Transformers to allow modelling temporal dynamics, which we found greatly improves results. The predictors can be lightweight: Indeed, two-block Transformers with hidden size 512 work optimally in our experiments. Targets. The targets are simply the outputs of momentum-based teachers g v and g a (Grill et al., 2020; Caron et al., 2021) , which are given as input the unmasked video or audio, in order to force the students to predict the missing information. Each teacher is architecturally identical to its student encoder counterpart. Denoting the parameters of the student encoders and teachers as s m and t m , respectively, at each iteration the following update is performed: t m ← µt m + (1 -µ)s m , where m ∈ {v, a} specifies the modality and µ is a momentum parameter following a cosine schedule from 0.999 to 1. A high value of µ leads to slowly-varying teachers and stable targets. The use of momentum-based teachers obviates the need for handcrafted targets or multi-stage training. Prediction tasks. The auditory modality contains more information relevant to speech than the visual one: Mouth motion is inherently more ambiguous than a speech waveform due to the presence of homophemes (Chung et al., 2017) . We propose a loss structure which reflects this asymmetry between the modalities. The audio student predicts the targets from both the video and audio teacher, thus benefiting from the ability of cross-modal learning to induce semantic representations, while at the same time being encouraged to retain information from the auditory input that is absent from the visual one. As a result, two predictors are associated with the audio student, one for each target type. On the other hand, the video student only predicts the auditory targets, which are inevitably of higher quality. Losses. The loss function is the negative cosine similarity (Grill et al., 2020) , denoted as sim. Due to the temporal alignment of the inputs, the cosine similarity is applied between pairs of corresponding features and then summed across the time dimension. For the within-modal task (audio-to-audio prediction), the loss is applied only on targets corresponding to masked portions of the input (Devlin et al., 2019) . For the cross-modal tasks, the loss is applied on all targets, which we found to work better. Note that predicting the unmasked portions in cross-modal learning is a non-trivial task (unlike in within-modal learning) and can bolster representation learning. Denoting the set of mask token indices for audio as M a , the audio-to-audio prediction loss and cross-modal losses can be respectively expressed as L a→a = - n∈Ma sim f a→a p (f a e (x a )) n , sg (g a (x a ) n ) , L m1→m2 = - n sim f m1→m2 p (f m1 e (x m1 )) n , sg (g m2 (x m2 ) n ) , where m 1 , m 2 ∈ {v, a}, m 1 ̸ = m 2 , and sg denotes the "stop-gradient" operation, which indicates that no gradient is passed back to the teacher networks.. Objectives. At each iteration, the objectives for the video and audio students are, respectively, L v = L v→a , L a = L a→v + L a→a . The teachers are updated via Equation 1.

3.2. FINE-TUNING

For fine-tuning, we keep the pre-trained student encoders and discard the rest. We append a linear layer and a Transformer decoder for joint CTC / attention decoding (Watanabe et al., 2017) , as in Ma et al. (2021b) . Following Ma et al. (2021b) , we set the CTC weight to 0.1. We use SentencePiece (Kudo & Richardson, 2018) subword units with a vocabulary size of 1,000 as our targets. Self-training. Combining pre-training with self-training tends to improve results over using either strategy in isolation (Xu et al., 2021; Shi et al., 2022) . To that end, we first fine-tune our pre-trained audio encoder on the labelled data, and then use the model for pseudo-labelling the unlabelled data. The pre-trained video and audio models are then fine-tuned using both the labels and pseudo-labels. 2022), which we refer to as LRS3+Vox2-en. The former features 433 hours of footage and the latter 1,759 hours. For fine-tuning, we use the full LRS3 with the labels as our high-resource labelled data setting, as well as a 30-hour subset (the "trainval" partition) as our low-resource setting. We present results for the LRS2 dataset (Chung et al., 2017) 

4.2. LOW-RESOURCE LABELLED DATA SETTING

We pre-train our models on LRS3 and/or LRS3+Vox2-en and then fine-tune them on the 30-hour LRS3 subset to evaluate performance when labels are scarce. We reports results in Table 1 . Compared with training from scratch, RAVEn pre-training leads to dramatic performance improvements in all configurations. Notably, increasing the model size hurts VSR performance when training from scratch, but improves it when using pre-training. Our Base variant outperforms all related methods on VSR. It surpasses the Base AV-HuBERT model by 4.8% and 5.9% WER when using LRS3 and LRS3+Vox2-en, respectively, despite having roughly half the number of parameters. The Large model provides significant boosts over the Base model (32.5% vs 40.2% WER) when using LRS3+Vox2-en for pre-training, keeping the number of labelled data points fixed. Self-training further improves WER by 7.7%, indicating its complementarity with RAVEn pre-training. Finally, using a language model (see Appendix A.4 for details) leads to a WER of 23.8%, better than a method (Serdyuk et al., 2021) trained on 90,000 hours of non-public data. On ASR, RAVEn significantly outperforms the audio-only Hubert (Hsu et al., 2021) model, and in all cases is better than or on par with AV-HuBERT. Our best ASR model without self-training achieves 2.7% WER vs AV-HuBERT's 2.9% WER, despite AV-HuBERT using a different pre-training strategy for VSR than ASR. For example, using the same pre-training hyperparameters increases AV-HuBERT's WER for ASR from 3.8% to 4.6% with the Base model (Shi et al., 2022) . In contrast, the video and audio encoders we use for fine-tuning are the result of a single pre-training phase, where they were jointly learned. All in all, our results suggest that transcribed data, which are costly to obtain, can be largely substituted with raw unlabelled audiovisual data.

4.3. HIGH-RESOURCE LABELLED DATA SETTING

Table 2 reports results when fine-tuning on the full 433 hours of LRS3. Despite increasing the labelled data, training the model from scratch still leads to poor performance, as in the low-resource setting. This is likely related to the long utterances in the LRS3 dataset, which may cause optimisation difficulties (Ma et al., 2022) . A potential remedy is to employ curriculum learning, as proposed by Ma et al. (2022) , by training the network in multiple stages with increasingly longer sequences. We observe that this strategy indeed reduces WER from 87.3% to 39.8% for the Base model. Even so, pre-training with the same data used for fine-tuning leads to a better WER of 39.1%, without requiring a curriculum learning procedure. This suggests that self-supervised pre-training can aid optimisability. The impact of pre-training is even more pronounced with the Large model. RAVEn outperforms AV-HuBERT under all configurations on VSR. Our best result is 23.1%, achieved using self-training and a language model. We note that unlike methods that use external ASR models (Afouras et al., 2020; Ma et al., 2022) for pseudo-labelling, we do not require extra data for the self-training phase. We are on par with the state-of-the-art for ASR in the high-resource setting, achieving a WER of 1.4% with the Large model. This is despite using raw audio as input (rather than spectrograms which Shi et al. (2022) use). We notice that additionally including self-training and a language model does not reduce the WER, suggesting that the ASR performance may have saturated in our environment.

4.4. PRE-TRAINING ABLATIONS

Ablations are performed with our Base model in the low-resource setting with LRS3 pre-training using the validation set from Shi et al. (2022) (as there is no official development set). For more ablations, see Appendix C. Prediction tasks. We study in Table 3 the impact of cross-and within-modal prediction on the learned representations. Figure 2 illustrates the various prediction tasks we consider. We observe the following. The within-modal variant performs decently for ASR but poorly for VSR, which can be explained by the assumption that audio contains more speech information than video. The crossmodal variant performs better than within-modal for both modalities. Since lip movements and the corresponding waveform are correlated in terms of lexical content, the representations resulting from cross-modal prediction are expected to capture rich semantic information. At the same time, they are likely to be, to a large extent, invariant to factors unshared between the two modalities,

Prediction tasks ablation

V A … … Within-modal V A … … Cross-modal V A … … Cross-+ within-modal V A … … Cross-+ audio within-modal V video student A audio student video target audio target V A … … Cross-+ video within-modal Figure 2 : Prediction tasks. We consider different choices for our prediction tasks, combining crossand within-modal losses. We find that applying both cross-and within-modal losses for the audio student, and only a cross-modal loss for the video student works best.

Setting

Prediction tasks WER (%) such as visual or auditory noise, benefiting generalisation. Combining cross-with within-modal learning hurts VSR relative to only cross-modal prediction. However, cross-modal with withinmodal learning only for audio achieves the best results. We hypothesise that predicting the auditory targets forces the auditory stream to keep relevant information absent from the visual stream. We note that removing the video-to-video prediction (from row 3 to row 5 in Table 3 ) also improves the audio student, as the quality of the visual targets improves. Finally, cross-modal with within-modal learning only for video does not perform well, validating that the asymmetry only works one way. V → V A → A V → A A → V VSR ASR Within-modal ✓ ✓ ✗ ✗ Masking sampling. Table 4a studies the effect of different masking strengths by varying the mask length and probability that a given index is chosen as the mask start. We observe that a probability of 0.2 and a length of three video frames (corresponding to 1,920 audio samples) works well. Less masking allows the student to focus less on context and degrades both VSR and ASR. Interestingly, although more masking hurts VSR, it helps ASR, suggesting that an asymmetric masking strategy w.r.t. the two modalities may further improve results. We leave this exploration to future work. Our masking is less aggressive than what was found to be optimal in related self-supervised image and action recognition literature (where 75% or even 90% of the input is masked) (He et al., 2021; Tong et al., 2022) . We hypothesise that mouth movements are fast-varying and do not contain much temporal redundancy, and thus such strong masking makes our pretext task overly difficult. Table 5 : Predictor capacity ablations under the LRS3 low-resource setting using our Base model. Mask token position. We investigate in Table 4b the optimal position of the mask tokens. It is clear that better performance is achieved when applying them in the predictors rather than the encoders. Forgoing the use of mask tokens in the encoders leads to no input discrepancy between pre-training and fine-tuning, since no mask tokens are used during fine-tuning. Mask loss. It is common to apply the loss only on masked inputs for within-modal losses (Hsu et al., 2021; He et al., 2021) , since predicting targets corresponding to unmasked inputs may be trivial. However, this is not the case for cross-modal prediction, where the targets are not related to the inputs in an obvious way. Indeed, Table 4c shows that applying the loss for both masked and unmasked inputs outperforms applying it only for masked inputs. Predictor design. Table 5a shows the effect of different predictor types. We note that using no predictors at all leads to representation collapse, where all outputs are constant (Grill et al., 2020) . We compare linear layers and 2-layer MLPs (applied independently to each encoder output) with Transformers of varying capacity. Following Grill et al. (2020) , the MLPs have hidden dimension of 4096 with batchnorm followed by ReLU (Nair & Hinton, 2010) after the hidden layer. We find that a Transformer predictor works significantly better, even when the number of parameters in the (1-block) Transformer and the 2-layer MLP is similar. Interestingly, the ability for the predictors to model temporal dependencies seems to be crucial to our representation learning phase. A two-block Transformer works optimally (Table 5a ). A shallower Transformer likely results in representations too specialised to the pretext task. A deeper one might place a lot of the burden on the predictors, leaving the encoder representations too general. Similar conclusions can be drawn regarding the Transformer width (see Table 5b ). Overall, the lightweight predictor design means that using three predictors (one on the video student side and two on the audio) has relatively little effect on the total computational cost.

5. CONCLUSION

We presented RAVEn, a single-stage method that jointly learns visual and auditory speech representations entirely from raw data. As future work, it would be interesting to examine the effect of sharing weights between the visual and auditory encoders as a way of reducing the memory requirements for pre-training. We would also like to apply RAVEn to other tasks related to speech. We hope our study inspires future research extending beyond speech recognition.

ETHICS STATEMENT

Despite numerous positive applications, our method can also be misused. For example, lipreading technologies can be employed for surveillance, compromising the public's privacy and trust. This problem is likely to be exacerbated as the quality of CCTV cameras improves over time. Appropriate government regulations will need to be put in place to limit such concerns. Another issue relates to the potential biases embedded in the datasets used. Such biases may have to do with gender, age, or ethnic backgrounds. A specific group being under-represented in the data would likely result in reduced model performance on samples belonging to said group. Making sure that the models are trained on balanced data, or using other bias-reduction techniques, can address such issues. Our work used datasets that were made public for research purposes, i.e., LRS2, LRS3, and Vox-Celeb2. In particular, we used the cropped face .mp4 videos that were available on the official dataset webpages and which complied with the following licenses: Creative Commons BY-NC-ND 4.0 license and Creative Commons Attribution 4.0 International License, TED terms of use, and BBC's terms of use.

REPRODUCIBILITY

To ensure reproducibility, we provide as many implementation details as possible in the main paper as well as tables showing the hyperparameter values in the appendix. Moreover, we plan on making the code and pre-trained models publicly available. (Ardila et al., 2020) . The total number of characters is 166 million. stage filters output size conv 1 5 × 7 × 7, 64, stride 1 × 2 × 2 T × 44 × 44 pool 1 max, 1 × 3 × 3, stride 1 × 2 × 2 T × 22 × 22 res 1 3 × 3, 64 3 × 3, 64 × 2 T × 22 × 22 res 2 3 × 3, 128 3 × 3, 128 × 2 T × 11 × 11 res 3 3 × 3, 256 3 × 3, 256 × 2 T × 6 × 6 res 4 3 × 3, 512 3 × 3, 512 × 2 T × 3 × 3 pool 2 global spatial average pool T × 1 × 1

A.5 FEATURE EXTRACTORS

The details of the visual and auditory convolutional feature extractors are provided in Tables 8a and 8b , respectively.

B LRS2 EXPERIMENTS

We report results on the test set of the LRS2 dataset (Chung et al., 2017) in Table 9 . After pre-training on the LRS3 or the LRS3+Vox2-en datasets, we fine-tune on the "pre-training" and "training" sets of LRS2. We observe similar trends as in the LRS3 experiments (Table 2 ), namely that performance is benefited from large models, large unlabelled datasets, and self-training. We significantly outperform all other methods, including one (Pan et al., 2022) pre-trained on 60,000 hours of audio data.

C MORE ABLATIONS C.1 MORE PRE-TRAINING ABLATIONS

Momentum parameter. Table 10a shows the effect of varying the momentum parameter for updating the teacher networks. We show results at three coarse levels: 0 (teacher is a copy of the student at each iteration), 0.999 (with a cosine schedule to 1), and 1 (teacher does not get updated during training). We see that using a momentum value of 0 leads to representation collapse. At the other extreme, a value of 1 does not allow the teacher targets to improve during training, leading to poor representations. A slowly-evolving momentum encoder is most effective. Pre-MLP normalisation. As mentioned in the main text, we use batchnorm before each MLP module rather than layernorm, a choice inspired by Chen et al. (2021) . current state-of-the-art method, to perform video-to-speech on the LRS3 test set after training on LRS3+Vox2-en. We study the effect of initialising the video encoder with the weights from our Large pre-trained model. We use the same hyperparameters as Mira et al. (2022) , except that we use a learning rate of 2e-4 and train only for 30 epochs (rather than 150). We train the "from scratch" baseline with a learning rate of 7e-4 for 50 epochs, as training longer resulted in overfitting. Table 14 reports performance based on PESQ (Rix et al., 2001) , which measures the perceptual quality of the generated samples, as well as STOI and ESTOI (Taal et al., 2011) 



LRS3 low-resource setting. We report results on the test set when fine-tuning on 30 hours of LRS3 with different model sizes and number of unlabelled data hours (Unlab hrs). LM denotes whether or not a language model was used during decoding. We also provide baselines for training our models from scratch (without our pre-training) and results from fully-supervised methods trained on more labelled data hours (Lab hrs) for reference.

LRS3

Prediction task ablations under the LRS3 low-resource setting using our Base model. V → A means that the video student predicts the audio teacher representations. Each prediction loss is associated with a separate predictor.

Masking ablations under the LRS3 low-resource setting using our Base model.

It employs momentum encoders to generate targets, which, given masked inputs, are predicted by Transformer encoders and predictors. Especially salient to the quality of the representations are appropriate masking, lightweight Transformer predictors, and an asymmetric loss structure w.r.t. the visual and auditory modalities. RAVEn achieves strong performance under many settings without requiring multiple pre-training stages, handcrafted audio targets, or separate pre-training strategies for VSR and ASR.

a) Visual feature extractor.

Feature extractors. We provide details on the convolutional architectures used as the feature extractors for the visual and auditory modalities. Note that the output sizes for both networks match.Language model. We use a 16-block Transformer-based language model, as proposed byIrie et al. (2019). The hidden size/MLP size/attention heads are 512/2048/8. The language model is trained on the combination of the following datasets: Librispeech(Panayotov et al., 2015), LRS2/3, TED-LIUM 3(Hernandez et al., 2018), VoxForge and Common Voice

Table10bshows that our method still works with layernorm, but lags behind batchnorm. This is an interesting observation worthy of future exploration. A preliminary hypothesis is that batchnorm may improve the LRS2 results. We report results on the test set with different model sizes and number of unlabelled data hours (Unlab hours). Lab hours denotes the number of labelled hours, and LM denotes whether or not a language model was used during decoding. Momentum parameter. It is important for the momentum encoders to slowly evolve during training. † Follows a cosine schedule with 0.999 as the starting value. ‡ Denotes representation collapse.

More pre-training ablations under the LRS3 low-resource setting using our Base model.

Audio vs audio-visual learning on LRS3 test set.. Including the visual modality for fine-tuning has very little influence in clean conditions.

Video-to-speech synthesis on LRS3 test set. Using RAVEn pre-training for the video encoder provides significant boosts in performance.

, which measure their intelligiblity. The results show that using RAVEn pre-training results in significant improvements in performance, compared with training from scratch, across all metrics. In fact, the performance boosts due to RAVEn pre-training (0.04 / 0.03 / 0.06 for PESQ / STOI / ESTOI) are larger than those observed byMira et al. (2022) (0.01, 0.02, 0.04) when increasing the training set size by around a factor of 4 (from LRS3 to LRS3+Vox2-en).

ACKNOWLEDGEMENTS

Only non-Meta co-authors downloaded, accessed, and used the LRS2 dataset. Only non-Meta authors conducted any of the dataset pre-processing (no dataset pre-processing took place on Meta's servers or facilities).

annex

A DATASET / IMPLEMENTATION DETAILS A.1 DATASETS LRS3. LRS3 (Afouras et al., 2018b) is the largest publicly available transcribed audio-visual dataset for continuous speech recognition. It consists of around 430 hours of spoken sentences from TED talks, with a vocabulary of more than 50,000 words uttered by thousands of speakers.The test set contains around 1 hour of utterances with speakers separate from those in the training set.LRS2. The 223-hour LRS2 dataset (Chung et al., 2017) , collected from BBC programmes, is the second-largest publicly available transcribed audio-visual dataset for continuous speech recognition. As LRS3, it contains an unconstrained vocabulary and thousands of diverse speakers.VoxCeleb2. VoxCeleb2 (Chung et al., 2018 ) is a non-transcribed dataset containing YouTubedownloaded videos. It consists of around 2,500 hours of utterances with over 6,000 speakers. Since VoxCeleb2 is multi-lingual, as mentioned in the main text, we use an English-only version curated by Shi et al. (2022) , amounting to 1,759 hours.

A.2 DATASET PRE-PROCESSING

We follow common practices in the literature for dataset pre-processing (Ma et al., 2022; Shi et al., 2022; Martinez et al., 2020) . We crop a 96 × 96 region centred around the mouth and transform to grayscale. Raw audio is used without any pre-processing nor normalisation. Utterances longer than 24 seconds are split into smaller constituents.

A.3 PRE-TRAINING

Table 6 provides the default setting for pre-training. We use the AdamW (Loshchilov & Hutter, 2019) optimiser with linear learning rate warmup (Goyal et al., 2017) and a cosine decay schedule (Loshchilov & Hutter, 2017) . During training, we apply random spatial cropping of size (88 × 88) followed by horizontal flipping with probability 0.5. These augmentations are applied in a time-consistent manner across the video clips to maintain temporal coherence. We do not use any augmentations for the raw audio. We also use stochastic depth (Huang et al., 2016) for regularisation, as well as LayerScale (Touvron et al., 2021) 

A.4 FINE-TUNING / DECODING

The protocol for fine-tuning is similar to that for pre-training with some exceptions (see Table 7 ).We use a higher learning rate for the decoder than for the pre-trained encoder, since the decoder is randomly-initialised. We also employ layer-wise learning rate decay (Clark et al., 2020 ), which we found to reduce overfitting. In addition to the augmentations from the pre-training stage, we apply time masking (Ma et al., 2022) for both video and audio clips. Specifically, for each second of the sample, we use zero-masking with a duration that is uniformly sampled between 0 and 0.4 seconds. CTC/attention decoding. We use joint CTC/attention decoding to map the input sequence x of length T to a target sequence y = (y 1 , . . . , y L ) of size L. The CTC loss during fine-tuning is given bywhere A x,y is the set of valid alignments, A is one such alignment, and a t is the token at time-step t.The attention loss, computed using the Transformer decoder outputs, can be expressed asThe final loss during fine-tuning is given by L = αL ctc + (1 -α)L att , where α = 0.1.We use the ESPnet framework (Watanabe et al., 2018) for decoding. We set the beam size to 40. The final score used to choose the most likely sequence is given by S = λS ctc +(1-λ)S att +βS LM , where S ctc and S att denote the scores from the CTC and attention branches, respectively, and λ = 0.1. S LM is an optional score from the language model, incorporated through shallow fusion (Watanabe et al., 2017) . When using a language model, β is chosen from {0.1, 0.2, 0.3, 0.4} using the validation set.Inference on one A100 GPU (without batching) takes around 3 seconds to decode 10 seconds of footage for Base and 5 seconds for Large.Encoder LR Decoder LR WER (%)

VSR ASR

same LR 1 × 10 -3 1 × 10 -3 36.1 13.1 3 × 10 -3 3 × 10 -3 34.0 12.2 5 × 10 -3 5 × 10 (c) Tokenisation. Using subword (SentencePiece) units as targets is better than using characters. A language model (LM) also helps.Table 11 : Fine-tuning ablations under the LRS3 low-resource setting using our Base model.conditioning of the networks at initialisation, leading to better targets at the beginning of training (Richemond et al., 2020) .

C.2 FINE-TUNING ABLATIONS

Learning rates. Table 11a studies the effect of using different learning rates for the encoder and decoder. We find that it is beneficial for VSR to use a higher learning rate for the decoder than the encoder, likely because the encoder is pre-trained, whereas the decoder is randomly-initialised and thus requires a larger learning rate to search for a more effective local optimum.Learning rate decay. Table 11b shows the influence of decaying the encoder's learning rate as the depth of the model decreases. The learning rate at block b, r b , is given by r b = r B d B-b , where B is the index of the last block and d is the learning rate decay (Clark et al., 2020) . A decay less than 1 works well for both VSR and ASR, suggesting that during fine-tuning it is useful to employ larger learning rates for the deeper layers, which are more task-specific.Tokenisation. In Table 11c , we compare the use of characters with SentencePiece (Kudo & Richardson, 2018) subword units (vocabulary size of 1,000) as our target tokens. We find that using subword units leads to superior results than character units. This may be explained by the language priors embedded in subword units, which facilitate speech recognition.

C.3 AUDIO-VISUAL FINE-TUNING

It is possible to use the learned auditory and visual representations for audio-visual speech recognition. To that end, following Ma et al. (2021b) , we concatenate the outputs of the two encoders, feed the resulting embeddings to a 2-layer MLP module with hidden size 4096 and batchnorm, and then fine-tune for speech recognition, as described in Section 3.2. We initialise the video and audio encoders with weights obtained from uni-modal fine-tuning and randomly initialise the rest. We fine-tune for 30 epochs with the hyperparameters used for audio-only fine-tuning, shown in Table 7 . when the audio is clean (i.e., not noisy), consistent with findings by Ma et al. (2021b) . Investigating the impact of audio-visual training and testing on audio corrupted with various noise types, is outside the scope of this work, but is interesting to take up in future work.

D ANALYSIS OF TRANSCRIPTION ERRORS

We provide examples of transcription errors in Table 13 . We consider the Base VSR and ASR models fine-tuned in the low-resource setting (47.0% and 4.7% WER, respectively) and the Large VSR and ASR models fine-tuned in the high-resource setting with self-training and a language model (23.1% and 1.4% WER, respectively). Although the worst model sometimes makes surprising errors (e.g., "They won the game" → "Because I wasn't happy"), most often the errors are related to words that are phonetically similar (e.g., "were taken" → "we're taking", "embedded" → "in better", "won" → "want"). As expected, the quality of the transcriptions is higher for ASR than VSR, and it improves as we increase the model size and number of unlabelled data points.

E RAVEN FOR VIDEO-TO-SPEECH SYNTHESIS

We evaluate here whether RAVEn pre-training can benefit tasks other than speech recognition by considering video-to-speech synthesis, which aims to output the speech waveform given the corresponding silent video of lip movements. We follow the protocol of SVTS (Mira et al., 2022) , the

