SLTUNET: A SIMPLE UNIFIED MODEL FOR SIGN LANGUAGE TRANSLATION

Abstract

Despite recent successes with neural models for sign language translation (SLT), translation quality still lags behind spoken languages because of the data scarcity and modality gap between sign video and text. To address both problems, we investigate strategies for cross-modality representation sharing for SLT. We propose SLTUNET, a simple unified neural model designed to support multiple SLTrelated tasks jointly, such as sign-to-gloss, gloss-to-text and sign-to-text translation. Jointly modeling different tasks endows SLTUNET with the capability to explore the cross-task relatedness that could help narrow the modality gap. In addition, this allows us to leverage the knowledge from external resources, such as abundant parallel data used for spoken-language machine translation (MT). We show in experiments that SLTUNET achieves competitive and even state-of-theart performance on PHOENIX-2014T and CSL-Daily when augmented with MT data and equipped with a set of optimization techniques. We further use the DGS Corpus for end-to-end SLT for the first time. It covers broader domains with a significantly larger vocabulary, which is more challenging and which we consider to allow for a more realistic assessment of the current state of SLT than the former two. Still, SLTUNET obtains improved results on the DGS Corpus. Code is available at https://github.com/bzhangGo/sltunet.

1. INTRODUCTION

The rapid development of neural networks opens the path towards the ambitious goal of universal translation that allows converting information between any languages regardless of data modalities (text, audio or video) (Zhang, 2022) . While the translation for spoken languages (in text and speech) has gained wide attention (Aharoni et al., 2019; Inaguma et al., 2019; Jia et al., 2019) , the study of sign language translation (SLT) -a task translating from sign language videos to spoken language texts -still lags behind despite its significance in facilitating the communication between Deaf communities and spoken language communities (Camgoz et al., 2018; Yin et al., 2021) . SLT represents unique challenges: it demands the capability of video understanding and sequence generation. Unlike spoken language, sign language is expressed using hand gestures, body movements and facial expressions, and the visual signal varies greatly across signers, creating a tough modality gap for its translation into text. The lack of supervised training data further hinders us from developing neural SLT models of high complexity due to the danger of model overfitting. Addressing these challenges requires us to develop inductive biases (e.g., novel model architectures and training objectives) to enable knowledge transfer and induce universal representations for SLT. In the literature, a promising way is to design unified models that could support and be optimized via multiple tasks with data from different modalities. Such modeling could offer implicit regularization and facilitate the cross-task and cross-modality transfer learning that helps narrow the modality gap and improve model's generalization, such as unified vision-language modeling (Jaegle et al., 2022; Bao et al., 2022; Kaiser et al., 2017) , unified speech-text modeling (Zheng et al., 2021; Tang et al., 2022; Bapna et al., 2022 ), multilingual modeling (Devlin et al., 2019; Zhang et al., 2020; Xue et al., 2021) , and general data modeling (Liang et al., 2022; Baevski et al., 2022) . In SLT, different annotations could be paired into different tasks, including the sign-to-gloss (Sign2Gloss), the signto-text (Sign2Text), the gloss-to-text (Gloss2Text) and the text-to-gloss (Text2Gloss) task. These In this paper, we propose a simple unified model for SLT, namely SLTUNET, to answer the above questions. As in Figure 1 , SLTUNET follows the encoder-decoder paradigm (Bahdanau et al., 2015) with Transformer (Vaswani et al., 2017) as its backbone and supports multiple vision/language-tolanguage generation tasks. It uses shared modules to encourage knowledge transfer and adopts separate visual/textual modules to avoid task or modality interference (Liang et al., 2022) . Thanks to its unified schema, SLTUNET allows us to leverage external data resources from other related tasks, such as machine translation. This partially alleviates the data scarcity issue and opens up the possibility of exploring relatively larger models for SLT. We further examine and develop a set of optimization techniques to ensure the trainability of SLTUNET. We conducted extensive experiments on two popular benchmarks, PHOENIX-2014T (Camgoz et al., 2018) and CSL-Daily (Zhou et al., 2021) for German and Chinese Sign Language, respectively. Following previous evaluation protocols (Camgoz et al., 2018) , we test SLTUNET on several SLTrelated tasks but with a single trained model. Results show that SLTUNET achieves competitive and even state-of-the-art performance, surpassing strong baselines adopting pretrained language models. We note that PHOENIX-2014T and CSL-Daily, while offering a valuable testbed for SLT, are limited in various aspects. They feature a small number of signers, and are limited in linguistic variety with a small vocabulary. As a more challenging, larger-scale SLT dataset, we propose to use the Public DGS Corpus (Hanke et al., 2020a) that covers broader domains and more open vocabularies, and gives a more realistic view of the current capability of SLT. We also take care in following best practices regarding preprocessing and evaluation (Müller et al., 2022) . We find that the challenging nature of the DGS Corpus results in generally low SLT performance, but we still observe some quality gains with SLTUNET. Our contributions are summarized below:



Figure1: Overview of the proposed SLTUNET and the tasks we explored. SLTUNET adopts separate encoders to capture modality-specific (visual and textual) characteristics followed by a shared encoder to induce universal features. It employs an autoregressive decoder shared across tasks for generation. SLTUNET optimizes the whole model via the maximum likelihood estimation (MLE) objective and optionally the connectionist temporal classification (CTC) objective and uses Transformer as its backbone. It supports multiple tasks, such as the sign-to-gloss (Sign2Gloss), the sign-to-text (Sign2Text), the gloss-to-text (Gloss2Text), the text-to-gloss (Text2Gloss) and the machine translation task. We regard the embedding of the corresponding task tag([2gls]   or [2txt]) as the task information to guide the generation, and append it in front of the input feature sequence inspired by multilingual NMT. α is a hyperparameter; blocks in colour (except gray) indicate trainable parameters; note Text2Gloss hurts SLT in our experiments and is not involved in the final joint objective.

