BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS

Abstract

Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units. As there are multiple potential aligning sequences (called paths) that are equally considered in CTC formulation, the choice of which path will be most probable and become the predicted alignment is always uncertain. In addition, it is usually observed that the alignment predicted by vanilla CTC will drift compared with its reference and rarely provides practical functionalities. Thus, the motivation of this work is to make the CTC alignment prediction controllable and thus equip CTC with extra functionalities. The Bayes risk CTC (BRCTC) criterion is then proposed in this work, in which a customizable Bayes risk function is adopted to enforce the desired characteristics of the predicted alignment. With the risk function, the BRCTC is a general framework to adopt some customizable preference over the paths in order to concentrate the posterior into a particular subset of the paths. In applications, we explore one particular preference which yields models with the down-sampling ability and reduced inference costs. By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models. Experimentally, the proposed BRCTC, along with a trimming approach, enables us to reduce the inference cost of offline models by up to 47% without performance degradation; BRCTC also cuts down the overall latency of online systems to an unseen level 1 .



(c) BRCTC Posterior (ours) Figure 1 : (a) An intuitive explanation of CTC paths. ∅ is the blank symbol. Each path suggests a hard alignment between the input and target. (b) Posterior of an offline vanilla CTC ASR system. Different colors mean different units. The predicted alignment drifts away from its referencefoot_1 but the predicted non-blank token sequence is correct. (c) Posterior of a BRCTC ASR system that adopts the method in section 3.3. All non-blank spikes are squeezed to the earlier time stamps. Sequence-to-Sequence (seq2seq) tasks have attracted broad interest and achieved great progress in multiple applications in the past few decades. Connectionist Temporal Classification (CTC) (Graves et al., 2006 ) is a fundamental criterion for seq2seq tasks. The CTC criterion was initially proposed for automatic speech recognition (ASR) but its usage has been extended to many other tasks like machine translation (MT) (Qian et al., 2021; Gu & Kong, 2020; Huang et al., 2022) , speech translation (ST) (Yan et al., 2022; Chuang et al., 2021; Liu et al., 2020) , sign language translation (Wang et al., 2018; Guo et al., 2019; Camgoz et al., 2020) , optical character recognition (OCR) (Graves & Schmidhuber, 2008 ), lip reading (Assael et al., 2017) , hand gesture detection (Molchanov et al., 2016) and even robot control (Shiarlis et al., 2018) . Research on CTC is of wide interest, as many advanced systems for seq2seq tasks are based on CTC (Yao et al., 2021) , its extensions (Graves, 2012; Sak et al., 2017; Higuchi et al., 2020; Qian et al., 2021) and its hybrid with attention-based architectures (Watanabe et al., 2017; Yan et al., 2022) . In CTC, each input unit is explicitly aligned to either a target unit or a blank symbol. During training, all of these potential aligning sequences (called paths) are enumerated and their posteriors are summed and maximized, which is equivalent to maximizing the posterior of the target sequence. Fig. 1 .a gives an explanation of the paths in CTC. Besides predicting the target sequence, another functionality of CTC is to predict the input-target alignment. Unlike the attention-based methods (Chan et al., 2016; Vaswani et al., 2017) that softly predict the aligning relationship by attention weights, CTC predicts a hard alignment. Usually, there is a path whose posterior is dominantly larger than the others (Zeyer et al., 2021) , so this dominant path is considered the predicted hard alignment between the input and the target sequences. In CTC implementation, unit-level classification over all possible target units is conducted for each input unit to obtain the posterior of each path. Fig. 1 .b demonstrates the dominant posterior of the predicted alignment by plotting the unit-level posteriors. Since predicting any path will yield the correct target sequence, the vanilla CTC is designed to treat all paths equally. However, this equality for paths will result in uncertainty about which path will be selected as the predicted alignment. Also, both our experiments (see Appendix K) and literature (Sak et al., 2015) show that there is a disagreement between the predicted alignment and its reference (see Fig. 1 .b), which limits its usage in real applications. Thus, the motivation of this work is to control the CTC alignment prediction, making it certain and functional. Specifically, instead of pursuing the accuracy of alignment prediction, e.g., for CTC segmentation (Kürzinger et al., 2020) , this work intentionally selects the path with customizable characteristics as the predicted alignment. This paper proposes a novel Bayes risk CTC (BRCTC) criterion to make CTC alignment prediction controllable. To express our preference for the paths with the desired characteristics, a Bayes risk function is adopted to weigh all paths during training. To be more detailed, the forward-backward algorithm of the original CTC is revised into a divide-and-conquer manner: the paths are firstly divided into several exclusive groups according to a customizable property, and the path groups with more preferred property will receive larger risk values during training. Same as the vanilla CTC, BRCTC can preserve the models' transcription ability, as it considers the posterior of all paths during training. However, the alignment prediction from BRCTC will additionally obtain the desired characteristics due to the adoption of the risk function. Note the designs of how the paths are grouped and what risk value is assigned to each path group are customizable, so the exact functionalities can be tailor-made according to specific applications, such as offline and online scenarios. In applications, the BRCTC provides novel solutions to two key problems of seq2seq tasks. For offline systems, BRCTC can help to down-sample the intermediate hidden representations so that the mismatch between the input and target lengths is alleviated and the inference cost is significantly reduced. For online systems like streaming ASR, BRCTC provides a better trade-off between transcription quality and latency. Besides ASR, the proposed BRCTC criterion can also be generalized to other seq2seq tasks like MT and ST. Experimentally, BRCTC can cooperate with a trimming approach to achieve up to 47% inference cost reduction for offline systems without degradation in transcription performance; it can also build online systems with extremely low overall latency that can hardly be achieved by vanilla CTC. Our main contributions are listed as follows: (1) Bayes risk CTC (BRCTC), an extension of CTC, is proposed as a customizable approach to achieve controllable CTC alignment prediction. To the best of our knowledge, this is among the first works which achieve alignment control for CTC-based models without external information. (2) With various intentional designs of the risk functions, BRCTC can significantly reduce the inference cost (by up to 47% relative) and overall latency (to 302ms, up to 30% relative) for offline and online models respectively. (3) Strong experimental evidence is provided in this work to show that high-quality CTC target predictions can be obtained from CTC / BRCTC posteriors which do not necessarily encode accurate alignment prediction. Code release: https://github.com/espnet/espnet. * means corresponding authors. Reference alignment is obtained by a deep neural network-hidden Markov model (DNN-HMM) system.

