CASR: GENERATING COMPLEX SEQUENCES WITH AUTOREGRESSIVE SELF-BOOST REFINEMENT

Abstract

There are sequence generation tasks where the best order to generate the target sequence is not left-to-right. For example, an answer to the Sudoku game, a structured code like s-expression, and even a logical natural language answer where the analysis may be generated after the decision. We define the target sequences of those tasks as complex sequences. Obviously, a complex sequence should be constructed with multiple logical steps, and has dependencies among each part of itself (e.g. decisions depend on analyses). It's a great challenge for the classic leftto-right autoregressive generation system to generate complex sequences. Current approaches improve one-pass left-to-right generation on NLG tasks by generating different heuristic intermediate sequences in multiple stages. However, for complex sequences, the heuristic rules to break down them may hurt performance, and increase additional exposure bias. To tackle these challenges, we propose a PLMfriendly autoregressive self-boost refinement framework, CASR. When training, CASR inputs the predictions generated by the model itself at the previous refinement step (instead of those produced by heuristic rules). To find an optimal design, we also discuss model architecture, parameter efficiency and initialization strategy. By evaluating CASR on Sudoku, WebQSP, MTOP and KVRET through controlled experiments and empirical studies, we find that CASR produces high-quality outputs. CASR also improves Accuracy on Sudoku (70.93% → 97.28%) and achieves state-of-the-art performance on KVRET with Micro F1 score (67.88% → 70.00%).



Sequence models are widely used in tasks related to natural, domain-specific and programming languages -E.g., question answering (Pandya & Bhatt, 2021) , neural machine translation (Yang et al., 2020 ), speech recognition (Malik et al., 2021) , automatic data analysis (Zhou et al., 2020) , drug discovery (Kim et al., 2021 ), document summarization (Ma et al., 2020) , code search and generation (Lee et al., 2021) , etc. To achieve better performance on these tasks, recent works often adopt autoregressive (AR) models (Wu et al., 2016) , especially the ones with one-pass L2R (left-to-right) token-by-token generation / decoding order. Many SOTA-performance generative PLMs (pre-trained language models) are one-pass L2R models, such as GPT (Radford et al., 2018) However, for many sequence generation tasks, beyond the left-side dependencies, there are right-side dependencies in the answer sequence, which together lead to multi-hop dependency chains, making left-to-right not the best order for generation. We call these tasks as complex tasks, and the answer sequences of these tasks as complex sequences. Complex tasks (see more details in §2.2, including Sudoku (PARK), WebQSP (Yih et al., 2016) , MTOP (Li et al., 2021) , and KVRET (Eric et al., 2017) , etc..) require better generation mechanism beyond one-pass L2R generation, since complex sequences are usually long, difficult, structured, or logical, which should be constructed with multiple logical steps. Human beings solve a complex problem with respect to its intrinsic order. For example, the order to write hierarchical answers (such as s-Expression or SQL code) is usually bottom-up or top-down following the dependencies between components as discussed by Sun et al. (2020) . The order to give an NL response is first analyses then decisions as discussed by CoT (Wei et al., 2022) . The order to solve a puzzle (such as the example 4x4 Sudoku game in Figure 2 ) is usually from easy parts to hard parts, because the hard parts become easier when the easy parts are correctly solved. (That is also verified in §5.1 where our CASR model learns to solve easy parts before hard ones.) Obviously, people give answers to different tasks in various orders with respect to all kinds of dependencies. Mimicking human behavior, some existing works design specific intermediate sequences to solve the dependency order challenge. E.g., templates (Hua & Wang, 2020) or heuristic rules (Zhang et al., 2018; Tan et al., 2021) are applied in autoregressive NL generation, allowing models to generate some parts (intermediate sequences) before the other parts in an answer via iterative refinement (rather than one-pass decoding). However, 1) it's really hard to design the best heuristic order and easy to miss intrinsic dependencies for some tasks, and we need expert knowledge or manual efforts to design specific heuristic orders for all different tasks. Besides, 2) when using teacher forcing strategy to parallel train all refinement iterations, additional exposure bias occurs. In this paper, CASR (Generating Complex Sequences with Auto-regressive Self-Boost Refinement) framework is proposed by us to: 1) decide intermediate sequences of complex answers for different tasks in a data-driven way, 2) avoid additional exposure bias. As shown in Figure 1 and will be discussed in §3, in CASR we design a model architecture ( §3.2) to not only take in the original input X, but also the previous prediction Ŷ t-1 in both training and inference. A special process ( §3.1) is designed to train refinement models M t for each step t = 0, 1, ..., T -1. To enhance the performance on downstream tasks, CASR models could be initialized with pre-trained language models ( §3.3) such as T5 (Raffel et al., 2020) , and even trained in a "Continue" way ( §3.4) by initializing M t



Figure 1: The Overview of CASR Framework. X, Y and Ŷ denote the input, ground truth and prediction, respectively. The blue arrows show how we iteratively added back the previous-step prediction Ŷ t-1 to the input for generating refined output Ŷ t .

, T5(Raffel et al., 2020),Bart (Lewis  et al., 2020), etc. Different from non-autoregressive (NAR) models(Gu et al., 2017)  which assume independence among tokens, L2R models assume the conditional probability in the form of P (Y |X) = i P (y i |X, y <i ) which better captures the left-side dependencies that exist in most generation tasks. More variations of generation models are discussed in §2.1.(a) A 9x9 Sudoku Example. White cells denote blanks, and the green numbers in them denote the ground truth. is from "a" to "g", rather than row-by-row.

Figure 2: Examples of Sudoku.

