CASR: GENERATING COMPLEX SEQUENCES WITH AUTOREGRESSIVE SELF-BOOST REFINEMENT

Abstract

There are sequence generation tasks where the best order to generate the target sequence is not left-to-right. For example, an answer to the Sudoku game, a structured code like s-expression, and even a logical natural language answer where the analysis may be generated after the decision. We define the target sequences of those tasks as complex sequences. Obviously, a complex sequence should be constructed with multiple logical steps, and has dependencies among each part of itself (e.g. decisions depend on analyses). It's a great challenge for the classic leftto-right autoregressive generation system to generate complex sequences. Current approaches improve one-pass left-to-right generation on NLG tasks by generating different heuristic intermediate sequences in multiple stages. However, for complex sequences, the heuristic rules to break down them may hurt performance, and increase additional exposure bias. To tackle these challenges, we propose a PLMfriendly autoregressive self-boost refinement framework, CASR. When training, CASR inputs the predictions generated by the model itself at the previous refinement step (instead of those produced by heuristic rules). To find an optimal design, we also discuss model architecture, parameter efficiency and initialization strategy. By evaluating CASR on Sudoku, WebQSP, MTOP and KVRET through controlled experiments and empirical studies, we find that CASR produces high-quality outputs. CASR also improves Accuracy on Sudoku (70.93% → 97.28%) and achieves state-of-the-art performance on KVRET with Micro F1 score (67.88% → 70.00%).



Figure 1 : The Overview of CASR Framework. X, Y and Ŷ denote the input, ground truth and prediction, respectively. The blue arrows show how we iteratively added back the previous-step prediction Ŷ t-1 to the input for generating refined output Ŷ t . Sequence generation models are widely used in tasks related to natural, domain-specific and programming languages -E.g., question answering (Pandya & Bhatt, 2021) , neural machine translation (Yang et al., 2020) , speech recognition (Malik et al., 2021) , automatic data analysis (Zhou et al., 2020) , drug discovery (Kim et al., 2021) , document summarization (Ma et al., 2020) , code search and generation (Lee et al., 2021) , etc. To achieve better performance on these tasks, recent works often adopt autoregressive (AR) models (Wu et al., 2016) , especially the ones with one-pass L2R (left-to-right) token-by-token generation / decoding order. Many SOTA-performance generative PLMs (pre-trained language models) are one-pass L2R models, such as GPT (Radford et al., 2018) , T5 (Raffel et al., 2020) , Bart (Lewis et al., 2020) , etc. Different from non-autoregressive (NAR) models (Gu et al., 2017) which assume independence among tokens, L2R models assume the conditional probability in the form of P (Y |X) = i P (y i |X, y <i ) which better captures the left-side dependencies that exist in most generation tasks. More variations of generation models are discussed in §2.1. (a) A 9x9 Sudoku Example. White cells denote blanks, and the green numbers in them denote the ground truth. 3d 1 4b 2 4 2c 3 1a 2g 4 1 3 1 3f 2e 4 (b) A 4x4 Sudoku Example. The game solving order of a human is from "a" to "g", rather than row-by-row. However, for many sequence generation tasks, beyond the left-side dependencies, there are right-side dependencies in the answer sequence, which together lead to multi-hop dependency chains, making left-to-right not the best order for generation. We call these tasks as complex tasks, and the answer sequences of these tasks as complex sequences. Complex tasks (see more details in §2.2, including Sudoku (PARK), WebQSP (Yih et al., 2016) , MTOP (Li et al., 2021) , and KVRET (Eric et al., 2017) , etc..) require better generation mechanism beyond one-pass L2R generation, since complex sequences are usually long, difficult, structured, or logical, which should be constructed with multiple logical steps. Human beings solve a complex problem with respect to its intrinsic order. For example, the order to write hierarchical answers (such as s-Expression or SQL code) is usually bottom-up or top-down following the dependencies between components as discussed by Sun et al. (2020) . The order to give an NL response is first analyses then decisions as discussed by CoT (Wei et al., 2022) . The order to solve a puzzle (such as the example 4x4 Sudoku game in Figure 2 ) is usually from easy parts to hard parts, because the hard parts become easier when the easy parts are correctly solved. (That is also verified in §5.1 where our CASR model learns to solve easy parts before hard ones.) Obviously, people give answers to different tasks in various orders with respect to all kinds of dependencies. Mimicking human behavior, some existing works design specific intermediate sequences to solve the dependency order challenge. E.g., templates (Hua & Wang, 2020) or heuristic rules (Zhang et al., 2018; Tan et al., 2021) are applied in autoregressive NL generation, allowing models to generate some parts (intermediate sequences) before the other parts in an answer via iterative refinement (rather than one-pass decoding). However, 1) it's really hard to design the best heuristic order and easy to miss intrinsic dependencies for some tasks, and we need expert knowledge or manual efforts to design specific heuristic orders for all different tasks. Besides, 2) when using teacher forcing strategy to parallel train all refinement iterations, additional exposure bias occurs. In this paper, CASR (Generating Complex Sequences with Auto-regressive Self-Boost Refinement) framework is proposed by us to: 1) decide intermediate sequences of complex answers for different tasks in a data-driven way, 2) avoid additional exposure bias. As shown in Figure 1 and will be discussed in §3, in CASR we design a model architecture ( §3.2) to not only take in the original input X, but also the previous prediction Ŷ t-1 in both training and inference. A special process ( §3.1) is designed to train refinement models M t for each step t = 0, 1, ..., T -1. To enhance the performance on downstream tasks, CASR models could be initialized with pre-trained language models ( §3.3) such as T5 (Raffel et al., 2020) , and even trained in a "Continue" way ( §3.4) by initializing M t with M t-foot_0 . For CASR models to be more efficient, we also explore the parameter-efficient model designs (SepEnc vs. UniEnc in §3.2), and tuning with parallel adapters ( §3.3). We evaluate CASR (and several baselines) on WebQSP, MTOP, KVRET, and Sudoku in §4, achieving SOTA performance on KVRET. In detail, CASR improves F1 on WebQSP from 70.81 to 74.81, EM on MTOP from 78.64 to 81.92, Micro F1 on KVRET from 67.40 to 70.00 1 , and accuracy on Sudoku from 70.93 to 97.28. We find the optimal CASR design ( §4.2) is "Fine-tuning" "SepEnc" with "Continue" strategy. We also do empirical studies on complexity ( §5.1), attention map ( §5.2), and visualize cases ( §5.3). Then we find that, CASR benefits hard sequences more than easy ones, and CASR can indeed correct the wrong part of the previous prediction according to other parts it depends on (as shown in Figure 8 ). In summary, our major contributions are: First, we point out the challenge to generate complex sequences due to the existence of multi-hop dependency chains and conduct a comprehensive review on existing iterative refinement methods. Second, we propose an autoregressive selfboost refinement framework, CASR, to decide intermediate sequences of complex answers in a data-driven way. The code of CASR framework is open sourced in the repository at https: //github.com/RalphHan/CASR. Third, we conduct experiments and empirical studies on four complex tasks to show CASR works and interpret how it works. 

Method

Objective NAT (Gu et al., 2017) i P (y i |X) INAT (Lee et al., 2018) i P (y t i |X, Ŷ t-1 ) Levenshtein (Gu et al., 2019) Imitate an expert policy to delete and insert L2R (Wu et al., 2016) i P (y i |X, y i-1 , ..., y 1 ) XLNet (Yang et al., 2019) i P (y z i |X, y z i-1 , ..., y z 1 ) Bidirectional (Zhang et al., 2018) i P (y i |X, y i+1 , ..., y n ) • i P (y i |X, y i-1 , ..., y 1 , C R2L ) Progressive (Tan et al., 2021) i P (y t i |X, Y t-1 , y 

2.1. SEQUENCE GENERATION METHODS

In recent years, many studies have been made on seq2seq generation. For input X and target Y sequences, the ultimate objective is to maximize P (Y |X). As shown in Table 1 , different method formulates the objective differently. These methods can be divided into non-autoregressive (row 1-3) and autoregressive (row 4-7) ones. Highly related ones are run as baselines in Table 5 . NAT (Gu et al., 2017) assumes that each token in sequence Y is mutually independent from each other, thus formulating P (Y |X) as i P (y i |X). To reduce the modeling bias in NAT, iterative refinement is applied on answers in INAT (Lee et al., 2018) . Levenshtein Transformer (Gu et al., 2019) further breaks down each refinement step into deletion, insertion, and classification so as to allow the model to edit the generation in a non-autoregressive way. Despite their efficiency, nonautoregressive models perform relatively poor comparing to autoregressive ones (see §4.4). L2R (Wu et al., 2016) generation tries to decode from left to right, which is also adopted by GPT (Radford et al., 2018) , basic transformer decoder (Vaswani et al., 2017) , VAE decoder (Yu et al., 2020) , and RNN decoder (Xu et al., 2015; Xia et al., 2017) . XLNet (Yang et al., 2019) allows the model to generate with any given order Z, but cannot decide by itself the best order. Bidirectional decoder (Zhang et al., 2018) generates backward (R2L) then forward (L2R), which handles at most 2-step logic. It trains an additional R2L decoder from scratch, without leveraging generation capabilities of PLMs as in CASR (see §3.3). PAIR (Hua & Wang, 2020) is used in controlled text generation with a template (constructed from a set of provided key phrases and placeholders) as input. It fills and refines the placeholders in an auto-regressive way. The template, the sequence length and position of placeholders remain unchanged during refinement. Similarly, chain of thought prompting (Wei et al., 2022) constructs the intermediate process through manually provided templates. These two methods are not chosen as baselines in §4 because they both require expert designs for each specific task. Progressive generation (Tan et al., 2021) is proposed to remove the requirement of templates and the limitation of fixed length during refinement. It breaks down the vocabulary into multiple stages according to word importance (average of tf-idf). Important words are generated first during early refinement steps, while other words (both important and unimportant) are generated later. Note that in Table 1, Y t-1 (row 7 ) denotes the intermediate sequence produced by the heuristics of progressive generation, but Ŷ t-1 (row 8) is the previous prediction generated by our CASR model (see §3).

2.2. TASKS WITH COMPLEX ANSWERS

The four complex-answer tasks referred in §1 will be further introduced in this section. WebQSP (Yih et al., 2016 ) is a classic dataset for KBQA(Knowledge Base Question Answering). The input consists of a knowledge graph and an NL query, and the output is an s-Expression which can be executed on the knowledge graph. The SOTA method of WebQSP (F1=83.6%) is ranking with bootstrapping negative samples (Ye et al., 2022) . MTOP (Li et al., 2021) is a benchmark for comprehensive multilingual task-oriented semantic parsing. The input consists of a list of API calls and an NL query, and the output is a tree-based TOP Representation that can be executed. KVRET (Eric et al., 2017) is a benchmark for table conversation. The input consists of a table and an NL query, and the output is an NL response corresponding to the dialog. The SOTA method of MTOP (EM=86.78%) and KVRET (Micro F1=67.88%) is multi-task (20+ tasks) prefix tuning with T5-3B as the backbone (Xie et al., 2022) .

Sudoku (PARK

) is an open dataset on Kaggle. Its game target is to fill the blanks correctly with the constraint that any two numbers in the same row, column, and house shouldn't have the same value. We choose Sudoku as an intuitive toy task for better demonstration of ideas in this paper. Examples of WebQSP (Figure 5 ), MTOP (Figure 6 ) and KVRET (Figure 7 ) are shown in Appendix §A. The number of samples in each split of the four tasks is shown in Table 2 .

3. CASR FRAMEWORK

As shown in Figure 1 and introduced in §1 and §2.1, the key idea of CASR framework is to take the previous-step prediction Ŷ t-1 as part of the current-step (t) input for generating a refined output Ŷ t . The iterative inference process in CASR follows the idea: As shown in Algorithm 1, the prediction at castep t is Ŷ t (0 ≤ t < T ) generated by the corresponding CASR model M t , where the input to M 0 is the original input X, and the input to M t (t > 0) is (X, Ŷ t-1 ). Following the notions in §2.1, X, Y , and Ŷ denote the input, the ground truth and the prediction, respectively. The iterative refinement process in CASR will take at most T casteps (CASR steps). The undetermined M t models in the CASR inference process lead to more questions: 1) How to design the training process and objectives that match the inference process? 2) How to design CASR model architecture that leverages existing PLMs and takes the extra Ŷ t-1 as input? 3) When leveraging a large PLM, can we train and save CASR models M t with less parameters? 4) Can the training and inference process be more effective by exploiting the relationships among CASR models M t of each castep? In the following, we will discuss these problems one by one.

3.1. TRAINING PROCESS

As formulated in Algorithm 2, the training process is also iterative. At the 0-th castep, we train the model M 0 normally on train set with the objective to maximize P (Y |X) (as discussed in §2.1). When the training is done, we use M 0 to generate prediction Ŷ 0 for each X in train and dev sets with beam searching (Ney et al., 1987) . At the t-th castep (0 < t < T ), we train M t to refine the last prediction Ŷ t-1 . The objective becomes maximizing P (Y |X, Ŷ t-1 ) on train set. After training, we generate Ŷ t on all train and dev sets to compose the samples (X, Y, Ŷ t ) for the next castep t + 1. After the training process, T versions of CASR model M t (0 ≤ t < T ) are available for the inference process. Because the distribution of Ŷ t-1 could change with t, and different casteps may learn different refinement patterns, by default all the T versions are saved. In the following, we will discuss more details of the algorithms on M t model architecture ( §3.2), tuning ( §3.3), and initialization ( §3.4). More discussions on result selections are in Appendix §B.

SepEnc+Fine-tune

UniEnc+Fine-tune SepEnc+Adapter UniEnc+Adapter

E1 E2 D

X 1 - t Y ÂE1 AE2 AD E D E E D AE AD D E X 1 - t Y X 1 - t Y X 1 - t Y ÂE and AD denote adapters of encoder and decoder. Blue and gray segments are the parameters to be trained and frozen, respectively. 

Train and Predict

The arrows begin at the source of initial weights of M t . Figure 4 : Initialization Strategies. The next problem after the training process is how to design a model architecture which could take (X, Ŷ t-1 ) as input and generate Ŷ t . Starting from any Transformer encoder-decoder (Vaswani et al., 2017) based model (such as T5, or an untrained Transformer), in CASR we provide two modification approaches to take in Ŷ t-1 . As shown in Figure 3 , SepEnc has two encoders to encode X and Ŷ t-1 separately (encode before concat), and UniEnc adopts only one encoder for both inputs (concat before encode). The output sequence length of SepEnc and UniEnc are both len(X) + len( Ŷ t-1 ). Let H denote the encoder output. For SepEnc, H = Concat(Encoder 1 (X), Encoder 2 ( Ŷ t-1 )), and for UniEnc, H = Encoder(Concat(X, Ŷ t-1 )). Both approaches have their own advantages: SepEnc naturally handles the distribution differences between X and Ŷ t-1 , while UniEnc requires fewer model parameters. UniEnc forces one encoder to handle two kinds of sequences with potentially different types of contents and lengths. As a tradeoff, SepEnc brings two times larger encoding part. And the decoder generates answers from H, just like the classic transformer decoder (Vaswani et al., 2017) .

3.3. PARAMETER EFFICIENCY

Both SepEnc and UniEnc approaches in §3.2 could take a PLM (pre-trained language model, such as T5) as starting point for model parameter initialization (more details in §3.4). In this way, CASR could leverage the existing knowledge from the PLM to enhance performances on downstream tasks (e.g., the ones in §2.2). However, it's costly to fine-tune all the parameters from a large PLM and save all the parameters of all the T CASR models M t . Thus, besides the fine-tuning approach, in CASR we also try parameterefficient tuning. As shown in Figure 3 , parallel adapters (He et al., 2021) could be added to every encoder and decoder. By adopting adapters, the parameters from the PLM are frozen, and only the parameters of the adapters are trained. In other words, one only needs to save the adapter (rather than the whole model) parameters for T times.

3.4. INITIALIZATION STRATEGY

In Algorithm 2, at the beginning of each castep, we initialize the parameter weights of M t with a chosen strategy from {Restart, Continue}. Illustrations of each strategy are shown in Figure 4 . "Restart" means initializing each M t with the same PLM weights (or from scratch). For SepEnc architecture, Both Encoder 1 and Encoder 2 can be initialized with the parameter weights of the PLM encoder. "Continue" means initializing with the best checkpoint from the previous castep. In this way, M t inherits the knowledge from M t-1 to avoid the potential cold start issue at castep t. In the "Restart" strategy, when comparing the late castep with the early castep, the improvement is only brought by refinement, rather than the extra continuous training steps (which exist in the "Continue" strategy). Note that, we design "Restart" mainly to decouple the effects of refinement from training steps, which is necessary for controlled experiments in §4.

4. EXPERIMENTS

In this section, we run controlled experiments to evaluate CASR designs ( §3) on WebQSP (Yih et al., 2016) , MTOP (Li et al., 2021) , KVRET (Eric et al., 2017) and Sudoku (PARK) ( §2.2), and compare CASR with high-related baselines ( §2.1).

4.1. EXPERIMENT SETUP

In this section, we explain how we design the controlled experiments and reproduce the high-related baselines. For more details about hyperparameter and resource consumption (number of parameters, training/inference time, etc.) please see Appendix §C. The three row blocks A0 ∼ E0 , A1 ∼ E1 and A2 ∼ E2 correspond to M 0 , M 1 and M 2 models at Castep t = 0, 1, 2, respectively. By definitions in §3, row A0 ∼ D0 share the same result because initialization and architecture choices do not influence M 0 models. In Table 5 , we run some high-related baselines as introduced in §2.1. We run INAT and Levenshtein with the official code provided by fairseqfoot_2 , where self-attention layers of the encoder and the decoder are initialized with bert, leaving cross-attention layers of the decoder trained from scratch. To decouple the effect from PLMs and make a fairer comparison between NAR and AR methods, we add row Xc , where the decoder of CASR is not initialized with T5 but from scratch (random weights), marked as "/Dec". In row L2 , "CASR-L" denotes changing the backbone of row B2 from T5-base to T5-large. For fair comparisons, the original backbones in Bidirectional Decoder (GRU) and Progressive Generation (Bart) implementations are changed to T5-base (same as CASR), and set T = 3 (same as CASR) for Progressive Generation.

4.2. BEST CASR DESIGNS

We achieve the SOTA performance on Micro F1 of KVRET, which is 70.00. The previous SOTA is 67.88 (Xie et al., 2022) based on T5-3B, and we beat it with a smaller T5-base. For the parameter initialization choices, "Continue" strategy brings better performance than "Restart". This is observed by comparing A with B rows, C with D rows in Table 3 , and comparing "Restart" and "Continue" columns in Table 4 . For the model architectures, by comparing A with C rows, B with D rows, we find that there is no great performance gap between SepEnc and UniEnc, and SepEnc is slightly better than UniEnc when combined with "Continue". For the parameter efficiency, by comparing E with A rows, we can find that adapter-tuning perform worse than fine-tuning. As the cost of freezing PLM parameters for efficiency, performance drops as expected when bringing adapter-tuning.

4.3. ANALYSIS ON SELF-BOOST REFINEMENT

As introduced in §3.4, we design "Restart" to decouple the effects of refinement from fitting the training set longer than t=0. The variations with "Restart" strategy ( A , C and E in Table 3 , "Restart" column in Table 4 ) demonstrates how self-boost refinement could improve the quality of generated sequences. Here the improvement is only brought by refinement since CASR models at all casteps are initialized with the same PLM/Scratch version and take the same training steps. In Table 4 , comparing "Continue" with "Continue w/o Ŷ t-1 ", we find that self-boostly feeding Ŷ t-1 to the model is indeed helpful, which improves the accuracy from 92.09% to 97.28%. Interestingly, the performance gap between castep 0 and 1 is much larger than the gap between t to t + 1 (t > 0). One possible explanation is that later refinements have reached the upper bound. As we can see in Table 4 , Sudoku requires more refinement steps (5 in our case) for the gap to converge. This shows there are more logical steps required and more dependencies exist in the generated Sudoku answers.

4.4. COMPARISON WITH BASELINES

In the NAR block of Table 5 , we find the results of NAR methods are not on par with AR methods. With case study (Figure 10 ), we find the ability of NAR methods to learn syntax is poor. Comparing row Xb and Xc , we find the poorness of NAR methods doesn't come from the initialization of the decoder. Comparing row Xd , Xe , and B0 , we find that those AR refinement methods using heuristic intermediate sequences are indeed harmful (worse than directly fine-tuning) for complex sequence generation. Compare row B2 and L2 , we find larger backbone (T5-base → T5-large) is overall helpful (UnifiedSKG (Xie et al., 2022) observes the same phenomenon that T5-large is worse than T5-base on KVRET). In summary, following conclusions can be drawn for CASR: • Self-boost refinement leads to better outputs than vanilla AR (one-pass L2R) generation, NAR refinement generation, and heuristic-rule-based AR refinement generation. • In general, the best combination is "Fine-tuning" "SepEnc" with "Continue" strategy. In this section, we conduct empirical studies to interpret how CASR works.

5. EMPIRICAL STUDIES

In §5.1, we verify that CASR brings more improvement on more complex sequence. Also, we find CASR model may implicitly learns dependency because it solves easy parts first during refinements. In §5.2, the cross-attention map between Ŷ t and Ŷ t-1 implies the informativeness and correctness of Ŷ t-1 . Meanwhile, from a microscopic perspective, each row in the cross-attention map implies the foundation of changing a component, where the hot-attended parts can be considered as the cause of changes. In §5.3, through case studies we show the intermediate steps generated by CASR models.

5.1. ANALYSIS ON COMPLEXITY

CASR helps more on problems with more complexity. For WebQSP, MTOP, and KVRET, we assume that the complexity of a problem (X, Y ) is positively correlated with the length of its answer. Thus, we divide each test dataset into "Short", "Middle" and "Long" equal splits according to len(Y ), the length of ground truth answer. As we can see in Table 6 , CASR brings the more improvement to the problems with "Long" answers than "Middle" and "Short" ones. Take Sudoku as example, we find that CASR solves easy blanks before difficult blanks during refinements. In late casteps, the generation benefits more from refinement by gradually handling complex dependencies. These are shown in Table 7 -As castep t increases, the average difficulty of remaining blanks increase. Here we formulate the difficultyfoot_3 of a blank cell as: Dif f iculty = (r -1) • (c -1) • (h -1) where r, c, and h denote the number of blanks in the same row, column, and house of the blank. Also, we can define the correct-solved castep for a blank: If a blank is correctly filled at castep t and remains the same for the rest casteps, then the blank is solved at castep t and t is the correct-solved castep for the blank. Otherwise, by default the correct-solved castep is set to T . The "Ratio" column in Table 7 means the percentage of blanks is correctly solved at castep t.

5.2. ANALYSIS ON ATTENTION

To measure to what extend does Ŷ t attend to X or Ŷ t-1 in cross-attention, we define the density. The cross attention map, A, is a tensor that displays in Ŷ t attend to H. Therefore, the size of A is [num-layers, num-heads, len( Ŷ t ), len(X)+len( Ŷ t-1 )]. We define the density to X and to Ŷ t-1 as: D X = A[..., : len(X)].mean() and D Ŷ t-1 = A[..., len(X) :].mean(), which are the average attention weight among each layer, head, and token, for X and for Ŷ t-1 . Similar to Table 6 , We sort the samples according to their D Ŷ t-1 , and divide a test dataset into sparse, middle, and dense splits, each holding 1/3 of the total test set. We evaluate the metrics of Ŷ t-1 on the three splits. We find the performance of Ŷ t-1 increases with D Ŷ t-1 . Besides, we compute the average D X and D Ŷ t-1 over all samples in the test set, and find that D Ŷ t-1 grows with t. Therefore, Ŷ t-1 becomes more and more informative thus be attended to by Ŷ t . Please refer to Table 10 and Table 11 in Appendix §D for more details. Meanwhile, we draw top-5 positions in cross-attention map (filter out non-blank cells and self) for each changed cell in Figure 8 . As we can see, the corrected cell usually attends to cells in the same row, column, or house.

5.3. CASE STUDIES

As examples, the intermediate CASR prediction of each castep are shown in Figure 9 of Appendix §D. We can find that as castep t increases, there is more green (keywords) and less red (mistakes) parts, which means Ŷ t becomes closer to Y . In the WebQSP case, the wrong part (marked red) of Ŷ 0 is deleted by late casteps. In the MTOP case, we find that CASR correctly generates the right part ("[SL:RECIPES INCLUDED INGR EDIENT dairy]") of the answer before the left part ("[SL:MET HOD RECIPES recipe]"), which one-pass L2R decoding cannot achieve. For Sudoku, an interesting phenomenon occurs. Although the wrong predictions (marked red) become fewer with the growth of castep, some previous correct predictions are modified to wrong ones, and then modified back. This can be considered a kind of exploration. We also present the predictions from other experiments (baselines and CASR-L, as discussed in Table 5 ) in Figure 10 , and we find it a challenge for NAR methods to generate complex sequences.

6. CONCLUSION

In this paper, CASR framework is proposed by us to generate Complex sequences with Autoregressive Self-Boost Refinement, which can decide intermediate sequences of complex answers in a data-driven way with no need for expert knowledge or manual efforts. Through controlled experiments, we find the optimal design in CASR is Fine-tuning SepEnc with Continue strategy. Also, the interpretability of CASR is enhanced via empirical studies on problem complexity and attention dependencies. CASR sheds more light on the sequence generation methods, especially on complex tasks (e.g. automatic data analysis and drug discovery), for future research. 4K steps for fine-tuning WebQSP, MTOP and KVRET, respectively. For adapter-tuning, we double the epoch number, which is 4K, 8K, and 8K steps. For the three tasks, we set the batch-size to 128, learning-rate to 2e-5, max-input-length to 1024, max-generation-length to 128, beam-size to 4, and evaluate every 2K steps for checkpoint selection. For Sudoku, we train a 12-layer encoder-decoder transformer from scratch, with d-model=512, ffn-dim=2048, num-heads=8. We set max castep T = 5 and max epoch E = 10K steps. We set the batch-size to 1024, learning-rate to 2e-5, beam-size to 2, and evaluate every 2K steps for checkpoint selection. Table 8 shows the total parameters of each controlled experiment to help the audience understand our methods. In row E , the total parameters is T*A+(E+D) rather than T*A+(2E+D), because Encoder 1 and Encoder 2 are initialized with the same PLM and frozen all the time.

C.2 RESOURCE CONSUMPTION

Table 9 shows the training and inference time consumption of CASR and baselines. Note that CASR-L runs slowly because we leverage deepspeedfoot_4 to avoid OOM. We train on 4 Tesla V100 GPUs. It takes 10, 22, and 19 hours to fine-tune WebQSP, MTOP and KVRET, respectively, and 12, 25, and 21 hours to adapter-tune them. Adapter-tuning is intrinsically hard so it takes more training steps to tune fewer parameters comparing to fine-tuning. For Sudoku, it takes 32 hours to train from scratch. D DETAILED RESULTS OF EMPIRICAL STUDIES In the game table Figure 2 , for the top-left blank (the ground truth is 6 in green), r=5 (the ground truths of the row blanks are 6, 9, 4, 8, and 3 respectively), c=6 (the ground truths of the column blanks are 6, 4, 3, 9, 8, and 2), and h=4 (the ground truths of the house blanks are 6, 9, 7, and 4). Then, the difficulty to fill the top-left blank is (5-1)*(6-1)*(4-1)=60. Algorithm 1 CASR Inference Process. Input: max castep T ; input X (from test set); well-trained CASR models M t (0 ≤ t < T ) Output: the best prediction Ŷ for X 1: for t in 0, 1, .. 



70.00 achieves the SOTA for micro f1 on KVRET, and the previous SOTA is 67.88(Xie et al., 2022). https://github.com/facebookresearch/fairseq Difficulty: According to this definition, when a blank is the only blank in a row, column, or house, the difficulty is 0, which makes sense because the blank can be directly solved. The larger r, c, and h are, the more difficult it is to solve the blank. https://github.com/microsoft/DeepSpeed



Figure 2: Examples of Sudoku.

Figure 3: Model Architectures and Parameter Efficiency Choices.

Figure 7: An Example of KVRET.

Figure 8: The Dependencies between Casteps Discovered by CASR. Shown as purple (t=1 attends to t=0) and blue (t=2 attends to t=1) arrows.

The Objectives of Sequence Generation Methods.



Controlled Experiments of CASR Framework on WebQSP, MTOP, and KVRET.

Sudoku Testing Results. the model only predicts on blanks. And we fill the predictions back to the blanks of the input as the combination of X and Ŷ , at the beginning of each castep t > 0. In other words, we use a special UniEnc to encode the combination, with H = Encoder(Combine(X, Ŷ t-1 )). "Restart" and "Continue" correspond to the same choices as C and D in Table3. "Continue w/o Ŷ t-1 " denotes that the previous predictions are not put back to the blanks of the input at each castep, which is equivalent to training the origin model T times steps as usual.

Comparisons between CASR and Baselines.

Result on Different Answer Length.



CASR Variations and Their Number of Parameters.

Training and Inference Time Consumption (take WebQSP as an example)

Performance of last-step prediction Ŷ 1 when castep t=2, produced by the setting of row B in Table 8, grouped by D Ŷ 1 , the density to Ŷ 1 . Sparse 51.88 78.52 72.72 16.70 64.17 56.80 Middle 85.84 87.53 83.17 20.18 75.09 70.36 Dense 85.78 89.08 87.63 20.26 85.80 73.71

The average density to input X and previous prediction Ŷ t-1 of the setting of row B in Table8. Note that, when castep t=0, the density to the previous prediction is 0, thus not listed.

., T -1 do Ŷ t ← argmax Y P (Y |X, M t ) |X, Ŷ t-1 , M t ) Algorithm 2 CASR Training Process ( §3.1). Input: max castep T ; max epoch E;all input X and ground truth Y in train (S t ) and dev (S d ) sets Output: CASR models M t (0 ≤ t < T ); 1: for t in 0, 1, ..., T -1 do for e in 0, 1, ..., E -1 doŶ t ← argmax Y P (Y |X, M t ) |X, Ŷ t-1 , M t )

ACKNOWLEDGMENTS

This research was partly supported by Shenzhen Stable Supporting Program (WDZC20200820200655001).

ETHICS STATEMENT

Datasets This work collects the public dataset for research purposes. We believe there is no privacy issue, because Sudoku, WebQSP, MTOP, and KVRET are accessible to the public.Models This work leverages T5-base and T5-large, which are pre-trained on C4, a clean corpus. Therefore, we can make sure that CASR will not produce discriminatory answers.Computational Resources We train on 4 Tesla V100 GPUs, and the training and inference time consumption is show in §9, which is acceptable considering the performance gain. For WebQSP, MTOP, and KVRET, keywords are highlighted with green color, and red color denotes wrong predictions. For Sudoku, the white fields denote the blanks to be filled, the red numbers in them denote that the predictions are wrong, and the green ones denote the correct predictions. It's a coincidence that the Sudoku prediction for this sample remains the same for t=2,3,4, so we draw them together. 

