LEARNING STRUCTURAL EDITS VIA INCREMENTAL TREE TRANSFORMATIONS

Abstract

While most neural generative models generate outputs in a single pass, the human creative process is usually one of iterative building and refinement. Recent work has proposed models of editing processes, but these mostly focus on editing sequential data and/or only model a single editing pass. In this paper, we present a generic model for incremental editing of structured data (i.e. "structural edits"). Particularly, we focus on tree-structured data, taking abstract syntax trees of computer programs as our canonical example. Our editor learns to iteratively generate tree edits (e.g. deleting or adding a subtree) and applies them to the partially edited data, thereby the entire editing process can be formulated as consecutive, incremental tree transformations. To show the unique benefits of modeling tree edits directly, we further propose a novel edit encoder for learning to represent edits, as well as an imitation learning method that allows the editor to be more robust. We evaluate our proposed editor on two source code edit datasets, where results show that, with the proposed edit encoder, our editor significantly improves accuracy over previous approaches that generate the edited program directly in one pass. Finally, we demonstrate that training our editor to imitate experts and correct its mistakes dynamically can further improve its performance.

1. INTRODUCTION

Iteratively revising existing data for a certain purpose is ubiquitous. For example, researchers repetitively polish their manuscript until the writing becomes satisfactory; computer programmers keep editing existing code snippets and fixing bugs until desired programs are produced. Can we properly model such iterative editing processes with neural generative models? To answer this question, previous works have examined models for editing sequential data such as natural language sentences. Some example use cases include refining results from a first-pass text generation system (Simard et al., 2007; Xia et al., 2017) , editing retrieved text into desired outputs (Gu et al., 2018; Guu et al., 2018) , or revising a sequence of source code tokens (Yin et al., 2019; Chen et al., 2019; Yasunaga & Liang, 2020) . These examples make a single editing pass by directly generating the edited sequence. In contrast, there are also works on modeling the incremental edits of sequential data, which predict sequential edit operations (e.g. keeping, deleting or adding a token) either in a single pass (Shin et al., 2018; Vu & Haffari, 2018; Malmi et al., 2019; Dong et al., 2019; Stahlberg & Kumar, 2020; Iso et al., 2020) or iteratively (Zhao et al., 2019; Stern et al., 2019; Gu et al., 2019a; b) , or modify a sequence in a non-autoregressive way (Lee et al., 2018) . However, much interesting data in the world has strong underlying structure such as trees. For example, a syntactic parse can be naturally represented as a tree to indicate the compositional relations among constituents (e.g. phrases, clauses) in a sentence. A computer program inherently is also a tree defined by the programming language's syntax. In the case that this underlying structure exists, many edits can be expressed much more naturally and concisely as transformations over the underlying trees than conversions of the tokens themselves. For example, removing a statement from a computer program can be easily accomplished by deleting the corresponding tree branch as opposed to deleting tokens one by one. Despite this fact, work on editing tree-structured data has been much more sparse. In addition, it has focused almost entirely on single-pass modification of structured outputs as exemplified by Yin et al. ( 2019 In this work, we are interested in a generic model for incremental editing of structured data ("structural edits"). Particularly, we focus on tree-structured data, taking abstract syntax trees of computer programs as our canonical example. We propose a neural editor that runs iteratively. At each step, the editor generates and applies a tree edit (e.g. deleting or adding a subtree) to the partially edited tree, which deterministically transforms the tree into its modified counterpart. Therefore, the entire tree editing process can be formulated as consecutive, incremental tree transformations (Fig. 1 ). While recent works (Tarlow et al., 2019; Dinella et al., 2020; Brody et al., 2020) have also examined models that make changes to trees, our work is distinct from them in that: First, compared with Dinella et al. ( 2020), we studied a different problem of editing tree-structured data particularly triggered by an edit specification (which implies a certain edit intent such as a code refactoring rule). Second, we model structural edits via incremental tree transformations, while Tarlow et al. ( 2019) and Brody et al. ( 2020) predict a complete edit sequence based on the fixed input tree, without applying the edits or performing any tree transformations incrementally. Although Dinella et al. ( 2020) have explored a similar idea, our proposed tree editor is more general owing to the adoption of the Abstract Syntax Description Language (ASDL; Wang et al. (1997) ). This offers our editor two properties: being language-agnostic and ensuring grammar validity. In contrast, Dinella et al. ( 2020) include JavaScript-specific design and employ only ad-hoc grammar checking. Finally, our tree editor supports a comprehensive set of operations such as adding or deleting a tree node and copying a subtree, which can fulfill a broad range of tree editing requirements. These operations are not fully allowed by previous work, e. We further propose two modeling and training improvements, specifically enabled by and tailored to our incremental editing formalism. First, we propose a new edit encoder for learning to represent the edits to be performed. Unlike existing edit encoders, which compress tree differences at the token level (Yin et al., 2019; Hoang et al., 2020; Panthaplackel et al., 2020b) or jointly encode the initial and the target tree pairs in their surface forms (Yin et al., 2019) , our proposed edit encoder learns the representation by encoding the sequence of gold tree edit actions. Second, we propose a novel imitation learning (Ross et al., 2011) method to train our editor to correct its mistakes dynamically, given that it can modify any part of a tree at any time. We evaluate our proposed tree editor on two source code edit datasets (Yin et al., 2019) . Our experimental results show that, compared with previous approaches that generate the edited program in one pass, our editor can better capture the underlying semantics of the intended edits, which allows it to outperform existing approaches by more than 7% accuracy in a one-shot evaluation setting. With the proposed edit encoder, our editor significantly improves accuracy over existing state-of-the-art methods on both datasets. We also demonstrate that our editor can become more robust by learning to imitate expert demonstrations dynamically. Our source code is available at https://github.com/neulab/incremental_tree_edit.

2. PROBLEM FORMULATION

As stated above, our goal is to create a general-purpose editor for tree-structured data. Specifically, we are interested in editing tree structures defined following an underlying grammar that, for every parent node type, delineates the allowable choices of child nodes. Such syntactic tree structures, like syntax trees of sentences or computer programs, are ubiquitous in fields like natural language processing and software engineering. In this paper we formulate editing such tree structures as revising an input tree C -into an output tree C + according to an edit specification ∆. As a concrete example, we use editing abstract syntax trees (ASTs) of C# programs, as illustrated in Fig. 1 . This figure shows transforming the AST of "x=list.ElementAt(i+1)" (C -) to the AST of "x=list[i+1]" (C + ). In this case, the edit specification ∆ could be interpreted as a refactoring rule that uses the bracket operator [ • ] for accessing elements in a list. 1 In practice, the edit specification is learned



The corresponding Roslyn analyzer in C# can be found at https://github.com/JosefPihrt/ Roslynator/blob/master/docs/analyzers/RCS1246.md.



); Chakraborty et al. (2020) for computer program editing.

g., Brody et al. (2020) cannot add (or generate) a new tree node from scratch; Tarlow et al. (2019) and Dinella et al. (2020) do not support subtree copying.

