N-BREF: A HIGH-FIDELITY DECOMPILER EXPLOIT-ING PROGRAMMING STRUCTURES

Abstract

Binary decompilation is a powerful technique for analyzing and understanding software, when source code is unavailable. It is a critical problem in the computer security domain. With the success of neural machine translation (NMT), recent efforts on neural-based decompiler show promising results compared to traditional approaches. However, several key challenges remain: (i) Prior neuralbased decompilers focus on simplified programs without considering sophisticated yet widely-used data types such as pointers; furthermore, many high-level expressions map to the same low-level code (expression collision), which incurs critical decompiling performance degradation; (ii) State-of-the-art NMT models (e.g., transformer and its variants) mainly deal with sequential data; this is inefficient for decompilation, where the input and output data are highly structured. In this paper, we propose N-Bref 1 , a new framework for neural decompilers that addresses the two aforementioned challenges with two key design principles: (i) N-Bref designs a structural transformer with three key design components for better comprehension of structural data -an assembly encoder, an abstract syntax tree encoder, and a tree decoder, extending transformer models in the context of decompilation. (ii) N-Bref introduces a program generation tool that can control the complexity of code generation and removes expression collisions. Extensive experiments demonstrate that N-Bref outperforms previous neural-based decompilers by a margin of 6.1%/8.8% accuracy in datatype recovery and source code generation. In particular, N-Bref decompiled human-written Leetcode programs with complex library calls and data types in high accuracy.

1. INTRODUCTION

Decompilation, which is a process of recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. For example, decompilation is highly valuable in many security and forensics applications (Lin et al. (2010) ; Lee et al. (2011) ; Brumley et al. (2011) ). Given a binary executable, an ideal decompiler generates the high-level program that preserves both the semantics and the functionality of the source code. However, this process is difficult as the data structure and semantics are largely destroyed or obfuscated during the compilation. Inspired by remarkable performance in neural machine translation (NMT) tasks (Liu et al. (2019) ; Vaswani et al. (2017) ; Dai et al. (2019) ; Devlin et al. (2018) ; Dong & Lapata (2016) ), recent works (Fu et al. (2019) ; Katz et al. (2019) ) leverage NMT model for neural-based decompilation and achieve promising performance on small code snippets. To make neural-based decompilation useful in practice, many challenges remain: (C1) Current stateof-the-art neural architectures for machine translation -transformer (Vaswani et al. (2017) ) or its variants (Dai et al. (2019) ; Devlin et al. (2018) ; Liu et al. (2019) ) -focused on sequential data (e.g., language), while neural decompilers deal with data with intrinsic structures (e.g., tree/graph) and long-range dependencies. (C2) The main decompilation task consists of many sub-tasks (e.g., datatype recovery, control/dataflow recovery). Training one neural network cannot solve them all. (C3) Practical data types (e.g., pointers) are not modeled and compiling configurations need to be known beforehand (Fu et al. (2019) ). (C4) Due to a lack of unification in terms of library usage, variable type, and/or control-flow complexity, a simple crawling from public repositories does not work well. Source code of different styles can be compiled into identical binary code (i.e., "expression collision" or EC) and yield issues when evaluating decomplied code against original source code. To our best knowledge, no code generation toolkit with configurable code complexity exists. In this paper, we present N-Bref, an end-to-end neural-based decompiler framework that learns to decompile the source code to assembly. For (C1), we design a back-bone structural transformer by incorporating inductive Graph Neural Networks (GNNs) (Hamilton et al. (2017) ) to represent the low-level code (LLC) as control/dataflow dependency graphs and source code as Abstract Syntax Tree (AST). To better model long-range correlations in the structural representations, we add a graph neural network after each of the self-attention layers in the transformer. The AST decoder expands the AST of the source code in a tree fashion to better capture the dependency of each predicted node. Also, we adopt memory augmentation (Cornia et al. (2019) ) and new tokenizing methods to improve the scalability of our neural networks with the growing size of programs. The backbone network is learned to iteratively generate AST for source code from structured representation of assembly. For (C2) and (C3), we decouple decompilation into two sub-tasks: data type solver (DT-Solver) and source code generator (SC-Gen), both use the same backbone structural transformer with different parameters. The output of the data type solver is used as the decoder input of the source code generation. For (C4), we design a dataset generator to generate training data, test and analyze the performance of different design principles across configurable code complexity. Different from conventional dataset generators (Yang et al. (2011) ; IntelC++compiler (2017)) used in programming language studies, our generator produces similar code styles as those written by human programmers, has unified source code representation that avoids EC, has configurable complexity and data types to facilitate factor analysis, and is specifically designed for learning-based methodologies. Extensive experiments show that on our new metrics, N-Bref outperforms transformer baseline/previous neural-based decompiler (Fu et al. (2019) ) by 3.5%/6.1% and 5.5%/8.8% in data type recovery and source code generation tasks, respectively. Furthermore, on 5 human-written Leetcode solutions, N-Bref shows 4.1%/6.0% and 6.0%/9.7% margins over transformer/previous neural decompiler in data type recovery and source code generation, respectively. We also perform a comprehensive study of the design component in neural-based decompiler across different dataset configurations. In summary, this paper makes the following contributions: We construct an end-to-end decompilation system by integrating a LLC Encoder, an AST encoder, an AST decoder, and a set of novel embedding methods in a holistic manner. Our new architectures bridge the gap between low-level code and high-level code by transforming both of them into a graph space. We perform a comprehensive analysis of the influence of each neural-based decompiler design component to the overall program recovery accuracy across different dataset configurations. We corroborate the design performance on various generated benchmarks and Leetcode tasks. We boost decompilation performance by decomposing the decompilation process into separate tasks, data type recovery and AST generation. In addition, we present corresponding new metrics to evaluate data type recovery and source code generation. We develop the first dataset generation tool for neural-based decompiler development and testing. It randomly generates programs with configurable complexity and data types; it also unifies source code representation to prevent "expression collision".

2. PRELIMINARIES OF DECOMPILERS

Decompilation takes an executable file as input and attempts to create high-level source code that are more semantically meaningful and can be compiled back. Figure 1 shows a low-level code snippet disassembled from a stripped binary and the corresponding high-level program. A commonly used low-level code (LLC) is assembly (ASM). An assembly program is a sequence of instructions that can be executed on a particular processor architecture (e.g. MIPS, . The first token for each instruction is called an "opcode", which specifies the operation to be performed by the instruction. Many instructions in a program operate on processor registers (a small amount of fast storage in the processor) or instant values to perform arithmetic operations, such as shifting (e.g.shl , shr ), floating-point multiplications (e.g. mulss), etc. Other instructions include (1) Each instruction has a certain internal structure, depending on the opcode. For example, in Line 8 of Figure 1 (b), the first operand is a floating-point value in the memory and multss multiplies the value with the destination register (xmm0 ) and stores the value back to xmm0 . Besides, connections also exist between instructions: (i) branch instructions (e.g., je, jmp) reveal the 'control flow' of the high-level program; (ii) the register which stores the new value of multss (Line 8) is consumed later as a source register (Line 9). These data movements reveal the 'data flow' of the program. In this paper, we formulate the low-level instructions as a graph using the instruction structure, control-flow and data-flow between each nodes as shown in Figure 1(b) . High-level programming languages can be represented in its equivalent abstract syntax tree (AST) (Baxter et al. (1998) ) during code generation (Figure 1(a) ). This representation has many advantages over its sequential representations: (i) adjacent nodes are logically closer in AST compared with sequential representations, (ii) error propagation in sequential expansion can be alleviated in a tree decoder, and (iii) AST grammar helps prevent error predictions.

3. N-BREF OVERVIEW

In this section, we provide an overview of our design components with an illustrative example. Figure 2 shows an example of the prediction procedures. The Backbone Structural Transformer. Our structural transformer has three components: (1) LLC encoder, (2) AST encoder, and (3) AST decoder (Detailed in Sec. 4). The LLC encoder takes the low-level instructions converted from binaries using disassembler as input. AST encoder takes the input of a previous (partial) AST, and the predictions of AST decoder are AST nodes, which can be equivalently converted to the high-level program. As mentioned earlier, we formulate input low-level code into graphs and high-level code into tree structures. As the AST of the data declaration is very distinct from the rest of the code (Figure 1(a) ) in highlevel program, we decompose decompilation into two tasks: data type solver (DT-Solver) and source code generator (SC-Gen). Both have the backbone structural transformer. Prediction Procedure. Figure 1 shows an example of the code recovery process of N-Bref. The assembly graph and high-level AST is the input of LLC encoder and AST encoder. The input of the AST decoder is the tree path from the root node to the expansion node. Initially, a single root-node is fed into the AST encoder/decoder. Once a new node is generated from decoder in each step, we update the AST and use it as the AST encoder input in the next prediction step. We expand the AST in a breadth-first (BFS) fashion. AST contains explicit terminal nodes, which are tokens with no child, such as registers, numerics, variable references and variable types. Non-terminal nodes (e.g. binary operator '=') must have children, otherwise there is a syntax error. The branch stop expansion when its leaf nodes are all terminal nodes. Note that during training, we apply 'teacher forcing' by attaching the correct node label into the AST encoder at each step. (See Appendix E for formal algorithm) Cascading DT-Solver and SC-Gen. As shown in Figure 1 , we divide the AST into two parts: (i) AST of data type and (ii) AST of main code body. Each part is generated using DT-Solver and SC-Gen respectively. This method allows the network to focus on each task individually and resolves more complicated data types. During testing, DT-Solver first generates the left part of the AST in Figure 1 , then the SC-Gen will continue the expansion from this intermediate results. During training, the initial data type input to the SC-Gen is the program golden.

4. METHODOLOGY: DATA GENERATOR

In this section, we detail the data generator designed in N-Bref. We design data generator so that it has no expression collision (EC). For example, 'unary' operators are converted to 'binary' operators (i++ and i=i+1) and all the 'while' loops are transferred into 'for' loops. Experimentally, we observe that our data generator is free of EC and performance improves. EC hurts the performance because (1) the same input assembly can be mapped to multiple equivalent outputs; (2) extra high-level semantics result in extra token dimensions; (3) the training under EC is both difficult and slow due to label adjustment at runtime. The generator is configurable with multiple hyper-parameters (Table 1 ), which makes it easy for N-Bref to accommodate with different decompiliation tasks. It also allows us to analyze which components in the pipeline improve the scalability and performance (See Sec. 6) . For each data point in the dataset, we sample b s depth , b s size and b s num with a uniform distribution between 1 and a user-specific maximal value (Table 1 ). The number of sampled variables (var s num ) of a program is related to b s num and b s depth following a Poisson Distribution (Equations in Appendix D). The generator also takes the libraries lib in that are pre-defined by the user as potential generated components. If a function call is sampled, the data generator will filter out the variables that do not match with its input / output types (line 4 Figure 1 

5. METHODOLOGY: PIPELINE

Here, we present the details of the backbone structural transformer in N-Bref. LLC Encoder. As shown in Figure 1 (b), we first formulate the assembly code as graphs by adding the following edges between nodes: (i) between the 'opcode' node of branch instructions and all their possible next instructions (control flow). (ii) between instructions 'opcode' and its 'operands' (i.e., registers / instant values). (iii) between the same register node of different instructions (register read-after-write) which indicates the data-dependency. Different from (Shi et al. ( 2019)), we do not add redundant pseudo nodes to indicates node positions, because this method is not scalable for long programs due to the exponential size of input using pseudo nodes. Instead, we directly concatenate all one-hot meta-features of a node together, namely the register / instruction type (var t / ins t ), position in the instruction (n pos ), node id (n id ) and numerical field (n num ) during tokenizing process. If the token is a number, we represent it in a binary format to help the transformer generalize to unseen numerical values (e.g., 12 is represented as n num =[0,0,0,0,1,1,0,0], a 16-by-1 vector). This representation method can greatly reduce the length of the transformer input and make the decompiler more robust for long programs. The tokenized vector for each node (h 0 = [var t ; ins t ; n pos ; n block ; n id ; n num ; n zeros ] T , h 0 ∈ R d×1 ) are fed into an embedding graph neural network (GNN) -GraphSAGE (Hamilton et al. (2017) ), an inductive framework that can generalize representations for unseen graphs. We pad n zeros to the input vector h 0 to match the GNN output feature size d. Note that to better represent the data flow in assembly code, we leverage a character embedding for registers (var t = [c 1 ; c 2 ; c 3 ]). For instance, if the register is $rax, we would break it into 'r','a','x'. That is because the naming policy of x86-64 (also MIPS) hardware registers indicates their underlying connections -$eax is the first 32-bit of register $rax and $ax/$ah is the first 16-/8-bit of register $eax. After getting the assembly graph V and each node's representation h 0 , each node v (where v ∈ V ) aggregates the feature representations of its sampled neighbors: h l N (v) = max(σ(W l h l u + b l )) , ∀u ∈ N (v) Here, the h l u represents the hidden state of a node's neighbours and N (v) represents the set of neighbours of the node v. W l is a trainable matrix (d-by-d) and b l is a bias vector, σ represents the sigmoid activation function. We choose to use an element-wise max-pooling (max) as an aggregator to collect the states from neighbours. The aggregation vector is concatenated with the current state of the node h l u as the input to a fully-connected layer to get the new state: h l+1 v = W l+1 ([h l v , h l N (v) ]) (2) Here, W l+1 ∈ R d×2d is the trainable embedding matrix and h l+1 v (a d-by-1 vector) is the output of the current layer of GNN. [•, •] is the concatenation operation. AST Encoder. The AST encoder encodes the AST tree from the AST decoder to guide future tree expansions. In this work, we treat the AST tree as a graph (V ) and embed it using GNN in the same way as LLC encoder following Eq. ( 2)(1). The input of the GNN includes meta-features of the tokenized AST node feature (n f eat ) and a boolean indicating whether the node is expanded in this step (n expand ). The input vector h v = [n expand ; n f eat ] (v ∈ V ) is fed into a GNN and the output (h v ) is added with the positional encoding result: h v = h v + W 1 h depth v + W 2 h idx v (3) Here h depth v and h idx v are the one-hot vector of the node's (v) depth in the tree and node's position among the parent's children. W 1 and W 2 are trainable matrices for embedding. The output hidden states is fed into our designed self-attention module (Sec. 5.1). At the end of the AST encoder, we integrate the AST encoder output (H ast ) and LLC encoder output (H llc ) using an additional multihead attention layer with H llc as input K and V , and H ast as Q (Figure 3 ). The result (H ast ) will be used for networks downstream. AST Decoder. The AST decoder takes the encoding result from the previous stage as input. The querying node of the AST decoder is represented as a path from the root to itself using the same methods proposed in (Zhu et al. (2019) ; Chen et al. (2018a) ). This method reduces the length of the input into the decoder. The results from the low-level code encoder H llc and the AST encoder H ast are integrated into the decoder using two attention layers following Eq. ( 4) as shown in Figure 3 . The output of the AST decoder is mapped into its output space with dimension d o ×1 using another fullyconnected layer. d o is the number of possible tokens of high-level code. After the new prediction is generated, we update the AST tree for the next time step (Figure . 2).

5.1. MEMORY AND GRAPH AUGMENTED SELF-ATTENTION MODULE

Decompilation is hardly a word-by-word translation like natural language. Each individual token in low-level code or AST tree must be interpreted in the context of its surrounding semantics. To capture the prior knowledge on programming structures and emphasize the node connections after embedding, we leverage two additional modules (i) memory-augmentation (ii) graph-augmentation in transformer attention. The formal descriptions are shown below. Memory augmentation. We propose a memory-augmented attention layer similar to the method in (Cornia et al. (2019) ). The input prior information is trained and does not depend on the input data. Traditional transformer's building block is self-attention layer which takes three sets of vectors (queries Q, keys K and values V ) as input. The layer first computes the similarity distribution between Q and K and use the resulted probability to do a weighted sum on V . (equations in Vaswani et al. (2017) ) In N-Bref, we add two trainable matrices for each head as an extra input to the transformer for memory augmentation. And the computation is adjusted to: H = M ultiHead(Q, K, V ) = Concat(head 1 , ..., head t ) Ẇ O ( ) head i = Attention(Q , K , V ) = sof tmax( Q K T √ d )V where Q = QW qi , V = [V W vi , M vi ], K = [KW ki , M ki ] Here, t is the number of parallel attention heads. (W qi ,W ki ,W vi ) are trainable matrices with a dimension of d × d t . W O has the dimension of d × d. M vi , M ki ∈ R dm×d and d m controls the number of slots of the memory. Note that we remove the positional embedding in the original transformer for LLC encoder as the position information is integrated into GNN embedding. Graph augmentation. We propose to emphasize the connections of assembly nodes after attention layer. The output of the multi-head attention layer s can be viewed as a matrix H t = [h s,0 , h s,1 , ..., h s,N ], H ∈ R d×N , where N is the number of nodes in the assembly graph. We first convert H t back to a graph structure using the edge information of assembly code. The new graph with connected hidden states is the input to an additional GNN layer using Eq. (1)(2) after each self-attention layer.

6. EVALUATION

6.1 EXPERIMENTAL SETUP We assess the performance of N-Bref on various benchmarks generated by our dataset generator with various difficulty levels and Leetcode solutions (Problems (2017); details in Appendix B) as shown in Table 2 . This binary is disassembled using the GNU binary utilities to obtain the assembly code. To generate the AST for C code, we implement the parser using clang compiler for python. Our dataset generator is built on csmith (Yang et al. (2011) ), which is a code generation tool for compiler debugging. The original tool is not suitable in our cases and thus N-Bref modifies most of the implementation for decompilation tasks. For the neural network setting discussed in Figure 3 , we choose [N 1 , N 2 , N 3 , t] = [3, 2, 2, 8] (t is the number of heads used in N-Bref), an embedding dimensionality d = 256, and memory slots d m = 128. The training/evaluation is implemented using Pytorch 1.4 and DGL library (Wang et al. (2019) ). Details about the training hyper-parameters and settings are included in Appendix A. Complexity arguments and benchmarks. This section describes the tasks in our evaluation. We randomly generate 25,000 pairs of high-level programs and the corresponding assembly code in each task for network training (60%), validation (20%) and evaluation (20%). We mainly focus on tuning b depth and b num (see Table . Metrics. We evaluate the performance of N-Bref using token accuracy. For evaluation of SC-Gen, we expand the decompiled AST from the root to match the structure of the golden AST (AST G ). The token accuracy is calculated as: acc = num(AST = AST G ) num(AST G ) We also show the evaluation result using graph edit distance (Sanfeliu & Fu (1983) ) without enforcing the match between the decompiled AST and AST G on the graph generation in Appendix G. The metric is fair to evaluate decompilation tasks as we remove all the EC. Eq. 7 is able to evaluate sequence output by treating the sequence as a tree. For DT-Solver, we have two metrics: (i) macro accuracy (Acc mac ) and (ii) micro accuracy (Acc mic ). The Acc mac treats unsigned and signed variables as the same. This is because unsigned and signed variables have no difference in assembly code except through hints from type-revealing functions (for example, the return value of strlen() must be an unsigned integer). Note that we do not recover numerical values that are directly assigned to a variable and we replace them with a token 'num'. These values exist in the stack memory or assembly instructions which can be resolved after the AST is correctly recovered using additional analysis methods. One of our future works can leverage the pointer network (See et al., 2017) to directly copy numerical values to the output.

6.2. RESULTS

Performance impact of each design component. To study the effectiveness of potential design principles, we perform a thorough sensitivity study as shown in Figure 4 . In SC-Gen, we observe that with the growth of code length and complexity, each component in N-Bref preserves more performance compared to the baseline transformer. The transformer with LLC encoder shows the least performance degradation when complexity and length increase (as shown by the slope). This is because the GNN in N-Bref connects instructions that are distant from each other to alleviate the performance drop. By expanding the source code into a tree structure (AST encoder+decoder), the result shows that it also prevents the accuracy of degradation. For DT-Solver, increasing b size improves the performance, because short programs do not have enough semantics to identify variable types. Also, the performance declines when increasing the program complexity (b depth ), and we assume that is because wiring control-flow complicates the analysis of data-flow. Traditional decompiler REWARD shows a large performance drop along the axis of b depth . That is because the dynamic analysis in REWARD is for a single control path. As such, it has limited performance among complex control-flow. We also tested many other design options but they cannot achieve better scalability and performance empirically compared to N-Bref. Comparison to previous works. N-Bref yields the highest token accuracy across all benchmarks (91.1% on average) as shown in Table 2 in both data type recovery and AST generation. N-Bref engenders 5.5% and 8.8% margin over transformer-baseline and Ins2AST, which is a previous neuralbased program decompiler (Fu et al. (2019) ). The encoder in Ins2AST leverages N-ary Tree LSTM, which cannot collect information for instructions far apart but logically adjacent. Our LLC encoder, on the other hand, leverages GNN embedding and graph augmentation to emphasize node connections. We do not show the results of traditional decompilers (RetDec (2017); Hex-Rays ( 2017)) as they do not preserve any semantics and achieves very low token accuracy using Eq. 7. (Examples of traditional decompilation results are shown in Fu et al. (2019) ). For type recovery, N-Bref also achieves 3.55% / 6.1% / 30.3% average margin over transformer, Ins2AST, and REWARD respectively. Traditional decompiler REWARD leverages type-revealing instructions and does not consider other low-level representations. Also, REWARD focuses on a single path in the program executed using dynamic analysis. As such, they cannot handle control flow properly. N-Bref uses static analysis and considers all paths during execution. Under review as a conference paper at ICLR 2021 For baseline transformer and N-Bref, we also present the Acc mic in parentheses of Table 2 . The gap between Acc mac and Acc mic is reduced when the program gets longer (reduced from 28.5% to 20.2% for <math.h> using N-Bref). That is because longer program have more type-revealing instructions/functions which can help the network to identify data types ('unsigned/signed') correctly. Also, N-Bref shows higher tolerance to the complexity and length growth compared to other works. For lib in =<string.h>, the token accuracy drops by 6.1% from the easiest to the hardest settings compared to 9.5% / 8.2% accuracy drop for baseline transformer and Ins2AST, respectively. That is because GNN can gather information from adjacent nodes for large assembly graphs. Also, AST decoders can effectively prevent error propagation through tree expansion when the source code grow larger, unlike sequence generation where early prediction errors affect the later nodes. We also select 5 Leetcode Problems (2017) solutions in C as an auxiliary test dataset and train a new model with the complexity of (b depth = 2, b size = 4) using <string.h> library. The result shows N-Bref is able to recover real benchmarks and achieves 6% / 9.7% margin over transformer and previous neural-based decompiler. This means N-Bref is able to generalize to human-writing code. The complexity of datasets we generate can cover real-world applications. Ablation Study. Table 3 shows the ablation studies of techniques in N-Bref. Graph augmentation in LLC and AST encoder (1 st column) helps increase the accuracy by 1.1%. Depth and child index positional encoding improve the performance by 0.53%. When replacing our method with Ahmad et al. ( 2020) for positional encoding, accuracy has a 0.23% drop. The 'node representation' refers to character embedding for assembly registers, concatenation of meta-features (Details in Sec. 5). Removing these techniques leads to a 1.8% accuracy drop on average. Memory augmentation helps to capture the prior knowledge of the code structure and removes it shows a 1.7% performance drop. Also, splitting the decompilation task into two-part shows a 2.5% improvement in accuracy. 2010)) leveraged the type-revealing operations or functions calls as a hint to inference variable types. These methods incur accuracy drop when there is not enough type-revealing semantics. N-Bref proposes a learning method which can collect more fine-grained information and achieve better accuracy in type inference. For control/data-flow recovery, Fu et al. (2019) ; Katz et al. (2019) propose neural network methods for decompilation. However, both works are based on a sequence-to-sequence neural network and tested on simple programs. N-Bref leverages a structural transformer-based model that achieves better results.

Data Type Recovery and

Neural Networks for Code Generation. Neural networks have been used for code generation in prior works (Ling et al. (2016) ; Yin & Neubig (2017) ; Rabinovich et al. (2017) ; Yin & Neubig (2018) ). These prior efforts are different from a decompilation task as the input is input-output pairs (Chen et al. (2019; 2017) ), description of the code usage (Zhu et al. (2019) ), or other domainspecific languages (Nguyen et al. (2013) ). The abstract syntax tree (AST) was used in these recent works (Chen et al. (2018b) ; Yin & Neubig (2018) ). Yet, most of the works leverage the Tree LSTM (Tai et al. (2015) ; Dong & Lapata (2018) ) or Convolutional Neural networks (Chen et al. (2018a) ). N-Bref demonstrates the effectiveness of transformer in the decompilation framework. Neural Networks for Binary Analysis. There is a significant body of work on binary analysis using neural networks, such as predicting execution throughput (Mendis et al. (2018) ), guiding the branch predictions (Shi et al. (2019) ), program analysis (Ben-Nun et al. (2018) ) and verification Li et al. (2015) . Most of the works using RNN to encode binary or assembly code (Mendis et al. (2018) ; Ben-Nun et al. (2018) ). (Shin et al. (2015) ) proposes to use RNN to identify function entry point in binary. GNNs were used in some of these works to encode memory heap information (Li et al. (2015) ) or assembly code (Shi et al. (2019) ), but the original representation methods are not scalable as they added many pseudo nodes and the node representation is not suitable for the transformer. He et al. (2018) ; Lacomis et al. (2019) use naive NMT model to predict debug information and to assign meaningful names to variables from binaries, yet they did not leverage the structural programming information. Many designs in N-Bref are easy to integrate with various neural-based binary analysis tasks where the input is also low-level code.

8. CONCLUSIONS AND FUTURE WORK

In this paper, we present N-Bref, an end-to-end framework that customizes the design of a neuralbased decompiler. N-Bref designs a dataset generator that removes expression collision and generates random programs with any complexity configurations. N-Bref disentangles decompilation into two parts -data type recovery and AST generation, and incorporates a new architecture to facilitate structural translation tasks by integrating structural encoder/decoder into the transformer. New embedding/representation techniques help to further improve accuracy. Experiments show that N-Bref outperform previous decompilers and the state-of-the-art transformer. Meanwhile, we observe that many other challenges remain, for example: (i) reverse engineering binary that has been optimized or obfuscated is still challenging; (ii) directly recovering numerical values from assembly code requires more efforts. We leave these more challenging problem setups as future work.

A TRAINING SETUP AND HYPER-PARAMETERS

We ran our experiments on Amazon EC2 using p3.16xlarge instance which contains Nvidia Tesla V100 GPUs with 16GB main memory. Hyper-parameters for Lang2logic and Ins2AST are selected using cross-validation with grid search. We present the hyper-parameters used by different neural networks in Table 4 . The number of GNN layer in N-Bref is set to 3. We use an Adam optimizer with β 1 = 0.9, β 1 = 0.98 which is the setting in the original Transformer Vaswani et al. (2017) . We add label smoothing and a dropout rate 0.3. Weights of attentive layers are initialized from the uniform distribution, while weights of feed-forward layers are initialized using techniques the same as Vaswani et al. (2017) . Other scheduling methods (e.g. learning rate, warm-up steps) are the same as Cornia et al. (2019) for training N-Bref and transformer baseline.  P (v) = λ v e -λ v! v = 0, 1, 2, 3, . . . . E FORMAL ALGORITHM FOR PREDICTIONS 

F PERFORMANCE IN GRAPH EDIT DISTANCE

We test the performance of N-Bref using graph edit distance (GED) which is calculated as Eq. 9. The distance is calculated as the minimum number of operations (i.e., node substitution and node insertion) to change our output AST (AST ) into the golden AST (AST G ).

GED(AST, AST

G ) = min e1,...,e k k i=1 Cost(e i ) , Here, e i denotes the ith operations to change AST to AST G . In our testing, we set Cost(e) = 1. The maximum possible GED between a AST and AST G is the number of nodes in AST G . Note that when e i substitutes a node from non-terminal to terminal type, the branch of the original terminal node is automatically deleted. The tree expansion algorithm to generate AST is shown above in Appendix E. Table 5 , shows the GED and transformer baseline. N-Bref shows 40.4% reduction on average in graph edit distance compared to traditional transformer. 

G PERFORMANCE OF N-BREF IN OTHER BINARY ANALYSIS TASKS

N-Bref's structural transformer architecture / low-level code encoding and representations are easy to integrate with various neural-based binary analysis tasks as their input is also low-level code, allowing advances in such tasks, i.e., vulnerability detection, crypto algorithm classification, malware detection, etc. We tried out two tasks using N-Bref's encoder and low-level representation methods to analyze binary code: (i) Identify binary code vulnerabilities (Table 6 ). We test the performance of N-Bref on vulnerabilities detections using Devign dataset which includes 25872 data points collected from commit difference of FFmpeg and QEMU repository. Using the commit id given from Devign dataset (Zhou et al. (2019) ), we clone the old repository, compile it and extract the binary of the function from the compiled project. We successfully generate 10302 binaries (25872 total data given) as many project commits in the dataset are not able to compile. (ii) Measure binary code similarity (Table 7 ). We test N-Bref on POJ-104 tasks (Mou et al. (2014) ) by compiling them into binary codes and use the same metrics as MISIM (Ye et al. (2020) ) to evaluate the performance of N-Bref. In vulnerability detection task, the performance of N-Bref is 3.0% margin over transformer baseline on binaries and 4.08% margin over BiLSTM-based vulnerability detector (Li et al. (2018) ) on highlevel source code using the same amount of dataset. For code similarity measures, N-Bref achieves 3.85% MAP@R performance increase compared to transformer baseline and shows 5.0%/20.16% better MAP@R than Aroma ( Luan et al. (2019) ) and NCC (Neural code comprehension Ben-Nun et al. ( 2018)) that are code searching frameworks that operates on high-level code. Note that binaries are more abstract and are difficult to analyze compared to high-level code. # i n c l u d e < s t r i n g . h> f o o : c h a r * f o o ( f l o a t l 0 , i n t * l 1 . L 1 : ) { p u s h q %r b p c h a r l 2 [ 4 ] ; movq %r s p , %r b p s h o r t l 3 = 2 ; s u b q $48 , %r s p f l o a t l 4 = 0 x9p + 1 ; movss %xmm0 , -36(% r b p ) i f ( s t r c h r ( l 2 , l 1 [ 0 ] ) ) { movq %r d i , -48(% r b p ) l 0 = l 4 * l 0 ; movq %f s : 4 0 , %r a x } movq %r a x , -8(% r b p ) e l e { x o r l %eax , %e a x l 2 [ l 3 ] = 7 ; movw $2 , -18(% r b p ) } movss .LC0(% r i p ) , %xmm0 r e t u r n l 2 ; movss %xmm0 , -16(% r b p ) } movq -48(% r b p ) , %r a x movl (% r a x ) , %edx l e a q -12(% r b p ) , %r a x movl %edx , %e s i movq %r a x , %r d i c a l l strchr@PLT t e s t q %r a x , %r a x j e . L 2 movss -36(% r b p ) , %xmm0 m u l s s -16(% r b p ) , %xmm0 movss %xmm0 , -36(% r b p ) jmp . L 3 . L 2 : movswl -18(% r b p ) , %e a x movb $7 , -12(%rbp ,% r a x ) . L 3 : movl $0 , %e a x movq -8(% r b p ) , %r c x x o r q %f s : 4 0 , %r c x j e . L 5 . L 5 : l e a v e is the lower 32-bit of rax , so it has data dependency with rax for line 12-13. Also, when there is no destination nodes in the instruction (e.g., line 9), the destination should be the memory location represented by address register (e.g., rbp)

I COMPLETE ASSEMBLY CODE GRAPH FOR FIGURE 1

and offset (e.g.,-36 in line 9).



N-Bref is the abbreviation for "neural-based binary reverse engineering framework" Complete assembly code and graph are shown in Appendix H & I. * result from MISIM( Ye et al. (2020))



Figure 1: An example of (a) source code to AST conversion. Note that the pseudo 'statement' or 'stmt' nodes are added by the compiler during AST conversion. (b) assembly (x86-64) to graph 2 .

Figure 2: An example of prediction procedures in N-Bref pipeline. AST is expanded in a breadth-first manner.

(a)). In summary, b depth and E c control the difficulty of control/data flow, while b size and b num control the length of the code. For example, the code snippet in Figure 1(a) has a configuration of [E c , b depth , b size , b num ] = [2, 1, 1, 1]. Note that in N-Bref, the program is compiled using gcc with no optimizations. Previous works (Brumley et al. (2013); Lee et al. (2011); Lin et al. (2010)) also disabled optimizations, because many optimizations change variable types and rewrite source code (e.g. loop-invariant optimization) which will result in unfair accuracy evaluations.

Figure 3: The backbone neural architecture design for DT-Solver and SC-Gen. N1,N2,N3 indicates the number of times to repeat the block in the architectures.

1). We set E c = 3, b size = 3, bias = 2 and test lib in with different complexities: (i)<math.h> and (ii)<string.h>. Function recursion is allowed for code generation. Other than function calls, normal expressions ( "+, -, * , \, , , &, ==, ∧" etc.) are also possible operators during code generation. (Code examples in Appendix C.)

Figure 4: Sensitivity analysis of each design component over dataset complexity for <math.h>. Each sample is trained for 10 epochs for simplicity. 'Trans' Refers to the baseline transformer.

Program Decompilation. There has been a long line of research on reverse engineering of binary code (Cifuentes (1994); Emmerik & Waddington (2004); Brumley et al. (2011); Bao et al. (2014); Rosenblum et al. (2008); Yakdan et al. (2016)). Commercialized decompilers ( Hex-Rays (2017); RetDec (2017)) do not care about semantics of the code which means their recovered code is very distinct from source code (see examples in Fu et al. (2019)). For type recovery, traditional methods(Lee et al. (2011);Lin et al. (

Figure 5: A complete assembly graph of Figure 1. Note that eax

Hyper-parameters in data generation

Accuracy (%) comparison between N-Bref and alternative methods in (a) type recovery, and (b) AST generation. Ins2AST is a previous neural-based program decompiler (only code sketch generation stage). REWARD (Lin et al. (2010)) is a traditional framework for type recovery. Lang2logic(Dong & Lapata (2016)) is a sequence-to-tree translator. The baseline is the transformer. (lib, b size , b depth ) (a) Data Type Recovery Acc mac (Acc mic )

Ablation study of N-Bref on AST generation. '-ensemble' refers to disable the separation of data

Hyper-parameters chosen for each neural network model EQUATIONS OF POSSION DISTRIBUTIONS FOR VARIABLE NUMBERS For variable number (var num or v) generated for a program, it follows Poisson distribution (Eq. 8) where λ = b s num + b s depth + bias as discussed in Section Evaluation.

Algorithm 1 Algorithm for N-Bref prediction.INPUT: Assembly Graph G asm ; Root Node γ; Terminal Node Types (T ); LLC encoder, AST encoder, AST decoder (LLC en , AST en , AST de ) ; N-Bref model (M odel)OUTPUT: Complete AST G ast .

Comparison between N-Bref and baseline transformer using graph edit distance across datasets.

Performance on code vulnerability detection. result from our implementation of VulDeePeckerLi et al. (2018) and test on devign dataset.

Performance of code similarity accuracy.

B LEETCODE SOLUTIONS EXAMPLES

We present the examples of the tested Leetcode solutions in Figure 1 and 2. The tasks that are tested includes "Isomorphic Strings", "Multiply Strings", "Longest Palindromic Substring", "Implement strStr()", "ZigZag Conversion". Many easy problems are too short (e.g. "Length of the last word") to justify the performance of N-Bref and some of them use their own-defined functions which is beyond the scope of N-Bref. I n p u t : num1 = " 2 " , num2 = " 3 " O u t p u t : t r u e O u t p u t : " 6 "I n p u t : s = " f o o " , t = " b a r " I n p u t : num1 = " 1 2 3 " , num2 = " 4 5 6 " O u t p u t : f a l s e O u t p u t : " 5 6 0 8 8 " * / I n p u t : s = " p a p e r " , t = " t i t l e " O u t p u t : t r u e c h a r * m u l t i p l y ( c h a r * num1 , c h a r * num2 ) { * / i n t l e n 1 = 0 ; i n t l e n 2 = 0 ; b o o l i s I s o m o r p h i c ( c h a r * s , c h a r * t ) { i n t * p r o d [ 1 0 0 ] = { 0 } ; / * same l e n g t h * / c h a r a n s [ 1 0 0 ] = { 0 } ; i n t l e n = s t r l e n ( s ) ;l e n 1 = s t r l e n ( num1 ) ; i n t y ; l e n 2 = s t r l e n ( num2 ) ; f o r ( i = l e n 1 -k=k + 1 ; / * t h e l a s t c a r r y d i g i t * / r e t u r n f a l s e ; / * c a r r y a l l * / ' ; r e t u r n t r u e ; r e t u r n a n s ; } }

C EXAMPLES OF N-BREF GENERATED PROGRAMS

We present the dataset examples in Figure 3 and 4. We define char, short, int, long as 'int8 t', 'int16 t', 'int32 t' and 'int64 t' to simplify tokenizing process (64-bit machine). 'uint' refers to 'unsigned' type. # i n c l u d e <math . h> # i n c l u d e < s t r i n g . h> i n t 3 2 t * f o o ( v o i d ) { i n t 3 2 t f o o ( i n t 8 t * l 0 ) { i n t 3 2 t l 0 = 0x3BL ; i n t 8 t * l 1 = &l 0 [ 1 ] ; i n t 3 2 t l 1 = 13L ; i n t 3 2 t l 2 [ 5 ] ; f l o a t l 2 = 0 x2p + 9 ; u i n t 1 6 t l 3 = 0 x47L ; u i n t 8 t l 3 = 13UL ; i n t 3 2 t l 4 = 0L ; i n t 3 2 t * l 4 = &l 1 ; i n t l 5 ; i n t 3 2 t l 5 = 0x9BL ; i n t 1 6 t l 6 = 0 x15L ; i n t 3 2 t l 6 = 0xA9L ; i n t 3 2 t * l 7 = &l 2 [ 3 ] ; i n t 3 2 t l 7 = 0L ; i n t 3 2 t * l 8 [ 2 ] ; u i n t 3 2 t l 8 = 0 x45L ; u i n t 3 2 t l 9 = 0 x11L ; f l o a t l 9 = 0 x1p + 1 ; i n t 1 6 t l 1 0 = 1L ; f l o a t l 1 0 = 0xE . 3 p + 1 5 ; i n t 3 2 t l 1 1 = 0xD1L ; f o r ( l 0 = 1 0 ; ( l 0 ! = 5 ) ; l 0 = l 0 -5) 

