N-BREF: A HIGH-FIDELITY DECOMPILER EXPLOIT-ING PROGRAMMING STRUCTURES

Abstract

Binary decompilation is a powerful technique for analyzing and understanding software, when source code is unavailable. It is a critical problem in the computer security domain. With the success of neural machine translation (NMT), recent efforts on neural-based decompiler show promising results compared to traditional approaches. However, several key challenges remain: (i) Prior neuralbased decompilers focus on simplified programs without considering sophisticated yet widely-used data types such as pointers; furthermore, many high-level expressions map to the same low-level code (expression collision), which incurs critical decompiling performance degradation; (ii) State-of-the-art NMT models (e.g., transformer and its variants) mainly deal with sequential data; this is inefficient for decompilation, where the input and output data are highly structured. In this paper, we propose N-Bref 1 , a new framework for neural decompilers that addresses the two aforementioned challenges with two key design principles: (i) N-Bref designs a structural transformer with three key design components for better comprehension of structural data -an assembly encoder, an abstract syntax tree encoder, and a tree decoder, extending transformer models in the context of decompilation. (ii) N-Bref introduces a program generation tool that can control the complexity of code generation and removes expression collisions. Extensive experiments demonstrate that N-Bref outperforms previous neural-based decompilers by a margin of 6.1%/8.8% accuracy in datatype recovery and source code generation. In particular, N-Bref decompiled human-written Leetcode programs with complex library calls and data types in high accuracy.

1. INTRODUCTION

Decompilation, which is a process of recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. For example, decompilation is highly valuable in many security and forensics applications (Lin et al. (2010) ; Lee et al. (2011); Brumley et al. (2011) ). Given a binary executable, an ideal decompiler generates the high-level program that preserves both the semantics and the functionality of the source code. However, this process is difficult as the data structure and semantics are largely destroyed or obfuscated during the compilation. Inspired by remarkable performance in neural machine translation (NMT) tasks (Liu et al. ( 2019 2019)) -focused on sequential data (e.g., language), while neural decompilers deal with data with intrinsic structures (e.g., tree/graph) and long-range dependencies. (C2) The main decompilation task consists of many sub-tasks (e.g., datatype recovery, control/dataflow recovery). Training one neural network cannot solve them all. (C3) Practical data types (e.g., pointers) are not modeled and compiling configurations need to be known beforehand (Fu et al. (2019) ). (C4) Due to a lack of unification in terms of library usage, variable type, and/or control-flow complexity, a simple crawling from public repositories does not work well. Source code of different styles can be compiled into identical binary code (i.e., "expression collision" or EC) and yield issues when evaluating decomplied code against original source code. To our best knowledge, no code generation toolkit with configurable code complexity exists. In this paper, we present N-Bref, an end-to-end neural-based decompiler framework that learns to decompile the source code to assembly. For (C1), we design a back-bone structural transformer by incorporating inductive Graph Neural Networks (GNNs) (Hamilton et al. (2017) ) to represent the low-level code (LLC) as control/dataflow dependency graphs and source code as Abstract Syntax Tree (AST). To better model long-range correlations in the structural representations, we add a graph neural network after each of the self-attention layers in the transformer. The AST decoder expands the AST of the source code in a tree fashion to better capture the dependency of each predicted node. Also, we adopt memory augmentation (Cornia et al. ( 2019)) and new tokenizing methods to improve the scalability of our neural networks with the growing size of programs. The backbone network is learned to iteratively generate AST for source code from structured representation of assembly. For (C2) and (C3), we decouple decompilation into two sub-tasks: data type solver (DT-Solver) and source code generator (SC-Gen), both use the same backbone structural transformer with different parameters. The output of the data type solver is used as the decoder input of the source code generation. For (C4), we design a dataset generator to generate training data, test and analyze the performance of different design principles across configurable code complexity. Different from conventional dataset generators (Yang et al. (2011) ; IntelC++compiler (2017)) used in programming language studies, our generator produces similar code styles as those written by human programmers, has unified source code representation that avoids EC, has configurable complexity and data types to facilitate factor analysis, and is specifically designed for learning-based methodologies. Extensive experiments show that on our new metrics, N-Bref outperforms transformer baseline/previous neural-based decompiler (Fu et al. ( 2019)) by 3.5%/6.1% and 5.5%/8.8% in data type recovery and source code generation tasks, respectively. Furthermore, on 5 human-written Leetcode solutions, N-Bref shows 4.1%/6.0% and 6.0%/9.7% margins over transformer/previous neural decompiler in data type recovery and source code generation, respectively. We also perform a comprehensive study of the design component in neural-based decompiler across different dataset configurations. In summary, this paper makes the following contributions: We construct an end-to-end decompilation system by integrating a LLC Encoder, an AST encoder, an AST decoder, and a set of novel embedding methods in a holistic manner. Our new architectures bridge the gap between low-level code and high-level code by transforming both of them into a graph space. We perform a comprehensive analysis of the influence of each neural-based decompiler design component to the overall program recovery accuracy across different dataset configurations. We corroborate the design performance on various generated benchmarks and Leetcode tasks. We boost decompilation performance by decomposing the decompilation process into separate tasks, data type recovery and AST generation. In addition, we present corresponding new metrics to evaluate data type recovery and source code generation. We develop the first dataset generation tool for neural-based decompiler development and testing. It randomly generates programs with configurable complexity and data types; it also unifies source code representation to prevent "expression collision".

2. PRELIMINARIES OF DECOMPILERS

Decompilation takes an executable file as input and attempts to create high-level source code that are more semantically meaningful and can be compiled back. Figure 1 shows a low-level code snippet disassembled from a stripped binary and the corresponding high-level program. A commonly used low-level code (LLC) is assembly (ASM). An assembly program is a sequence of instructions that can be executed on a particular processor architecture (e.g. MIPS, . The first token for each instruction is called an "opcode", which specifies the operation to be performed by the instruction. Many instructions in a program operate on processor registers (a small amount of fast storage in the processor) or instant values to perform arithmetic operations, such as shifting (e.g.shl , shr ), floating-point multiplications (e.g. mulss), etc. Other instructions include (1)



N-Bref is the abbreviation for "neural-based binary reverse engineering framework" Complete assembly code and graph are shown in Appendix H & I.



); Vaswani et al. (2017); Dai et al. (2019); Devlin et al. (2018); Dong & Lapata (2016)), recent works (Fu et al. (2019); Katz et al. (2019)) leverage NMT model for neural-based decompilation and achieve promising performance on small code snippets. To make neural-based decompilation useful in practice, many challenges remain: (C1) Current stateof-the-art neural architectures for machine translation -transformer (Vaswani et al. (2017)) or its variants (Dai et al. (2019); Devlin et al. (2018); Liu et al. (

