N-BREF: A HIGH-FIDELITY DECOMPILER EXPLOIT-ING PROGRAMMING STRUCTURES

Abstract

Binary decompilation is a powerful technique for analyzing and understanding software, when source code is unavailable. It is a critical problem in the computer security domain. With the success of neural machine translation (NMT), recent efforts on neural-based decompiler show promising results compared to traditional approaches. However, several key challenges remain: (i) Prior neuralbased decompilers focus on simplified programs without considering sophisticated yet widely-used data types such as pointers; furthermore, many high-level expressions map to the same low-level code (expression collision), which incurs critical decompiling performance degradation; (ii) State-of-the-art NMT models (e.g., transformer and its variants) mainly deal with sequential data; this is inefficient for decompilation, where the input and output data are highly structured. In this paper, we propose N-Bref 1 , a new framework for neural decompilers that addresses the two aforementioned challenges with two key design principles: (i) N-Bref designs a structural transformer with three key design components for better comprehension of structural data -an assembly encoder, an abstract syntax tree encoder, and a tree decoder, extending transformer models in the context of decompilation. (ii) N-Bref introduces a program generation tool that can control the complexity of code generation and removes expression collisions. Extensive experiments demonstrate that N-Bref outperforms previous neural-based decompilers by a margin of 6.1%/8.8% accuracy in datatype recovery and source code generation. In particular, N-Bref decompiled human-written Leetcode programs with complex library calls and data types in high accuracy.

1. INTRODUCTION

Decompilation, which is a process of recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. For example, decompilation is highly valuable in many security and forensics applications (Lin et al. (2010) ; Lee et al. (2011); Brumley et al. (2011) ). Given a binary executable, an ideal decompiler generates the high-level program that preserves both the semantics and the functionality of the source code. However, this process is difficult as the data structure and semantics are largely destroyed or obfuscated during the compilation. Inspired by remarkable performance in neural machine translation (NMT) tasks (Liu et al. ( 2019 2019)) -focused on sequential data (e.g., language), while neural decompilers deal with data with intrinsic structures (e.g., tree/graph) and long-range dependencies. (C2) The main decompilation task consists of many sub-tasks (e.g., datatype recovery, control/dataflow recovery). Training one neural network cannot solve them all. (C3) Practical data types (e.g., pointers) are not modeled and compiling configurations need to be known beforehand (Fu et al. (2019) ). (C4) Due to a lack of unification in terms of library usage, variable type, and/or control-flow complexity, a simple crawling from public repositories does not 1 N-Bref is the abbreviation for "neural-based binary reverse engineering framework" 1



); Vaswani et al. (2017); Dai et al. (2019); Devlin et al. (2018); Dong & Lapata (2016)), recent works (Fu et al. (2019); Katz et al. (2019)) leverage NMT model for neural-based decompilation and achieve promising performance on small code snippets. To make neural-based decompilation useful in practice, many challenges remain: (C1) Current stateof-the-art neural architectures for machine translation -transformer (Vaswani et al. (2017)) or its variants (Dai et al. (2019); Devlin et al. (2018); Liu et al. (

