Computer Laboratory

Technical reports

Decompilation as search

Wei Ming Khoo

November 2013, 119 pages

This technical report is based on a dissertation submitted August 2013 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Hughes Hall.

Abstract

Decompilation is the process of converting programs in a low-level representation, such as machine code, into high-level programs that are human readable, compilable and semantically equivalent. The current de facto approach to decompilation is largely modelled on compiler theory and only focusses on one or two of these desirable goals at a time.

This thesis makes the case that decompilation is more effectively accomplished through search. It is observed that software development is seldom a clean slate process and much software is available in public repositories. To back this claim, evidence is presented from three categories of software development: corporate software development, open source projects and malware creation. Evidence strongly suggests that code reuse is prevalent in all categories.

Two approaches to search-based decompilation are proposed. The first approach borrow inspiration from information retrieval, and constitutes the first contribution of this thesis. It uses instruction mnemonics, control-flow sub-graphs and data constants, which can be quickly extracted from a disassembly, and relies on the popular text search engine CLucene. The time taken to analyse a function is small enough to be practical and the technique achieves an F2 measure of above 83.0% for two benchmarks.

The second approach and contribution of this thesis is perturbation analysis, which is able to differentiate between algorithms implementing the same functionality, e.g. bubblesort versus quicksort, and between different implementations of the same algorithm, e.g. quicksort from Wikipedia versus quicksort from Rosetta code. Test-based indexing (TBI) uses random testing to characterise the input-output behaviour of a function; perturbation-based indexing (PBI) is TBI with additional input-output behaviour obtained through perturbation analysis. TBI/PBI achieves an F2 measure of 88.4% on five benchmarks involving different compilers and compiler options.

To perform perturbation analysis, function prototyping is needed, the standard way comprising liveness and reaching-definitions analysis. However, it is observed that in practice actual prototypes fall into one of a few possible categories, enabling the type system to be simplified considerably. The third and final contribution is an approach to prototype recovery that follows the principle of conformant execution, in the form of inlined data source tracking, to infer arrays, pointer-to-pointers and recursive data structures.

Full text

PDF (1.2 MB)

BibTeX record

@TechReport{UCAM-CL-TR-844,
  author =	 {Khoo, Wei Ming},
  title = 	 {{Decompilation as search}},
  year = 	 2013,
  month = 	 nov,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-844.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-844}
}