Computer Laboratory

Technical reports

Facilitating program parallelisation: a profiling-based approach

Jonathan Mak

March 2011, 120 pages

This technical report is based on a dissertation submitted November 2010 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. John’s College.

Abstract

The advance of multi-core architectures signals the end of universal speed-up of software over time. To continue exploiting hardware developments, effort must be invested in producing software that can be split up to run on multiple cores or processors. Many solutions have been proposed to address this issue, ranging from explicit to implicit parallelism, but consensus has yet to be reached on the best way to tackle such a problem.

In this thesis we propose a profiling-based interactive approach to program parallelisation. Profilers gather dependence information on a program, which is then used to automatically parallelise the program at source-level. The programmer can then examine the resulting parallel program, and using critical path information from the profiler, identify and refactor parallelism bottlenecks to enable further parallelism. We argue that this is an efficient and effective method of parallelising general sequential programs.

Our first contribution is a comprehensive analysis of limits of parallelism in several benchmark programs, performed by constructing Dynamic Dependence Graphs (DDGs) from execution traces. We show that average available parallelism is often high, but realising it would require various changes in compilation, language or computation models. As an example, we show how using a spaghetti stack structure can lead to a doubling of potential parallelism.

The rest of our thesis demonstrates how some of this potential parallelism can be realised under the popular fork-join parallelism model used by Cilk, TBB, OpenMP and others. We present a tool-chain with two main components: Embla 2, which uses DDGs from profiled dependences to estimate the amount of task-level parallelism in programs; and Woolifier, a source-to-source transformer that uses Embla 2’s output to parallelise the programs. Using several case studies, we demonstrate how this tool-chain greatly facilitates program parallelisation by performing an automatic best-effort parallelisation and presenting critical paths in a concise graphical form so that the programmer can quickly locate parallelism bottlenecks, which when refactored can lead to even greater potential parallelism and significant actual speed-ups (up to around 25 on a 32-effective-core machine).

Full text

PDF (1.2 MB)

BibTeX record

@TechReport{UCAM-CL-TR-796,
  author =	 {Mak, Jonathan},
  title = 	 {{Facilitating program parallelisation: a profiling-based
         	   approach}},
  year = 	 2011,
  month = 	 mar,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-796.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-796}
}