L25 background reading

This page provides some papers for background reading on L25, organised along the lines of the structure of the course.

Many of the papers are hosted by the ACM Digital Library. There is a normally charge for accessing this service, but you can access it for free from the university network. You are not expected to buy any of the papers. If you wish to read the papers elsewhere, then either download them while on the university network, or use Computer Lab VPN to access the ACM Digital Library from any other connection.

The course assessment does not require that you read these papers, but you may find that they are good starting points for exploring areas introduced in the course that you find interesting. You may also find inspiration for a miniproject idea in one of the papers.

This is by no means an exhaustive list of relevant papers. Following citations is often a good way of finding other interesting things to read. This works both backwards (reading papers from the bibliography) and forwards: The ‘Cited By’ tab in a paper’s ACM Digital Library entry often provides links to interesting follow-on work.

Understanding compiler intermediate representations

Fred Chow provides a good overview of the requirements of a good IR [1]. This article also covers some of the history of IR design.

Static Single Assignment (SSA) form was developed by the authors of [2], which provides a good overview of the algorithms used to create it.

The original LLVM paper [3] provides a good overview of the original motivations and design of LLVM. You will note how much the design has changed over time. Contrast this with the [4] Java bytecode specification, which was created for very different goals.

Dynamic dispatch and duck typing

The canonical paper for polymorphic inline caching in Self [5] is still very relevant.

The lectures skip over the gory details of C++ implementation. If you really want to know how it works, then the ABI docs explain it all [6].

Trace-based JITs originate in dynamic recompilation (not dynamic languages at all!), with Dynamo [7] introducing the technique. Their use in dynamic language implementations is much more recent [8], with the first implementation appearing in Adobe Flash (for ActionScript) and then being contributed to Mozilla (Tamarin).

Autovectorisation

The SLP vectorizer in LLVM is based on work from 2000 [9]. A more flexible version of the algorithm [10] (with associated costs in complexity, translating directly to run time) was recently published by researchers in the lab.

There are a lot of papers on polyhedral optimisation in general, but for the specific interaction with LLVM, Tobias Grossner’s diploma thesis [11] provides a lot of detail.

Garbage collection

The best overview of garbage collection techniques and the relationship between various approaches is David F. Bacon’s Unified Theory [12]. Most collectors use a combination of techniques discussed in this paper.

For something a bit more modern, the C4 collector [13] is close to the state of the art for high-end systems. Some more work from IBM shows the state of the art at the embedded end [14]

Bibliography

Fred Chow. Intermediate Representation. Commun. ACM 56, 12 (2013), 57–62. [doi]

BibTeX

@article{Chow:2013:IR:2534706.2534720,
  author = {Chow, Fred},
  title = {Intermediate Representation},
  journal = {Commun. ACM},
  issue_date = {December 2013},
  volume = {56},
  number = {12},
  month = dec,
  year = {2013},
  issn = {0001-0782},
  pages = {57--62},
  numpages = {6},
  url = {http://doi.acm.org/10.1145/2534706.2534720},
  doi = {10.1145/2534706.2534720},
  acmid = {2534720},
  publisher = {ACM},
  address = {New York, NY, USA}
}

Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman and F. Kenneth Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst. 13, 4 (1991), 451–490. [doi]

BibTeX

@article{Cytron:1991:ECS:115372.115320,
  author = {Cytron, Ron and Ferrante, Jeanne and Rosen, Barry K. and Wegman, Mark N. and Zadeck, F. Kenneth},
  title = {Efficiently Computing Static Single Assignment Form and the Control Dependence Graph},
  journal = {ACM Trans. Program. Lang. Syst.},
  issue_date = {Oct. 1991},
  volume = {13},
  number = {4},
  month = oct,
  year = {1991},
  issn = {0164-0925},
  pages = {451--490},
  numpages = {40},
  url = {http://doi.acm.org/10.1145/115372.115320},
  doi = {10.1145/115372.115320},
  acmid = {115320},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {control dependence, control flow graph, def-use chain, dominator, optimizing compilers}
}

Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, IEEE Computer Society (2004), 75–.

BibTeX

@inproceedings{Lattner:2004:LCF:977395.977673,
  author = {Lattner, Chris and Adve, Vikram},
  title = {LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation},
  booktitle = {Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization},
  series = {CGO '04},
  year = {2004},
  isbn = {0-7695-2102-9},
  location = {Palo Alto, California},
  pages = {75--},
  url = {http://dl.acm.org/citation.cfm?id=977395.977673},
  acmid = {977673},
  publisher = {IEEE Computer Society},
  address = {Washington, DC, USA}
}

The Java Virtual Machine Specification. .

BibTeX

@misc{jvm,
  title = {The Java Virtual Machine Specification},
  authors = {Tim Lindholm and Frank Yellin and Gilad Bracha and Alex Buckley},
  url = {http://docs.oracle.com/javase/specs/jvms/se7/html/index.html}
}

Urs Hölzle, Craig Chambers and David Ungar. Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches. 1991. [pdf]
BibTeX
```
@misc{citeulike:1903110,
  author = {Hölzle, Urs and Chambers, Craig and Ungar, David},
  keywords = {dynamictyping, oo, optimization, self},
  posted-at = {2007-11-12 15:33:30},
  priority = {4},
  title = {{Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches}},
  pdf = {http://www.selflanguage.org/_static/published/pics.pdf},
  year = {1991}
}
```
Abstract

: Polymorphic inline caches (PICs) provide a new way to reduce the overhead of polymorphic
message sends by extending inline caches to include more than one cached lookup result per call site. For
a set of typical object-oriented SELF programs, PICs achieve a median speedup of 11%.
As an important side effect, PICs collect type information by recording all of the receiver types actually used
at a given call site. The compiler can exploit this type information to generate better code when...

Itanium C++ ABI. .

BibTeX

@misc{cxxabi,
  title = {Itanium C++ ABI},
  howpublished = {https://mentorembedded.github.io/cxx-abi/abi.html},
  url = {https://mentorembedded.github.io/cxx-abi/abi.html}
}

Vasanth Bala, Evelyn Duesterwald and Sanjeev Banerjia. Dynamo: A Transparent Dynamic Optimization System. Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, ACM (2000), 1–12. [doi]

BibTeX

@inproceedings{Bala:2000:DTD:349299.349303,
  author = {Bala, Vasanth and Duesterwald, Evelyn and Banerjia, Sanjeev},
  title = {Dynamo: A Transparent Dynamic Optimization System},
  booktitle = {Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation},
  series = {PLDI '00},
  year = {2000},
  isbn = {1-58113-199-2},
  location = {Vancouver, British Columbia, Canada},
  pages = {1--12},
  numpages = {12},
  url = {http://doi.acm.org/10.1145/349299.349303},
  doi = {10.1145/349299.349303},
  acmid = {349303},
  publisher = {ACM},
  address = {New York, NY, USA}
}

Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R. Haghighat, Blake Kaplan, Graydon Hoare, Boris Zbarsky, Jason Orendorff, Jesse Ruderman, Edwin W. Smith, Rick Reitmaier, Michael Bebenita, Mason Chang and Michael Franz. Trace-based Just-in-time Type Specialization for Dynamic Languages. Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM (2009), 465–478. [doi]

BibTeX

@inproceedings{Gal:2009:TJT:1542476.1542528,
  author = {Gal, Andreas and Eich, Brendan and Shaver, Mike and Anderson, David and Mandelin, David and Haghighat, Mohammad R. and Kaplan, Blake and Hoare, Graydon and Zbarsky, Boris and Orendorff, Jason and Ruderman, Jesse and Smith, Edwin W. and Reitmaier, Rick and Bebenita, Michael and Chang, Mason and Franz, Michael},
  title = {Trace-based Just-in-time Type Specialization for Dynamic Languages},
  booktitle = {Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation},
  series = {PLDI '09},
  year = {2009},
  isbn = {978-1-60558-392-1},
  location = {Dublin, Ireland},
  pages = {465--478},
  numpages = {14},
  url = {http://doi.acm.org/10.1145/1542476.1542528},
  doi = {10.1145/1542476.1542528},
  acmid = {1542528},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {dynamically typed languages, trace-based compilation}
}

Samuel Larsen and Saman Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, ACM (2000), 145–156. [doi]

BibTeX

@inproceedings{Larsen:2000:ESL:349299.349320,
  author = {Larsen, Samuel and Amarasinghe, Saman},
  title = {Exploiting Superword Level Parallelism with Multimedia Instruction Sets},
  booktitle = {Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation},
  series = {PLDI '00},
  year = {2000},
  isbn = {1-58113-199-2},
  location = {Vancouver, British Columbia, Canada},
  pages = {145--156},
  numpages = {12},
  url = {http://doi.acm.org/10.1145/349299.349320},
  doi = {10.1145/349299.349320},
  acmid = {349320},
  publisher = {ACM},
  address = {New York, NY, USA}
}

V. Porpodas, A. Magni and T.M. Jones. PSLP: Padded SLP automatic vectorization. Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on, (2015), 190–201. [pdf] [doi]
BibTeX
```
@inproceedings{PSLP,
  author = {Porpodas, V. and Magni, A. and Jones, T.M.},
  booktitle = {Code Generation and Optimization (CGO), 2015 IEEE/ACM International Symposium on},
  title = {PSLP: Padded SLP automatic vectorization},
  year = {2015},
  month = feb,
  pages = {190-201},
  pdf = {http://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-cgo.pdf},
  keywords = {parallel processing;program compilers;support vector machines;PSLP;SIMD vector units;compiler-based automatic vectorization;isomorphic instructions;nonisomorphic instruction sequences;padded SLP automatic vectorization;straight-line scalar code;superword-level parallelism vectorization algorithm;vectorization algorithm;vendors support vector instructions;writing code;Algorithm design and analysis;Assembly;Educational institutions;Parallel processing;Program processors;Registers;Vectors},
  doi = {10.1109/CGO.2015.7054199}
}
```
Abstract

The need to increase performance and power efficiency in modern processors has led to a wide adoption of SIMD vector units. All major vendors support vector instructions and the trend is pushing them to become wider and more powerful. However, writing code that makes efficient use of these units is hard and leads to platform-specific implementations. Compiler-based automatic vectorization is one solution for this problem. In particular the Superword-Level Parallelism (SLP) vectorization algorithm is the primary way to automatically generate vector code starting from straight-line scalar code. SLP is implemented in all major compilers, including GCC and LLVM. SLP relies on finding sequences of isomorphic instructions to pack together into vectors. However, this hinders the applicability of the algorithm as isomorphic code sequences are not common in practice. In this work we propose a solution to overcome this limitation. We introduce Padded SLP (PSLP), a novel vectorization algorithm that can vectorize code containing non-isomorphic instruction sequences. It injects a near-minimal number of redundant instructions into the code to transform non-isomorphic sequences into isomorphic ones. The padded instruction sequence can then be successfully vectorized. Our experiments show that PSLP improves vectorization coverage across a number of kernels and full benchmarks, decreasing execution time by up to 63%.

Tobias Grosser. Enabling Polyhedral Optimizations in LLVM. 2011. [pdf]

BibTeX

@mastersthesis{polly,
  title = {Enabling Polyhedral Optimizations in LLVM},
  school = {Universität Passau},
  address = {Germany},
  month = apr,
  year = {2011},
  author = {Grosser, Tobias},
  pdf = {http://www.grosser.es/publications/grosser-2011--Enabling-Polyhedral-Optimizations-in-LLVM--diplomathesis.pdf}
}

David F. Bacon, Perry Cheng and V. T. Rajan. A Unified Theory of Garbage Collection. Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, ACM (2004), 50–68. [doi]

BibTeX

@inproceedings{Bacon:2004:UTG:1028976.1028982,
  author = {Bacon, David F. and Cheng, Perry and Rajan, V. T.},
  title = {A Unified Theory of Garbage Collection},
  booktitle = {Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications},
  series = {OOPSLA '04},
  year = {2004},
  isbn = {1-58113-831-8},
  location = {Vancouver, BC, Canada},
  pages = {50--68},
  numpages = {19},
  url = {http://doi.acm.org/10.1145/1028976.1028982},
  doi = {10.1145/1028976.1028982},
  acmid = {1028982},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {graph algorithms, mark-and-sweep, reference counting, tracing}
}

Gil Tene, Balaji Iyengar and Michael Wolf. C4: The Continuously Concurrent Compacting Collector. Proceedings of the International Symposium on Memory Management, ACM (2011), 79–88. [doi]

BibTeX

@inproceedings{Tene:2011:CCC:1993478.1993491,
  author = {Tene, Gil and Iyengar, Balaji and Wolf, Michael},
  title = {C4: The Continuously Concurrent Compacting Collector},
  booktitle = {Proceedings of the International Symposium on Memory Management},
  series = {ISMM '11},
  year = {2011},
  isbn = {978-1-4503-0263-0},
  location = {San Jose, California, USA},
  pages = {79--88},
  numpages = {10},
  url = {http://doi.acm.org/10.1145/1993478.1993491},
  doi = {10.1145/1993478.1993491},
  acmid = {1993491},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {concurrent, garbage collection, genera- tional, linux, pauseless, read barrier, virtual memory}
}

David F. Bacon, Perry Cheng and V. T. Rajan. POPL 2003: A Real-time Garbage Collector with Low Overhead and Consistent Utilization. SIGPLAN Not. 48, 4S (2013), 58–71. [doi]

BibTeX

@article{Bacon:2013:PRG:2502508.2502523,
  author = {Bacon, David F. and Cheng, Perry and Rajan, V. T.},
  title = {POPL 2003: A Real-time Garbage Collector with Low Overhead and Consistent Utilization},
  journal = {SIGPLAN Not.},
  issue_date = {April 2013},
  volume = {48},
  number = {4S},
  month = jul,
  year = {2013},
  issn = {0362-1340},
  pages = {58--71},
  numpages = {14},
  url = {http://doi.acm.org/10.1145/2502508.2502523},
  doi = {10.1145/2502508.2502523},
  acmid = {2502523},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {defragmentation, read barrier, real-time scheduling, utilization}
}