Course pages 2014–15

Modern Compiler Design

Coursework

The exercises and deadlines are described in this PDF.

Miniproject and Report (80%)

Demonstrate the effectiveness (or lack thereof) of some optimisation technique of your choice. Project ideas must be approved in advance.

Don't forget that you are primarily marked based on the quality of your evaluation (did it give a speedup? On what kinds of workloads? If not, why not?), not on raw performance improvements.

Your project proposal should be at most one page and should explain:

What you intend to do.
Why you believe that it will be an improvement (e.g. cite papers, give observations about poor performance).
How you plan to evaluate your changes

Some possible examples projects are listed below. Feel free to look for missed optimisation opportunities or missing features in other compilers and see if there's something that would interest you.

Inverted dispatch tables for Objective-C

Modify the GNUstep Objective-C runtime to use inverted dispatch tables. Extend clang to emit specialised call sites, one per selector, directly referencing the inverted dispatch tables.

The runtime must be modified to create and update the tables. The call sites should fall back to using the existing message send mechanism if there is no inverted dispatch table.

Speculating loads vs concurrency

LLVM will speculatively hoist loads outside of conditionals if it can demonstrate that there is no observable side effect within the abstract machine. For example, consider the following:

int g;
int foo(int cond) {
  if (cond)
    return g;
  return 0;
}

LLVM will hoist the load of g outside of the loop. This is a good optimisation in single-threaded code, but in multithreaded code can cause false sharing. Can you come up with an heuristic for disabling this transform that doesn't cause a slowdown in single-threaded code but avoids the false sharing problem?

See the LLVMDev thread 'Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)' for more details and a test cast.

Extend the Cellular Automata language for parallelism

The ordering of writes to the global registers in the CA language from the assignments is undefined. Extend the language by providing some form of explicit synchronisation (for example, barriers, mutexes) and an implementation that uses autovectorisation, autoparallelism, or both, yet provides deterministic output.

Add hot loop transfer to MysoreScript

MysoreScript only compiles code when a particular function has been invoked a number of times. A hot loop at the global scope or within a function that is only called a few times will never be compiled. Modify the compiler so that it is able to move execution from the interpreter into compiled code during loop execution.

Successful miniprojects from previous years

The following projects were successful in previous years. They may provide some inspiration for future projects.

Structure reorganisation

For large structures, you can achieve better performance if fields in a structure that are accessed together are close enough that they are in the same cache line. If fields are never accessed together (but there is concurrency), it may be better to add padding such that some are never in the same line. Can you write a pass that will rearrange the layouts of structure types to enforce this? Make sure that you don't modify any types that are passed outside the current module...

This project was completed for Java, where the layout of structures in memory is not defined by the language.

Analyse LLVM's lowering of atomics

Peter Sewell et al. have discovered that LLVM has a very conservative approach to lowering atomics on ARM, such that an algorithm that relaxes the memory orderings can end up being compiled to code with a stronger model. Implement a set of lockless algorithms and provide as set of test cases, with the generated assembly annotated to indicate where the back end has been overly aggressive in inserting barriers. Analyse the slowdown that these overly aggressive barriers cause.

This project was presented at EuroLLVM 2014 with the title Efficient code generation for weakly ordered architectures.

Polyhedral DCE for Polly

Implement polyhedral dead code elimination in polly. There are some test cases available and an existing implementation in ppcg can provide inspiration.

This project showed no performance improvement but a measurable improvement in code size and was committed to Polly

Polly open projects

Tobias Grosser has proposed a few open projects for Polly, the polyhedral optimisation framework for LLVM. These may form the basis of project proposals:

Performance optimization

Compile-time performance is very important in the context of Polly. It would be great to perform an analysis of the compile-time overhead introduced by Polly (we have LNT testers that will help here), and to address the top 3 compile-time issues in Polly.

Support for assumptions

LLVM provides llvm.assume intrinsics. It would be great if we could detect and understand such intrinsics in Polly. We can exploit the information that is provided by these assume intrinsics and generate better code.

As a test case, we could add assumptions to boost::ublas that communicate relations that need to hold e.g., matching sizes for the operands of matrix operations. This should allow us to remove certain run-time checks that the delinearization would otherwise generate

Register tiling with Polly

Polly is already giving a almost 10x speedup when optimizing a simple gemm kernel. However, it is far away from the performance that can be achieved by ATLAS. The paper "Terra: A Multi-Stage Language for High-Performance Computing" has shown that LLVM can be used to generate code that achieves performance within 80% of ATLAS kernels. Looking at the code that it generates, it seems to be normal loop tiling, with tile loops being register tiled again. The register tiling is the lacking piece in Polly. Adding register tiling to Polly and showing competitive performance would be amazing

Removing run-time conditions from Julia code

Julia is a HPC language that is compiled with LLVM. Using Polly to optimize Julia requires only a 10 line patch. One interesting feature in Julia is that it generates by default run-time bounds checks for all memory accesses. These checks are both costly and inhibit any kind of loop optimizations. It would be great to teach Polly to recognize loop nests with run-time checks, optimize them under the assumption that no run-time checks occur and generate a run-time check that verifies this assumption. This will remove all run-time-check overhead from the core computation and makes the loop nest amanable to optimizations, such that large speed-ups are expected.

Derive assumption from LLVM-IR

LLVM-IR as passed to Polly often provides not sufficient information to assume the absence of certain problematic behaviors. One example here is that accesses to multi-dimensional arrays of fixed sizes are not known to remain within bounds, even though the information about fixed size array bounds is readily available. To allow Polly to generate better code, it would be great to allow Polly to assume such memory accesses remain within bounds by generating appropiate run-time conditions. This will enable additional optimizations and the generation of more efficient code.

The polybench kernels will largely benefit from this, as by default the array size is known, but the loop bounds are not which requires us to generate code for parameter values that will yield out of bound memory accesses.

Computer Laboratory