L25 Miniproject ideas

Detecting wrong memory access patterns for work groups in OpenCL

Project proposed by Anastasia Stulova.

Some ideas are expressed in this paper.

Basically, it’s beneficial to access memory in row-major order by the threads from the same work group. If we could implement an LLVM pass to detect inefficient accesses and perhaps also transform or investigate how to transform them it would be quite useful. Also something from related work here could be to detect memory redundant loads in work group. This occurs in applications like Sobel that operate on sliding window of data across threads in the same work group.

Discuss evaluation strategies with Anastasia.

Improve LLVM’s optimisation of atomics

A student in 2014 improved LLVM’s generation of atomics for ARM by eliding redundant barriers (Slides, Writeup). There is still a lot of scope for improvement within the allowed memory model and recent WG14 publications have enumerated some of the allowable transforms.

This project is likely to be very difficult, but has the potential for some very nice results.

Add GPU support to CellAtom

The cellular automata language from Exercise 3 is designed to be a simplified version of a shader language. It is therefore likely to be amenable to GPU offloading. As with the parallelisation cases, handling updates to globals will require some care, though the barrier operations available within work groups will make it possible to propagate these values between adjacent cells in the same group.

A good evaluation should show how the cost of transferring data to and from the GPU dominates initially and find grid sizes and iteration counts where the GPU solution starts to show a speedup.

Add deoptimisation support to MysoreScript

The current version of MysoreScript treats the JIT’d code as immutable. This is convenient for simplicity, but is not representative of real-world JITs.

This project should add some generic infrastructure for using LLVM patchpoints to allow JIT’d code to perform side exits when optimisation decisions are incorrect. The generic infrastructure should reconstruct an interpreter state and allow the execution to resume in the interpreter from any point where the compiler decides to use a side exit. This should use a combination of the stack map support in LLVM and the anyregcc calling convention, with an interface that allows parts of the compiler to define points at which the function should exit and resume in the interpreter.

Evaluation should consist of (re)doing the polymorphic inline caching and integer optimisation parts of Exercise 2, rewritten to take advantage of the support.

Implement generic abbreviations for FuncDecl in PCH

Project proposed by Anastasia Stulova.

The current problem is that the OpenCL header (included in all OpenCL compilation units) is about 20K lines of code and when compiled to precompiled header occupies 2MB of space. This header is mainly composed of functions declarations.

PCH achieves size reduction via abbreviations but the PCH writer doesn’t abbreviate FuncDecls, mainly because they can have dynamic descriptors i.e. attributes. It might be possible to create generic (or worst case) abbreviation that would fit all FuncDecl and would be more efficient when general way of storing data.

Evaluation of this project should show space savings for the OpenCL header and should determine whether this is a net win for precompiled headers for C/C++ as well.

Accurate garbage collection for MysoreScript

Currently, MysoreScript uses the Boehm garbage collector. This is not the best choice, because it is a conservative collector that does not know anything about the structure of MysoreScript.

MysoreScript is inherently single threaded and so some of the complexity in writing a garbage collector is omitted. This project should add a new garbage collector, using the existing C++ templates for holding values to identify pointers held by the interpreter, and the LLVM statepoint intrinsics to track references in JIT’d code.

Ideally, this would be a copying collector and would allow object allocation to be inlined into the JIT’d code as a simple bump-the-pointer operation, with a call out to the GC only in low-memory conditions, though some care must be taken to ensure that interpreted code correctly handles GC updates.

Evaluation should show better performance for microbenchmarks that allocate large numbers of objects with relatively short lifespans.

Method caching for the GNUstep Objective-C Runtime

The GNUstep Objective-C Runtime uses a dispatch table design very similar to the original NeXT runtime, with a 3-deep tree of pointers. This is very cheap to search in microbenchmarks, but has large RAM and cache usage.

The codebase contains a somewhat bit-rotted low memory profile that provides a dense representation that is more expensive to search. Either resurrect this or define a new structure, and add some caching to either classes or selectors so that the slow path is rarely required.

Evaluation should show both memory and CPU usage for some real-world code and microbenchmarks.

The runtime also provides some mechanism for safe caching. There used to be an LLVM pass that automatically inserted caches for message sends in loops and for message sends to class objects (which are unlikely to change). An alternative, but closely related, project would rewrite these passes and benchmark them.

Inverted dispatch tables for the GNUstep Objective-C Runtime

The GNUstep Objective-C Runtime uses a dispatch table design very similar to the original NeXT runtime, with a 3-deep tree of pointers. This is very cheap to search in microbenchmarks, but has large RAM and cache usage.

One alternative discussed in the course is an inverted dispatch table, where each selector has a (hopefully much smaller) table of the classes that implement the method. Try implementing this in the GNUstep Objective-C runtime and measure the memory savings and the performance overheads.

A good evaluation should also include an analysis of some real-world Objective-C code using the Objective-C reflection APIs to show for the distribution of classes per selector and methods per class.

Extend random testing for OpenCL2.0

Project proposed by Anastasia Stulova.

CLSmith (based on CSmith) is a tool developed at Imperial College for random testing of OpenCL. It could be extend to test OpenCL 2.0 features or a subset of those. As far as I know there is a way to do it by specifying a grammar. Some papers describing the tool and how it can be extended.

OpenCL 2 support in Clang is relatively immature and so a successful outcome for this project would find and report some bugs.