Department of Computer Science and Technology

Parts II, III & ACS Projects

These are some Part II, Part III and ACS project ideas that I've put together. The groupings are only loose and many of the projects could be adapted to suit any course stage. Please contact me for more information. At the end there are a list of successful projects that have been done in the recent past, to help seed ideas.

If you have your own ideas for a project that you'd like to do with me, then it would be great to hear them so please also get in touch. I'm interested in advising anybody wanting to do a project in computer architecture or compilers.

Part II Projects

Also, check out the Computer Architecture Group Part II projects page for other ideas. Some of these are replicated there.

Vector runahead

To address the widening performance gap between CPU cores and main memories, designers have implemented prefetchers into various levels of the cache hierarchy, so as to bring data close to the processor before it is needed, meaning it is available in fast storage at the point of use. There are a wide variety of data prefetchers available, but few that can accurately identify data that is accessed through complex data structures.

An alternative scheme is Runahead execution. Here, when the processor is stalled, it continues speculatively fetching an executing instructions from the future so as to perform their memory accesses, then discards them and re-executes them correctly once the pipeline starts up again. This provides an accurate form of prefetching within the core, rather than as separate logic beside the cache. Until recently though, Runahead techniques couldn't deal with complex access logic either. However, a new scheme, called Vector Runahead, can effectively prefetch these access patterns, providing significant performance increases for certain workloads. The aim of this project is to implement Vector Runahead in the gem5 simulator to reproduce the impressive results obtained, with more advanced extensions possible too.

Dynamic binary translator

Let's celebrate Apple's move from x86 to Arm by creating a dynamic binary translator for them. The idea is that it will read in an x86 binary and then either interpret each instruction, or actually emit Arm instructions for each x86 instruction that it then directly executes. It would probably be best to restrict the project to a subset of x86, or a different source ISA if preferred (e.g. AArch32 to AArch64).

Interpreter or JIT compiler

In a similar vein, why not create an optimised interpreter or JIT compiler from your favourite source language. This could be an established language, such as C or OCaml, or a minimalist language, such as P'' (or a descendent), or an existing bytecode (Java being the obvious choice, but it could be the CPython bytecode or something else). Or feel free to suggest your own.

Compiler to WebAssembly

WebAssembly is emerging as a rival to JavaScript for delivering performance-critical code for the web. A number of languages already target WebAssembly, but it's interesting to see what others can be used. WebAssembly doesn't currently have a garbage collector, but that would just make it more interesting to compile a subset of Java, ML, or your favourite language down to this format.

Hardware Transactional Memory

Transactional memory systems have been gaining increasing popularity since multicores became commonplace due to their ease of use, allowing the programmer to avoid reasoning about locks while transparently supporting atomic access to data. There are a number of software transactional memory libraries available and hardware transactional memory has now become mainstream with Intel's release of the TSX extensions in Haswell processors (bugs notwithstanding). However, a significant downside is that no ordering can be enforced on transaction commit, meaning that TSX is a poor fit for techniques such as thread-level speculation. This project would start by implementing a simple version of hardware transactional memory within a simulator, such as gem5. It would then evaluate the amount of performance improvement available either through source code alterations, or automatically in hardware through speculative thread creation.

Part III / ACS Projects

Also, check out the Computer Architecture Group ACS projects page for other ideas. Some of these are replicated there.

KVM-Accelerated Full-System Binary Translation

In order for binary-translating emulators to support full-system emulation, they need to translate virtual addresses to physical addresses on each memory access. QEMU handles this by creating a software direct-mapped TLB and generating code that checks this TLB for each memory-access instruction. This software-TLB lookup requires at least two extra memory accesses for each original memory-access instruction and therefore hurts performance.

The Kernel-based Virtual Machine (KVM) allows userspace applications to run as VM hypervisors. This project will develop a system that lets DBT-ed code execute inside a guest VM, leveraging the CPU's second-level address translation to do the address translation for us. Instead of maintaining a software TLB, a shadow page table would be used, which should provide considerable performance improvements.

This project is joint with Gary Guo

Old Projects

Use-After-Free Security Mitigations - Part III Project

This project created an extension to a memory allocator (jemalloc, but the implementation was largely independent of the specific allocator) to keep freed objects in quarantine, preventing them from being reallocated, until a sweep through memory could confirm that no dangling pointers to the objects remained. At that point they could be given back to the allocator to be recycled, thus removing many use-after-free vulnerabilities from applications.

Profile-Driven Data-Layout Optimisations - ACS Project

This project profiled memory accesses made by an application, analysed the objects that were accessed together and rewrote the binary to identify different clusters of objects. A custom memory allocator then dynamically allocated these objects so that those accessed together were placed near each other in memory, increasing spatial locality. This work got published as a paper at CGO 2020.

Language A to B Compiler - Part II Project

Various projects in the past have written compilers from one language to another. For example, Java to JavaScript, Crystal to JavaScript, Haskell to JVM, ML to JVM, Java to WebAssembly, Haskell to WebAssembly.

An Optimising Compiler - Part II Project

Various projects in the past have created optimising compilers. For example, an optimiser for OCaml's bytecode generator.

Binary Alias Analysis - ACS Project

This project wrote an alias analysis pass for Janus, our binary parallelisation tool.

Vectorisation in a Dynamic Binary Translator - Part III Project

This project wrote a vectorisation pass for Janus, our binary parallelisation tool, so as to take advantage of SIMD operation. It also got incorporated into a paper we wrote for an international workshop.

A Multicore Cache Simulator - Part II Project

This project created a cache simulator based on the DynamoRIO dynamic binary translator, providing an alternative to its own cache simulator, but addressing some of its limitations (such as simulating arbitrary hierarchies and cache coherence).

Garbage Collector For C - Part II Project

This project wrote a garbage collector for C programs, to replace malloc() and free() calls, then compared it to the Boehm garbage collector.

DoAll Parallelisation - Part II Project

The idea of this project was to identify DoAll parallelism within program loops using the LLVM intermediate representation and then split up the loops into multiple threads to execute each iteration concurrently. The project was extended to extract a form of DoAcross parallelism too.

Data Dependence Profiling - Part II Project

This project implemented SD3, a technique for dynamic data-dependence profiling.

Software Error Detection - Part II Project

This project implemented SWIFT, a technique to detect transient errors by adding extra software instructions.

Dynamic Data-Dependence Analysis - Part II Project

Data dependence profiling is time consuming and uses a lot of space if taking program traces. To combat this, this project implemented SD3, an online data dependence analyser. A library implementing SD3 was written and then interfaced with a program automatically by writing a pass within LLVM to insert calls into the library.

Function Call Parallelisation - Part II Project

This project attempted to run functions in their own threads. It used software transactional memory to deal with data dependences and performed the transformation automatically in LLVM.

Ultra-Fine-Grained Parallelisation - ACS Project

Automatic program parallelisation is the process of taking a sequential application and splitting it into threads that can be run independently. One of the challenges of automatic parallelisation is defining the amount of code that each thread should execute. This project took this towards an extreme, by splitting up the application into small units of work, with threads executing many of these tasks each.

JavaScript Parallelisation - ACS Project

The main idea of this project was to transform the code generated by a JavaScript engine (we used Google's V8) to take advantage of thread-level parallelism. Essentially DoAll loops were identified by the programmer (those where all iterations can run in parallel; there are no dependences) and the V8 engine split it out into different threads, running all at the same time.