Computer Laboratory

Computer Architecture Group

ACS & Part II/III Project Suggestions

The following project suggestions are starting points for possible Part II, Part III and ACS projects. ACS and Part III projects require more of a research emphasis to be successful, where as Part II projects might focus more on computer engineering and reproducibility of results, though top marks tend to be awarded to more research oriented projects.

Please contact the proposer(s) by email if you are interested in any of the projects below. In addition, some of the projects from previous years may still be suitable and interesting. Please remember, these are just starting points that suggest possible directions for the resarch. You can continue to check here again over the coming weeks for more projects. We would also be happy to consider any project ideas you have too.

Suggestions for 2020/2021


  1. Microarchitectural design of new CHERI domain crossing instructions
    Contact: Jonathan Woodruff


    This project would implement the two experimental instructions described in the CHERI Architecture document in the section titled "Indirect Sentry Capabilities" (see CHERI Architecture v8; upcoming publication). This section describes two domain crossing instruction variants for CHERI which should present interesting tradeoffs in microarchitectural implementation. The first loads a capability pointer from memory directly into the PC, and the second not only does this but also loads a second, adjacent capability pointer into a register. This project would implement examples of these instructions in the Flute (in-order, single-scalar) and/or Toooba (out-of-order, super-scalar) open-source CHERI-RISCV implementations and explore the complexity and cost of adding these unusual instructions to the microarchitecture. There should be some evaluation of performance using micro-benchmarks to demonstrate the value (or lack thereof) of the more sophisticated implementation options.



  2. CHERI Domain Crossing using Simplified Primitives
    Contact: Jonathan Woodruff


    The primary domain crossing mechanism for compartmentalisation in CHERI has relied on sealing with object types. The shortcoming of this approach is that it requires a type field to be allocated in the capability encoding which is necessarily limited in size, and therefore introduces an awkward necessity to manage this limited typespace. A simpler mechanism has recently been added to the CHERI architecture; Sealed Entry Capabilities (or "Sentries"). These simply unseal and jump to a target capability, granting access to any capabilities embedded with the target instructions. Sentries have the convenient property that their object type space scales with the virtual address space, while bearing the disadvantage of tying together objects and code.

    Libcheri is a library used for compartmentalisation which uses the traditional type-based domain crossing instructions. This project would develop an alternative implementation of libcheri based solely on the Sentry mechanism, likely requiring some system of indirection to ensure the integrity of entry tokens distributed to mutually distrusting domains. This project would seek to approach the performance of type-based libcheri for the existing benchmarks to learn where the type-based mechanism might be replaced for this style of compartmentalisation.



  3. More CHERI sercure hardware and software projects


    Follow this link to more CHERI hardware and software project suggestions



  4. Compiler to WebAssembly
    Contact: Timothy Jones


    Wouldn't it be great to be able to write web applications in your favourite language? The idea of this project is to take some source language (for example, C or ML, but it could be anything) and write a compiler that transforms it to WebAssembly. You'll probably want to consider just a subset of the language and there may be optimisations to perform on the way to produce faster / easier to read code as the end result.



  5. Dynamic binary translator
    Contact: Timothy Jones


    Let's celebrate Apple's move from x86 to Arm by creating a dynamic binary translator for them. The idea is that it will read in an x86 binary and then either interpret each instruction, or actually emit Arm instructions for each x86 instruction that it then directly executes. It would probably be best to restrict the project to a subset of x86, or a different source ISA if preferred (e.g. AArch32 to AArch64).



  6. Interpreter or JIT compiler
    Contact: Timothy Jones


    In a similar vein, why not create an optimised interpreter or JIT compiler from your favourite source language. This could be an established language, such as C or OCaml, or a minimalist language, such as P'' (or a descendent), or an existing bytecode (Java being the obvious choice, but it could be the CPython bytecode or something else). Or feel free to suggest your own.



  7. Framework for Targeted Tracing of Software Execution with Debug Symbols
    Contact: Jasmin Jahic


    Reasoning about software quality, potential bugs, and its dynamic behaviour (e.g., execution hot spots) is not an easy task. Existing approaches for doing so predominantly rely on static analysis. However, static analysis is inherently imprecise. The goal of this project is to design and implement a framework for targeted tracing of software execution that considers relating of executed instructions with debug symbols in order to provide more information for the execution trace analysis.

    The basic technology for tracing will be based on DynamoRIO binary instrumentation tool. It will be necessary to enable DynamoRIO to extract debug symbols according to executed instructions. Furthermore, it will be necessary to create an external execution engine of this framework in a language more suitable for doing complex analysis (e.g., Python or Java). The framework should be configurable with options of targeting specific functions and threads for execution tracing, with the possibility of an online configuration (i.e., during the execution). The framework should write the execution trace in a configurable form of JSON files.



  8. GitHub Application for Continuous Engineering
    Contact: Jasmin Jahic


    Continuous software engineering and DevOps are transforming software engineering process in companies. However, from the software architectural point of view, there are no good tools to support this process. Most of the existing tools are monoliths that are not suited for cooperation between team members and frequent updates to models.

    Another issue with the existing tools is that they are far away from code and code development processes. After a while, the gap between the design and implementation becomes inevitable. In order to solve this issue, the idea is to build a set of tools that support architecture processes and integrate it with source code repositories. In the concrete case, we will be using GitHub API. For this purpose, it will be first necessary to choose several challenges related to continuous software engineering, on the software architecture level, and for them implement appropriate support.



  9. Improving Performance of LLVM Interpreter
    Contact: Jasmin Jahic


    LLVM compiler infrastructure is a suitable platform for research projects. One of the most interesting parts of LLVM compiler is its interpreter. It enables a possibility of fine-grained tracing of executed instructions with possibilities for quick quantification of code coverage. However, execution of software with LLVM interpreter introduces a significant overhead. The aim of this project is to identify sources of overhead in LLVM interpreter and implement appropriate changes in order to reduce the overhead to appropriate level. As part of the project, it will be necessary to implement components for writing the traces in JSON trace files.



  10. Support for LLVM interpreter to run multithreaded applications
    Contact: Jasmin Jahic


    LLVM compiler infrastructure is a suitable platform for research projects. One of the most interesting parts of LLVM compiler is its interpreter. It enables a possibility of fine-grained tracing of executed instructions with possibilities for quick quantification of code coverage. However, LLVM interpreter does not support execution of multithreaded software. Furthermore, LLVM interpreter faces issues when executing functions from dynamically and statically linked libraries. The goal of this project is to implement components within the LLVM interpreter and support execution and tracing of multithreaded software.



  11. Timing Model for an ISA Simulation
    Contact: Marno van der Maas


    Instruction-level simulators are convenient to use, easy to understand and widely available, e.g. Spike for RISC-V. However, if performance figures are required, a timing model is needed. Rather than simulate the whole pipeline in detail, one approach is to add a timing model to an ISA simulator, which has the advantage that the timing model doesn’t need to execute the instruction but instead simply deal with classes of instructions. For example, there need be only one simple model to figure out the timing of all R-type arithmetic/logical operations. Similarly, there will be one model for all branches, another for loads/stores, etc. The models could initially be simple: e.g. R-type instructions all take one cycle, but then refined, e.g. taking into account forwarding paths that might introduce load-to-use dependancies on R-type instructions. Similarly, branches might be modelled as always introducing one bubble into the pipeline, through to modelling branch predictors. Memory operations could be timed via a simple statistical hit rate model, or you could model cache hierarchies in various levels of detail.



  12. Graphical System-on-Chip Constructor for Bluespec
    Contact: Jonathan Woodruff


    Several commercial vendors have graphical tools for connecting components of a system on chip. These are generally FPGA-specific, or belong to an expensive commercial ASIC package of tools. One challenge is that the SystemVerilog base language is complex to analyse and difficult to parameterise, so automatic generation of wires and converters can be quite involved. Bluespec SystemVerilog is a recently open-sourced hardware description language which is far more parameterisable than SystemVerilog, and the architecture group has recently developed a general-purpose AXI library in the BlueStuff collection of tools.

    This project would develop a graphical user interface that could generate a system-on-chip fabric to connect Bluespec AXI masters and slaves in an arbitrary arrangement, including busses and width adapters, along with address mappings for each master. In addition to generating Bluespec for the interconnect, this project should also produce Device Tree files to describe the system to be consumed by software using the system. Optionally, the tool may allow editing any of the device tree, graphical description, or the Bluespec source and maintain consistent design across all three.



  13. Graphical Analysis of Bluespec SystemVerilog Designs
    Contact: Jonathan Woodruff


    Bluespec SystemVerilog is a recently open-sourced hardware description language that is used extensively in the Cambridge computer architecture research group for design exploration and implementation. Bluespec SystemVerilog imposes a greater degree of structure on designs than SystemVerilog, typically requiring transactional interfaces and grouping actions into atomic rules. This structure enables greater analysis of the semantics of a design and has been used for formal reasoning about Bluespec designs.

    This project would leverage any existing tools for reasoning about Bluespec designs to construct a graphical representation of an architecture described in Bluespec, displaying all modules hierarchically along with connections between them. Optionally state machines or sets of exclusive rules might be rendered in a useful way. Optionally area and timing from a synthesis of a Bluespec design might be added to the graphical display of the design.



  14. ARM/AArch64/RISC-V backend for Duplo
    Contact: Nandor Licker


    Duplo is a post-link optimiser for OCaml and C, currently targeting only amd64. The framework relies on a Low-Level Intermediate Representation (LLIR) to represent and optimise both C and OCaml, lowering LLIR through the code generator of LLVM. Besides expanding the framework, building a backend for a RISC processor would allow for a comparison between the improvement the LLVM backend brings in for a CISC architecture and the potential speedup on a RISC architecture.



  15. Feedback-directed optimisation for Duplo
    Contact: Nandor Licker


    Feedback-directed optimisation relies on dynamic profile information to improve branch and function layout in a binary to better exploit instruction caches. Profile information gathered using perf and Last Branch Records (LBRs) can be used to identify "hot" and "cold" sections in a program, feeding this information to the compiler to improve layout decisions. The project would either reuse existing tools (such as AutoFDO) or manually process perf files to identify branch weights, passing the information to the LLVM target-specific pipeline for block layout. As an extension, the same information could also be used to improve function placement prior to lowering to LLVM.



  16. Comparison of parallel architectures
    Contact: Peter Rugg


    Modern computing hardware provides parallelism in many forms in the hope that applications can exploit it. Parallelising code is not easy, often requiring a complete rethink of the approach taken to achieve orders-of-magnitude performance gains. This project would take some particular problem and investigate how suited alternative parallel architectures are to solving it. After completing an initial C baseline, parallel implementations can be completed targeting multi-core CPUs, CPU vector instructions, GPUs, and even a custom FPGA implementation. The problem analysed may draw on other aspects of the tripos (bioinformatics, complexity, security, machine-learning/AI) exercising a broad range of skills in addition to architecture. The project could also emphasise evaluation, digging into how and why the implementations perform differently. The architecture group may be able to provide limited access to many-core CPU/GPU servers, as well as FPGA boards.

    Problems investigated in the past: Breaking of Lorenz cipher (17-18), Bioinformatics: sequence alignment (19-20).

    Ideas for future projects: Game AI (Chess? Go?) or Traveling salesman



Older suggestions that might still be useful starting points


  1. Automatic Generation of Synchronisation Primitives for Multithreaded Programs
    Contact: Timothy Jones
    Contact: Jasmin Jahic


    Reasoning about synchronisation mechanisms in multithreaded software is hard. This work aims to create an approach that would free developers from this challenge by generating optimal synchronisation mechanisms for the software under consideration. During this work, it will be necessary to create an approach that reasons about concurrency bugs, potential synchronisation mechanisms, and developer’s synchronisation intentions. It will be necessary to create a conceptual solution for this challenge, an algorithm for generation of synchronisation mechanisms, and a prototype of a tool that demonstrates the benefits of this approach.



  2. KVM-Accelerated Full-System Binary Translation
    Contact: Timothy Jones
    Contact: Gary Guo


    In order for binary-translating emulators to support full-system emulation, they need to translate virtual addresses to physical addresses on each memory access. QEMU handles this by creating a software direct-mapped TLB and generating code that checks this TLB for each memory-access instruction. This software-TLB lookup requires at least two extra memory accesses for each original memory-access instruction and therefore hurts performance.

    The Kernel-based Virtual Machine (KVM) allows userspace applications to run as VM hypervisors. This project will develop a system that lets DBT-ed code execute inside a guest VM, leveraging the CPU's second-level address translation to do the address translation for us. Instead of maintaining a software TLB, a shadow page table would be used, which should provide considerable performance improvements.



  3. RISC-V Implementation in Bluespec
    Contact: Simon Moore
    Contact: David Chisnall


    Note that since this project was suggested we've implemented a number of RISC-V cores in Bluespec including ones extended with CHERI security feactures - see CHERI-RISC-V

    The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.

    This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.

    We have an online web tutor to allow people to quickly learn the Bluespec SystemVerilog hardware description language: http://www-bluespec.cl.cam.ac.uk/



  4. Real FPGA Virtual I/O
    Contact: Simon Moore


    This is a fairly advanced project. VirtIO is used to virtualize devices for virtual machines. It provides an abstraction layer between the guest OS and the virtual machine. Now that we have System-on-Chip (SoC) FPGAs it may be possible to treat the ARM core (running Linux or FreeBSD) as the guest OS and a NIOS core mimicking the virtual machine with shared memory (or a FIFO) between the two. The NIOS could then be replaced with some custom FPGA hardware to consume (or produce) VirtIO. For example, it would be good to have a VirtIO stream to Avalon Stream adaptor, or a VirtIO block device that could scan/search through blocks of data. Such an approach would allow FPGA acceleration while not having to deal with low-level configuration details of a particular device.



  5. Convolutional neural networks on FPGA
    Contact: Jonathan Woodruff (jdw57@cl)


    Develop a convolutional neural network engine on FPGA.. Convolutional neural networks are finding applications in big data learning but are mostly running on standard CPUs or GPUs. This project would design a hardware/software architecture for efficient processing of convolutional neural networks on FPGA using either the BlueVec vector processor or NiosII CPU cores with custom accelerators. BlueVec is an opensource vector processor written in BlueSpec System Verilog for synthesis on FPGA and has been shown to be very efficient for low-precision arithmetic such as that used in convolutional neural networks, and may be a useful starting point for this project.



  6. Using reinforcement learning to achieve backend optimisation for quantum compilers
    Contact: Steven Herbert
    Contact: Timothy Jones
    Contact: Robert Mullins


    With the advent of the first generation of quantum computers (for example Google, Rigetti and IBM in the US; and OQC, Universal quantum and NQIT in the UK), attention has turned to compiling quantum algorithms to run on these machines. The emerging norm is a quantum compiler structure that is, in some important respects closely analogous to that of a classical compiler. In particular, the front-end, optimiser, back-end division has been retained, in which the quantum circuit (typically encoded in openQASM) plays the role of the intermediate representation (IR).

    Just as in the classical case, the backend receives the IR which is in some sense optimised, but in a manner that is not specific to the target hardware. Specifically, quantum circuits consist of two-qubit gates, and implicitly assume complete connectivity - that is, any two pair of qubits can interact (gate together). This, however, contravenes physical constraints, which mean that there will be some limitation on the actual connectivity, for example the qubits may be laid out in a square grid, with each only able to interact with its four neighbours. It is possible to swap qubits, therefore general quantum algorithms can be executed on such machines, but with an execution time cost incurred by the insertion of these swaps. Therefore the role of the quantum compiler backend is to optimally map the IR (which is a quantum circuit assuming complete connectivity) to a target architecture with some known limited connectivity.

    Typically ad-hoc and / or heuristic approaches have been taken to this quantum backend optimisation problem, however we have previously shown that it can be expressed as a reinforcement learning problem, with proof of principle results demonstrating the potential of this technique. The goal of this project is therefore to improve and extend this work.



  7. Use-After-Free Security Mitigations
    Contact: Timothy Jones
    Contact: Sam Ainsworth


    Use-after-free vulnerabilities, where a victim mistakenly reuses data that has been freed and reallocated by an attacker, have plagued software written in low-level languages, such as C and C++, becoming one of the most frequent classes of exploited software bugs. This project aims to mitigate this threat in software at low performance overhead, by extending a state-of-the-art memory allocator to perform efficient checks on programmer-freed memory. This will involve a variety of optimizations based on virtual memory and past system behaviour, to achieve better than state-of-the-art performance in a low-level C allocator.



  8. Binary Alias Analysis
    Contact: Timothy Jones


    One of the (many!) problems with transforming application binaries is that a significant amount of information about the code has been lost during the compilation stages. Analysing and transforming binaries is an important task because it allows us to optimise programs even when we don't have access to the source code. A number of transformations rely on knowing whether two instructions access the same locations in memory, so that we can ensure we maintain application correctness when we perform our optimisations. Recently, a number of analyses have been proposed to identify the memory objects that a program uses, specifically for Intel x86 binaries. The aim of this project is to implement a subset of these analyses for ARM binaries, and develop strategies for coping with facets of the ISA that are not present in x86, like predicated execution. If time permits, this analysis could then be used to implement a code motion transformation to safely move instructions around in the binary.



  9. Ultra-Fine-Grained Parallelisation
    Contact: Timothy Jones


    Automatic program parallelisation is the process of taking a sequential application and splitting it into threads that can be run independently. One of the challenges of automatic parallelisation is defining the amount of code that each thread should execute. This project seeks to take this to the extreme, by splitting up the application into very small units of work, with threads executing many of these tasks each. This could be performed by hand or, ideally, implemented as a pass within an optimising compiler, such as LLVM.



  10. JavaScript Parallelisation
    Contact: Timothy Jones


    This is an ambitious project to evaluate the potential for parallelisation of JavaScript applications. The main idea is to instrument the code generated by a JavaScript engine (e.g. Google's V8) and assess the parallelism available under different scenarios (e.g. across each iteration of loops). We have already developed a basic framework for analysing generic code, but it would need extending and adding to the compiler. To tackle this project, you'd need to be willing to learn about the internals of a dynamic compiler and be confident making changes to incorporate the instrumentation.



  11. Energy-Efficient Caching
    Contact: Timothy Jones


    Processor caches exploit both spatial and temporal locality to reduce the latency of accessing memory. In an ideal cache, when an item of data is brought in it would be free to occupy any position within the cache that it liked, replacing the least recently used value wherever it may be. In reality, fully-associative caches such as these are too costly to implement in hardware. At the other extreme, direct-mapped caches restrict each data item to only one position within the cache, but these suffer from a significant number of conflict misses, when two data items map to the same position. Therefore a compromise is found with some form of set-associativity, usually restricting each data item to a small number of positions.

    Wouldn't it be great to be able to combine the best of both worlds by having fully-associative caches at the cost of direct-mapped hardware? Recent research has pointed to a potential method for achieving this, but as yet nobody has applied the concept to caches. This project aims to be the first to do this. It will perform research into this type of cache hardware, evaluating the trade-offs involved and determining how best to make use of this recent research. The aim is to create a cache that has the low miss rate of a fully-associative cache, but with an energy consumption close to a direct-mapped cache.



  12. Flexible I/O for the lowRISC SoC
    Contact: Robert Mullins


    The lowRISC project (www.lowrisc.org) is aiming to produce a competitive open-source SoC. One important part of this project is the design of an array of simple I/O coprocessors called "Minions". Soft peripheral interfaces can be created by programming these cores. They can also be used to off-load work from the SoC's main processors, e.g. by filtering or preprocessing I/O.

    This project will explore the architecture of the Minions and the thin layer of custom logic that will be placed between the I/O pins and the Minion cores themselves (the I/O shim). The aim will be to develop an implementation that is able to support the widest range of interface types at the lowest cost. This will involve carefully dividing work between the cores and I/O shim, investigating ISA extensions and devising a suitable interface between the minions and I/O shim.

    Slides outlining the lowRISC project can be found here



  13. Exploring Architectural Trade-offs in RISC-V Processors
    Contact: Robert Mullins


    This project will explore complexity, area, power and performance trade-offs for a number of different processor implementations (targeting the RISC-V ISA). Comparisons will be made to public implementations and those created by the student.

    Detailed comparisons will be made using a standard ASIC toolflow.



  14. A Flexible Multipurpose Tagged Memory System
    Contact: Robert Mullins


    The lowRISC project (www.lowrisc.org) is aiming to produce a competitive open-source SoC. We aim to support a simple tagged memory system to provide protection against control-flow hijack attacks. This project would explore the implementation of the tagged memory system and possible other uses for it, e.g. infinite memory watchpoints, garbage collection, accelerating existing debug tools, locks on every word, simple control-flow integrity checks etc.



  15. Optimal Heterogeneous CMP Core Selection
    Contact: Timothy Jones


    As we project into the future, continued increases in transistor counts, coupled with tight processor power constraints, will lead to increased specialisation of cores within a chip multiprocessor (CMP). However, it is still an open question as to what this heterogeneous CMP will look like.

    This project will seek to answer this question by exploring the design space of heterogeneous CMPs. It will use the gem5 simulation infrastructure to run applications on a variety of cores and develop an algorithm to pick the best ones, given constraints such as power or area.



  16. Vectorisation in General Purpose Applications
    Contact: Timothy Jones


    Modern application processors now contain specialised instructions for operating on a vector of data. This is often called single-instruction, multiple data (SIMD) processing, and common forms are the SSE and AVX instructions in x86 processors, or NEON instructions in ARM. Making use of these instructions can provide significant speed ups.

    This project will study the opportunities for vectorisation within general purpose applications, which are traditionally not suited to this kind of processing. It will analyse the loops within each application to determine the inherent vector operations and those that can be exposed through additional compiler transformations. The goal is to expose as many opportunities for vectorisation as possible and, if time allows, implement a vectorisation pass within a compiler to take advantage of these.



  17. RISC-V Implementation in Bluespec
    Contact: Simon Moore Contact: David Chisnall


    The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.

    This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.



  18. A Fast Cache Hierarchy for BERI
    Contact: Simon Moore Contact: Jonathan Woodruff


    The BERI processor, developed at the University of Cambridge, is a 64-bit MIPS processor which is somewhat mature and reliable, but has not so-far been optimised extensively for performance. One of the greatest shortcomings of the current design is cache performance, which only allows a single outstanding transaction.

    This project would implement cache heirarchy for the BERI project that can saturate the bandwidth to DRAM for the Terasic DE4. The student would implement instruction and data L1 caches with a shared L2 cache as well as a traffic generator to test the heirarchy. The caches should be pipelined and allow at least 16 outstanding transactions and should run at a high clockspeed on the Terasic DE4. The caches should be parameterizable for size and possibly for line size and associativity. The traffic generator should be capable of both speed tests and complex patterns to test consistency in the caches. An optional extension would be to exend the caches to support coherency when more than one set of L1 caches is present. The final report should present cache performance with a range of parameters which trade off between area, clock speed, and performance.



  19. Application Scheduling for Heterogeneous Systems
    Contact: Robert Mullins and Timothy Jones


    As energy efficiency becomes the main driver for processor development, heterogeneous systems become attractive, since they allow applications to be scheduled on the cores that best suit their current requirements. Emerging heterogeneous systems include those with close CPU-GPU integration and ARM's big.LITTLE processors.

    The goal of this project is to perform an evaluation of a heterogeneous multicore system using the gem5 simulation environment. It will consider a range of cores to determine the optimal system for a group of multi-threaded and multi-programmed workloads. There should not need to be a significant amount of infrastructure development, since gem5 already includes support for multiple, configurable cores. The results will be an analysis of the types of workloads that benefit from heterogeneity in the processor and how they can be successfully scheduled together.



  20. Speculative Guided Parallelisation of Application Binaries
    Contact: Robert Mullins and Timothy Jones


    With multicore systems now the norm across the computing landscape, and many-core systems on the horizon, it is important for applications to gain performance through parallel execution. However, a significant fraction of existing software is in single-threaded form, and rewriting it to be parallel would be a significant undertaking.

    This project seeks to parallelise applications without needing to alter the program source code. Using dynamic binary instrumentation and rewriting, such as within DynamoRio, it will alter program loops as they execute to allow them to run in parallel. To avoid complicated analysis of each loop, it will employ a form of speculation to catch situations where the code must be executed sequentially. The loops to parallelise will be determined in advance.



  21. Acceleration of the Floyd-Steinberg dithering algorithm
    Contact: Robert Mullins and Timothy Jones


    Applications such as high-speed ink-jet printing need to perform image dithering at Gpixel/s rates. Highly optimised sequential implementations can today only reach ~200Mpixels/sec. This project will explore parallel implementations of the Floyd-Steinberg algorithm, either hand-coded or produced with the aid of an automatic loop parallelisation technique (called HELIX).

    There is scope to extend the project to explore source-to-source transformations that could improve the performance of the HELIX technique.

    [1] PT Metaxas, Optimal parallel error diffusion dithering
    [2] Y Zhang, "Line diffusion: a parallel error diffusion algorithm for digital halftoning"

  22. Scalable Graphics Shader Engine
    Contact: Simon Moore


    Full-system research is becoming practical using FPGAs, and Cambridge is at the forefront with a full CPU and OS stack with a number of peripherals. However modern systems are not complete without an autonomous graphics processing unit with implications for system-on-chip data flow and prioritization, memory allocation and scheduling, and especially security.

    This project would explore the implications of an autonomous graphics processing unit in a system-on-chip architecture. This project would design and build a compact, scalable fragment shader engine in Bluespec SystemVerilog which is able, at least, to apply textures to triangles in a framebuffer. We would recommend an internal 16-bit floating-point pixel format similar to the ARM MALI GPU to save area and improve timing.

    Evaluation could include efficiency and performance as well as novel memory protection or sharing ideas when combined with the Cambridge CHERI 64-bit MIPS processor optionally running FreeBSD.

    [1] ARM, MALI Shader Arithmatic