Computer Laboratory

Computer Architecture Group

Computer Architecture ACS Project Suggestions

Please contact the proposer(s) by email if you are interested in any of the projects below. In addition, some of the projects from previous years may still be suitable and interesting. Please remember, these are just starting points that suggest possible directions for the resarch. You can continue to check here again over the coming weeks for more projects. We would also be happy to consider any project ideas you have too.

  1. Binary Alias Analysis
    Contact: Timothy Jones

    One of the (many!) problems with transforming application binaries is that a significant amount of information about the code has been lost during the compilation stages. Analysing and transforming binaries is an important task because it allows us to optimise programs even when we don't have access to the source code. A number of transformations rely on knowing whether two instructions access the same locations in memory, so that we can ensure we maintain application correctness when we perform our optimisations. Recently, a number of analyses have been proposed to identify the memory objects that a program uses, specifically for Intel x86 binaries. The aim of this project is to implement a subset of these analyses for ARM binaries, and develop strategies for coping with facets of the ISA that are not present in x86, like predicated execution. If time permits, this analysis could then be used to implement a code motion transformation to safely move instructions around in the binary.

  2. Ultra-Fine-Grained Parallelisation
    Contact: Timothy Jones

    Automatic program parallelisation is the process of taking a sequential application and splitting it into threads that can be run independently. One of the challenges of automatic parallelisation is defining the amount of code that each thread should execute. This project seeks to take this to the extreme, by splitting up the application into very small units of work, with threads executing many of these tasks each. This could be performed by hand or, ideally, implemented as a pass within an optimising compiler, such as LLVM.

  3. Colossus-like code breaking machines
    Contact: Simon Moore

    The Colossus code breaking machine was so pivotal to the World War II code breaking efforts at Bletchley Park that it's very existence was kept secret for 50 years. The machine was reconstructed in the 1990s by Tony Sale at The National Museum of Computing from illegally kept photographs and a few other documents. This project is to explore software and/or hardware implementations of Colossus-like machines. For example, Joachim Schueth produced an award winning program to beat Colossus using a PC (see his Ada code and example input). There are a number of possible subprojects that could be joined into a set to form an interesting researchy project (e.g. 1+2 with 3 as an option would make one good project, or 4+5 with 6 as an option, or some other combination):

    1. Produce an efficient Java/C/? version of the code breaking code based on Joachim's code taking cyphertext (not the radio broadcast Morse code) as input.
    2. Write a parallel version of the code breaking code.
    3. Write a GPU version of the code breaking code.
    4. Research the principle algorithmic approach taken by Colossus and write a simulator which is functionally faithful.
    5. Produce an FPGA implementation of a Colossus which is functionally similar to the original valve version. There is a partial version (when I looked June 2016) -
    6. Explore the use of more modern computer architecture techniques and the use of a large FPGA to produce a high performance modern Colossus.

  4. Real FPGA Virtual I/O
    Contact: Simon Moore

    This is a fairly advanced project. VirtIO is used to virtualize devices for virtual machines. It provides an abstraction layer between the guest OS and the virtual machine. Now that we have System-on-Chip (SoC) FPGAs it may be possible to treat the ARM core (running Linux or FreeBSD) as the guest OS and a NIOS core mimicking the virtual machine with shared memory (or a FIFO) between the two. The NIOS could then be replaced with some custom FPGA hardware to consume (or produce) VirtIO. For example, it would be good to have a VirtIO stream to Avalon Stream adaptor, or a VirtIO block device that could scan/search through blocks of data. Such an approach would allow FPGA acceleration while not having to deal with low-level configuration details of a particular device.

  5. JavaScript Parallelisation
    Contact: Timothy Jones

    This is an ambitious project to evaluate the potential for parallelisation of JavaScript applications. The main idea is to instrument the code generated by a JavaScript engine (e.g. Google's V8) and assess the parallelism available under different scenarios (e.g. across each iteration of loops). We have already developed a basic framework for analysing generic code, but it would need extending and adding to the compiler. To tackle this project, you'd need to be willing to learn about the internals of a dynamic compiler and be confident making changes to incorporate the instrumentation.

  6. Convolutional neural networks on FPGA
    Contact: Jonathan Woodruff (jdw57@cl)

    Develop a convolutional neural network engine on FPGA.. Convolutional neural networks are finding applications in big data learning but are mostly running on standard CPUs or GPUs. This project would design a hardware/software architecture for efficient processing of convolutional neural networks on FPGA using either the BlueVec vector processor or NiosII CPU cores with custom accelerators. BlueVec is an opensource vector processor written in BlueSpec System Verilog for synthesis on FPGA and has been shown to be very efficient for low-precision arithmetic such as that used in convolutional neural networks, and may be a useful starting point for this project.

Older Project Suggestions

The following is a list of older projects from previous years that may have been attempted already, but could be built upon or provide inspiration for your own ideas.

  1. Energy-Efficient Caching
    Contact: Timothy Jones

    Processor caches exploit both spatial and temporal locality to reduce the latency of accessing memory. In an ideal cache, when an item of data is brought in it would be free to occupy any position within the cache that it liked, replacing the least recently used value wherever it may be. In reality, fully-associative caches such as these are too costly to implement in hardware. At the other extreme, direct-mapped caches restrict each data item to only one position within the cache, but these suffer from a significant number of conflict misses, when two data items map to the same position. Therefore a compromise is found with some form of set-associativity, usually restricting each data item to a small number of positions.

    Wouldn't it be great to be able to combine the best of both worlds by having fully-associative caches at the cost of direct-mapped hardware? Recent research has pointed to a potential method for achieving this, but as yet nobody has applied the concept to caches. This project aims to be the first to do this. It will perform research into this type of cache hardware, evaluating the trade-offs involved and determining how best to make use of this recent research. The aim is to create a cache that has the low miss rate of a fully-associative cache, but with an energy consumption close to a direct-mapped cache.

  2. Flexible I/O for the lowRISC SoC
    Contact: Robert Mullins

    The lowRISC project ( is aiming to produce a competitive open-source SoC. One important part of this project is the design of an array of simple I/O coprocessors called "Minions". Soft peripheral interfaces can be created by programming these cores. They can also be used to off-load work from the SoC's main processors, e.g. by filtering or preprocessing I/O.

    This project will explore the architecture of the Minions and the thin layer of custom logic that will be placed between the I/O pins and the Minion cores themselves (the I/O shim). The aim will be to develop an implementation that is able to support the widest range of interface types at the lowest cost. This will involve carefully dividing work between the cores and I/O shim, investigating ISA extensions and devising a suitable interface between the minions and I/O shim.

    Slides outlining the lowRISC project can be found here

  3. Exploring Architectural Trade-offs in RISC-V Processors
    Contact: Robert Mullins

    This project will explore complexity, area, power and performance trade-offs for a number of different processor implementations (targeting the RISC-V ISA). Comparisons will be made to public implementations and those created by the student.

    Detailed comparisons will be made using a standard ASIC toolflow.

  4. A Flexible Multipurpose Tagged Memory System
    Contact: Robert Mullins

    The lowRISC project ( is aiming to produce a competitive open-source SoC. We aim to support a simple tagged memory system to provide protection against control-flow hijack attacks. This project would explore the implementation of the tagged memory system and possible other uses for it, e.g. infinite memory watchpoints, garbage collection, accelerating existing debug tools, locks on every word, simple control-flow integrity checks etc.

  5. Optimal Heterogeneous CMP Core Selection
    Contact: Timothy Jones

    As we project into the future, continued increases in transistor counts, coupled with tight processor power constraints, will lead to increased specialisation of cores within a chip multiprocessor (CMP). However, it is still an open question as to what this heterogeneous CMP will look like.

    This project will seek to answer this question by exploring the design space of heterogeneous CMPs. It will use the gem5 simulation infrastructure to run applications on a variety of cores and develop an algorithm to pick the best ones, given constraints such as power or area.

  6. Vectorisation in General Purpose Applications
    Contact: Timothy Jones

    Modern application processors now contain specialised instructions for operating on a vector of data. This is often called single-instruction, multiple data (SIMD) processing, and common forms are the SSE and AVX instructions in x86 processors, or NEON instructions in ARM. Making use of these instructions can provide significant speed ups.

    This project will study the opportunities for vectorisation within general purpose applications, which are traditionally not suited to this kind of processing. It will analyse the loops within each application to determine the inherent vector operations and those that can be exposed through additional compiler transformations. The goal is to expose as many opportunities for vectorisation as possible and, if time allows, implement a vectorisation pass within a compiler to take advantage of these.

  7. RISC-V Implementation in Bluespec
    Contact: Simon Moore Contact: David Chisnall

    The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.

    This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.

  8. A Fast Cache Hierarchy for BERI
    Contact: Simon Moore Contact: Jonathan Woodruff

    The BERI processor, developed at the University of Cambridge, is a 64-bit MIPS processor which is somewhat mature and reliable, but has not so-far been optimised extensively for performance. One of the greatest shortcomings of the current design is cache performance, which only allows a single outstanding transaction.

    This project would implement cache heirarchy for the BERI project that can saturate the bandwidth to DRAM for the Terasic DE4. The student would implement instruction and data L1 caches with a shared L2 cache as well as a traffic generator to test the heirarchy. The caches should be pipelined and allow at least 16 outstanding transactions and should run at a high clockspeed on the Terasic DE4. The caches should be parameterizable for size and possibly for line size and associativity. The traffic generator should be capable of both speed tests and complex patterns to test consistency in the caches. An optional extension would be to exend the caches to support coherency when more than one set of L1 caches is present. The final report should present cache performance with a range of parameters which trade off between area, clock speed, and performance.

  9. Application Scheduling for Heterogeneous Systems
    Contact: Robert Mullins and Timothy Jones

    As energy efficiency becomes the main driver for processor development, heterogeneous systems become attractive, since they allow applications to be scheduled on the cores that best suit their current requirements. Emerging heterogeneous systems include those with close CPU-GPU integration and ARM's big.LITTLE processors.

    The goal of this project is to perform an evaluation of a heterogeneous multicore system using the gem5 simulation environment. It will consider a range of cores to determine the optimal system for a group of multi-threaded and multi-programmed workloads. There should not need to be a significant amount of infrastructure development, since gem5 already includes support for multiple, configurable cores. The results will be an analysis of the types of workloads that benefit from heterogeneity in the processor and how they can be successfully scheduled together.

  10. Speculative Guided Parallelisation of Application Binaries
    Contact: Robert Mullins and Timothy Jones

    With multicore systems now the norm across the computing landscape, and many-core systems on the horizon, it is important for applications to gain performance through parallel execution. However, a significant fraction of existing software is in single-threaded form, and rewriting it to be parallel would be a significant undertaking.

    This project seeks to parallelise applications without needing to alter the program source code. Using dynamic binary instrumentation and rewriting, such as within DynamoRio, it will alter program loops as they execute to allow them to run in parallel. To avoid complicated analysis of each loop, it will employ a form of speculation to catch situations where the code must be executed sequentially. The loops to parallelise will be determined in advance.

  11. Acceleration of the Floyd-Steinberg dithering algorithm
    Contact: Robert Mullins and Timothy Jones

    Applications such as high-speed ink-jet printing need to perform image dithering at Gpixel/s rates. Highly optimised sequential implementations can today only reach ~200Mpixels/sec. This project will explore parallel implementations of the Floyd-Steinberg algorithm, either hand-coded or produced with the aid of an automatic loop parallelisation technique (called HELIX).

    There is scope to extend the project to explore source-to-source transformations that could improve the performance of the HELIX technique.

    [1] PT Metaxas, Optimal parallel error diffusion dithering
    [2] Y Zhang, "Line diffusion: a parallel error diffusion algorithm for digital halftoning"

  12. Scalable Graphics Shader Engine
    Contact: Simon Moore

    Full-system research is becoming practical using FPGAs, and Cambridge is at the forefront with a full CPU and OS stack with a number of peripherals. However modern systems are not complete without an autonomous graphics processing unit with implications for system-on-chip data flow and prioritization, memory allocation and scheduling, and especially security.

    This project would explore the implications of an autonomous graphics processing unit in a system-on-chip architecture. This project would design and build a compact, scalable fragment shader engine in Bluespec SystemVerilog which is able, at least, to apply textures to triangles in a framebuffer. We would recommend an internal 16-bit floating-point pixel format similar to the ARM MALI GPU to save area and improve timing.

    Evaluation could include efficiency and performance as well as novel memory protection or sharing ideas when combined with the Cambridge CHERI 64-bit MIPS processor optionally running FreeBSD.

    [1] ARM, MALI Shader Arithmatic