Computer Laboratory

Computer Architecture Group

Computer Architecture ACS Project Suggestions (2013-2014)

Please contact the proposer(s) by email if you are interested in any of the projects below.

Please check here again over the coming weeks for more projects. We would also be happy to consider any project ideas you have too.

  1. Optimal Heterogeneous CMP Core Selection
    Contact: Timothy Jones

    As we project into the future, continued increases in transistor counts, coupled with tight processor power constraints, will lead to increased specialisation of cores within a chip multiprocessor (CMP). However, it is still an open question as to what this heterogeneous CMP will look like.

    This project will seek to answer this question by exploring the design space of heterogeneous CMPs. It will use the gem5 simulation infrastructure to run applications on a variety of cores and develop an algorithm to pick the best ones, given constraints such as power or area.

  2. Vectorisation in General Purpose Applications
    Contact: Timothy Jones

    Modern application processors now contain specialised instructions for operating on a vector of data. This is often called single-instruction, multiple data (SIMD) processing, and common forms are the SSE and AVX instructions in x86 processors, or NEON instructions in ARM. Making use of these instructions can provide significant speed ups.

    This project will study the opportunities for vectorisation within general purpose applications, which are traditionally not suited to this kind of processing. It will analyse the loops within each application to determine the inherent vector operations and those that can be exposed through additional compiler transformations. The goal is to expose as many opportunities for vectorisation as possible and, if time allows, implement a vectorisation pass within a compiler to take advantage of these.

  3. RISC-V Implementation in Bluespec
    Contact: Simon Moore Contact: David Chisnall

    The University of California, Berkeley is developing the RISC-V open instruction set architecture to promote open source research into computer architecture, but their current implementations are simple, unproven, user-mode designs. At the Cambridge Computer Laboratory, we have been developing the BERI 64-bit MIPS processor which now has a mature design with register forwarding, branch prediction, a MMU, floating point, a dependable cache heirarchy as well as a mature system on chip.

    This project would implement the RISC-V ISA instead of the 64-bit MIPS ISA using the BERI infrastructure. The base project would include user-mode, 32-bit instructions. Optional extensions, of which at least one should be attempted, include floating point instructions, 16-bit instructions, and full system support (which is preliminary in the specification). The resulting processor should be able to run code compiled with riscv-gcc from Berkeley. The student may also attempt or colaborate to develop an LLVM backend for the RISC-V ISA. This project will explore implementation implications of the experimental RISC-V instruction set as well as provide insight into the efficiency of the ISA when running compiled code.

  4. A Fast Cache Hierarchy for BERI
    Contact: Simon Moore Contact: Jonathan Woodruff

    The BERI processor, developed at the University of Cambridge, is a 64-bit MIPS processor which is somewhat mature and reliable, but has not so-far been optimised extensively for performance. One of the greatest shortcomings of the current design is cache performance, which only allows a single outstanding transaction.

    This project would implement cache heirarchy for the BERI project that can saturate the bandwidth to DRAM for the Terasic DE4. The student would implement instruction and data L1 caches with a shared L2 cache as well as a traffic generator to test the heirarchy. The caches should be pipelined and allow at least 16 outstanding transactions and should run at a high clockspeed on the Terasic DE4. The caches should be parameterizable for size and possibly for line size and associativity. The traffic generator should be capable of both speed tests and complex patterns to test consistency in the caches. An optional extension would be to exend the caches to support coherency when more than one set of L1 caches is present. The final report should present cache performance with a range of parameters which trade off between area, clock speed, and performance.

Computer Architecture ACS Project Suggestions (2012 - 2013)

The following is a list of old projects from previous years that may provide inspiration for your own ideas.

  1. Application Scheduling for Heterogeneous Systems
    Contact: Robert Mullins and Timothy Jones

    As energy efficiency becomes the main driver for processor development, heterogeneous systems become attractive, since they allow applications to be scheduled on the cores that best suit their current requirements. Emerging heterogeneous systems include those with close CPU-GPU integration and ARM's big.LITTLE processors.

    The goal of this project is to perform an evaluation of a heterogeneous multicore system using the gem5 simulation environment. It will consider a range of cores to determine the optimal system for a group of multi-threaded and multi-programmed workloads. There should not need to be a significant amount of infrastructure development, since gem5 already includes support for multiple, configurable cores. The results will be an analysis of the types of workloads that benefit from heterogeneity in the processor and how they can be successfully scheduled together.

  2. Speculative Guided Parallelisation of Application Binaries
    Contact: Robert Mullins and Timothy Jones

    With multicore systems now the norm across the computing landscape, and many-core systems on the horizon, it is important for applications to gain performance through parallel execution. However, a significant fraction of existing software is in single-threaded form, and rewriting it to be parallel would be a significant undertaking.

    This project seeks to parallelise applications without needing to alter the program source code. Using dynamic binary instrumentation and rewriting, such as within DynamoRio, it will alter program loops as they execute to allow them to run in parallel. To avoid complicated analysis of each loop, it will employ a form of speculation to catch situations where the code must be executed sequentially. The loops to parallelise will be determined in advance.

  3. Acceleration of the Floyd-Steinberg dithering algorithm
    Contact: Robert Mullins and Timothy Jones

    Applications such as high-speed ink-jet printing need to perform image dithering at Gpixel/s rates. Highly optimised sequential implementations can today only reach ~200Mpixels/sec. This project will explore parallel implementations of the Floyd-Steinberg algorithm, either hand-coded or produced with the aid of an automatic loop parallelisation technique (called HELIX).

    There is scope to extend the project to explore source-to-source transformations that could improve the performance of the HELIX technique.

    [1] PT Metaxas, Optimal parallel error diffusion dithering
    [2] Y Zhang, "Line diffusion: a parallel error diffusion algorithm for digital halftoning"

  4. Scalable Graphics Shader Engine
    Contact: Simon Moore

    Full-system research is becoming practical using FPGAs, and Cambridge is at the forefront with a full CPU and OS stack with a number of peripherals. However modern systems are not complete without an autonomous graphics processing unit with implications for system-on-chip data flow and prioritization, memory allocation and scheduling, and especially security.

    This project would explore the implications of an autonomous graphics processing unit in a system-on-chip architecture. This project would design and build a compact, scalable fragment shader engine in Bluespec SystemVerilog which is able, at least, to apply textures to triangles in a framebuffer. We would recommend an internal 16-bit floating-point pixel format similar to the ARM MALI GPU to save area and improve timing.

    Evaluation could include efficiency and performance as well as novel memory protection or sharing ideas when combined with the Cambridge CHERI 64-bit MIPS processor optionally running FreeBSD.

    [1] ARM, MALI Shader Arithmatic