Computer Laboratory

Old ACS project suggestions

Project suggestions from the Computer Architecture Group

Note that these projects are aimed at those who might like to continue to undertake research into computer architecture for their Ph.D.

Multithreaded ARM style processor in Bluespec

Originator: Simon Moore

Research in the Computer Architecture Group focuses on prototyping multiprocessor systems on FPGAs in order to do architectural evaluation on none trivial benchmarks. We are currently completing a 64-bit MIT style processor with multithreaded extensions. We are also working on a project with Prof Steve Furber in Manchester looking at million processor systems.

Research question: in order to undertake research on multiprocessors whilst using little hardware, can we time multiplex an ARM processor pipeline so that it emulates N processors running at 1/Nth of the speed (e.g. where N might be from 8 to 1024)?

Suggested approach: following on from the Advanced Computer Design course, design an ARM style processor in Bluespec SystemVerilog. To simplify the design, instructions should be executed sequentially, i.e. there is only one instruction in flight down the pipeline at any one time. But to improve performance a number (e.g. 128) of contexts can be scheduled in a round robin manner. Provided there are more contexts than pipeline stages the pipeline will remain full. The ARMv3 instruction set variant it to be targeted since it is believed to be out of patent (or nearly out of patent) but still has good compiler support. Initially the simpler single cycle instructions should be implemented with more complex instructions (e.g. load and store multiple) added as a possible extension.

Possible extensions: include adding a timing model to the ARM emulation to allow cycle times to be reported which reflect the time taken for a conventional ARM processor pipeline.

Operating system emulation for FPGA based experimental processors

Originator: Simon Moore

Research in the Computer Architecture Group focuses on prototyping multiprocessor systems on FPGAs in order to do architectural evaluation on none trivial benchmarks. We are currently completing a 64-bit MIT style processor with multithreaded extensions. We are also working on a project with Prof Steve Furber in Manchester looking at million processor systems.

Research question: can we emulate an operating system (OS) for novel processors prototyped on FPGA so that we do not need to port an OS for every variant of the processor? Further motivation: One of the problems with undertaking research on novel computer architectures is that you don't really want to implement a full operating system on a novel processor and yet so many benchmarks will only run with operating system support. However, it is often the case that OS support is only required at the beginning and end of the benchmark and is not of particular interest in the performance critical middle phase of execution. So provided some mechanism can be provided to provided OS support, it doesn't matter if it is slow.

Approach: This project aims to provide operating system stubs to allow many benchmarks to run. This is analogous to the often used computer architecture simulation technique of providing operating stubs which "magically" provide operating system support. One approach is to replace the base C library (libc) with one which just has stubs which in turn cause the simulator to handle the OS calls via the OS that it sits on top of. We need something analogous for our FPGA prototypes.

Proposed infrastructure: For this project a prototype system will be implemented on FPGA using a dual-NIOS (soft core from Altera) system (an example of which can be provided by the proposer of this project). The processors can easily be provided with a set of FIFO communication channels for efficient communication. One processor (the "client") will be designated as the processor being "simulated" with a view to replacing this with something more exotic/experimental (e.g. the Mamba multithreaded processor we are using for research) in the future. The other processor, the "server", will provide operating system support and will locally run the MicroC/OS-II RTOS provided by Altera for the NIOS. The client processor will run a newlibc (cut down libc) stub library (written by this project) which turns operating system calls into remote procedure calls to the "server" processor. The server processor will then handle the calls locally or will, optionally, talk to a host PC to fulfil the request (e.g. to allow remote file access).

Possible extensions: include adding profiling and timing information to allow statistics on the OS calls to be obtained and for timing information to be added, e.g. the "simulation"/"client" processor can be delayed by a fixed number of clock cycles to allow it to be "billed" for time spent in the OS.

Exploring the Limits of Snooping Coherence Mechanisms

Supervisor: Robert Mullins

Originator: Robert Mullins

Industry has made a step-change in microprocessor design strategy by no longer seeking additional performance from individual processor cores. Future performance growth will now come from an exponential increase in the number of cores placed on each die.

The success of this strategy in the long term is threatened by both the complexity and power-consumption of the non-computational components of these chip multiprocessors, namely the on-chip network, the cache-coherency hardware and the off-chip interfaces. This project will focus on the design of the cache-coherency hardware and on-chip networks in particular.

While high-end specialized systems may shift coherence issues into the compiler by restricting the choice of programming language and requiring more extensive application tuning, cache coherency is still important for a wide range of general-purpose systems [1]. Bus-based schemes are an appealing solution due to their simplicity, but typically have limited scalability and high power requirements.

This project will explore the limits of simple cache coherence mechanisms. Various techniques to improve their scalability will be evaluated including the use of snoop filtering techniques [3,4] and a combination of totally ordered and point-to-point interconnects (see seminars 5 and 7). The limits of any potential gains from a static-analysis of the program could also be explored. Results will be based on memory access traces from the PARSEC benchmarks suite [2], which have already been generated.

References and Recommended Reading

[1] "Improving the accuracy of snoop filtering using stream registers", V. Salapura, M. Blumrich and A. Gara, MEDEA, 2007.

[2] "The PARSEC benchmark suite: Characterization and architectural implications", C. Bienia, S. Kumar, J. P. Singh and K. Li, Proc. of the 17th Intl. Conf. on Parallel architectures and compilation techniques (PACT), pp 72-81, 2008.

You might also like to take a look at the PARSEC webpage.

[3] "JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers", A. Moshovos, G. Memik, B. Falsafi and A. Choudhary, HPCA, 2001

[4] "Design and implementation of the Blue Gene/P snoop filter", V. Salapura, M. Blumrich and A. Gara, HPCA, 2008

Exploring the Costs of Deterministic Multithreading

Supervisor: Robert Mullins

Originator: Robert Mullins

The majority of general-purpose parallel applications are created by describing a number of parallel threads which communicate via a single shared-memory. This concurrency model reflects the hardware organisation of the majority of today's general-purpose chip-multiprocessors.

Programming with threads is particularly challenging as it is not normally possible to guarantee that a program will produce the same results each time it is run. This is true even when the input to the program remains unchanged. Even a small difference in timing at run-time, or differences in the initial state of the system, can perturb execution in a way that can lead to a different output. This inherent non-determinism makes it difficult to test, debug and maintain multithreaded code [1].

This project will explore how deterministic multithreading can be enforced and the performance overheads of imposing and managing such a restriction on the execution of threads. A number of schemes have already been proposed that could provide a starting point for your research [2][3]. An implementation may also choose to exploit the PIN dynamic binary instrumentation tool [4].

References and Recommended Reading

[1] "The Problem with Threads", Edward A. Lee, IEEE Computer, 39(5):33-42, May 2006.

There's also a technical report UCB/EECS-2006-1, 2006.

[2] "Kendo: Efficient Deterministic Multithreading in Software", M. Olszewski, J. Ansel and S. Amarasinghe, ASPLOS, 2009.

[3] "DMP: Deterministic Shared Memory Multiprocessing", J. Devietti, B. Lucia, L. Ceze and M. Oskin. ASPLOS, 2009.

[4] The Pin tool for dynamic instrumentation of programs

Verifying Cache Coherency Protocols

Supervisor: Robert Mullins

Originator: Robert Mullins

This project involves the use of model checking tools to verify complex cache coherency protocols. There is potentially an industrial link with the promise of some support/collaboration. Please come and talk to me if you would be interested in a project in this area.

You will need to have both a good grasp of the theoretical aspects of computer science and have completed the chip-multiprocessor module (or have prior experience of directory-based cache coherency protocols) to tackle this project.