I am a Research Associate in the Computer Architecture Group, under the supervision of Dr Robert Mullins. During my PhD, I worked on exploring possible designs for a massively parallel processing fabric, as a part of the Loki project. I am now refining the chosen configuration and working towards a physical implementation.
The fabric's design lies somewhere between an FPGA, a GPU and a multicore processor, with hundreds or thousands of very simple cores connected by an on-chip network. Efficient direct communication between cores and a configurable memory system allows some interesting use cases, such as building specialised virtual processors out of a group of cores and memory banks. Using this technique, we are able to emulate a wide range of application-specific architectures, and specialise them further at run-time:
- Multithreading: all cores execute different threads/programs.
- SIMD execution: all cores execute the same code.
- Scalarisation: most cores execute the same code, with a small number reserved to perform tasks which would otherwise be repeated. (e.g. Accessing a common memory location, updating loop indices, ...)
- Software pipelining: each core performs a partial computation, then sends its result on to the next core for further processing.
- Superscalar: cores follow the same control flow, but execute different instructions from each basic block, communicating between each other to resolve dependencies.
- (Almost arbitrary compositions of the above: for example in a software pipeline, each pipeline stage can itself exploit a different form of parallelism.)
I am particularly interested in the potential for simultaneously increasing performance and reducing power consumption by parallelising a program. This is possible because a single core has limited resources available to it, and a compiler may need to generate additional code to get around these limitations. By providing more cores, the program gains access to more registers, more cache bandwidth, and more functional units, and so the compiler may be able to produce better code.
I have recently started looking at how these features might benefit neural network applications. The suspicion is that GPUs, which are currently the favoured platforms for neural network computation, may not be the best target due to their restrictive execution model.
University of Cambridge
15 JJ Thomson Avenue
Cambridge CB3 0FD
|E-mail:||Daniel.Bates at cl.cam.ac.uk|