Robert Mullins

SPEAR: Shapeshifting many-core

ERC Starting grant "SPEAR", ref:306386 (2012-2018)

(The Loki processor)

Contact: Robert.Mullins@cl.cam.ac.uk

This project explores the design space between FPGAs and many-core processors. The aim is to produce flexible and energy-efficient processors suitable for a wide-range of compute-intensive tasks.

We have designed and prototyped an architecture which supports large numbers of simple cores and provides low-level control over how these individual cores communicate and collaborate. The memory system is also highly configurable, allowing cores to setup, access and share scratchpads and caches of different sizes. Specialised cache hierarchies can also be configured as required.

The cores, interconnect and on-chip memories provide a sea of resources, not unlike an FPGA, that can be used in many different ways to optimise the execution of software. Performance and power efficiency gains are possible through specialising the mapping of software to hardware.

A 122M transistor test-chip was fabricated in TSMC's 40nm LP process. The chip is an array of 4x4 tiles each containing 8 cores and 64KB of SRAM. In total the chip contains 128-cores and 1MB of on-chip memory. Its area is less than 5mm x 5mm and average core power consumption is expected to be around 1-2W. We made extensive use of the BaseJump open-source BGA package substrate and motherboards. The test chips are packaged using a 352-pin wirebond BGA package.

The cores were designed at Cambridge and implement our own communication-centric ISA. Most instructions are able to indicate that their result should be sent to other cores or memories. The processors are 32-bit, with 32 general-purpose registers, support for decoupled loads, and contain a tightly-coupled 256-entry local scratchpad and a 64-entry ("L0") instruction packet cache.

A LLVM compiler backend port has been completed for Loki. The project is also supported by a SystemC simulation environment. Work continues on libraries, new languages, intermediate representations and compilation/mapping tools to ease the use of the platform.

A range of applications have been prototyped and explored including convolutional neural networks exploiting sparse data storage and computation. Direct comparisons to other platforms in terms of performance and power consumption are planned with the help of our test chip.

Latest news

June 2019 : Key features/blocks operating as expected. Currently deriving an accurate power model of the chip.

January 2019 : Core power has been measured at ~2.5W with all 128-cores running a loop (loads/multiplies) at 450MHz (~19mW per core and SRAM bank). We are in the process of collecting more detailed results.

January 2019 : The chip was submitted to a Europractice MPW run in June 2018. Packaged chips were received in December 2018 and the bring-up and test process began in January 2019. Preliminary results indicate that everything is working as expected.

People

Acknowledgements

This work was funded by the European Research Council (grant 306386). The PI would like to thank the the ERC for all of their support and guidance throughout the project. Earlier work in this area was funded by EPSRC grant EP/G033110/1 (2009-2013).

We would like to thank everyone in the IMEC team who worked with us, including Mustafa Haluk Cologlu who worked on the physical design, for their high-quality support and attention to detail. We are also indebted to Prof. Michael Taylor and his Bespoke Silicon Group at the University of Washington for making their BaseJump infrastructure public and for kindly providing us with so much good advice. We would also like to thank everyone who helped with the bonding/balling and packaging, BGA socket and PCB work at Quik-Pak, Ironwood Electronics and Sierra Circuits.

Papers

Configurable memory systems for embedded many-core processors
Daniel Bates, Alex Chadwick and Robert Mullins
International Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES 4), January 2016.
Paper

Exploiting Tightly-Coupled Cores
Daniel Bates, Alex Bradbury, Andreas Koltes and Robert Mullins
Journal of Signal Processing Systems, August 2014.
SpringerLink

Spatial computation on a homogeneous, many-core architecture
Daniel Bates, Alex Bradbury, Andreas Koltes and Robert Mullins
PRISM-2, June 2014.
Paper, Slides

Exploiting Tightly-Coupled Cores
Daniel Bates, Alex Bradbury, Andreas Koltes and Robert Mullins
SAMOS XIII, July 2013.
Paper, Slides

PhD Dissertations

Exploiting tightly-coupled cores
Daniel Bates
University of Cambridge Computer Laboratory, July 2013.
Thesis, Tech report