Introduction

Kiwi is a compiler and library and infrastructure for hardware accelerator synthesis and general support for high-performance scientific computing. The output is intended for execution of on FPGA or in custom silicon on ASIC.

We aim to compile a fairly broad subset of the concurrent C# language subject to some restrictions:

For Kiwi 1, the current version, we have the following aims:

Works with the Linux/mono infrastructure but should also work on Windows.
Program can freely instantiate classes but not at run time - a fixed number of instantiation operations must be detectable at compile time.
Array and heap structure sizes must all be statically determinable (i.e. at compile time).
Program can use recursion but the maximum calling depth must be statically determined in Kiwi 1.
Stack and heap must have same shape at each run-time iteration of non-unwound loops. In other words, every allocation made in the outer loop of your algorithm must be matched with an equivalent, manifestly-implicit garbage generation event or explicit obj.Dispose() or Kiwi.Dispose(Object obj) in the same loop.
Program can freely create new threads but creation sites statically determined too.

In Kiwi 2 we will relax the static restrictions and allow the size of data structures in DRAM to be determined at runtime. See
urlhttp://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/kiwic-demos/linkedlists.html

Kiwi 2, planned to be available in the middle of 2017, supports three major compilation modes. These can be mixed in a single design, at a subsystem granularity, with the new incremental compilation support based on IP-XACT.

The Sequencer major mode is `classical HLS'. It will generate a custom datapath made up of RAMs, ALUs and external DRAM connections and folds the program onto this structure using some small number of clock cycles for each iteration of the inner loops.
The Pipelined Accelerator major mode (§15) generates hardware with a low initiation interval (II). When the II is unity we say the component is `fully-pipelined'. A fully-pipelined component will execute the complete computation every clock tick, accepting new argument data every clock cycle, allbeit with some number of clock cycles latency between a particular input appearing at the output.
The SoC Render major mode provides C# access to an IP-XACT-driven wiring generator with support for automatic glue logic insertion. The invoked subsystem is called HPR System Integrator (§39). This can target multi-FPGA designs and provides a clean mechanism to wrap up third-party IP blocks, such as CAMs. (This is now not a separate major mode -- instead it is a separate application program invoked from a Makefile or command line.)

Asymptotic Background Motivation for FPGA Computing

The Von Neumann computer has hit a wall in terms of increasing clock frequency. It is widely accepted that Parallel Computing is the most energy-efficient way forward. The FPGA is intrinsically massively-parallel and can exploit the abundant transistor count of contemporary VLSI. Andre DeHon points out that the Von Neumann architecture no longer addresses the correct problem: he writes "Stored-program processors are about compactness, fitting the computation into the minimum area possible".

`Stored-program processors are about compactness, fitting the computation into the minimum area possible.' -- `Fundamental Underpinnings of Reconfigurable Computing Architectures' by Andre DeHon.

Why is computing on an FPGA becoming a good idea ? Spatio-Parallel processing uses less energy than equivalent temporal processing (ie at higher clock rates) for various reasons. David Greaves gives nine:

Pollack's rule states that energy use in a Von Neumann CPU grows with square of its IPC. But the FPGA with a static schedule moves the out-of-order overheads to compile time.
To clock CMOS at a higher frequency needs a higher voltage, so energy use has quadratic growth with frequency.
Von Neumann SIMD extensions greatly amortise fetch and decode energy, but FPGA does better, supporting precise custom word widths, so no waste at all.
FPGA can implement massively-fused accumulate rather than re-normalising after each summation.
Memory bandwidth: FPGA has always had superb on-chip memory bandwidth but latest generation FPGA exceeds CPU on DRAM bandwidth too.
FPGA using combinational logic uses zero energy re-computing sub-expressions whose support has not changed. And it has no overhead determining whether it has changed.
FPGA has zero conventional instruction fetch and decode energy and its controlling micro-sequencer or predication energy can be close to zero.
Data locality can easily be exploited on FPGA -- operands are held closer to ALUs, giving near-data-processing (but the FPGA overall size is x10 times larger (x100 area) owing to overhead of making it reconfigurable).
The massively-parallel premise of the FPGA is the correct way forward, as indicated by asymptotic limit studies [DeHon].

Download and License

Kiwi has been open source since early 2017 and is downloadable (perhaps on completion of a web form). The download page is http://koo.corpus.cam.ac.uk/kiwic-download.

Warranty

Neither the authors nor their employers warrant that the Kiwi system is correct, usable or noninfringing. It is an academic prototype. We accept no responsibility for direct or indirect loss or consequential loss to the maximum amount allowable in UK law.

Subsections

- - Asymptotic Background Motivation for FPGA Computing
- Download and License
  - Warranty

David Greaves 2019-11-14