Kiwi Scientific Acceleration using FPGA

Overview

Kiwi is open source: the live repo is hosted on bitbucket.org/djg11/bitbucket-hprls2. There are stable snapshots linked on adjacent web pages.

Demo at FPL 2017, September, Ghent.

The Kiwi project aims to make reconfigurable computing technology like Field Programmable Gate Arrays (FPGAs) more accessible to mainstream programmers. FPGAs have a huge potential for quickly performing many interesting computations in parallel but their exploitation by computer programmers is limited by the need to think like a hardware engineer and the need to use hardware description languages rather than conventional programming languages.

Asymptotic Background Motivation for FPGA Computing

The Von Neumann computer has hit a wall in terms of increasing clock frequency. It is widely accepted that Parallel Computing is the most energy-efficient way forward. The FPGA is intrinsically massively-parallel and can exploit the abundant transistor count of contemporary VLSI. Andre DeHon points out that the Von Neumann architecture no longer addresses the relevant problem: he writes "Stored-program processors are about compactness, fitting the computation into the minimum area possible".

"Stored-program processors are about compactness,
fitting the computation into the minimum area possible." -- Andre DeHon

The FPGA alternative:

Pico Computing FPGA-based server blade: Four FPGAs, a PCI-e bus switch and a 3-D DRAM Cube, but no CPU!

Why is computing on an FPGA becoming a good idea ?

Spatio-Parallel processing uses less energy than equivalent temporal processing (ie at higher clock rates) for various reasons. David Greaves gives nine:

Pollack's rule states that energy use in a Von Neumann CPU grows with square of its IPC. But the FPGA with a static schedule moves the out-of-order overheads to compile time.
To clock CMOS at a higher frequency needs a higher voltage, so energy use has quadratic growth with frequency.
Von Neumann SIMD extensions greatly amortise fetch and decode energy, but FPGA does better, supporting precise custom word widths, so no waste at all.
FPGA can implement massively-fused accumulate rather than re-normalising after each summation.
Memory bandwidth: FPGA has always had superb on-chip memory bandwidth but latest generation FPGA exceeds CPU on DRAM bandwidth too.
FPGA using combinational logic uses zero energy re-computing sub-expressions whose support has not changed. And it has no overhead determining whether it has changed.
FPGA has zero conventional instruction fetch and decode energy and its controlling micro-sequencer or predication energy can be close to zero.
Data locality can easily be exploited on FPGA --- operands are held closer to ALUs, giving near-data-processing (but the FPGA overall size is x10 times larger (x100 area) owing to overhead of making it reconfigurable).
The massively-parallel premise of the FPGA is the correct way forward, as indicated by asymptotic limit studies [DeHon].

Scientific Users

It is widely accepted that many problems in scientific computing can be vastly accelerated using either FPGA or GPU execution resources. Also, the FPGA approach in particular leads generally to a significant saving in execution energy. The product of the two gains can be typically one thousand fold.

Kiwi provides acceleration for multi-threaded (parallel) programs provided they can be converted to .NET bytecode. Originating from Microsoft, the .NET (also known as CIL) is a well-engineered, general purpose intermediate code that runs on many platforms, including mono/linux.

Substrate

We use the term substrate to refer to an FPGA board or boards that is loaded with various standard parts of the Kiwi system. The most important substrate facilities are access to DRAM memory, a disk filesystem and a console/debug channel. Basic run/stop/error status output to LEDs via GPIO is also provided.

The substrate is like an operating system on the FPGA. It supports connection to more than one application.

The basic .NET classes for streamreader, streamwriter, textreader and textwriter are provided. Random access using ftell and fseek is also supported.

Very high bandwidth writes to the framestore are an intrinsic feature of FPGA computing. The framestore can be used for high-performance visualisation or just for a progress indicator- e.g. percentage of the job processed.

Design Exploration and Performance Prediction

Find out what speedup you will achieve before investing an afternoon in the projection to FPGA bitstream.

When using FPGA acceleration, as with all high-performance computing, careful design of datastructures is a most important consideration. This requires a good idea of what is stored where and how frequently it must move. Moreover, the key to compiling software to high-performance circuits is to balance the ow of data on the various paths. Given that such information is available inside tools such as Kiwi, the novel aspect in this project is to augment and forward this information up to the user so that they can make informed decisions and rapid experiments at the data structure and algorithm level.

The profile counters in the substrate, when enables, count the number of visits to a carefully-chosen set of basic blocks. This information is compared with the original prediction and a recalibration file is generated that can either be permanently inserted in the program's source code or else kept in the IDE alongside the program.

A design exploration demo using the Performance Predictor is being put together on this link: Kiwi Performance Predictor Demo.

Kiwi USPs - Unique Selling Points

There have been numerous high-level synthesis (HLS) projects in recent decades. Finally HLS has come of age, with all FPGA and EDA vendors offering HLS products. Nearly all of the prior work has used C, C++ or SystemC as the source HLL. (Historical note: Greaves CTOV compiler from 1995 is now owned by Synopsys Greaves-CTOV-1995 via the Tenison EDA sale.)

It is widely accepted that C# and mono/dotnet provide a significant leg up compared with C++ owing to crystal clear semantics, selectively checked overflows, neat higher-order functions and delegates, amenability to compiler optimisations and automated refactoring, garbage management, versioned assemblies and so on. Many of these benefits are most strongly felt with parallel programs. Also, the LINQ/Dryad extension is a clean route for manual invocations of accelerators.

The Kiwi system has the following USPs compared with most/all other HLS tools:

Source language is C# (or other CIL PE-generating HLLs),
The same program runs on mono/dotnet for development as on the FPGA for high-performance execution,
Concurrent (parallel) programs are supported,
Dynamic allocation of objects and manipulation of object pointers is supported,
KiwiC performs automatic mapping of arrays and object heap to appropriate memory subsystems,
Channel communication between separately-compiled components, instead of manual bit banging of wires to implement a protocol,
Floating-Point is supported (custom precision F/P in the future),
Incremental compilation with IP-XACT wrapper generation and import for instantiated hierarchic blocks,
Mix hard and soft coding styles where clock cycle mapping is controlled from the C# file (for manual bit-banging of net-level protocols) or fully automatic (for Scientific Acceleration),
Advanced register colouring taking into account wiring length under 2-D projection and multiplexor fan-in
Generates Verilog RTL output as well as a IP-XACT descriptions and SystemC behavioural and/or SystemC RTL-style models for whole-system modelling.
Some recursive programs are supported (unlimited recursion again in the future),
Compile on linux or Windows with existing substrate templates for Zynq, VC-707 and NetFPGA.

The performance predictor will be a vital component for scientific users, but can perhaps be adapted for embedded hard-real-time control where Kiwi is being used for protocol implementation and deep packet inspection in the Network As A Service project.

Demos

A number of demos and projects have been done using Kiwi in the last five years. Here we are collecting together a number of them and making sure they still compile. All the following will become links in the near future...

`Synthesis of a Parallel Smith-Waterman Sequence Alignment Kernel into FPGA Hardware' PDF. Src code CSharp (Serial Version). Src code CSharp (Parallel Version).
Bowtie Genome Alignment (Burrows-Wheeler Transform) Page Under Construction.
K-Means Interactive Game (FPL 2017 Ghent Demo)Page Under Construction.
Smith-Waterman Genome Alignment: L1 PDF - code is in distro examples folder.
Primes using Sieve method - see distro examples folder and performance predictor demo.
Passing objects between threads: LINK.
DFSIN and other chstone benchmarks (first converted to C#): LINK.
Lemple Ziv Compression - Link missing.
FPL 2016 AES Demo: LINK.
Cuckoo Cache: LINK.
Random Number Tester: LINK (Also in distro examples folder).
Bloom Filters: LINK.
Conway's Game of Life - operating in FPGA framestore with DVI/HDMI output - Link missing.
Linked Lists Example LINK.
Stream Computation Example Data Streaming to FPGA.
L/U Decomposition - Simultaneous Equation with Kiwi.
N-Body Simulation (with video clip) LINK.
Multi-FPGA logic partitioner, instantiator and structural wiring generator:HPR System Integrator .
(Older, simple and small demos.)

VSFG

We have a new technique for compiling heavy control flow that differs from conventional HLS. It is called VSFG. "Exposing ILP in custom hardware with a dataflow compiler IR". This is being prototyped within the Kiwi framework (and in other projects).

Other Current Developments

Although Kiwi is a tool that mainly/essentially works, a lot of further development is envisioned. Apart from bug fixing, the main development/research areas for Kiwi at the moment are:

Even higher performance (mostly through enhanced, profile-directed tuning),
Spatially-aware register colouring,
Accurate performance prediction,
Custom-width floating point,
Better support for C# structs (as opposed to C# classes),
DRAM to cache prefetch,
Better debug support (including preserving vsibility name of C# local variable names),
Easier co-design when part of the application still runs under mono/Windows,
Dataflow (VSFG) execution engine,
Kiwi-2 full support for run-time dynamic storage allocation.

Download

The Kiwi compiler, KiwiC, itself consists of about 22 klocs (thousand lines of code) of F# (FSharp) code that is a front end to the HPR L/S logic synthesis library that is composed of another 60 or so klocs of F#. The code density for F#, like other dialects of ML, is perhaps (conservatively perhaps) 3 times higher than for common imperative languages like C++, Java and C#, so it is a significant project.

Kiwi is open source: download a snapshot from HERE.

Links

First Draft User Manual 80 pages (PDF)

KiwiC Page.

Multi-FPGA logic partitioner, instantiator and structural wiring generator:HPR System Integrator .

Small hardware-oriented demos

Hastlayer's dotnet to FPGA project.

Performance predictor demo: Kiwi Performance Predictor Demo.

Acceleration of graph algorithms using multi-blade Co-Synthesis Kiwi-Axelgraph.