Expert and Hardware-level User Guide

Kiwi Hard-Realtime Pipelined Accelerators

Note: real-time Pipelined Accelerator mode is being implemented 3Q16.

Classical HLS generates a custom datapath and controlling sequencer for an application. The application may run once and exit or be organised as a server that goes busy when given new input data. KiwiC supported only, up until now, that classical way for each thread. We call this `sequencer major HLS mode'.

In `Pipelined Accelerator' major HLS mode, KiwiC will generate a fully-pipelined, fixed-latency stream processor that tends not to have a controlling sequencer, but which instead relies on predicated execution and a little backwards and forwards forwarding along its pipeline.

Like classical HLS mode, a compilation root is identified in the high-level source, but its manifestation in the hardware is different. The loop is implemented by the subtrate instead of the KiwiC-generated RTL. Hence a different subtrate is needed and different techniques are used to connect such components together by HPR System Integrator.

The root designation for a hardware accelerator is a C# static method with arguments and a return value. This is typically the loop body of a C# iteration where the loop construct itself is only used in C# form in WD (workstation development) execution (§3.2).

A pipelined accelerator mode with latency set to zero results in a purely combinational circuit in terms of input to output data path, but it may post writes to registers and RAMs that still need a clock.

The prior Kiwi.Remote() attribute, described in §7.1, enables a given method to be cut out for separate compilation. This was non-rentrant and does not enforce hard real time.

When generating a real-time accelerator, a C# function (method with arguments and return value) is designated by the user as the target root, either using a C# attribute or a command line flag to the KiwiC compiler. The user may also state the maximum processing latency. He will also typically state the reissue frequency, which could be once per clock cycle and whether stalls (flow control) is allowed.

  static int piCombDemo(int arga) // The synthesis target
    // Trival example: probably a combinational design infact.  
    return arga+100;

For a real-time accelerator, multiple `calls' to (or invokations of) the designated function are being evaluated concurrently in the generated hardware. Operations on mutable state, including static RAMs and DRAM are allowed, but care must be taken over the way multiple executions appear to be interleaved, just as care is needed with re-entrant, multithreaded software operating on shared variables. Local variables are private to each invokation.

Although we default to every concurrent run's behaviour being treated in isolation, we support two means for inter-run communication: we can address the arguments and intermediate state of neighbouring (in the time domain) runs and, as mentioned just above, we can read and write mutable state variables that are shared between runs.

Variable-latency leaf cells cannot be instantiated (currently) in accelerator mode where the latency varies by more than the reinitiation interval. Further details need defining, but, for now, we need to avoid off-chip DRAM and KiwiC will request fixed-latency integer dividers (latency equal to the bit width) instead of the more commonly instantiated variable-latency divider.

Pipelined Accelerator Example 1

A simple example is test54 in KiwiC regression suite. Alternative mark up illustrated ... final system under design.

  static readonly uint[] htab4 = { 0x51f4a750, 0x7e416553, 0x1a17a4c3, 0x3a275e96, 
                                   ... many more entries ... 

  // We require a reissue interval of 1 (fully pipelined)
  // We want a maximum latency of 16. 
  [Kiwi.PipelinedAccelerator("accel1", "nostall", 1, "maxlat", 16)]
  static uint Accel1(uint a0)
    uint r0 = a0;
    for (int p=0; p<3; p++) { r0 += htab4[(r0 >> 6) % htab4.Length]; }
    return r0;

We can specify the reissue interval via the C# attribute. In this example, a reissue interval of 1 is specified. This generates fully-pipelined hardware that can be supplied with fresh arguments every clock cycle.

We also specify the maximum result latency as 16. KiwiC will determine its own latency, up to this value, guided by the logic cost settings, and report it in the KiwiC.rpt output file.

The ROM, in the full source code of the example, has 256 entries, and so is implemented as a statically-initialised block RAM on most FPGAs. This has a synchronous access time of one clock cycle. For multiple, concurrent accesses, as required by the reissue interval of 1, the ROM must be mirrored. Owing to loop-carried ROM address dependencies, the minumum implementation latency, by inspection, is 5 cycles.

All loops offered in pipelined accelerator mode must be fully unwindable by KiwiC. This means they must have a hard and obvious upper iteration limit, but they may have data-dependent early exit.

Internally, in our first implementation, the bevelab recipe stage unwinds all loops. This gives a single superstate to the restructure recipe stage which operates in a mode where all holding registers and input operands are replicated as needed in pipeline form and where mirroring of structural resources, such as the ROM in the above example, is used to avoid structural hazards arising not only for multiple use by a single run, as normal, but over different stages in that run that are separated by more than the reissue interval.

Designing General/Reactive Hardware with Kiwi

Kiwi can be used in an RTL-like style for some applications. This is where the user takes more active control over clock cycle mapping than is required or desired by scientific users.

The Kiwi system has a hard pause mode, clock domains and net-level I/O facilities for specifying cycle-accurate hardware. This is needed for bit-bang coding to connecting to existing hardware interfaces like AXI, I2C and LocalLink. Ideally, protocols are supported natively by Kiwi and bit-banging can be avoided.

Input and Output Ports

Input and Output Ports can arise and be defined in a number of ways.

Net-level I/O ports are inferred from static variables in top-most class being compiled. These are suitable for GPIO applications such as simple LED displays and push buttons etc.. The following two examples show input and output port declarations, where the input and output have their width specified by the underlying type and by attribute, respectively.
\begin{quoze}[Kiwi.InputPort(''serin'')]static bool serialin;
[Kiwi.HwWidth(5)] [Kiwi.OutputPort(''data_out'')] static byte out5;
The contents of the string are a friendly name used in output files.

For designers used to the VDHL concept of a bit vector, we also allow arrays of bools to be designated as I/O ports. This can generate more efficient circuits when a lot of bitwise operations are performed on an I/O port.
\begin{quoze}[Kiwi.OutputWordPort(11, 0, ''dvi_d'')]public static int[] dvi_d = ...
...ordPort(11, 0, ''dvi_i'')] public static int[] dvi_i = new int [12];
Although it makes sense to denote bitwise outputs using booleans, this may require castings, so ints are also allowed, but only the least significant bit will be an I/O port in Verilog output forms.

Register Widths and Wrapping

Integer variables of width 1, 8, 16, 32 and 64 bits are native in C# and CIL but hardware designers frequently use other widths. We support declaration of registers with width up to 64 bits that are not a native width using an `HwWidth' attribute. For example, a five-bit register is defined as follows.
\begin{quoze}[Kiwi.HwWidth(5)]static byte fivebits;
When running the generated C# natively as a software program (as opposed to compiling to hardware), the width attribute is ignored and wrapping behaviour is governed by the underlying type, which in the example is a byte. We took this approach, rather than implementing a genuine implementation of specific-precision arithmetic by overloading every operator, as done in OSCI SystemC [1], because it results in much more efficient simulation, i.e. when the C# program is run natively.

Although differences between simulation and synthesis can arise, we expect static analysis in KiwiC to report the vast majority of differences likely to be encountered in practice. Current development of KiwiC is addressing finding the reachable state space, not only so that these warnings can be generated, but also so that efficient output RTL can be generated, such that tests that always hold (or always fail) in the reachable state space are eliminated from the code.

The following code produces a KiwiC compile-time error because the wrapping behaviour in hardware and software is different.

\begin{quoze}[Kiwi.HwWidth(5)]byte fivebits;
void f()
fivebits = (byte)(fivebits + 1);

The cast of the rhs to a byte is needed by normal C# semantics.

Compiling this example gives an error:
KiwiC assign wrap error:
(widthclocks_fivebits{storage=8 }+1)&mask(7..0):
assign wrap condition test rw=8, lw=5, sw=8

The following examples work
// four bit input port
... din;
// six bit local var
[Kiwi.HwWidth(6)] static int j = 0;

A short-cut form for declaring input and output ports

public static int result;
[Kiwi.OutputWordPort(31, 0)]
public static int bitvec_result;

How to write state machines...

Kiwi hardware coding styles: how to code combinational, Mealy and Moore systems in hard-pause mode.

Moore Machines

First compare the Moore machines define by main_pre and main_post:

\begin{quoze}[Kiwi.Input()]int din;
[Kiwi.Output()] int q;
q = 100;
while (true) { Kiwi.Pause(); q -= din; }

each has some initial reset behaviour followed by an indefinite looping behaviour. Their difference is the contents of q on the first tick: main_pre will subtract din on the first tick whereas main_post does not. In both cases, q is a Moore-style output (i.e. dependent on current state but not on current input).

The shortly-to-be-implemented optimisation in bevelab will make a further change: the run-time program counter will disappear entirely for main_post because the loading of q with its initial value will be done as part of the hardware reset. However, main_pre will still use a state machine to implement its different behaviour on the first clock tick.

Mealy and combinational logic:

Coding Mealy-style logic and purely combinational sub-circuits is not currently supported (but will be via pipelined accelerator mode where latency is set to zero cycles). Purely combinational logic could possibly inferred from an unguarded infinite loop, such as main_comb
main_comb() { while (true) q = (din) ? 42:200; }

However, main_comb is not a sanitary program to run under KiwiS since it will hog excessive CPU power.

Mealy-style coding could better be implemented with a new attribute as illustrated in main_mealy where the mel output is a function of both the current state q and current input din.

\begin{quoze}[Kiwi.OutputMealy()]int mel;
main_mealy() { while (true) { q += 1; mel = q+din; Kiwi.Pause(); }

Exploring this further would best be done in conjunction with further development of SystemCsharp to yield a nice overall semantic. TODO perhaps?

State Machines

Explicit state machines can be coded fairly naturally:

q = 1;
...q = 2; break;
case 2: q = 3; break;
case 3: q = 1; break;

and the position of the single Kiwi.Pause() statement before or after the switch statement only alters the reset behaviour, as discussed above.

Implicit state machines can also be used:

q = 1;
Kiwi.Pause(); q = 2;
Kiwi.Pause(); q = 3;
Kiwi.Pause(); q = 1;

Because main_implicit_state_mc is a relatively simple example, the KiwiC compiler can be expected to reuse the initial state as the state entered after the third Pause call, but in general the compiler may not always spot that states can be reused.

Clock Domains

A synchronous subsystem designed with Kiwi requires a master clock and reset input. The allocation of work to clock cycles in the generated hardware is controlled by an unwind budget described in [3] and the user's call to built-in functions such as `Kiwi.Pause'. By default, one clock domain is used and default net names clock and reset are automatically generated. To change the default names, or when more than one clock domain is used, the `ClockDom' attribute is used to mark up a method, giving the clock and reset nets to be used for activity generated by the process loop of that method.
\begin{quoze}[Kiwi.ClockDom(''clknet1'', ''resetnet1'')]
public static void Work1()
{ while(true) { ... } }
A method with one clock domain annotation must not call directly, or indirectly, a method with a differing such annotation.


SystemCSharp follows the design of SystemC using C# instead of C++. Currently there is a very initial version of it in existence. Please see the README.txt in its folder.

SystemCsharp is a library, written in C#, that provides RTL semantics for hardware modelling. In particular, it provides signals that support the evaluate/commit paradigm of synchronous digital logic, where all variables in a clock domain take on their new values, atomically, one the active edge of the relevant clock.

The KiwiC compiler can generate SystemCsharp output by using the -csharp-gen=enable command line flag. The default output name is the default name with the suffix .sysc.cs added. The -cgen-fn=filename flag can be used to change the output filename.

Several of the C++ output flags affect the way that C# is generated but these may be decoupled in the future.

Note that emitting C# or C++ with the standard recipe writes these output files at the same point in the system flow as used for RTL output. Hence a large number of parallel, RTL-style assignments will be used. Using a shorter recipe or with some of the intermediate stages disabled, output closer to the input form can be rendered: for instance, with bevelab turned off assignments will be made in order using a thread instead of an HLS sequencer.

David Greaves 2018-01-22