Kiwi Supported Language Subset Limitations and Style Guide

Kiwi aims to support a very broad subset of the C# language and so be suitable for a wide variety of High-Performance Computing (HPC) applications. However, the user is expected to write in a parallel/concurrent style using threads to exploit the parallelism available in the FPGA hardware. However, conventional high-level synthesis (HLS) benefits should be realised even for a single-threaded program.

This chapter will explain the synthesisable subset of C# supported by KiwiC, but currently much work is needed in this section of the manual ...

In general, for Kiwi 1, all recursion must be to a compile-time determinable depth. The heap and stack must have the same shape at each point of each iteration of every loop this is not unwound at compile time. In other words, dynamic storage allocation is supported in KiwiC, provided it is called only from constructors or once and for all on the main (lasso stems of) threads before they enter an infinite loop. If called inside a non-unwound loop, the heap must be the same shape at each point on each iteration.

KiwiC implements a form of garbage collection called 'autodispose'. This can currently (October 2016) be enabled with -autodispose=enable. It will be turned on by default in the near future when further escape analysis is completed. Currently it disposes of a little too much and when that memory is reused we have a nasty aliasing problem since that store was still live with other data. This will crop up with linked-list and tree examples or where the address of a field in a heap object is taken.

When autodispose fails to free something (or is turned off) you can explicitly free such store with a call to obj.Dispose() or Kiwi.Dispose(Object obj).

WRONG: Dynamic storage regions cannot currently be shared between Kiwi threads. Currently, KiwiC implements different heap spaces for each thread ... really ? If so this needs fixing ... TODO ... maybe they are only different AFTER a fork but resources allocated before Thread.Start are ok.

Floating point is being provided for the standard 32 and 64-bit IEEE precisions, but FPGAs really shine with custom precision floating point so we will add support for that while maintaining bit-accurate compatiblity between the execution environments.

Atomic operations: Kiwi supports the CLR Enter, Exit and Wait calls by mapping them on to the hpr_testandset primitive supported by the rest of the toolchain. Ed: The rest of this paragraph should be in the `internal operation' section. Although RTL target languages, such as Verilog, are highly-concurrent, they do not have native support for mutexes. The bevelab recipe stage correctly supports testandset calls implemented by its own threads, but KiwiC does not use these threads: instead it makes a different HPR virtual machine for each thread and these invoke bevelab once each instead of once and for all with bevelab threads within that invokation. Hence the the testandset primitives dissappear inside bevelab. ... TODO explain further.

General CSharp Language Features and Kiwi Coding Style

Supported Types

Kiwi supports custom integer widths for hardware applications alongside the standard integer widths of dotnet 8, 16, 32 and 64.

Char is a tagged form of the 16-bit signed integer form.

Single and double-precision floating point are supported.

Enumerations are supported with custom code points. MSDN says the approved underyling types for an enum are byte, sbyte, short, ushort, int, uint, long, or ulong, but Kiwi uses a suitable custom width of any number of bits.

Supported Constants

Verilog and SystemC has 8-bit chars but C# and dotnet have 16 bit chars.

Supported Variables

Kiwi supports static, instance, local and formal parameter variables.

Variables may be classes or built-in primitive types and arrays of such variables. An array may contain a class and a class may contain an array, to any nesting depth. Multi-dimensional arrays (as opposed to jagged arrays) are supported with a little syntactic sugar in the C# compiler but mostly via library class code provided in Kiwic.dll.

Structs are also being added.

Signed and unsigned integer and floating point primitive variables are fully supported.

Strings are supported a little, but there is currently no run-time concatenation or creation of new strings, so all such string creation operations must be elaborated at KiwiC compile time.

Supported Operators

All standard arithmetic and logical operators are supported. Some operators, especially floating-point converts and floating-point arithmetic result in components being instantiated from the cvgates.v library. Integer mod, divide and larger multiplies also result in ALU instantiation, unless arguments are constant identity values or powers of two that are easily converted to shifts. Divide and multiply by a constant may result in adders being generated.

Supported Class Featuress

Supported I/O with Kiwi

Kiwi supports a number of forms of I/O:

Data Structures with Kiwi 1/2

To achieve high performance from any computer system the programmer must think about their data structures and have a basic knowledge of cache and DRAM behaviour. Otherwise they will hit memory bandwidth limitations with any algorithm that is not truly CPU bound.

As in most programming languages, C# variables and structures are static or dynamic. Dynamic variables are allocated on the heap or stack. All are converted to static form during compilation using the version 1 Kiwi compiler. Support for truly dynamic variables will perhaps be added in a future release.

Kiwi does not (currently) support taking the address of local variables or static variables in fields (except when pass by reference is being compiled). All pointers and object handles need to refer to heap-allocated items.

It is helpful to define the following two terms for pointer variables. Pointers generally point to dynamic data but their pattern of use falls into two classes. We will call a static pointer one whose value is initially set but which is then not changed. A dynamic pointer is manipulated at run time. Some dynamic pointers range over the value null. (As with all C# variables, such pointers can be declared as static or instance in C# program files -- this is orthogonal to the current discussion.)

Every C# array and object is associated with at least one pointer because all arrays and objects are created using a call to 'new'. Also, some valuetypes become associated with a pointer, either by being passed-by-reference or by application of the ampersand operator in unsafe code. The KiwiC compiler will `subsume' nearly all static pointers in its front end constant propagation and any remaining static pointers will be trimmed by later stages in the KiwiC compiler or in the vendor-specific FPGA /ASIC tools applied to the output RTL.

KiwiC maps data structures to hardware resources in two stages. In the first stage (known as repack §29), every C# form (that did not disappear entirely in the front end) is converted to either scalars of some bit width or 1-D arrays (also known as vectors) of such scalars. In the second stage (known as restructure §30), mapping to physical resource decisions are made as to which vectors and scalars to place in what type of component (flip-flops, unregistered SRAM, registered SRAM, DP SRAM or off-chip in DRAM) and which structural instance thereof to use. The first stage behaviour is influenced mainly by C# programming style. Second stage behaviour is controlled by heuristic rules parametrised by command-line flags and recipe file values.

Data Structures with Kiwi 2/2 - more advanced and opaque temporary write up...


First Stage Processing (repack):

Two-dimensional arrays are a good example to start with. Although there is syntactic sugar in C# for 2-D arrays, with current C# compilers this is just replaced with operations supplied by a library dll. The dotnet runtime and KiwiC support just 1-D arrays called vectors. There are two possible implementations of a 2-D array library: jagged and packed. The packed form subscript is computed using a multiply of the first co-ordinate with the arity of the second co-ordinate and then adding on the second co-ordinate. The jagged form uses a vector of static pointers to vectors that contain the data; the first co-ordinate is the subscript to the pointer vector and the second co-ordinate is the subscript to the selected vector. We use the term jagged to encompass their smooth form where all data vectors are the same length.

KiwiC inlines the subscript computation for a packed array as though the programmer had inlined such an expression in his C# code. Additionally, there is only one vector created. Therefore packed 2-D arrays first become 1-D vectors. However, such vectors are then subject to unpacking in first stage operation. For instance, if all subscripts are constant values, the array is replaced with a set of scalars. Of if the subscripts fall into clearly disjoint regions, the vector is split into multiple, separately-addressed regions. Or if all the subscripts have a common factor or common offset then these are divided and subtracted off respectively. This unpacking into multiple vectors removes structural hazards that would prevent parallelism.

For a jagged array, initially a number of separate vectors are created and a potentially large number of multiplexing expressions (that appear as the ?: construct in Verilog RTL) are created to direct reads to the correct vector. For writes, an equivalent demultiplexor is created to select the correct vector for writing. (The pointer vector is normally static and becomes subsumed, but we will later discuss what happens if the C# code writes to it, making it dynamic.)

Implementation note: if a jagged array is created by allocating a large 1-D array and storing references to offsets in that vector in the pointer array, it is possible to generate a structure that is identical to the packed array. KiwiC happens to detect this pattern and the behaviour would be as per the packed array: however this style of programming is not allowed in safe C#, but could be encountered in unsafe code or other dotnet input form, say, C++.

If we create an array of objects do we expect the fields of the objects to be placed in vectors? Yes, certainly if the object pointers are subsumed.

If we take the parfir example, there's one initialise place where empty flags are written from a non-unwound loop and hence with dynamic subscript, but elsewhere they are updated only with constant subscripts and so should be simple scalar flags.

Kiwi on Loop Unwinding: Loop-carried dependencies in data or control form limit the amount of parallelism that can be achieved with unwinding.

The hard cbg algorithm unwinds all loops without event control. The soft algorithm allocates cycles based on greedy or searching strategies based on complexity and structural hazards. Consider 1: Hoisting of exit condition computation, or hoisting of data dependency computation: this should preferably be applied? So the post-dependent tail of each loop can be forked off

Dynamic Storage Allocation

For statically-allocated object instances, KiwiC packs them into flip-flops, B-RAM or DRAM according to thresholds configured in the recipe or command line. This includes objects and structs allocated on the C# heap before the end of static elaboration.

For dynamically-allocated instances, KiwiC cannot easily tell how much memory may be needed and so defaults to DRAM channel 0 if present. But we can switch manually between B-RAM and DRAM for dynamic storage allocation using C# attributes.

We make the following interesting observation: Once data structures are placed in DRAM there is no real need to have their number statically determined at compile time: instead they can be truely dynamically allocated at run time (DJ Greaves 2015). Indeed, if an application becomes overly dependant on DRAM data then the FPGA advantage will disappear and a Von Neumann (e.g. x86) implemenation may likely have better performance. But, there remains some good FPGA mid ground where a lot of dynamic store is needed but where the access bandwidth required is not excessive.

Kiwi.HeapManager

Physical memories used for dynamic storage require a freespace manager. We can allocate a HeapManager for each physical memory and the user can direct requests to an appropriate instance. Typically there could be one for each separate DRAM bank and one for each separate on-chip B-RAM.

Also, arrays with dynamic sizes ...

Pointer Arithmetic

handleArith pointer arithmetic

Kiwi.ObjectHandler<T>

The object handler provides backdoors to certain unsafe code for pointer arithmetic that are banned even in unsafe C# code. Implementation in CIL assembler would be possible but having hardcoded support in the KiwiC compiler accessed via this object manager is easier.


Garbage Collection

With Kiwi 1, the stack and heap must have same shape at each run-time iteration of non-unwound loops. In other words, every allocation made in the outer loop of the compiled program must be matched with an equivalent dispose or garbage generation event in the same loop.

Where a heap object is allocated inside a loop that is not compile-time, it will potentially consume fresh memory on each iteration. There are two basic senarios associated with such a condition: either the fresh memory is useful, such as when adding items to a linked-list datastructure, or else it is not needed because the previous allocation is no longer live and the same heap space could be simply reused. This second case is fully served by converting to static allocation at compile time.

KiwiC V2 is implementing a more easy to use, run-time storage allocator, but without garbage collection.

KiwiC V1 does not support genuine dynamic storage allocation inside an execution-time loop. Bit it provides two mechanisms to support dynamic to static reduction where dynamic store is not really needed. The first uses an explicit dispose and the second uses an implicit dispose. Either way, when the loop iterates, the active heap has shrunk and KiwiC makes sure to reuse the previously allocated heap record at the allocation site (call to C# new).

See the linked list example ... http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/kiwic-demos/linkedlists.html

KiwiC V1 arrays - Array sizes must all be statically determinable (i.e. at compile time).

System.BitConverter provides a typical use case that involves a lot of temporary byte arrays. The F# compiler also uses a lot of temporary structures and the KiwiC has a chance of compiling F# bytecode by exploiting the implicit disposal approach.

Arrays in .NET do not have a Dispose() method. Instead an array can be disposed of with Kiwi.Dispose<T>(T [] array). This is a nop when running on mono/dotnet.

System.BitConverter returns char arrays when destructing native types and the arrays returned by BitConverter should be explicitly disposed of inside a non-unwound loop if KiwiC is failing to spot an implicit manifest garbage creating event, as reported with the an error like:

System.BitConverter returns char arrays when destructing native types. The arrays returned by BitConverter should be explicitly disposed of inside a non-unwound loop if KiwiC is failing to spot an implicit manifest garbage removal opportunity, as reported with the an error like

KiwiC +++ Error exit: BitConverterTest.exe: constant_fold_meets
 entry_point=5:: Bad form heap pointer for obj_alloc of type
 CT_arr(CTL_net(false, 32, Signed, native), 8) post end of elaboration
 point (or have already allocated a runtime variable sized object ?).
 Unless you are geninuely making a dynamic linked list or tree, this
 can generally be fixed using a manual call to Kiwi.Dispose() in your
 source code at the point where your allocation could be safely
 garbage collected.

Unless you are geninuely making a dynamic linked list or tree, the failed implicit garbage collector can generally be worked around using a manual call to Kiwi.Dispose() in your source code at the point where your allocation could be safely garbage collected.

new

For making trees and lists, see the linked list example ... http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/kiwic-demos/linkedlists.html

... field-arrays and spatial locality

Testing Execution Env: Whether I am running on the Workstation or the FPGA

We need sometimes to achieve different behaviour, for debugging and scaling reasons, in the three execution environments.

  1. For Workstation Development - WD - we can invoke
    Kiwi.inHardware() and find that it returns false
  2. For RTLSIM check that inHardware returns false and that the
    Kiwi.InputBitPort("FPGA")] static bool FPGA; returns false. You should tie this net low in your simulator top-level instantiation.
  3. Otherwise we are in FPGA. The Kiwi substrate for a hardware PCB should tie this net high in the pad ring.

Call the function Kiwi.inHardware() for this purpose. Since this is a compile-time constant, it is useful for removing development and debugging code from the final implementation. KiwiC will ignore code that is inside if (false) { } constructs so write
if (!Kiwi.inHardware()) { ... test/debug code ... }.

[KiwiSystem.Kiwi.HprPrimitiveFunction()]    
public static bool inHardware()
{
  return false; // This is the returned value when running on the workstation.
  // An alternative overriding implementation is hardcoded inside KiwiC and will 
  //return 'true' for FPGA and RTL simulation.
}


Clone

Clone of arrays and objects ....

Delegates and Dynamic Free Variables

Kiwi Dynamic Method Dispatch

Dynamic method dispatch is where which function body that gets called from a callsite is potentially data-dependent. Computed function calls occur with action and function delegates and dynamic object polymorphism.

In C++ there are restrictions that higher-order programming is only possible within a class hierarchy. This arises from the C compatibility issues where the higher-order function passing does not have to manage an object pointer. These issues are neatly wrapped up in C# using delegates. An action delegate has void return type whereas a function delegate returns a value.

Kiwi supports the function and action delegates of C#.

KiwiC partitions dynamically-callable method bodies into equivalence classes and gives each body within a class an integer. These classes typically contain only a very few members each. It then uses constant folding on the entire system control-flow graph as a general optimisation. This may often turn a dynamic dispatch into a static dispatch, hence these integers will not appear in the output hardware unless truly dynamic dispatch is being used, such as in

 Action<int, string> boz_green = delegate(int var1, string var2)
   {	Console.WriteLine("  {1} {0} boz green", var1, var2);
   };
 Action<int, string> boz_red = delegate(int var1, string var2)
   {	Console.WriteLine("  {1} {0} boz red", var1, var2);
   };
 for(int pp=0; pp<3; pp++)
   { Kiwi.Pause(); // Pause makes this loop unwind at run time.
     boz_red(pp+100, "site1");
     boz_green(pp+200, "site2"); 
     var x = boz_red; boz_red = boz_green; boz_green = x; //swap
   }

C# 3.0 onwards supports proper closures. These are implemented inside the C# compiler and compile fine under Kiwi provided the static allocation restrictions are obeyed.

Test55 of the regression suite contains the following demo.

  public static Func<int,int> GetAFunc()
  {
    var myVar = 1;
    Func<int, int> inc = delegate(int var1)
      { myVar = myVar + 1;
        return var1 + myVar;
      };
    return inc;
  }

  [Kiwi.HardwareEntryPoint()] static void Main()
  { var inc = GetAFunc();
    Console.WriteLine(inc(5));
    Console.WriteLine(inc(6));
  }

This compiles and works fine. But, there is a Kiwi 1 resriction that the GetAFunc call must be before the end of static elaboration since this creates the closure that is allocated on the heap.

If no closure is needed, Action and Function delegates suffer from no static allocation restriction.


The ToString() Method

Kiwi implements a basic version of the ToString method. It will give output that is rather dependent on which version of the compiler is being used, but it is better than nothing. Enumerations print as integers.

Accessing Numerical Value of Pointer Variables

IntPtr types.

Clearly, the addresses used on the FPGA have little relationship when run on the mono VM, but it is possible to display class pointer value on the hardware platform. One method is to use the default ToString method on an object handle. This will generate a Kiwi-specific output.

For example

  Console.WriteLine("  Ntest14w line0 : pointer={0}", ha.ToString());
  Console.WriteLine("  Ntest14w line1 : left={0}", ha.left);

Might give:

Ntest14w line0 : pointer=Var(test14w/T401/Main/T401/Main/V_0%$star1$/test14w/
                   dc_cls%30008%4, &(CTL_record(test14w/dc_cls,...)), ..., )
Ntest14w line1 : left=32

Ah - this has printed the variable not its value!

Accessing Simulation Time

The Kiwi.dll library declares a static variable called tnow. During compilation reads of this are replaced with references to the appropriate runtime mechanism for access to the current simulation time. For instance, the following line

   Console.WriteLine("Start compute CRC of result at {0}\n", Kiwi.tnow);
becomes
   $display("Start compute CRC of result at %t\n", $time);
when output as Verilog RTL.

The substrate has a tick counter that is instantiated when tnow is used for FPGA execution and so the RTLSIM code is a now a shim and not a direct call to the non-synthesisable $time infact... TODO fix.


Run-time Status Monitoring, Waypoints and Exception Logging

The following text to be corrected and moved to debugging section of manual please:

The user requires an indication of whether an FPGA card is actively running an application. Nearly all FPGA cards have LED outputs controlled by GPIO pins that are useful for basic status monitoring. It is normal to connect an LED or two to indicate Kiwi activity and/or error, but most status reporting is via the substrate gateway.

Some FPGAs have LCD or VGA framestore outputs that are also relatively easy to use for monitoring and results.

The sequencer index and waypoint for each thread can be remotely monitored via the substrate gateway. This provides ... abend reason register ... logs thread id, waypoint, pc value and abend reason.

Exiting Threads


Null pointer, Array bounds, Overflow, Divide-By-Zero and Similar Run-time Exceptions

The Kiwi substrate gateway will log the thread identifier, waypoint and sequencer index for threads that finish or abort in an abend reason register. The user can reverse-engineer these via the KiwiC report file. An XML variant of that file for import into IDE needs to be provided in the future.

It is possible to get a run-time null pointer exception.

It is possible to get a run-time checked overflow exception.

It is possible to get a run-time divide-by-zero exception.

It is possible to get a run-time array bounds exception.

It is possible to get a run-time exception.

(Floating point exceptions are normally handled with via NaN propagation.)


Normal Thread and Program Exit

For RTLSIM execution of the KiwiC-generated RTL, it is sometimes convenient to have the simulator automatically exit when the program has completed.

When the main thread of Kiwi program exits (return from Main), the generated code may include a Verilog $finish statement if the flag "-kiwic-finish=enable" is supplied on the command line or in the recipe file. The equivalent is generated for C++ output. Otherwise a new implicit state machine state is created with no successors and the thread sits in that state forever. Hanging forever is the default behaviour for forked threads.

For use with a standard execution substrate, having a $finish statement in the generated design makes no sense,

Environment.Exit(1) can also be invoked within C# to cause the same effect as main thread return.

(Pipelined accelerators cannot exit since they have no sequencer (§15 and are permanently ready to compute. )


User-defined C# Exceptions

C# try-except blocks are supported as is exception handling. But no exceptions can currently be caught and all lead to either a compile-time or run-time abend.

In other words, the contents of a C# catch block are ignored in the current KiwiC compiler.

The contents of a C# finally block are executed under Kiwi as normal.


Debug.Assert

System.Diagnostics.Debug.Assert(bool cond) and friends ...

We can raise a run-time assertion problem that is logged in the abend register ...

There is a compile-time variant called - not reached - or something ...

Pause Modes (within Sequencer HLS Mode)

Kiwi supports several major HLS modes, but the default, sequencer major HLS mode, generates a sequencer for each thread. When creating a sequencer, the number of states can be fully automatic, completely manual, or somewhere in between, according to the pause mode setting.

The mapping of logic operations to clock cycles is one of the main tasks automated by high-level synthesis tools, but sometimes manual control is also needed. Control can be needed for compatibility with existing net-level protocols or as a means to move the design along the latency/area Pareto frontier.

KiwiC supports several approaches according to the pause mode selected. Pause modes are listed Table 1. The number of ALUs and RAM ports available also makes a big difference owing to structural hazards. Fewer resources means more clock cycles needed.

The pause mode can, most simply, be set once and for all on the command line with, for examples -bevelab-bevelab-default-pause-mode=soft.

When in soft mode, the bevelab-soft-pause-threshold parameter is one of the main guiding metrics. But it has no effect on regions of the program compiled in hard-pause or other non-soft modes.

Typical values for the soft pause threshold are intended to be in the range 0 to 100, with values of 100 or above leading to potentially very large, massively-parallel designs, and with values around 15 or lower giving a design similar to the `maximal' pause mode.

The Kiwi.cs file defines an enumeration for locally changing the pause mode for the next part of a thread's trajectory.

enum PauseControl
   { autoPauseEnable, hardPauseEnable, softPauseEnable, 
     maximalPauseEnable, blockbPauseEnable };

The idea is that you can change it locally within various parts of a thread's control flow graph by calling Kiwi.PauseControlSet(mode) where the mode is a member of the PauseControl enumeration. Also, this can be passed as an argument to a Kiwi.Pause call to set the mode for just that pause. However, dynamic pause mode changing may not work at the moment ... owing to minor bugs.

For example, you can invoke Kiwi.PauseControlSet(Kiwi.PauseControl.softPauseEnable).


Table 1: Kiwi Pause Modes (within Sequencer Major HLS Mode)
No Name Pauses are inserted at
0 auto ?
1 hard exactly where pause statements are explicitly included
2 soft where needed to meet soft-pause-threshold
3 maximal inserted at every semicolon
4 bblock every basic block boundary


Nearly all net-level hardware protocols are intolerant to clock dilation. In other words, their semantics are defined in terms of the number of clock cycles for which a condition holds. A thread being compiled by KiwiC to a sequencer defaults to bblock or soft pause control, meaning that KiwiC is free to stall the progress of a thread at any point, such as when it needs to use extra clock cycles to overcome structural hazards. These two approaches are incompatible. Therefore, for a region of code where clock cycle allocation is important, KiwiC must be instructed to use hard pause control.

The recipe file kiwic00.rcp sets the following as the default pause mode now

    <option> bevelab-bevelab-default-pause-mode bblock </option>
This is not suitable for net-level interfaces but does lead to quick compile of scientific code which is what we are targeting at the moment.

For compiling net-level input and output, give KiwiC -bevelab-bevelab-default-pause-mode=hard as a command line option to override the recipe.

Maximal and blockb are considered just `debug' modes where pauses are inserted at every semicolon and every basic block boundary respectively.

Unwound Loops

For a thread in hard-pause mode that executes loops with no Pause() calls in them will, KiwiC will attempt to unwind all of the work of that loop and perform it in a single run-time clock cycle. (There are some exceptions to this, such as when there are undecidable name aliases in array operations or structural hazards on RAMs but these are flagged as warnings at compile time and run time hardware monitors can also be generated that flag the error).

TODO: describe the way KiwiC resolves structural hazards or variable-latency if the user has specified hard pause mode. Currently, KiwiC essentially tacitly takes and consumes any further clock cycles it needs to do the work.


\begin{quoze}
main_unwound_leader()
{
q = 100;
for (int d=0; d<16; d++) Co...
...while (true) { Kiwi.Pause(); Console.WriteLine(''q={0}'', q++); }
}
\end{quoze}

The example main_unwound_leader will unwind the first loop at compile time and execute the first 16 print statements in the first clock tick and q will be loaded with 116 on the first clock tick.

More-complex implied state machines


\begin{quoze}
main_complex_state_mc()
{
q = 1;
while(true)
{
Kiwi.Pause()...
...0; v<din; v++) { Kiwi.Pause(); q += v; }
Kiwi.Pause(); q = 1;
}
}
\end{quoze}

The example main_complex_state_mc has a loop with run-time iteration count that is not unwound because it contains a Pause call. This is accepted by KiwiC. However, it could not be compiled without the Pause statement in the inner loop because this loop body is not idempotent. In soft-pause mode the pause call would be automatically added by KiwiC if missing.

Inner loop unwound while outer loop not unwound.


\begin{quoze}
main_inner_unwound()
{
q = 1;
while(true)
{
Kiwi.Pause(); q...
...;
for (int v=0; v<10; v++) { q «= 1; }
Kiwi.Pause(); q = 1;
}
}
\end{quoze}

In main_inner_unwound the inner loop will be unwound at compile time because it has constant bounds and no Pause call in its body. (This unwind will be performed in the bevelab recipe stage, not KiwiC front end.)

Entry Point With Parameters

A top-level entry point with formal parameters, such as


\begin{quoze}[Kiwi.HardwareEntryPoint()]
main_withparam(int x)
{
...
}
\end{quoze}

is currently not allowed in normal sequencer mode, although in future it would be reasonable for these to be treated as additional inputs. This will be relaxed soon.

Top-level arguments are allowed in RPC (§7.1) and Accelerator major HLS modes (§15).

The -root command line flag is an alternative to the HardwareEntryPoint marker. Supplying this information on the command line is compatible with multiple compilation appoaches where a given source file needs to be processed in different ways on different compilation runs.

Generate Loop Unwinding: Code Articulation Point

Figure 2: Front End Control Flow after Unwind: Lasso Diagram.
\begin{figure*}\centerline{\epsfbox{lassoos.eps}}\end{figure*}

The KiwiC front end unwinds certain loops such as those that peform storage allocation and fork threads. The main behavioural elaborate stage of the KiwiC flow also unwinds other loops. Because of the behaviour of the former, the latter operates on a finite-state system and it makes its decisions based on space and performance requirements typical in high-level synthesis flows. Therefore, the loop unwinding performed in the KiwiC front end can be restricted just to loops that perform structural elaboration. These are known as generate loops in Verilog and VHDL. It is a typical Kiwi programming style to spawn threads and allocate arrays and other objects in such loops. Such elaboration that allocates new heap items, in Kiwi 1, must be done in the KiwiC front end since the rest of the HPR recipe deals only with statically-allocated variables.

Since threads both describe compile-time and run-time behaviour a means is needed to distinguish the two forms of loop. The approach adopted is that every thread in the source code is treated as generally having a lasso shape, consisting of code that is executed exactly once before entering any non-unwound, potentially-infinite loop.

The front-end algorithm used selects an articulation point in the control graph of a thread where all loops before this point have been unwound and all code reachable after that point has its control graph preserved in the program output to the next stage. Figure 2 illustrates the general pattern. The articulation point is called the end of static elaboration point. The point selected is the first branch target that is the subject of a conditional branch during an interpreted run of the thread or the entry point to the last basic block encountered that does not contain a Kiwi.Pause() call.

The branch will be conditional either because it depends on some run-time input data or because it is after at least one Kiwi.Pause() call. The semantics of Kiwi.Pause() imply that all code executed after the call are in a new run-time clock cycle. Apparently-conditional branches may be unconditional because of constant folding/propagation during the interpreted run. This is the basis of generate-style loop unwinding in the lasso stem.

Some programming styles require the heap changes shape at run time. A simple example occurs when an array or other object is allocated after the first call to Kiwi.Pause. We have found that programmers quite often write in this style, perhaps not allways intenionally, so it is useful if KiwiC supports it.


\begin{quoze}
main_runtime_malloc()
{
...
Kiwi.Pause();
int [] a = new Int[10];
for (int i=0; i<10; i++) a[i] = i;
while (true) { ... }
}
\end{quoze}

Provided the heap allocator internal state is modelled in the same way as other variables, no further special attention is required. In this fragment the heap values are compile-time constants.


\begin{quoze}
main_runtime_dyn_malloc()
{
...
Kiwi.Pause();
if (e)
{ int ...
...10];
for (int i=0; i<10; i++) a[i] = i;
}
while (true) { ... }
}
\end{quoze}

If the value of `e' in runtime_dyn_malloc is not a compile-time constant, KiwiC cannot compile this since there would be two possible shapes for the heap on the exit for the if statement. A solution is to call a.Dispose() before exit, but KiwiC currently does not support Dispose calls.

There's also the matter of saved thread forks ....

Here the outer loop is non-unwound loop yet has a compile-time constant value on each read if the inner loop is unwound
\begin{quoze}
while(true) // not unwound
{
for (int i=0;i<3;i++) foo[i].bar(f);
...
}
\end{quoze}

Supported Libraries Cross Reference

We have started documenting our libray coverage in this section.

System.Collections.Generic

Currently (August 2016), none of the standard collection types, such as Dictionary, are provided in the distro. Please contribute.

System.Random

For random number generation, for both WD and FP, please use KiwiSystem.Random instead of System.Random.

  KiwiSystem.Random dg = new KiwiSystem.Random();

Console.WriteLine and Console.Write

The Write and WriteLine methods are the standard means for printing to the console in C# and Kiwi. They can also print to open file descriptors. They embody printf like functionality using numbered parameters in braces.

Overloads are provided for used with up to four arguments. Beyond this, the C# compiler allocates a heap array, fills this in and passes it to WriteLine, after which it requires garbage collection. This should provide no problem for Kiwi's algorithm that converts such dynamic use to static use but if there is a problem then please split a large WriteLine into several smaller ones with fewer than five arguments (beyond the format string).

Argument formats supported are

  1. {n} -- display arg $n$ in base 10
  2. {n:x} -- display arg $n$ in base 16

Kiwi will convert console writes to Verilog's $display and $write PLI calls with appropriate munging of the format strings. These will come out during RTL simulation of the generated design. They can also be rendered on the substrate console during FPGA execution.

On important choice is whether this console output is preserved for the FPGA implementation. By default it is, with the argument strings compiled to hardware and copied character by character over the console port.

Sometimes two other behaviours are selectively wanted:

To achieve item 1, do not call Console.Write or Console.WriteLine. Instead call Kiwi.Write or Kiwi.WriteLine.

To achieve item 2, alter the recipe file or add the following command line argument to KiwiC

  -kiwic-fpgaconsole-default=disable

get_ManagedThreadId

- returns an integer representing the current thread identifier (tid).
        int tid = Thread.CurrentThread.ManagedThreadId;
        Console.WriteLine("Receiver process started. Tid={0}", tid);

// OLD      Console.WriteLine("Receiver process started. Tid={0}", System.Threading.ManagedThreadId);


System.BitConverter

System.String.ToCharArray

- convert a string to an array of chars. Chars are 16 bits wide in dotnet. They are tagged shorts and do not behave quite the same as shorts for various output options.

System.IO.Path.Combine

- join a pair of file name paths - OS-specific. FileStream

TextWriter

TextReader

The TestReader ReadLine api is allowed to create garbage under Kiwi provided the outer loop frees or garbages the returned string on every iteration. It must not, for example, store a handle on the returned string in an array.

FileReader

FileWriter

Threading and Concurrency with Kiwi

One novel feature of Kiwi that sets it apart from other HLS systems is its support for concurrency.

Threads can be spawned in the static lasso stem but Kiwi does not support thread creation at runtime.

Kiwi supports Thread.Create() and Thread.Start().

To run a method of the current object on its own thread use code like this:

                                                  
    public static void IProc()
    { 
       while (true) { ... }
    }

   ...
    
   Thread IProcThread = new Thread(new ThreadStart(IProc));     
   IProcThread.Start();

Or use delegates to pass arguments to a spawned thread running a method of perhaps another object:

                                                  
   Thread filterChannel = new Thread(delegate() { ZProc(1, 2, 3); });
   filterChannel.Start();

Exiting threads can be joined with code like this:

                                                  

  ...  missing ...
  Thread.Join(); // not tested currently.

Mutual exclusion is provided with the lock primitive of C#. Its argument must be the object handle of any instance (not a static class).

The Monitor.Wait and Monitor.PulseAll are supported for interprocess events.

                                                  
      lock (this)
         {
            while (!emptyflag) { /* Kiwi.NoUnroll(); */ Monitor.Wait(this); }
            datum = v;
            emptyflag = false;
            Monitor.PulseAll(this);
         }

The NoUnroll directive to KiwiC can decrease compilation time by avoiding unrolling exploration.

Sequential Consistency

KiwiC does not currently support fine-grained store ordering. Where a number of writes are generated in one major cycle (delimited by hard or soft pauses) the writes within that major cycle are freely reordered by the restructure recipe stage to maximimse memory port throughput. However, KiwiC already maintains ordering in PLI and other system calls, so extending this preservation to remotely-visible writes can easily be added in the near future.

Write buffers and copy-back caches may also be instantiated outside the KiwiC-generated code in uncore structures that are part of the substrate for a given FPGA blade. KiwiC has no control over these.

We are writing a paper that explores this space.

C# provides the Thread.MemoryBarrier() call to control memory read and write re-ordering between threads... but in the meantime you have to use Kiwi.Pause() to ensure write ordering.

Volatile Declarations

Variables that are shared between threads may need to be marked as volatile. The normal semantics are that memory fences are inferred from lock block boundaries and other concurrency primitives such as PulseAll. However, if shared variables are used without such fences they should be declared as volatile. Otherwise a process spinning on a change written by another thread may never see it change.

The C# language does not support volatile declarations of some types. You may get an error such as

   //tinytest0.cs(16,26): error CS0677: `tinytest0.shared': A volatile field
             cannot be of the type `ulong'

To overcome this, you can try to use the Kiwi-provided custom volatile attribute instead for now. For instance:

   [Kiwi.Volatile()]
   static ulong shared_var;

This technique will not stop the C# compiler from optimising away a spin on a shared variable, but the C# compiler may not do a lot of optimisation, based on the idea that backend (jitting) runtimes will implement all required optimisations. Ideally KiwiC works out which variables need to be volatile since all threads sharing a variable are compiled to FPGA at once.

Kiwi C# Attributes Cross Reference

The KiwiC compiler understands various .NET assembly language custom attributes that the user has added to the source code. In this section we present the attributes available. These control thinks such as I/O net widths and assertions and to mark up I/O nets and embed assertions that control unwinding.

C# definitions of the attributes can be taken from the file support/Kiwi.cs in the distribution.

The Kiwi attributes can be used by referencing their dll during the C# compiler.

  gmcs /target:library mytest.dll /r:Kiwi.dll

Many attributes are copied into the resulting .dll file by the gmcs compiler. Other code from such libraries is not copied and must be supplied separately to KiwiC. To do this, list the libraries along with the main executable on the KiwiC command line.

WARNING: THE ATTRIBUTE LIST IS CURRENTLY NOT STABLE AND THIS LIST IS NOT COMPLETE. For the most up-to-date listing, see hprls/kiwi/Kiwi.cs.

The C# language provides a mechanism for defining declarative tags, called attributes, that the programmer may place on certain entities in the source code to specify additional information. An attribute is specified by placing the name of the attribute, enclosed in square brackets, in front of the declaration of the entity to which it applies. We present design decisions regarding attributes that allow a C# program to be marked up for synthesis to hardware using the KiwiC compiler that we are developing [2]. This compiler accepts CIL (common intermediate language) output from either the .NET or Mono C# compilers and generates Verilog RTL.


Kiwi.Remote() Attribute

RPC (Remote-Procedure Call) Interface Between Compilations.

Marking up given methods to be remotely callable:

Object-oriented software sends threads between compilation units to perform actions. Synthesisable Verilog and VHDL do not allow threads to be passed between separately compiled circuits: instead, additional I/O ports must be added to each circuit and then wired together at the top level. Accordingly, we mark up methods that are to be called from separate compilations with a remote attribute.

  [Kiwi.Remote(``parallel:four-phase'')]
  public return_type entry_point(int a1, bool a2, ...)
  { ... }

When an implemented or up-called method is marked as `Remote', a protocol is given and KiwiC generates additional I/O terminals on the generated RTL that implement a stub for the call. The currently implemented protocol is asynchronous, using a four-phase handshake and a wide bus that carries all of the arguments in parallel. Another bus, of the reverse direction, conveys the result where non-void. Further protocols can be added to the compiler in future, but we would like to instead lift them so they can be specified with assertions in C# itself.

KiwiC will generate hardware both for the client and the server as separate RTL files. In more-realistic examples, there will be multiple files, with one being the top-level that contains client calls to some of the others which in turn make client calls to others, with the leaf modules in the design hierarchy being servers only.

One can also envision leaf modules in the design hierarchy making upcalls to parents, but this is not currently implemented in Kiwi.

class test10
{
    static int limit = 10;
    static int jvar;

    [ Kiwi.Remote(``client1-port'', ``parallel: four-phase'') ]
    public static int bumper(int delta)
    {
        jvar += delta;
        return jvar;
    }

    [Kiwi.HardwareEntryPoint()]
    public static void Main()
    {
       Console.WriteLine(``Test 10 Limit='' + limit);
       for (jvar=1;jvar<=limit;jvar+=2) 
       {
           Console.Write(jvar + `` ``);
       }  
       Console.WriteLine(`` Test 10 finished.'');
    }
}

See demo on this link

http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/rpc.html


Flag Unreachable Code

Kiwi.NeverReached("This code is not reached under KiwiC compilation.");

This call can be inserted in user code to create a compile-time error if elaborated by KiwiC. If a thread of control that is being expanded by KiwiC encounters this call, it is a compile-time error.

For flagging invalid run-time problems, please use System.Diagnostics.Debug.Assert within Kiwi code.

Hard and Soft Pause (Clock) Control

This section needs joining up with the repeated copy elsewhere in this manual!

Many net-level hardware protocols are intolerant to clock dilation. In other words, their semantics are defined in terms of the number of clock cycles for which a condition holds. A thread being compiled by KiwiC defaults to soft pause control (or other default set in the recipe or command line), meaning that KiwiC is free to stall the progress of a thread at any point, such as when it needs to use extra clock cycles to overcome structural hazards. These two approaches are incompatible. Therefore, for a region of code where clock cycle allocation is important, KiwiC must be instructed to use hard pause control.

The Kiwi.Pause() primitive may be called without an argument, when it will pause according to the current pause control mode of the calling thread. It may also be called with the explicit argument ` soft' or `hard'.

The current pause control mode of the current thread can be updated by calling
`Kiwi.SetPauseControl'.

When a thread calls Kiwi.SetPauseControl(hardPauseControl) its subsequent actions will not be split over runtime clock cycles except at places where that thread makes explicit calls to Kiwi.Pause() or makes a blocking primitive call.

The default schedulling mode for a thread can be restored by making the thread calls
Kiwi.SetPauseControl(autoPauseControl).

Finally, blockb pause control places a clock pause at every basic block and maximal pause control turns every statement into a separately-clocked operation
Kiwi.SetPauseControl(maximalPauseControl).

The Kiwi.Pause() primitive may be called with an argument that is an integer denoting a combination of built-in flags. This enables per-call-site override of the default pause mode.

End Of Static Elaboration Marker - EndOfElaborate

    public static void EndOfElaborate()
    {
        // Every thread compiled by KiwiC has its control flow partitioned
        // between compile time and run time.  The division is the end
        // of elaboration point.
        // Although KiwiC will spot the end of elaboration point for itself,
        // the user can make a manual call to this at the place where they
        // think elaboration should end for confirmation.
        // This will be just before the first Pause in hard-pause mode or 
        // undecidable name alias or sensitivity to a run-time input etc..
    }

Loop NoUnroll Manual Control

Put a call to `Kiwi.NoUnroll(loopvar)' in the body of a loop that is NOT to be unrolled by KiwiC. Pass in the loop control variable.

If there is a `KiwiC.Pause()' in the loop, that's the default anyway, so the addition of a NoUnroll makes no difference.

The number of unwinding steps attempted by the CIL front end can be set with the `-cil-uwind-budget N' command line flag. This is different from the ubudget command line flag used by the FSM/RTL generation phase.

Because a subsume attribute cannot be placed on a local variable in C#, an alternative syntax based on dummy calls to Unroll is provided.


\begin{quoze}
public static void Unroll(int a)
{ // Use these unroll functions...
...ubsumed in total or
// at least in the currently enclosing loop.
}
\end{quoze}

Elaborate/Subsume Manual Control

OLD: Ignore this paragraph from 2015 onwards.

This manual control was used in early versions of KiwiC but has not been needed recently.

KiwiC implements an elaboration decision algorithm. It decides which variables to subsume at compile time and which to elaborate into concrete variables in the output RTL design.

The decisions it made can be examined by grepping for the word `decided' in the obj/h1.log file.

The algorithm sometimes makes the wrong decision. This is being improved on in future releases.

For variables that can take attributes in C# (i.e. not all variables), it can be forced one way or the other by instantiating one of the pair of attributes, Elaborate or Subsume.

For example, to force a variable to be elaborated, use:
\begin{quoze}[Kiwi.Elaborate()]
bool empty = true;
\end{quoze}

Examples of variables that cannot be attributed is the implied index variable used in a foreach loop, or the explicit local defined inside a for loop using the for (int i=...;... ; ...) syntax.

The force of an elab can also be made using the -fecontrol command line option. For instance, one might put -fecontrol 'elab=var1;elab=var2';

Synchronous and/or Asynchronous RAM Mapping

See §8.


Register Widths and Overflow Wrapping

Integer variables of width 1, 8, 16, 32 and 64 bits are native in C# and CIL but hardware designers frequently use other widths. We support declaration of registers with width up to 64 bits that are not a native width using an `HwWidth' attribute. For example, a five-bit register is defined as follows.
\begin{quoze}[Kiwi.HwWidth(5)]static byte fivebits;
\end{quoze}
When running the generated C# natively as a software program (as opposed to compiling to hardware), the width attribute is ignored and wrapping behaviour is governed by the underlying type, which in the example is a byte. We took this approach, rather than implementing a genuine implementation of specific-precision arithmetic by overloading every operator, as done in OSCI SystemC [1], because it results in much more efficient simulation, i.e. when the C# program is run natively.

Although differences between simulation and synthesis can arise, we expect static analysis in KiwiC to report the vast majority of differences likely to be encountered in practice. Current development of KiwiC is addressing finding the reachable state space, not only so that these warnings can be generated, but also so that efficient output RTL can be generated, such that tests that always hold (or always fail) in the reachable state space are eliminated from the code.

The following code produces a KiwiC compile-time error because the wrapping behaviour in hardware and software is different.


\begin{quoze}[Kiwi.HwWidth(5)]byte fivebits;
void f()
{
fivebits = (byte)(fivebits + 1);
}
\end{quoze}

The cast of the rhs to a byte is needed by normal C# semantics.

Compiling this example gives an error:
\begin{quoze}
KiwiC:assignment may wrap differently:
(widthclocks_fivebits{storage=8 }+1)&mask(7..0):
assign wrap condition test rw=8, lw=5, sw=8
\end{quoze}


Net-level Input and Output Ports

Input and Output Ports can arise and be defined in a number of ways.

Net-level I/O ports are inferred from static variables in top-most class being compiled. These are suitable for GPIO applications such as simple LED displays and push buttons etc.. The following three examples show input and output port declarations, where the first two have their input and output have their width specified by the underlying type and the last by an explicit width attribute.
\begin{quoze}[Kiwi.OutputBitPort(''done'')]static bool done;
[Kiwi.InputPort(''...
... [Kiwi.HwWidth(5)] [Kiwi.OutputPort(''data_out'')] static byte out5;
\end{quoze}
KiwiC can create obscure names if these I/O declarations are not in a top-level class. So, the contents of the string are a friendly name used in output files.

For designers used to the VDHL concept of a bit vector, we also allow arrays of bools to be designated as I/O ports. This can generate more efficient circuits when a lot of bitwise operations are performed on an I/O port.
\begin{quoze}[Kiwi.OutputWordPort(11, 0, ''dvi_d'')]public static int[] dvi_d = ...
...ordPort(11, 0, ''dvi_i'')] public static int[] dvi_i = new int [12];
\end{quoze}
Although it makes sense to denote bitwise outputs using booleans, this may require castings, so ints are also allowed, but only the least significant bit will be an I/O port in Verilog output forms.

Currently we are extending the associated Kiwi library so that abstract data types can be used as ports, containing a mixture of data and control wires of various directions. Rather than the final direction attribute being added to each individual net of the port, we expect to instantiate the same abstract datatype on both the master and slave sides of the interface and use a master attribute, such as `forwards' or `reverse', to determine the detailed signal directions for the complete instance.

The following examples work
\begin{quoze}
// four bit input port
[Kiwi.HwWidth(4)]
[Kiwi.InputPort('''')]...
... din;
\par
// six bit local var
[Kiwi.HwWidth(6)] static int j = 0;
\end{quoze}

A short-cut form for declaring input and output ports


\begin{quoze}[Kiwi.OutputIntPort('''')]
public static int result;
\par
[Kiwi.OutputWordPort(31, 0)]
public static int bitvec_result;
\end{quoze}

Wide Net-level Inputs and Outputs

The C# language supports primitive data word lengths up to 64 bits. Sometimes we require net-level I/O busses that are wider than this. This can be achieved by attaching the net-level attribute markups to arrays.

Coding style `lostio'

Note: this style stopped working in about 2010 but is just being made to work again (Dec 2016).


\begin{quoze}
// Wide input and output, net-level I/O.
[Kiwi.InputWordPort(''w...
...or (int p=0; p<widein.Length; p++)
{
wideout[p] = widein[p];
}
}
\end{quoze}

Coding style using structs ... being fixed ...


\begin{quoze}
public class WideWordDemo
{
// Demo of wide input and output word...
...// Falls foul of operating on formals if passed by value?
}
...
}
\end{quoze}

Clock Domains

You do not need to worry about clock domains for general scientific computing: they are only a concern for hardware interfacing to new devices. KiwiC generates synchronous logic. By default the output circuit has one clock domain and requires just one master clock and reset input. The allocation of work to clock cycles in the generated hardware depends on the current `pause mode' and an unwind budget described in [2] and the user's call to built-in functions such as `Kiwi.Pause'.

Terminal names clock and reset are automatically generated for the default clock domain. To change the default names, or when more than one clock domain is used, the `ClockDom' attribute is used to mark up a method, giving the clock and reset nets to be used for activity generated by the process loop of that method.
\begin{quoze}[Kiwi.ClockDom(''clknet1'', ''resetnet1'')]
public static void Work1()
{ while(true) { ... } }
\end{quoze}
A method with one clock domain annotation must not call directly, or indirectly, a method with a differing such annotation.


Remote

Object-oriented software sends threads between compilation units to perform actions. Synthesisable Verilog and VHDL do not allow threads to be passed between separately compiled circuits: instead, additional I/O ports must be added to each circuit and then wired together at the top level. Accordingly, we mark up methods that are to be called from separate compilations with a remote attribute.
\begin{quoze}[Kiwi.Remote(''parallel:four-phase'')]
public return_type entry_point(int a1, bool a2, ...)
{ ... }
\end{quoze}
When an implemented or up-called method is marked as `Remote', a protocol is given (or implied) and KiwiC generates additional I/O terminals on the generated RTL that implement a stub for the call. The currently implemented protocol is synchronous (using the current clock domain - TODO explain how to wire up), using a four-phase handshake and a wide bus that carries all of the arguments in parallel. Another bus, of the reverse direction, conveys the result where non-void. Further protocols can be added to the compiler in future, but we would like to instead lift them so they can be specified with assertions in C# itself. The protocol argument can be omitted from the attribute.

A remote-marked method is either an entry point or a stub for the current compilation. This depends on whether it is called from other hardware entry points (roots).

If it is called, then it is treated as a stub and its body is ignored. Call sites will initiate communication on the external nets. The directions of the external nets is such as to send arguments and receive results (if any).

If it is not called from within the current compilation, then it is treated as a remote-callable entity. The directions of the external nets is such as to receive arguments and return results (if any).

Elaboration Pragmas - Kiwi.KPragma

        public static int KPragma(bool fatalFlag, string cmd_or_message)
        public static int KPragma(bool fatalFlag, string cmd_or_message, int arg0)
        public static int KPragma(bool fatalFlag, string cmd_or_message, int arg0, int arg1)

Kiwi.KPragma with first argument as Boolean true can be used to conditionally abend elaboration. This behaves the same way as System.Diagnostics.Debug.Assert described in §7.14 except that a user-defined error code can be passed in arg0.

With the Bool false, it is used to log user progress messages during elaboration.

Kiwi.KPragma calls present in run-time loops can be emitted at runtime using the Console.WriteLine mechanisms (in the future - current release ignores them beyond elaboration).

Kiwi.KPragma calls with magic string values will be used to instruct the compiler, but no magic words are currently implemented.


Assertions Debug.Assert()

Sometimes it is convenient to generate compile-time errors or warnings. Othertimes we want to flag a run-time abend, as per §2.2.

Typically you might want to direct flow of control differently using the function Kiwi.inHardware() and to abort the compilation if it has gone wrong. Call the function Kiwi.KPragma(true/false, ``my message'') to generate compile time messages. If the first arg holds, the compilation stops, otherwise this serves as a warning message.

You can make use of System.Diagnostics.Debug.Assert within Kiwi code.

In KiwiC 1.0 you have to re-code dynamic arrays with static sizes and this is needed for all on-chip arrays in Kiwi 2.0. The code below originally inspected the fileStream Length attribute and created a dynamic array. But it had to be modified for Kiwi 1.0 use as follows

  int length = (int)fileStream.Length;  // get file length - will be known at runtime only
  System.Console.WriteLine("DNA file length is {0} bytes.", length);
  const int max_length = 1000 * 1000 * 10; // Arrays need to be constant length for Kiwi use.
  System.Diagnostics.Debug.Assert(length <= max_length, "DNA file length exceeds static buffer size");
  buffer = new byte[max_length];            // create buffer to read the file
  int count;                            // actual number of bytes read
  int sum = 0;                          // total number of bytes read
  // read until Read method returns 0 (end of the stream has been reached)
  while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
  {
     sum += count;  // sum is a buffer offset for next reading
  }
  System.Console.WriteLine("All read, length={0}", sum);

The C# compiler may/will ignore the Assert calls unless some flag is passed ...

Assertions - Temporal Logic

Universal assertions about a design can be expressed with a combination of a predicate method (i.e. one that returns a bool) and a temporal logic quantifier embedded in an attribute. For instance, to assert that whenever the following method is called, it will return true, one can put
\begin{quoze}[Kiwi.AssertCTL(''AG'', ''pred1 failed'')]
public bool pred1()
{ return (... ); }
\end{quoze}
where the string AG is a computational tree logic (CTL) universal path quantifier and the second argument is a message that can be printed should the assertion be violated. Although the function `pred1' is not called by any C# code, KiwiC generates an RTL monitor for the condition and Verilog $display statements are executed should the assertion be violated. In order to nest one CTL quantifier in another, the code of the former can simply call the latter's method. Since this is rather cumbersome for the commonly used AX and EX quantifiers that denote behaviour in the next state, an alternative designation is provided by passing the predicate to a function called `Kiwi.next'. A second argument is an optional number of cycles to wait, defaulting to one if not given. Other temporal shorthands are provided by `Kiwi.rose', `Kiwi.fell', `Kiwi.prev', `Kiwi.until' and `Kiwi.wunitl'. These all have the same meaning as in PSL.

We are currently exploring the use of assertions to describe the complete protocol of an I/O port. Such a description, when compiled to a monitor, serves as an interface automaton. To automatically synthesise glue logic between I/O ports, the method of [3] can be used, which implements all non-blocking paths through the product of a pair of such interface automata.


Memories in Kiwi

Arrays allocated by the C# code must be allocated hardware resources. Small arrays are commonly converted directly into Verilog array definitions that compile to on-chip RAMs using today's FPGA tools. There are a number of (adjustable) threshold values that select what sort of RAM to target. Larger arrays are placed off-chip by default. Arrays that are only written at each location precisely once with a constant value for each location are treated as read-only look-up tables (ROMs).

Sometimes there are multiple ports to a given memory space/bank for bandwidth reasons. For instance, on the Xilinx Zynq, it is common to use two high-performance AXI bus connections to the same DRAM bank. In addition, there can be multiple memory controllers each with its own channel. We prefer the term channel to the older term bank since bank now refers to an internal bank within a DRAM chip that can have up to one row open in each bank. Kiwi does not currently support multiple channels.

Terminology summary: we use the following hierarchy of terms to describe the off-chip memory architecture: bit, lane, word, row, col, bank, rank, channel.

Explanation: A word is addressed with a binary address. The row, col, bank and rank are all fields in the address. Ordering between col and bank may vary. Channels potentially have disjoint address spaces. Mapping the channel number into the address would eliminate spatial reuse and simply be an extension of the rank. Within the word there are multiple lanes that are separately writable and each lane has some number of bits. In today's CPUs from Intel and ARM, the lane size is 8 (a byte lane) and the word size is also 8, making it a 64-bit word. On FPGAs, where clock frequencies are lower than DRAM speeds, word sizes of 512 can commonly be used with a correspondingly larger number of lanes.

In this documentation, we use the term `off-chip' to denote resources that are not instantiated by KiwiC and which, instead, are provided by the substrate platform. In reality, the resources might physically be on the same silicon chip as the FPGA programmable logic.

Each array with off-chip status is allocated a base address in one of some number of off-chip memory channels and accessed via one or more off-chip load/store ports.


Table 2: RAM forms supported by FPGAs.
\begin{table}\framebox{\begin{minipage}{12cm}
FPGAs tools support RAMs in four g...
...ransactions destined for a complex memory subsystem.
\end{minipage}}
\end{table}


Overall, these thresholds and attributes map each RAM instance to a specific level in a four-level memory technology hierarchy:

  1. unstructured: no read or write busses are generated (the old default, sea-of-gates, any number of concurrent reads and writes are possible without worry over structural hazard)

  2. combinational read, synchronous write register file (address generated in same cycle as read data consumed)

  3. latency of 1 SSRAM (address generated one clock cycle before read data used)

  4. external memory interface for off-chip ZBT/QBI, DRAM, or cached DRAM.

The number of ports is unlimited for type 1 (register file) and the FPGA tools will typically implement such a register file if the number of operations per clock cycle is more than one. This depends on the number of subscription operators in the generated RTL, the number of different address expressions in use and whether the tools can infer disjointness in their use.

For types 2 through 4, the number of ports is decided by KiwiC and it generates that number of read, write and address busses. By default, KiwiC uses one port per clock domain, but this can be influenced in the future with PortsPerThread and ThreadsPerPort attributes.

In the current version of Kiwi, the res2-loadstore-port-count (formerly called res2-no-dram-ports) recipe setting configures the number of load/store ports available per thread Also, each thread that makes off-chip loads and stores must have its own port since KiwiC does not automatically instantiate the DRAM (HFAST) arbiters: instead the substrate top-level needs to instantiate the arbiters when KiwiC generates more DRAM ports than physically exist on the FPGA.

The three thresholds set in the command line or recipe that distinguish between the four memory types are :

  1. res2-regfile-threshold: the number of locations below which to not instantiate any sort of structural SRAM or register file: instead raw flip-flops are used.

  2. res2-combram-threshold:, the threshold in terms of number of locations at which to start instantiating synchronous, latency=1, structural SRAM,

  3. res2-offchip-threshold: the threshold in terms of number of locations at which to map to an off-chip resource, such as TCM, ZBT or cached DRAM.

In addition to comparing sizes against compilation thresholds, the user can add CSharp attributes to instances to force a given technology choice on a per-RAM basis.

The SynchSRAM(n) attribute indicates that an array is to be mapped to an on-chip RAM type that is not the default for its size. The argument is the number of clock cycles of latency for read.

TODO: describe PortsPerThread and so on... these control multi-port RAMS and how the number of external ports is configured.

Kiwi has a scheduller in its restructure phase that runs at compile time to sequence operations on scarce resources such as complex ALUs and memory resources. Kiwi supposedly implements run-time arbitration for resources that are contended between threads, but the reality is currently different. It follows three policies: 1. For 'on-chip' RAMs like FPGA B-RAM it allocates one port per thread so, with Xilinx and Altera that support up to two ports only two threads can access an 'on-chip' B-RAM. 2. For ALUs it does not share them between threads and starts the ALU budgeting freshly for each thread, just as though the threads had been separately compiled. 3. For `off-chip RAM' like DRAM, it generates one (more are possible via the command line) HFAST port per thread. The user must currently manually instantiate arbiters that mux this collection of ports onto the DRAM banks that are available.

However, Kiwi does not care whether `off-chip' resources are actually off-chip and instead one can use the off-chip technique to multiplex and arbitrate multiple threads onto on-chip resources, such as a large, manually instantiated B-RAM.

On-chip RAM (and ROM) Mirror, Widen and Stripe Directives

To increase memory performance, three techniques are generally available (these techniques may not all be sensible for off-chip RAM resources). All of these increase the number of data bus wires to RAMs, thereby increasing available throughput.

  1. A Kiwi.Mirror(n) directive applied to a C# array instructs KiwiC to make multiple copies of the RAM or ROM. This is most sensible for ROMs since all copies of a RAM must be updated with every write.

  2. A Kiwi.Widen(n) directive applied to a C# array instructs KiwiC to pack $n$ words into a single location. This multiplies the data bus width by this factor. For RAMs, a RAM with laned writes may be needed. This will boost performance where an aligned group of $n$ words is commonly read and written at once.

  3. A Kiwi.Stripe(n) directive applied to a C# array instructs KiwiC to allocate $n$ multiple RAMs or ROMs each of $1/n^{th}$ the size with every $n^{th}$ word placed in each of them.

(In order to pack multiple user arrays into a single RAM on the FPGA, additional directives are needed. Not described here currently.)


ROMs (read-only memories) and Look-Up Tables

Most FGPAs support ROMs. ROM inference is a variation on RAM inference. Combination and registered ROMs are both commonly used, depending on size. KiwiC will deploy ROMs with pipeline latency of 1 when the size in addresses exceeds the size set by res2-combrom-threshold.

ROM inference in KiwiC can be turned off with flag repack-to-rom=disable in which case RAMs are commonly generated and initialised with the ROM contents after the run-time reset. But, when ROMs are present, they are manifest in the generated Verilog RTL as arrays that have their only write operations embodied in Verilog initial statements that install the fixed data.

ROMs can sometimes usefully be mirrored. The Kiwi.Mirror(4) attribute can be applied to individual array instances to mirror them.

  [Kiwi.Mirror(4)]
  static readonly uint[] htab4 = { 0x51f4a750, 0x7e416553, 0x1a17a4c3, 0x3a275e96, 
                                   ... many more entries ... 
                                 };

Or else the command line flag repack-to-rom=4 can be added, which would replicate all ROMs up to a factor of 4, but the additional copies would not be generated if they cannot usefully be used.


Forced Off-chip/Outboard Memory Array Mapping

The OutboardArray attribute forces that an array is to be mapped to a region of external memory instead of being allocated a private array inside the current compilation. Large arrays are placed off chip in this way by default. It is up to the substrate architect what sort of memory to attach to the resulting port: it could range from simple large SRAM bank to multiple DRAM banks with caches.

OLD: The fullest version of this attribute takes two arguments: a bank name and an offset in that bank.

OLD: Pre performance profiling: In general, arrays can be mapped to a specific bank by giving the bank name and leaving out the base address. KiwiC will then allocate the base addresses for each memory to avoid overlaps. If no bank name is given, then a default of 'drambank0' is automatically supplied. Therefore, without using any attributes, all large arrays are mapped into consecutive locations of a memory space called 'drambank0'.


Off-chip load/store ports

KiwiC generates load/store ports to access off-chip memory. (Off-chip means not instantiated by KiwiC, so the addressed resource can be on the same die in reality). With more load/store ports in use, greater memory access bandwidth is available AND greater opportunities for out-of-order memory service exist.

The off-chip port architecture is defined in recipe/command line settings. It is also written as a report file in every KiwiC run. The Off-chip Memory Physical Ports/Banks report looks something like this:

*-----------+----------+--------+--------+-------+-----------*
| Name      | No Words | Awidth | Dwidth | Lanes | LaneWidth |
*-----------+----------+--------+--------+-------+-----------*
| loadstor1 | 4194304  | 22     | 256    | 32    | 8         |
*-----------+----------+--------+--------+-------+-----------*

Total load/store port width = bits per lane * number of lanes.

Default -res2-loadstore-port-count=1 
     Number of LOADSTORE ports for automatic off-chipping of large RAMs.

res2-loadstore-port-lanes 32 LOADSTORE ports - number of write lanes.

res2-loadstore-lane-width 8 LOADSTORE lane width

When the number of lanes is 1 no lane write enables are used and the memory is word addressed always.

A suitable behavioural Verilog fragment to connect to them for simulation test purposes is available as part of the distro in the rams folder.

Typical DRAM controllers run much faster than the FPGA user logic and hence a wide word is presented to the KiwiC-generated code of 256 bits or so.

The user's wanted data width is either rounded up to some integer multiple number of external words, or some fraction of a word where the fraction is rounded up to a bounding power of 2 number of lanes.

The restructure log file will explain, somewhat cryptically, how each DRAM bank is being used with a table that contains interleaved entries covering all the banks (portnames). The lines in this report can be decoded with experience: D16 means sixteen bits wide. AX means an array. etc..

Off-chip Memory Map
*-----------------+-----------+-------+-----------+-----------*
| Resource        | Base      | Width | Length    | Portname  |
*-----------------+-----------+-------+-----------+-----------*
| D8US_AX/CC/SOL  | 0x1312d02 | 32    | 0x989680  | drambank0 |
| D16SS_AX/CC/SOL | 0x0       | 32    | 0x1312d02 | drambank0 |
*-----------------+-----------+-------+-----------+-----------*

Performance generally needs to be enhanced above this baseline by packing data sensibly into DRAM words. Also, support of multiple in-flight requests is preferable for the highest performance.

The KiwiC-generated code should be connected to an externally-provided memory controller that will often also also include some sort of cache.

Three off-chip protocols are supported BVCI, HSIMPLE and HFAST. HFAST is most commonly used. BVCI allows multiple transactions to be in flight. AXI will be added shortly to KiwiC, but there are some AXI components in the support and subtrates library. Including an HFAST to AXI protocol bridge and AXI master and slave shims for the Zynq substrate for CPU interaction and DRAM access.

When we say `off-chip' we simply mean outside the generated hardware circuit - the substrate configuration may put various items on the same Physical chip.

KiwiC will shortly be enhanced to issue prefetch bus cycles on off-chip RAMs. These are appropriate for cached DRAM and sometimes appropriate for uncached off-chip RAMs. They serve no useful function for SRAM (static RAM), whether on-chip or off-chip, owing to its uniform access latency.

HSIMPLE Offchip Interface & Protocol

Low-performance HSIMPLE uses four-phase handshake and only transfers data once every four clock cycles. It is more suitable for connecting to simple peripherals than DRAM. The following nets will require connection to the synthesis output when the DRAM is in use with the default, simple, 4/P HSIMPLE protocol.

    output reg hs_dram0bank_req,
    input hs_dram0bank_ack,
    output reg hs_dram0bank_rwbar,
    output reg [255:0] hs_dram0bank_wdata,
    output reg [21:0] hs_dram0bank_addr,
    input [255:0] hs_dram0bank_rdata,
    output reg [31:0] hs_dram0bank_lanes,

When the number of lanes is one, there are no lane outputs.

HFAST Offchip Interface & Protocol

HFAST1 offers one cycle read latency and back-to-back operations, achieving 100 percent throughput. It is ideal for front-side cache connections where prefetch is not used.

The signature for HFAST is typically as follows (the total width and number of lanes and address bus width are all parameterisable).

    output reg hf1_dram0bank_opreq,
    input hf1_dram0bank_oprdy,              // Any posedge clk with overlap of opreq and opack starts a new request.
    input hf1_dram0bank_ack,                // Ack acknowledges the last request is complete.
    output reg hf1_dram0bank_rwbar,         // 1=read, 0=write on request active clock edge.
    output reg [255:0] hf1_dram0bank_wdata, // For write, data to be written, valid on request active clock edge.
    output reg [21:0] hf1_dram0bank_addr    // Address, valid on request active clock edge.
    input [255:0] hf1_dram0bank_rdata,      // Read result, valid on ack cycle.
    output reg [31:0] hf1_dram0bank_lanes,  // Byte lane qualifiers.

When the number of lanes is 1 no lane write enables are used and the memory is word addressed always.

A DDRAM2 controller is available in the file kiwi/rams/ddr2-models. This can be used for high-level simulations. It instantiates the DDR_DRAM_BANK underneath itself.

A behavioural model of a DDRAM2 is available in the file kiwi/rams/ddr2-models. It has signature:

// (C) 2010-14 DJ Greaves.                                                                                                                           
// Verilog RTL DDR2 behavioural model - fairly high level.                                                                                           
// The SIMM or DIMM (all the chips of the bank) is modelled with one RTL module.                                                                     
module DDR_DRAM_BANK(
  input                           clk,    // DDR Clock - 800 MHz typically. We use one edge only and double the datapath width.       
  input                           reset,  // Active high synchronous reset                                                            
  input                           ddr_ras, // Active low row address strobe                                                           
  input                           ddr_cas, // Active low col address strobe                                                           
  input [log2_internal_banks-1:0] ddr_ibank,// Internal bank select                                                                   
  input                           ddr_rwbar,// On CAS: 1=read, 0=write. On RAS 1=precharge, 0=activate.                               
  input [2*dwidth-1:0]            ddr_wdata, // The wdata and rdata busses are here twice their width in reality owing to DDR.        
  input [awidth-1:0]              ddr_mux_addr, // Multiplexed address bus                                                            
  input [2*dwidth/8-1:0]          ddr_dm,   // Lanes: Separate nets here for +ve and -ve edges instead of combined.                   
  output reg [2*dwidth-1:0]       ddr_rdata); // Read data bus.                                                                       
   parameter log2_dwidth = 5;
   parameter dwidth = (1<<log2_dwidth);         // Word width in bits - we actually have twice this to achieve/simulate double data rate.            
   // FOR DRAM style                                                                                                                                 
   // E.g.   MT41K256M32-125 DDR3 @ 800 MHz/1.25ns RCD-RP-CL=11-11-11 Arch=32M x 32 bits x 8 banks = 8Gb = 1GB.  Row bits=15, Col=10, Bank=3.        
   parameter LOG2_ROW_SIZE = 15;   // Log_2 number of words per RAS                                                                                  
   parameter LOG2_COL_SIZE = 10;   // Log_2 number of words per CAS                                                                                  
   parameter PRECHARGE_LATENCY = 11;
   parameter ACTIVATE_LATENCY = 11;
   parameter CAS_LATENCY = 11;
   parameter log2_internal_banks = 3;
   parameter awidth = LOG2_ROW_SIZE;         // Address width in bits - word addressed.                                                              
   // DRAM burst size - can be dynamically encoded in high-order CAS address. Currently fixed at 32 bytes.  With a 32 bit data bus (64 after doublin\
g for DDR) this requires 4 clocks to transfer the burst.                                                                                             
   parameter burstSize = 4;

HFAST2 is the same as HFAST1 but uses a two-cycle, fully-pipelined read latency.

A simple cache is provided. Its signature is:

module cache256_hf1
  (input clk,
   input                              reset, // synchronous, active high.

   // Front-side interface 
   input                              fs_rwbar,
   output reg [noLanes*laneSize-1:0]  fs_rdata,
   input [noLanes*laneSize-1:0]       fs_wdata,
   input [addrSize-1:0]               fs_wordAddr,
   output                             fs_oprdy,
   input                              fs_opreq,
   output reg                         fs_ack,
   input [noLanes-1:0]                fs_lanes,

   // Back-side interface 
   output reg                         bs_rwbar,
   input [noLanes*laneSize-1:0]      bs_rdata,
   output reg [noLanes*laneSize-1:0] bs_wdata,
   output reg [addrSize-1:0]          bs_wordAddr,
   input                              bs_oprdy,
   output reg                         bs_opreq,
   input                              bs_ack,
   output reg [noLanes-1:0]           bs_lanes
   );

   parameter dram_dwidth = 256;          // 32 byte DRAM burst size or cache line.          
   parameter laneSize = 8;
   parameter noLanes = dram_dwidth / laneSize; // Bytelanes.

The cache must be manually instantiated by the substrate designer.

HFAST arbiters can be instantiated on the front or back side of the cache, so that multiple synthesised load/store ports can share one cache or multiple caches can share one DRAM bank. Sharing would be inconsistent.

The default substrate runs the DRAM and DRAM controller at 800 MHz and the Cache and KiwiC generated code at 133 Mhz which is 1/6th of this.

BVCI Offchip Interface & Protocol

Text missing.


AXI and HFAST-to-AXI mapping

AXI has become the most prevalent SoC and FPGA bus interface standard. AXI supports burst transactions and out-of-order service. Such AXI service discipline is well-suited to a high-performance DRAM bank controller. (Such a bank controller typically has 8 internal banks, all of which can be concurrently open on a DRAM row.)

Today's CPUs use multiple load/store stations per core that are pari passu with that core's ALUs. KiwiC-generated hardware is no different. Each load/store station is busy with at most one scalar load/store request and this can only be served in order.

As with CPUs, there are two techniques that adapt between single-issue load/store stations: multiplexing and caching.

KiwiC load/store stations are served with HFAST interfaces. In the fullness of time, KiwiC will provide automated support for HFAST to AXI adaptation but currently a substrate that manually matches the number of load/store ports is required.

The substrate typically converts the KiwiC-generated HFAST interfaces to AXI or other off-chip protocols not currently supported by KiwiC. The substrate provider writes RTL transactors to convert protocols.

Off-chip address size

KiwiC assumes it can use address zero upwards in the off-chip space. The substrate must offset the address bus to address available SoC regions if this is not the case.

KiwiC accepts a recipe parameter to bound the amount of off-chip memory it can use in its one channel. Where a design attempts to use more memory, a compile-time error is raised.

`res2-loadstore-lane-addr-size' gives the off-chip address bus width in bits. In other words, this is the log2 no of words of memory available in each address space. Providing different limits for different off-chip spaces will be enabled in future. The word size and lane structure is defined with `res2-loadstore-port-lanes' and `res2-loadstore-lane-width' where the first of these is typically 4, 8, 16 or 32 and the second nearly always 8 (ie byte-sized lanes).

B-RAM Inference

B-RAM instantiation is normally automatic in FPGA tools. B-RAMs with an access latency of one clock cycle are normally used although KiwiC can support zero and two cycle reads (but how to access them is not described here! TODO).

A B-RAM is inferred from a structure following one of several paradigms based on all addresses passing through a single register or all read data being passed through a single register. These can be mapped onto the same underlying technology by posting the writes as necessary but the effects of read while writing to the same location differ.

KiwiC generates on-chip RAMs as explicit instances in the generated RTL. It uses 'read before' coding style. The FPGA Vendor 'read after' forms, where newly written data is read out are not explicitly found in the generated RTL: KiwiC will forward the data for itself when needed, either at compile or run time.

// (C) Xilinx 2009.  Single-Port B-RAM with Byte-wide Write Enable: Read-First mode
// Download: ftp://ftp.xilinx.com/pub/documentation/misc/xstug_examples.zip
// File: HDL_Coding_Techniques/rams/bytewrite_ram_1b.v
//
module v_bytewrite_ram_1b #(
   parameter SIZE = 1024,
   parameter ADDR_WIDTH = 10,
   parameter COL_WIDTH = 9,
   parameter NB_COL = 4)
   (
   input clk,
   input [NB_COL-1:0] we,
   input [ADDR_WIDTH-1:0] addr,
   input [NB_COL*COL_WIDTH-1:0] di,
   output reg [NB_COL*COL_WIDTH-1:0] do);

   reg [NB_COL*COL_WIDTH-1:0] RAM [SIZE-1:0];

   always @(posedge clk) begin
     do <= RAM[addr];
     end

   generate
     genvar i;
     for (i = 0; i < NB_COL; i = i+1) begin
       always @(posedge clk) 
       if (we[i]) RAM[addr][(i+1)*COL_WIDTH-1:i*COL_WIDTH] <= di[(i+1)*COL_WIDTH-1:i*COL_WIDTH];
       end
   endgenerate

endmodule

//Single Port Block RAM with registered output Option
// Please note that XST infers distributed RAM or B-RAM based on the size. For small RAMs, you may need to use ram_style constraint to fore the use of B-RAM.

module TWO_CYCLE_READ_BRAM(
  input clk,
  input wen,
  input [6:0] a,
  input [15:0] di,
  output reg [15:0] do);

  reg [15:0] ram [0:127];
  reg [15:0] do0;

  always @(posedge clk) begin 
     if (wen) ram[a] <= di; 
     do0 <= ram[a];
     do <= do0;
  end
endmodule

Style 1:

  always @(posedge clk) begin
      addr_reg <= addr ... ;
      if (wen ...)  data[addr_reg] <= (wdata ...);
      rdata = data[addr_reg]; // Note blocking assign used or 
                              // else the rhs freely used elsewhere.
      end

Style 2:

  always @(posedge clk) begin
      if (wen ...) data[addr] <= (wdata ...);
      rdata_reg <= data[addr]; // No other reads elsewhere
      end

There are also the dual-ported equivalents of these styles, supported by both Xilinx and Altera.

Dual-port and multi-port RAMs

See demo test50.

The FPGA libraries contain dual-port RAMs. These can be used for sharing data between up to two threads. Threads can also shared data via a scalar variables. Kiwi supports any number of threads reading or writing shared scalar variables but there are technology restrictions on shared access to arrays.

Where an array is small enough to instantiated as an FPGA on-chip B-RAM (block RAM), and overrides are not applied, then such a B-RAM will be used. Both Xilinx and Altera provided FPGAs with on-chip, dual-ported B-RAMs with synchronous read latency of one cycle.

If three threads operated on the shared memory, Kiwi would generate an instance of a triple-ported SRAM module but this would not be found in the technology library when then FPGA tools were applied. A hardware designer could implement such a device, but it would probably have to be variable latency (i.e. have handshake wires) and this can be requested with CSharp attributes on the array instance. The preferred/supported design style for when three or more threads share an array is to ensure the underlying memory if 'off-chip' and then each thread will make access to its via its own load/store port (see Kiwi manual).

By default, KiwiC will use one port on an SRAM for each thread that operates on it. However, by setting the PortsPerThread parameter to greater than one then greater access bandwidth per clock cycle for each thread is possible. Note that Xilinx Virtex BRAM supports up to two ports per BRAM in total, so having ports per thread set to two is the maximum sensible value and that may only sensible if there is only one thread making access to the RAM. In the future, several threads in the same clock domain might get to share the physical ports if the compiler can spot they are temporarily disjoint (i.e. never concurrent).

<p>... we need to add a little more explanation or forward reference here please ...

<hr>


Substrate Gateway

The substrate gateway is a hardware/software boundary for use on platforms such as Zynq or others that run embedded linux with a console, network and filesystem. It has an associated protocol for providing operating system access.


Console I/O

This section will explain how to do console I/O via the substrate gateway.


Filesystem Interface

The basic dotnet classes for StreamReader, StreamWriter, TextReader and TextWriter are provided via the substrate gateway. Random access using fseek is also supported.

documentation incomplete ... add KiwiFilesystemStubs.dll to your compilation ... documentation for Zynq use will be added here... Satnam's windows version ... It works fine under RTL_SIM with verilator.

The following nets will require connection to the synthesis output when the Kiwi file system is in use.

For high performance computing applications the filesystem is part of the Kiwi Substrate (alongside the DRAM).

    output reg KiwiFiles_KiwiRemoteStreamServices_perform_op_req,
    input KiwiFiles_KiwiRemoteStreamServices_perform_op_ack,
    input [63:0] KiwiFiles_KiwiRemoteStreamServices_perform_op_return,
    output reg [63:0] KiwiFiles_KiwiRemoteStreamServices_perform_op_a2,
    output reg [31:0] KiwiFiles_KiwiRemoteStreamServices_perform_op_cmd,

A suitable behavioural Verilog fragment to connect to them for simulation test purposes is /kiwi/filesystem/kiwifs_bev.v that provides the basic console and file stat/exists/open/close/read/write calls required by the dotnet Stream and File.IO classes.

The remainder of this part of the user manual is missing, but please check the Bowtie Geneome Sequencer demo for an example of file system use.


Hardware Server

The Server attribute indicates that a method and the methods it calls in turn are to be allocated to a separate RTL module that is instantiated once and shared over all calling threads.

Kiwi Performance Tuning

An HLS system can be set to optimise for

  1. Performance: achieving the best execution time, aiming for maximal clock frequency and minimal number of clock cycles,

  2. Area: using as little area as possible, generally at the expense of many more clock cycles,

  3. Debugibility: renaming and sharing registers as little as possible and providing additional debug and trace resources for interative access.

The main parameters for tuning the Kiwi Area/Performance tradeoff, folding space over time are:

  1. The bevelab-soft-pause-threshold parameter. The nominal range is 0 to 100 with useful values currently being between 5 and 40. A lower value tends towards more clock cycles and possibly less area. Values above 40 may lead to very long KiwiC compile time.

  2. The loop unwind limits alter the amount that a loop is unwound at compile time, leading to parallelism. For instance, the Kiwi.Unroll("COUNT~=4", lvar); attribute added to the C# source code suggests that the loop whose control variable is called `lvar' is unwound by a factor of 4.

  3. Structural Resource Budgets: The restructure phase accepts ten or so recipe settings that limit the maximum number of structural resources, such as floating-point ALUs allocated pre thread. Smaller settings lead to smaller designs that use more clock cycles.

  4. RAM thresholds: Settings such as res2-offchip-threshold alter the amount of block RAM allocated. This is faster than external (off-chip) SRAM or DRAM but uses more FPGA resources.

  5. The setting res2-loadstore-port-lanes alters the number of external memory ports used. These each operate in order, so if you have more of them and mux them externally onto separate resources or an out-of-order bus then you get more parallelism and external RAM bandwidth.

  6. ALU latency: Settings such as fp_fl_dp_div describe the type of divider to generate. For such components you can provide your own implementations, alongside those provided in the Kiwi libraries like cvgates.v, and specifiy whether they are fixed or variable latency, fully-pipelined and what the fixed or expected latency in clocks cycles is.

  7. Register colouring affinty: The kiwic-colour-enable setting alters the amount to which KiwiC reuses registers. With it disabled, the hardware is easier to inspect/debug, but many more registers are generated. An experimental, spatially-aware binder is being added to Kiwi at the moment. This will handle both registers and ALUs and gives a floorplan plot.

Commonly, the system DRAM will run at a hardwired clock frequency, such as 800 MHz. This is too fast for most current FPGA logic, Kiwi-generated or otherwise. An integer divisor of 4 or 5 typically needs to be applied to bring the logic speed below 200 MHz. Getting KiwiC to hit a target clock frequency is a common requirement ... TBC ...

Kiwi Performance Predictor

In 2015 a performance predictor was added to Kiwi so that estimates of run-time performance can be rapidly provided without having to do an FPGA place-and-route or even a complete pre-FPGA RTL simulation. The performance predictor is based on basic block visit ratios stored in a database that is updated with the results from short runs. When the application is edited and recompiled with KiwiC, a new prediction is generated, straightaway, based on the contents of the database generated by previous versions. Short profile runs of the new design can then be run to improve prediction accuracy. Every prediction is reported with confidence limits. The reported confidence is reduced (wider error bars) both by certain design edits and by extrapolating to runs that are much longer than those used for profiling.

Performance prediction is based on accurate knowledge of control flow branching ratios: the percentage of time a conditional branch is taken or not taken. This enables execution counts for each basic block to be estimated. Profile information from previous runs is the default basis for this knowledge. To ensure the information stored in the profile database is robust against program edits, it cannot be indexed by fragile tags such as a basic block number in global syntax-directed enumeration. Instead, performance prediction uses the method names occurring naturally in the application program as timing markers. Every method has a clear entry point as well as potentially several exit points (return statements are numbered in their textual order in the CIL byte code... branches to the exit). With loops that contain no method calls in their bodies, the user must add a method call to a dummy method (null body) and that method should be (preferably?) annotated with a KppMarker attribute. Conditional branches and basic block names are then taken in a syntax-directed way from the code between the named control-flow points and discrepancies in the control flow graph between named points is used to flag warnings and discard profile information no longer usable.

All call strings for a method can either be considered separately or in common. The call string is the concatenation of the call site textual names from the thread or program entry point. If the call strings are considered in common, they are being disregarded and the average over all call strings is used.

These attributes also enable the user to control the way the performance estimation report is presented. They also enable the user to provide a substitute loop or visit count that overrides the stored profile. This provides the basis for extrapolating the run time from a small test or profiling data set to the envisioned real date size that will be processed on the FPGA.

Where the performance predictor cannot find profile information for a branch it assumes a 50/50 division and the number of such assumptions and their effect on the confidence in the result is included in the report.

Profiles for performance prediction can be sourced from various places, including diosim, but RTL simulation is used in the following, step-by-step, example.

  1. Preferably denote several waypoints in the application C# program Kiwi.KppMark().
  2. Generate an RTL design using KiwiC and an RTL testbench using the standard flow for your envionment, but with the following minor changes
  3. Run your RTL simulation. The included material will write out a file file called 'profile.xml' or similar. (You can also get this file from diosim without an external RTL simulator).
  4. Invoke the performance predictor (hpr/kpredict.fs) using ... and you will see
  5. With a suitable Makefile, you can make the web page redisplay automatically after every high-level edit ...


Phase Changes, Way Points and Loop Markers

Hardware itself does not have a start and end time. Instead, performance metrics are always quoted between a START/FINISH pair of named events. A typical program is structured with a time-domain series of internal phases, such as `startup', `load', `compute' and `report'. The performance predictor makes separate predictions for each phase and sums them. The confidence for different phases may be different, typically according to which part of the program was most recently edited. A marker between phases is called a way point. Kiwi.KppMark() dummy calls and/or Kiwi.KppMarker attributes are used to define waypoints. Each way point has a name and all but the last start a phase that optionally also has a name. The entry and exit waypoints should be called START and FINISH respectively. The program's control flow cannot loop around a way point. If a KppMarker is found in a loop body, or a method body where that method is called more than once, the provided labels are code point markers (explained below).

  Kiwi.KppMark("START", "subsequent-phase-name1");
  ...
  Kiwi.KppMark("waypoint-name2", "subsequent-phase-name2");
  ...
  Kiwi.KppMark("waypoint-name3", "subsequent-phase-name3");
  ...
  Kiwi.KppMarker("FINISH");

A waypoint is a special form of code point marker. The use of code point markers adds robustness to the information stored in the profile database against program edits, allowing it to be safely applied to edited programs. The markers provide index points that can be associated with loop heads and other control-flow points, to assist in robustness of the profile for complex method bodies. Basic block names are then named in a syntax-directed way with respect to, and as textual extensions of, the previous and next labelled control point.

KppMark has no innate multi-threaded capabilities and so should generally be set by an application's master/controlling thread, assuming it has one.

An exiting application has precisely one entry point. It has one exit point if other exits are are routed to a singleton exit point. Way points should appear once. Given expected visit ratios for each basic block, the problem is overconstrained and the frequency of visiting each way point and the singleton exit point can be inspected as a confidence indicator: they are all nominally visited once.

Growth Parameter Assertions/Denotations

C# attributes also enable the user to provide a substitute loop or visit count that overrides the stored profile. This provides the basis for extrapolating the run time from a small test or profiling data set to the envisioned real data set size that will be processed on the FPGA . Also, hardware itself does not have a start and end time - it is static/eternal. Instead, performance metrics are always quoted between a start/end pair of named code lables, again specified with C# attributes. Times for various phases within a program, such as `load', `process' and `write out', can also be predicted by inserting appropriate further control-graph delineations with an attribute that denotes a way point.

Debug and Single Step

There is no explict support for hardware debug currently in Kiwi. However, we will add shortly the following features to the generated RTL that can be hooked up to a management CPU via the substrate gateway. They each add hardware overhead but this can be trimmed out mostly by FPGA tools when reporting resources are left disconnected.

  1. Abend reason register - successful halt/array bounds/integer overflow/null pointer run time errors agumented with waypoint and thread.
  2. Thread status and control bit vectors: run enable and status: not started, running and exited.
  3. CPU register debug access ports: additional read/write logic is generated enabling programmed I/O access to every register.

Register colouring, RAM binding with memory maps and ALU binding is reported in the KiwiC report file. Only a static mapping, generated at KiwiC compile time, is used.

Watchpoints are best implemented in the C# source code and recompiled, or else use vendor tools like ChipScope etc..

Spatially-Aware Binder

An experimental, spatially-aware binder is being added to Kiwi at the moment. This will handle both registers and ALUs and gives a floorplan plot.

Generated RTL

Kiwi generates Verilog RTL for synthesis to FPGA by vendor tools. It can also generate SystemC and CSharp but we do not commonly use those flows at the moment and their will be some regressions.

KiwiC will assume the presence of various IP blocks in Verilog. These include RAMs and fixed and floating point ALUs. It will instantiate instances of them.

The libary blocks are generally provided in the following source files:

CV_FP_ARITH_LIB=$(HPRLS)/hpr/cv_fparith.v
CV_INT_ARITH_LIB=$(HPRLS)/hpr/cvgates.v
CVLIBS=$(CV_INT_ARITH_LIB) $(CV_FP_ARITH_LIB)

RAM Library Blocks

Fixed-latency RAMs are provided in the cvgates.v. They have names such as CV_SP_SSRAM_FL1 which denotes a synchronous RAM with fixed read latency of one clock cycle (FL1) and one port (SP). The cvgates implementations are intended to by synthesisable by FPGA tools.

Parameter overrides set the address range and word and lane width.

ALU Library Blocks

These blocks are found in cv_fparith.v

Example: CV_FP_FL5_DP_ADDER - floating point, fixed latency of 5 clock cycles, double precision, 

CV_FP_FL_SP_MULTIPLIER

Key: FLASH=combinational. FLn = fixed latency of $n$ clock cycles, VL variable latency with handshake wires, blocking while busy, DP=double precision, SP=single precision.

Incremental Compilation

Compiling everything monolithically does not scale to large projects. A modular design is always needed for revision control, unit testing and other reasons.

There are several cut points in the design flow where separately-compiled modules can be combined. Numbers 1 and 3 in the following list are relatively obvious, so we discuss only number 2.

  1. KiwiC will accept any number of .dll or .exe files on its command line. These will have been generated, typically, from separate invokation of the C# compiler.

  2. The Kiwi.Remote() attribute described in §7.1 enables a given method to be cut out for separate compilation.

  3. Incremental invokation of FPGA tools is also typically possible, where some RTL files have been seen before and others are new, but is beyond the scope of this document.

  4. (In principle it is possible to load and save VMs to disk (serialised in XML) and so incremental compilation at intermediate points in the opath recipe is a future option.)

Separately-compiled modules will not share hardware resources (such as registers, ALUs or RAMs) between them. Also, each will, in general, have its own (set of) load/store port(s) for access to DRAM.

Examples

There are some examples in the standard distribution, such as primes and cuckoo cache.

A get-started example: 32-bit counter.

Here's how to make a simple synchronous counter that produces a 32-bit net-level output.

using KiwiSystem;                                                                                                                                    class Counter32
{             
  [Kiwi.OutputWordPort("counter")]
  static int counter;

  [Kiwi.HardwareEntryPoint()]
  static int Main2()
  {
    while(true)
    {
      Kiwi.Pause();
      counter = counter + 1;
    }
  }
}



Subsections
David Greaves 2016-12-05