Kiwi Scientific Acceleration Manual
Rough Draft User Manual (KiwiC Version Alpha 2.15n)

©2011-16 DJ Greaves + S Singh
September 6, 2016

Preface

Kiwi was a collaborative project between the University of Cambridge Computer Laboratory and Microsoft Research Limited, headed by David Greaves (UoCCL) and Satnam Singh (MRL). From 2013 onwards, the Kiwi system was further developed at the Computer Laboratory and using a logic synthesis library called HPR-L/S.

Kiwi is developing a methodology for algorithm acceleration using parallel programming and the C# language. Specifically, Kiwi consists of a run-time library for hardware FPGA execution of algorithms expressed within C# and a compiler, KiwiC, that converts .NET bytecode into Verilog RTL for further compilation for FPGA execution. In the future, custom domain-specific front ends that generate .NET bytecode can be used.

The Kiwi technology has many potential uses, but some of note are:

1. Kiwi-HPC: High-performance computing or scientific acceleration.
2. ASIC hard-core generation for standard algorithms that are to be implemented in silicon, such as MPEG compression.
3. Routing logic for software-defined networking.
4. Rapid transaction processing and hardware implementation of automated trading algorithms.

Compared with existing high-level synthesis tools, KiwiC supports a wider subset of standard programming language features. In particular, it supports multi-dimensional arrays, threading, file-server I/O, object management and limited recursion. Release 1 of KiwiC supports static heap
management, where all memory structures are allocated at compile-time and permanently allocated to on-FPGA RAM or external DRAM. Release 2 of KiwiC, which has had some successful tests already, supports arbitrary heap-allocation at run time but does not implement garbage collection.

The Kiwi performance predictor is an important design space exploration tool. It enables HPC users to explore the expected speed up of their application as they modify it, without having to wait for multi-hour FPGA compilations in each development iteration.

Sadly, this manual is currently in a very-rough form.

Contents

1 Download and License 3
   1.1 Open Source ? 3
   1.2 Warranty 3

I Scientific Users’ Guide 4

2 Kiwi Substrate 4
   2.1 Console and LCD stdout I/O and LED GPIO 4
   2.2 Run-time Exception Handler 4
   2.3 DRAM 5
   2.4 Watchpoints and Start/Stop Control 5
   2.5 Framestore 5
   2.6 Profiling 5

II Installation and Easy Get Started 6

3 Get Started (Mono on Linux) 6

III Kiwi Supported Language Subset Limitations and Style Guide 9

4 General CSharp Language Features and Kiwi Coding Style 9
   4.1 Supported Types 9
   4.2 Supported Constants 10
   4.3 Supported Variables 10
   4.4 Supported Operators 10
   4.5 Supported Class Features 10
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.6</td>
<td>Supported I/O with Kiwi</td>
<td>10</td>
</tr>
<tr>
<td>4.7</td>
<td>Data Structures with Kiwi 1/2</td>
<td>11</td>
</tr>
<tr>
<td>4.8</td>
<td>Data Structures with Kiwi 2/2 - more advanced and opaque temporary write up</td>
<td>11</td>
</tr>
<tr>
<td>4.8.1</td>
<td>First Stage Processing (repack)</td>
<td>11</td>
</tr>
<tr>
<td>4.9</td>
<td>Dynamic Storage Allocation</td>
<td>12</td>
</tr>
<tr>
<td>4.10</td>
<td>Pointer Arithmetic</td>
<td>13</td>
</tr>
<tr>
<td>4.11</td>
<td>Garbage Collection</td>
<td>13</td>
</tr>
<tr>
<td>4.12</td>
<td>Testing Execution Env: Whether I am running on the Workstation or the FPGA</td>
<td>14</td>
</tr>
<tr>
<td>4.13</td>
<td>Clone</td>
<td>14</td>
</tr>
<tr>
<td>4.14</td>
<td>Delegates and Dynamic Free Variables</td>
<td>14</td>
</tr>
<tr>
<td>4.15</td>
<td>The ToString() Method</td>
<td>15</td>
</tr>
<tr>
<td>4.16</td>
<td>Accessing Numerical Value of Pointer Variables</td>
<td>16</td>
</tr>
<tr>
<td>4.17</td>
<td>Accessing Simulation Time</td>
<td>16</td>
</tr>
<tr>
<td>4.18</td>
<td>Run-time Status Monitoring, Waypoints and Exception Logging</td>
<td>16</td>
</tr>
<tr>
<td>4.19</td>
<td>Exiting Threads</td>
<td>17</td>
</tr>
<tr>
<td>4.19.1</td>
<td>Null pointer, Array bounds, Overflow, Divide-By-Zero and Similar Run-time Exceptions</td>
<td>17</td>
</tr>
<tr>
<td>4.19.2</td>
<td>Normal Thread and Program Exit</td>
<td>17</td>
</tr>
<tr>
<td>4.19.3</td>
<td>User-defined C# Exceptions</td>
<td>17</td>
</tr>
<tr>
<td>4.19.4</td>
<td>Debug.Assert</td>
<td>18</td>
</tr>
<tr>
<td>4.20</td>
<td>Pause Modes (within Sequencer HLS Mode)</td>
<td>18</td>
</tr>
<tr>
<td>4.21</td>
<td>Unwound Loops</td>
<td>19</td>
</tr>
<tr>
<td>4.22</td>
<td>More-complex implied state machines</td>
<td>20</td>
</tr>
<tr>
<td>4.23</td>
<td>Inner loop unwound while outer loop not unwound</td>
<td>20</td>
</tr>
<tr>
<td>4.24</td>
<td>Entry Point With Parameters</td>
<td>20</td>
</tr>
<tr>
<td>5</td>
<td>Generate Loop Unwinding: Code Articulation Point</td>
<td>21</td>
</tr>
<tr>
<td>6</td>
<td>Supported Libraries Cross Reference</td>
<td>22</td>
</tr>
<tr>
<td>6.1</td>
<td>System.Collections.Generic</td>
<td>22</td>
</tr>
<tr>
<td>6.2</td>
<td>System.Random</td>
<td>23</td>
</tr>
<tr>
<td>6.3</td>
<td>Console.WriteLine and Console.Write</td>
<td>23</td>
</tr>
<tr>
<td>6.4</td>
<td>get_ManagedThreadId</td>
<td>24</td>
</tr>
<tr>
<td>6.5</td>
<td>System.BitConverter</td>
<td>24</td>
</tr>
<tr>
<td>6.6</td>
<td>System.String.ToCharArray</td>
<td>24</td>
</tr>
<tr>
<td>6.7</td>
<td>System.IO.Path.Combine</td>
<td>24</td>
</tr>
<tr>
<td>6.8</td>
<td>TextWriter</td>
<td>24</td>
</tr>
</tbody>
</table>

Kiwi Scientific Acceleration Manual
Rough Draft User Manual (KiwiC Version Alpha 2.15n)
6.9 TextReader ................................................................. 24
6.10 FileReader ................................................................. 24
6.11 FileWriter ................................................................. 24
6.12 Threading and Concurrency with Kiwi ................................. 24
   6.12.1 Volatile Declarations ............................................. 25

7 Kiwi C# Attributes Cross Reference ........................................ 26
7.1 Kiwi.Remote() Attribute ................................................ 26
7.2 Flag Unreachable Code .................................................. 27
7.3 Hard and Soft Pause (Clock) Control .................................. 28
7.4 End Of Static Elaboration Marker - EndOfElaborate ................. 28
7.5 Loop NoUnroll Manual Control ...................................... 29
7.6 Elaborate/Subsume Manual Control .................................. 29
7.7 Synchronous and/or Asynchronous RAM Mapping .................... 30
7.8 Register Widths and Overflow Wrapping ............................. 30
7.9 Net-level Input and Output Ports .................................... 30
7.10 Clock Domains .......................................................... 31
7.11 Remote ........................................................................ 32
7.12 Elaboration Pragmas - Kiwi.KPragma ................................ 32
7.13 Assertions Debug.Assert() ............................................. 33
7.14 Assertions - Temporal Logic ......................................... 33

8 Memories in Kiwi ............................................................. 34
8.1 On-chip RAM (and ROM) Mirror, Widen and Stripe Directives ...... 36
8.2 ROMs (read-only memories) and Look-Up Tables ................... 37
8.3 Forced Off-chip/Outboard Memory Array Mapping .................. 37
8.4 Off-chip load/store ports ............................................... 38
   8.4.1 HSIMPLE Offchip Interface & Protocol ......................... 39
   8.4.2 HFAST Offchip Interface & Protocol ......................... 39
   8.4.3 BVCI Offchip Interface & Protocol ............................ 41
8.5 AXI and HFAST-to-AXI mapping ..................................... 41
8.6 Off-chip address size .................................................... 42
8.7 B-RAM Inference ........................................................ 42
8.8 Dual-port and multi-port RAMs ....................................... 44

9 Substrate Gateway .......................................................... 44
9.1 Console I/O ...................................................... 44
9.2 Filesystem Interface ........................................ 44
9.3 Hardware Server ............................................. 45

10 Kiwi Performance Tuning ...................................... 46
10.1 Kiwi Performance Predictor ............................... 46
10.2 Phase Changes, Way Points and Loop Markers ........ 47
10.3 Growth Parameter Assertions/Denotations ............. 48
10.4 Debug and Single Step ..................................... 48

11 Spatially-Aware Binder ........................................ 48

12 Generated RTL .................................................. 48
12.1 RAM Library Blocks ....................................... 49
12.2 ALU Library Blocks ....................................... 49

13 Incremental Compilation ...................................... 49

14 Examples ......................................................... 49
14.1 A get-started example: 32-bit counter .................. 49

IV Expert and Hardware-level User Guide ..................... 51

15 Kiwi Hard-Realtime Pipelined Accelerators ............ 51
15.1 Pipelined Accelerator Example 1 ....................... 51

16 Designing General/Reactive Hardware with Kiwi ........ 52
16.1 Input and Output Ports ................................... 52
16.2 Register Widths and Wrapping .......................... 53
16.3 How to write state machines ......................... 54
  16.3.1 Moore Machines ...................................... 54
  16.3.2 Mealy and combinational logic ..................... 54
16.4 State Machines ............................................. 55
16.5 Clock Domains ............................................. 55

17 SystemCSHarp .................................................. 56

Kiwi Scientific Acceleration Manual .......................... 5
Rough Draft User Manual (KiwiC Version Alpha 2.15n) .. 5
IX General Orangepath Facilities
32 FILES AND DIRECTORIES
33 Espresso
34 Cone Refine
35 HPR Command Line Flags
  35.1 Other output formats
  35.2 General Command Line Flags
  35.3 Simulation Control Command Line Flags
36 Diosim Simulator
Introduction

Kiwi is a compiler and library and infrastructure for hardware accelerator synthesis and general support for high-performance scientific computing. The output is intended for execution on FPGA or in custom silicon on ASIC.

We aim to compile a fairly broad subset of the concurrent C# language subject to some restrictions:

For Kiwi 1, the current version, we have the following aims:

- Works with the Linux/mono infrastructure but should also work on Windows.
- Program can freely instantiate classes but not at run time - a fixed number of instantiation operations must be detectable at compile time.
- Array and heap structure sizes must all be statically determinable (i.e. at compile time).
- Program can use recursion but the maximum calling depth must be statically determined in Kiwi 1.
- Stack and heap must have same shape at each run-time iteration of non-unwound loops. In other words, every allocation made in the outer loop of your algorithm must be matched with an equivalent, manifestly-implicit garbage generation event or explicit `obj.Dispose()` or `Kiwi.Dispose(Object obj)` in the same loop.
- Program can freely create new threads but creation sites statically determined too.

In Kiwi 2 we will relax the static restrictions and allow the size of data structures in DRAM to be determined at runtime.

1 Download and License

Kiwi has been given to several academic institutions for early experiments. A usable release is envisioned early 2016 and this will be downloadable on completion of a web form. The web form will be on this url: http://koo.corpus.cam.ac.uk/kiwic-download.

1.1 Open Source?

Is Kiwi open source? Currently the source code has been shared with a few tens of partners. Kiwi will become completely open source at the end of 2016.

1.2 Warranty

Neither the authors nor their employers warrant that the Kiwi system is correct, usable or noninfringing. It is an academic prototype. We accept no responsibility for direct or indirect loss or consequential loss to the maximum amount allowable in UK law.
Part I

Scientific Users’ Guide

2 Kiwi Substrate

We use the term substrate to refer to an FPGA board or set of server blades that is/are loaded with various standard parts of the Kiwi system. The most important substrate facilities are access to DRAM memory, a disk filesystem and a console/debug channel. Basic run/stop/error status output to LEDs via GPIO is also provided.

The substrate is like an operating system on the FPGA. It supports connection to more than one application loaded in FPGA at once (cite farming paper).

2.1 Console and LCD stdout I/O and LED GPIO

2.2 Run-time Exception Handler

Run-time exceptions include integer divide-by-zero and null pointer de-reference, array bounds fail and runtime fail of Debug.Assert(). Floating point overflow is normally handled by returning
IEEE Inf or NaN.

CIL bytecode has overflow trapping versions of the arithmetic operators that raise exceptions. We generate these from C# using checked keyword. Numeric casts can also be out of range, as in (ushort)0x10000 (a CIL conv.ovf.u2 assembly instruction is used.) In the future KiwiC can trap these overflows as run-time errors.

Convert exceptions for casting a value to an illegal value with respect to the target type range, as raised by the conv.ovf CLR instruction, ... please explain.

Array bounds checking can also give a run-time error.

TODO: explain here about a per-clock domain error net generated by KiwiC as part of control wires.

2.3 DRAM

DRAM and Caches are described in §8.4.

2.4 Watchpoints and Start/Stop Control

2.5 Framestore

Having very high bandwidth for writes to the framestore is an intrinsic feature of FPGA computing. The framestore can be part of the compute engine and used for high-performance visualisation. Or it might just be used for a progress indicator - e.g. percentage of the job processed and final output.

2.6 Profiling

Certain basic block visit counts are collected and the results fed back to the performance counters...

Tick counter ... for now.
Part II

Installation and Easy Get Started

Kiwi is currently not as easy to use as it could be. You can find an ‘addin’ for the monodevelop IDE on the following URL but it is currently not very useful and since it is really focussed on Kiwi performance prediction which is immature. Currently it is best if you craft a makefile based on one of the examples.

Monodevelop addin: http://www.cl.cam.ac.uk/users/djg11/kiwi/kiwiaddons/kiwiScientificA

The makefile will compile your application and optionally run the application on your workstation under mono or the Windows equivalent.

The makefile will then invoke the Kiwi compiler to generate a Verilog RTL file and combine this with the provided substrate Verilog files for your FPGA target. Finally it will invoke the FPGA tool suite to give a bitstream file to be loaded to the FPGA.

The means for loading to the FPGA is currently highly-platform specific. Each substrate should have its own user guide.

3 Get Started (Mono on Linux)

Kiwi is normally supplied as a zip file that contains folders called lib, bin, doc and so on. If you want the source for the compiler please ask. An open source release is planned.

Requirements: You need a working dotnet environment (mono or Windows) on your machine.

KiwiC/HPR is currently internally implemented in F# but you just need a C# compiler to use it.

If F# is not locally installed you will need to manually add at least the FSharpCore.dll to the KiwiC/distro/lib folder. We do ship one you can move there. Otherwise you may get ‘type load’ and ‘missing entry point’ errors.

FSharp can be simply obtained with apt-get install fsharp on some machines.

KiwiC uses the Mono.Cecil front end and hence the Mono.Cecil.dll is required, either installed on the machine or copied to the KiwiC/distro/lib folder.

Place the KiwiC distribution somewhere on your filesystem. Let us call that place PREFIX. To run KiwiC on linux you must execute the KiwiC shell script

$ $(PREFIX)/bin/kiwic ... args ...

The shellsript just contains mono $(PREFIX)/lib/kiwic.exe

Windows users can invoke the kiwic.exe executable directly.

The arguments to KiwiC should either be portable assembly files (suffix .dll or .exe) or option flags prefixed with a minus sign. Generally you will supply the current design and KiwiC will automatically load the Kiwi libraries it needs.

Two Kiwi libraries are commonly needed:
1. Kiwi.dll - This defines the Kiwi attributes and other material implemented in C# that should be supplied both to C# compilations and to the KiwiC compiler for both FP and WD.

2. Kiwic.dll - This defines additional or replacement implementations of standard .NET library functions for use by the KiwiC compiler and must nominally be supplied on the KiwiC command line. Generally, this is not needed for the first stage of a compilation when an application program in C# is converted to a .NET binary (.exe or .dll) where that binary is either going to be run on the workstation (mono/windows) or compiled further by KiwiC. It should be automatically found by KiwiC and so does not need to be actually named on any command line.

To enable the same RTL file to be synthesised for FPGA by vendor tools and simulated [RTL-SIM] but to have slightly different behaviour (e.g. w.r.t. BIST self test) it is handy to define an external input to the Kiwi code that you tie low in the RTLSIM testbench but strap high in the FPGA substrate pad ring.

```csharp
[Kiwi.InputBitPort("FPGA")]
```

If you have these libraries in .cs form only, you will need to compile them to .dll form using mcs or similar. You will get some warnings about the ‘unsafe’ code they contain.

You must manually include the reference to Kiwi.dll in the C# compilation step.

For the KiwiC compilation step, KiwiC will automatically search for the above libraries and include them in the compilation and this is equivalent to manually including them on the KiwiC command line.

To disable automatic search or redirect it to specific files, use the command-line flags -kiwic-dll and -kiwi-dll. Set these to the empty string to disable them or set them to a specific location, e.g. -kiwic-dll=/usr/lib/kiwic/mykiwic.dll.

Note that anything specified via the command line can also be specified in an XML `recipe file`, with the command line taking precedence when specified both ways. Kiwi comes with a standard recipe for accelerating scientific computing. You can modify this to get SystemC output or for privately developed flows based on Kiwi.

Kiwi defines the terms WD, RTLSIM and FP to define three execution environments.

1. WD — Rapid development of applications on the workstation with performance prediction.
2. RTLSIM — Verilog simulation (verilator is fastest) in case of KiwiC bugs and for performance calibration when interacting with RTL models of other system components.
3. FPGA — high-performance execution on the FPGA.

The `Kiwi.HardwareEntryPoint` attribute can be attached to one or more static methods in the input program. The control-flow graph beneath such methods is converted to hardware. The command line -root flag is another way of specifying an entry point. KiwiC does not default to using a static Main method.

To obtain Verilog RTL output, KiwiC requires a source file name and access to its libraries. So the most basic Makefile is something like
It might be helpful to pass constant values as arguments to the HardwareEntryPoint but this is not supported. Instead, write a C# shim that takes no arguments and passes constants to a putative entry point.

```
PREFIX=$(HPRLS)/kiwipro/kiwic/distro
KLIBC=$(PREFIX)/kiwipro/support/Kiwic.dll
KLIVO=$(PREFIX)/kiwipro/support/Kiwi.dll
KIWIC=$(PREFIX)/kiwipro/bin/kiwic

all:
gmcs /target:library tiny.cs /r:$(KLIVO)
$(KIWIC) tiny.exe

# Other useful options until recently: -vnl and -root:
   $(KIWIC) tiny.exe -root "tiny;tiny.Main" -vnl tiny.v
```

Given that you have a file called tiny.exe to hand, this should result in a file called tiny.v in your current directory.

To generate tiny.exe one can do the following:

```c
$ cat > tiny.cs
using System;
using KiwiSystem;

class tiny
{
    [Kiwi.HardwareEntryPoint()]
    public static int Main (string []argv)
    {
        Console.WriteLine("Hello World");
        return 1;
    }
}
$ gmcs tiny.cs # or use mcs the mono C# compiler.
```

Should you need it, KiwiC will write a disassembly of the PE file to obj/ast.cil in the current folder, enabled by recipe or command line flag `-kiwic-cil-dump=separately` or `-kiwic-cil-dump=combined`.

If you do not have the Kiwi.dll library to hand (e.g. input from C++ instead of C#) or have other problems putting a HardwareEntryPoint attribute on a method then using the `-root` command line flag is a good idea.

If you do not have the Kiwi.dll library to hand (e.g. input from C++ instead of C#) or have other problems putting a HardwareEntryPoint attribute on a method then using the `-root` command line flag is an alternative.
Part III

Kiwi Supported Language Subset
Limitations and Style Guide

Kiwi aims to support a very broad subset of the C# language and so be suitable for a wide variety of High-Performance Computing (HPC) applications. However, the user is expected to write in a parallel/concurrent style using threads to exploit the parallelism available in the FPGA hardware. However, conventional high-level synthesis (HLS) benefits should be realised even for a single-threaded program.

This chapter will explain the synthesisable subset of C# supported by KiwiC, but currently much work is needed in this section of the manual ...

In general, for Kiwi 1, all recursion must be to a compile-time determinable depth. The heap and stack must have the same shape at each point of each iteration of every loop this is not unwound at compile time. In other words, dynamic storage allocation is supported in KiwiC, provided it is called only from constructors or once and for all on the main (lasso stems of) threads before they enter an infinite loop. If called inside a non-unwound loop, the heap must be the same shape at each point on each iteration. If KiwiC is failing in its escape analysis to determine that store is implicitly reusable you should explicitly free such store with obj.Dispose() or Kiwi.Dispose(Object obj).

Dynamic storage regions cannot currently be shared between Kiwi threads. Currently, KiwiC implements different heap spaces for each thread ... really ? If so this needs fixing ... TODO ... maybe they are only different AFTER a fork but resources allocated before Thread.Start are ok.

Floating point is being provided for the standard 32 and 64 bit IEEE precisions, but FPGAs really shine with custom precision floating point so we will add support for that while maintaining bit-accurate compatibility between the execution environments.

Atomic operations: Kiwi supports the CLR Enter, Exit and Wait calls by mapping them on to the hpr_testandset primitive supported by the rest of the toolchain. Ed: The rest of this paragraph should be in the ‘internal operation’ section. Although RTL target languages, such as Verilog, are highly-concurrent, they do not have native support for mutexes. The bevelab recipe stage correctly supports testandset calls implemented by its own threads, but KiwiC does not use these threads: instead it makes a different HPR virtual machine for each thread and these invoke bevelab once each instead of once and for all with bevelab threads within that invocation. Hence the the testandset primitives disappear inside bevelab. ... TODO explain further.

4 General CSharp Language Features and Kiwi Coding Style

4.1 Supported Types

Kiwi supports custom integer widths for hardware applications alongside the standard integer widths of dotnet 8, 16, 32 and 64.

Char is a tagged form of the 16-bit signed integer form.
Single and double-precision floating point are supported.

Enumerations are supported with custom code points. MSDN says the approved underlying types for an enum are byte, sbyte, short, ushort, int, uint, long, or ulong, but Kiwi uses a suitable custom width of any number of bits.

4.2 Supported Constants

Verilog and SystemC has 8-bit chars but C# and dotnet have 16 bit chars.

4.3 Supported Variables

Kiwi supports static, instance, local and formal parameter variables.

Variables may be classes or built-in primitive types and arrays of such variables. An array may contain a class and a class may contain an array, to any nesting depth. Multi-dimensional arrays (as opposed to jagged arrays) are supported with a little syntactic sugar in the C# compiler but mostly via library class code provided in KiwiC.dll.

Structs are also being added.

Signed and unsigned integer and floating point primitive variables are fully supported.

Strings are supported a little, but there is currently no run-time concatenation or creation of new strings, so all such string creation operations must be elaborated at KiwiC compile time.

4.4 Supported Operators

All standard arithmetic and logical operators are supported. Some operators, especially floating-point converts and floating-point arithmetic result in components being instantiated from the cv-gates.v library. Integer mod, divide and larger multiplies also result in ALU instantiation, unless arguments are constant identity values or powers of two that are easily converted to shifts. Divide and multiply by a constant may result in adders being generated.

4.5 Supported Class Features

4.6 Supported I/O with Kiwi

Kiwi supports a number of forms of I/O:

- Net-level RTL-style I/O through peeking and poking of static variables that are shared with the outside world is the most basic form of I/O. Please see §7.9.

- Methods can also be designated as remotely-callable. Communication between separately-compiled hardware modules is then analogous to method calls between software components. This is explained in §7.1.

- Local console debugging style output. Principally this involves calling +verbConsole.WriteLine()+ and running the KiwiC output on an RTL simulator. On the FPGA blades, output to a logging
file is supported over the network. Also, certain real hardware devices on the substrate such as LEDs, LCD panels and framestores have also been run experimentally.

- Remote Console, Network and Filesystem I/O via the substrate gateway. See §9.2

4.7 Data Structures with Kiwi 1/2

To achieve high performance from any computer system the programmer must think about their data structures and have a basic knowledge of cache and DRAM behaviour. Otherwise they will hit memory bandwidth limitations with any algorithm that is not truly CPU bound.

As in most programming languages, C# variables and structures are static or dynamic. Dynamic variables are allocated on the heap or stack. All are converted to static form during compilation using the version 1 Kiwi compiler. Support for truly dynamic variables will perhaps be added in a future release.

Kiwi does not (currently) support taking the address of local variables or static variables in fields (except when pass by reference is being compiled). All pointers and object handles need to refer to heap-allocated items.

It is helpful to define the following two terms for pointer variables. Pointers generally point to dynamic data but their pattern of use falls into two classes. We will call a static pointer one whose value is initially set but which is then not changed. A dynamic pointer is manipulated at run time. Some dynamic pointers range over the value null. (As with all C# variables, such pointers can be declared as static or instance in C# program files — this is orthogonal to the current discussion.)

Every C# array and object is associated with at least one pointer because all arrays and objects are created using a call to ‘new’. Also, some valuetypes become associated with a pointer, either by being passed-by-reference or by application of the ampersand operator in unsafe code. The KiwiC compiler will ‘subsume’ nearly all static pointers in its front end constant propagation and any remaining static pointers will be trimmed by later stages in the KiwiC compiler or in the vendor-specific FPGA /ASIC tools applied to the output RTL.

KiwiC maps data structures to hardware resources in two stages. In the first stage (known as repack §29), every C# form (that did not disappear entirely in the front end) is converted to either scalars of some bit width or 1-D arrays (also known as vectors) of such scalars. In the second stage (known as restructure §30), mapping to physical resource decisions are made as to which vectors and scalars to place in what type of component (flip-flops, unregistered SRAM, registered SRAM, DP SRAM or off-chip in DRAM) and which structural instance thereof to use. The first stage behaviour is influenced mainly by C# programming style. Second stage behaviour is controlled by heuristic rules parametrised by command-line flags and recipe file values.

4.8 Data Structures with Kiwi 2/2 - more advanced and opaque temporary write up...

4.8.1 First Stage Processing (repack):

Two-dimensional arrays are a good example to start with. Although there is syntactic sugar in C# for 2-D arrays, with current C# compilers this is just replaced with operations supplied by a library dll. The dotnet runtime and KiwiC support just 1-D arrays called vectors. There are two possible implementations of a 2-D array library: jagged and packed. The packed form subscript is computed
using a multiply of the first co-ordinate with the arity of the second co-ordinate and then adding on the second co-ordinate. The jagged form uses a vector of static pointers to vectors that contain the data; the first co-ordinate is the subscript to the pointer vector and the second co-ordinate is the subscript to the selected vector. We use the term jagged to encompass their smooth form where all data vectors are the same length.

KiwiC inlines the subscript computation for a packed array as though the programmer had inlined such an expression in his C# code. Additionally, there is only one vector created. Therefore packed 2-D arrays first become 1-D vectors. However, such vectors are then subject to unpacking in first stage operation. For instance, if all subscripts are constant values, the array is replaced with a set of scalars. Of if the subscripts fall into clearly disjoint regions, the vector is split into multiple, separately-addressed regions. Or if all the subscripts have a common factor or common offset then these are divided and subtracted off respectively. This unpacking into multiple vectors removes structural hazards that would prevent parallelism.

For a jagged array, initially a number of separate vectors are created and a potentially large number of multiplexing expressions (that appear as the ?: construct in Verilog RTL) are created to direct reads to the correct vector. For writes, an equivalent demultiplexor is created to select the correct vector for writing. (The pointer vector is normally static and becomes subsumed, but we will later discuss what happens if the C# code writes to it, making it dynamic.)

Implementation note: if a jagged array is created by allocating a large 1-D array and storing references to offsets in that vector in the pointer array, it is possible to generate a structure that is identical to the packed array. KiwiC happens to detect this pattern and the behaviour would be as per the packed array: however this style of programming is not allowed in safe C#, but could be encountered in unsafe code or other dotnet input form, say, C++.

If we create an array of objects do we expect the fields of the objects to be placed in vectors? Yes, certainly if the object pointers are subsumed.

If we take the parfir example, there’s one initialise place where empty flags are written from a non-unwound loop and hence with dynamic subscript, but elsewhere they are updated only with constant subscripts and so should be simple scalar flags.

Kiwi on Loop Unwinding: Loop-carried dependencies in data or control form limit the amount of parallelism that can be achieved with unwinding.

The hard cbg algorithm unwinds all loops without event control. The soft algorithm allocates cycles based on greedy or searching strategies based on complexity and structural hazards. Consider 1: Hoisting of exit condition computation, or hoisting of data dependency computation: this should preferably be applied? So the post-dependent tail of each loop can be forked off

4.9 Dynamic Storage Allocation

For statically-allocated object instances, KiwiC packs them into flip-flops, B-RAM or DRAM according to thresholds configured in the recipe or command line. This includes objects and structs allocated on the C# heap before the end of static elaboration.

For dynamically-allocated instances, KiwiC cannot easily tell how much memory may be needed and so defaults to DRAM channel 0 if present. But we can switch manually between B-RAM and DRAM for dynamic storage allocation using C# attributes.
We make the following interesting observation: *Once data structures are placed in DRAM there is no real need to have their number statically determined at compile time: instead they can be truly dynamically allocated at run time* (DJ Greaves 2015). Indeed, if an application becomes overly dependant on DRAM data then the FPGA advantage will disappear and a Von Neumann (e.g. x86) implementation may likely have better performance. But, there remains some good FPGA mid ground where a lot of dynamic store is needed but where the access bandwidth required is not excessive.

**Kiwi.HeapManager**

Physical memories used for dynamic storage require a freespace manager. We can allocate a Heap-Manager for each physical memory and the user can direct requests to an appropriate instance. Typically there could be one for each separate DRAM bank and one for each separate on-chip B-RAM. Also, arrays with dynamic sizes ...

### 4.10 Pointer Arithmetic

handleArith pointer arithmetic  
Kiwi. ObjectHandler<T>

The object handler provides backdoors to certain unsafe code for pointer arithmetic that are banned even in unsafe C# code. Implementation in CIL assembler would be possible but having hardcoded support in the KiwiC compiler accessed via this object manager is easier.

### 4.11 Garbage Collection

With Kiwi 1, the stack and heap must have same shape at each run-time iteration of non-unwound loops. In other words, every allocation made in the outer loop of your algorithm must be matched with an equivalent dispose or garbage generation event in the same loop.

KiwiC V2 is implementing a more easy to use, run-time storage allocator, but without garbage collection.

Arrays - Array sizes must all be statically determinable (i.e. at compile time).

Arrays in .NET do not have a Dispose() method. Instead an array can be disposed of with Kiwi.Dispose<T>(T [] array). System.BitConverter returns char arrays when destructing native types and the arrays returned by BitConverter should be explicitly disposed of inside a non-unwound loop if KiwiC is failing to spot an implicit manifest garbage creating event, as reported with the an error like

KiwiC +++ Error exit: BitConverterTest.exe: constant_fold_meets
entry_point=5:: Bad form heap pointer for obj_alloc of type
CT_arr(CTL_net(false, 32, Signed, native), 8) post end of elaboration
point (or have already allocated a runtime variable sized object?).

See the linked list example ... http://www.cl.cam.ac.uk/research/srg/hans/hsrls/orangepath/kiwic-
demos/linkedlists.html
field-arrays and spatial locality

4.12 Testing Execution Env: Whether I am running on the Workstation or the FPGA

We need sometimes to achieve different behaviour, for debugging and scaling reasons, in the three execution environments.

1. For Workstation Development - WD - we can invoke
   `Kiwi.inHardware()` and find that it returns false.

2. For RTLSIM check that inHardware returns false and that the
   `Kiwi.InputBitPort("FPGA")` static bool FPGA; returns false. You should tie this
   net low in your simulator top-level instantiation.

3. Otherwise we are in FPGA. The Kiwi substrate for a hardware PCB should tie this net high in
   the pad ring.

Call the function `Kiwi.inHardware()` for this purpose. Since this is a compile-time constant, it is
useful for removing development and debugging code from the final implementation. KiwiC will
ignore code that is inside if (false) {} constructs so write

```csharp
if (!Kiwi.inHardware()) { ... test/debug code ... }
```

```csharp
public static bool inHardware()
{
    return false; // This is the returned value when running on the workstation.
    // An alternative overriding implementation is hardcoded inside KiwiC and will
    // return 'true' for FPGA and RTL simulation.
}
```

4.13 Clone

Clone of arrays and objects ....

4.14 Delegates and Dynamic Free Variables

Kiwi Dynamic Method Dispatch

Dynamic method dispatch is where which function body that gets called from a callsite is potentially
data-dependent. Computed function calls occur with action and function delegates and dynamic
object polymorphism.

In C++ there are restrictions that higher-order programming is only possible within a class hierarchy.
This arises from the C compatibility issues where the higher-order function passing does not have
to manage an object pointer. These issues are neatly wrapped up in C# using delegates. An action
delegate has void return type whereas a function delegate returns a value.

Kiwi supports the function and action delegates of C#.
KiwiC partitions dynamically-callable method bodies into equivalence classes and gives each body within a class an integer. These classes typically contain only a very few members each. It then uses constant folding on the entire system control-flow graph as a general optimisation. This may often turn a dynamic dispatch into a static dispatch, hence these integers will not appear in the output hardware unless truly dynamic dispatch is being used, such as in

```csharp
Action<int, string> boz_green = delegate(int var1, string var2)
   { Console.WriteLine(" {1} {0} boz green", var1, var2);
   };
Action<int, string> boz_red = delegate(int var1, string var2)
   { Console.WriteLine(" {1} {0} boz red", var1, var2);
   };
for(int pp=0; pp<3; pp++)
   { Kiwi.Pause(); // Pause makes this loop unwind at run time.
      boz_red(pp+100, "site1");
      boz_green(pp+200, "site2");
      var x = boz_red; boz_red = boz_green; boz_green = x; //swap
   }
```

C# 3.0 onwards supports proper closures. These are implemented inside the C# compiler and compile fine under Kiwi provided the static allocation restrictions are obeyed.

Test55 of the regression suite contains the following demo.

```csharp
public static Func<int,int> GetAFunc()
{
   var myVar = 1;
   Func<int, int> inc = delegate(int var1)
   { myVar = myVar + 1;
      return var1 + myVar;
   };
   return inc;
}

[Kiwi.HardwareEntryPoint()] static void Main()
{ var inc = GetAFunc();
   Console.WriteLine(inc(5));
   Console.WriteLine(inc(6));
}
```

This compiles and works fine. But, there is a Kiwi 1 restriction that the GetAFunc call must be before the end of static elaboration since this creates the closure that is allocated on the heap.

If no closure is needed, Action and Function delegates suffer from no static allocation restriction.

### 4.15 The ToString() Method

Kiwi implements a basic version of the ToString method. It will give output that is rather dependent on which version of the compiler is being used, but it is better than nothing. Enumerations print as
integers.

4.16 Accessing Numerical Value of Pointer Variables

IntPtr types.

Clearly, the addresses used on the FPGA have little relationship when run on the mono VM, but it is possible to display class pointer value on the hardware platform. One method is to use the default ToString method on an object handle. This will generate a Kiwi-specific output.

For example

```csharp
Console.WriteLine(" Ntest14w line0 : pointer={0}", ha.ToString());
Console.WriteLine(" Ntest14w line1 : left={0}", ha.left);
```

Might give:

```
Ntest14w line0 : pointer=Var(test14w/T401/Main/T401/Main/V_0%$star1$/test14w/dc_cls%30008%4, &(CTL_record(test14w/dc_cl...
Ntest14w line1 : left=32
```

Ah - this has printed the variable not its value!

4.17 Accessing Simulation Time

The Kiwi.dll library declares a static variable called tnow. During compilation reads of this are replaced with references to the appropriate runtime mechanism for access to the current simulation time. For instance, the following line

```csharp
Console.WriteLine("Start compute CRC of result at {0}\n", Kiwi.tnow);
```

becomes

```
$display("Start compute CRC of result at %t\n", $time);
```

when output as Verilog RTL.

The substrate has a tick counter that is instantiated when tnow is used for FPGA execution and so the RTLSIM code is a now a shim and not a direct call to the non-synthesisable $time infact... TODO fix.

4.18 Run-time Status Monitoring, Waypoints and Exception Logging

The following text to be corrected and moved to debugging section of manual please:

The user requires an indication of whether an FPGA card is actively running an application. Nearly all FPGA cards have LED outputs controlled by GPIO pins that are useful for basic status monitor-
ing. It is normal to connect an LED or two to indicate Kiwi activity and/or error, but most status reporting is via the substrate gateway.

Some FPGAs have LCD or VGA framestore outputs that are also relatively easy to use for monitoring and results.

The sequencer index and waypoint for each thread can be remotely monitored via the substrate gateway. This provides ... abend reason register ... logs thread id, waypoint, pc value and abend reason.

4.19 Exiting Threads

4.19.1 Null pointer, Array bounds, Overflow, Divide-By-Zero and Similar Run-time Exceptions

The Kiwi substrate gateway will log the thread identifier, waypoint and sequencer index for threads that finish or abort in an abend reason register. The user can reverse-engineer these via the KiwiC report file. An XML variant of that file for import into IDE needs to be provided in the future.

It is possible to get a run-time **null pointer** exception.

It is possible to get a run-time **checked overflow** exception.

It is possible to get a run-time **divide-by-zero** exception.

It is possible to get a run-time **array bounds** exception.

It is possible to get a run-time exception.

(Floating point exceptions are normally handled with via **NaN** propagation.)

4.19.2 Normal Thread and Program Exit

For RTLSIM execution of the KiwiC-generated RTL, it is sometimes convenient to have the simulator automatically exit when the program has completed.

When the main thread of Kiwi program exits (return from Main), the generated code may include a Verilog $finish statement if the flag "-kiwic-finish=enable" is supplied on the command line or in the recipe file. The equivalent is generated for C++ output. Otherwise a new implicit state machine state is created with no successors and the thread sits in that state forever. Hanging forever is the default behaviour for forked threads.

For use with a standard execution substrate, having a $finish statement in the generated design makes no sense,

Environment.Exit(1) can also be invoked within C# to cause the same effect as main thread return.

(Pipelined accelerators cannot exit since they have no sequencer (§15).)

4.19.3 User-defined C# Exceptions

C# try-except blocks are supported as is exception handling. But no exceptions can currently be caught and all lead to either a compile-time or run-time abend.
In other words, the contents of a C# `catch` block are ignored in the current KiwiC compiler. The contents of a C# `finally` block are executed under Kiwi as normal.

### 4.19.4 Debug.Assert

```
System.Diagnostics.Debug.Assert(bool cond) and friends...
```

We can raise a run-time assertion problem that is logged in the abend register ...

There is a compile-time variant called - not reached - or something ...

### 4.20 Pause Modes (within Sequencer HLS Mode)

Kiwi supports several major HLS modes, but the default, sequencer major HLS mode, generates a sequencer for each thread. When creating a sequencer, the number of states can be fully automatic, completely manual, or somewhere in between, according to the pause mode setting.

The mapping of logic operations to clock cycles is one of the main tasks automated by high-level synthesis tools, but sometimes manual control is also needed. Control can be needed for compatibility with existing net-level protocols or as a means to move the design along the latency/area Pareto frontier.

KiwiC can take various approaches according to the pause mode selected. Pause modes are listed in Table 1. The number of ALUs and RAM ports available also makes a big difference owing to structural hazards. Fewer resources means more clock cycles needed.

The Kiwi.cs file defines an enumeration for locally changing the pause mode for the next part of a thread’s trajectory.

```
enum PauseControl
{
    autoPauseEnable, hardPauseEnable, softPauseEnable,
    maximalPauseEnable, blockbPauseEnable
};
```

The idea is that you can change it locally within various parts of a thread’s control flow graph by calling `Kiwi.PauseControlSet(mode)` where the mode is a member of the PauseControl enumeration. Also, this can be passed as an argument to a Kiwi.Pause call to set the mode for just that pause. However, dynamic pause mode changing may not work at the moment ... owing to minor bugs.

For example, you can invoke `Kiwi.PauseControlSet(Kiwi.PauseControl.softPauseEnable)`.

The `bevelab-soft-pause-threshold` parameter is one of the main guiding metrics but it has no effect on regions of the program compiled in hard-pause mode.

Nearly all net-level hardware protocols are intolerant to clock dilation. In other words, their semantics are defined in terms of the number of clock cycles for which a condition holds. A thread being compiled by KiwiC to a sequencer defaults to bblock or soft pause control, meaning that KiwiC is free to stall the progress of a thread at any point, such as when it needs to use extra clock cycles to overcome structural hazards. These two approaches are incompatible. Therefore, for a region of code where clock cycle allocation is important, KiwiC must be instructed to use hard pause control.

The recipe file kiwic00.rcp sets the following as the default pause mode now
Pauses are inserted at
0 auto
1 hard exactly where pause statements are explicitly included
2 soft where needed to meet soft-pause-threshold
3 maximal inserted at every semicolon
4 bblock every basic block boundary

Table 1: Kiwi Pause Modes (within Sequencer Major HLS Mode)

This is not suitable for net-level interfaces but does lead to quick compile of scientific code which is what we are targeting at the moment.

For compiling net-level input and output, give KiwiC - bevelab-bevelab-default-pause-mode=hard as a command line option to override the recipe.

Maximal and blockb are considered just ‘debug’ modes where pauses are inserted at every semicolon and every basic block boundary respectively.

4.21 Unwound Loops

For a thread in hard-pause mode that executes loops with no Pause() calls in them will, KiwiC will attempt to unwind all of the work of that loop and perform it in a single run-time clock cycle. (There are some exceptions to this, such as when there are undecidable name aliases in array operations or structural hazards on RAMs but these are flagged as warnings at compile time and run time hardware monitors can also be generated that flag the error).

TODO: describe the way KiwiC resolves structural hazards or variable-latency if the user has specified hard pause mode. Currently, KiwiC essentially tacitly takes and consumes any further clock cycles it needs to do the work.

```csharp
main_unwound_leader()
{
    q = 100;
    for (int d=0; d<16; d++) Console.WriteLine("q={0}", q++);
    while (true) { Kiwi.Pause(); Console.WriteLine("q={0}", q++); }
}
```

The example main_unwound_leader will unwind the first loop at compile time and execute the first 16 print statements in the first clock tick and q will be loaded with 116 on the first clock tick.
4.22 More-complex implied state machines

```c
main_complex_state_mc()
{
    q = 1;
    while(true)
    {
        Kiwi.Pause(); q = 2;
        for (int v=0; v<din; v++) { Kiwi.Pause(); q += v; }
        Kiwi.Pause(); q = 1;
    }
}
```

The example `main_complex_state_mc` has a loop with run-time iteration count that is not unwound because it contains a Pause call. This is accepted by KiwiC. However, it could not be compiled without the Pause statement in the inner loop because this loop body is not idempotent. In soft-pause mode the pause call would be automatically added by KiwiC if missing.

4.23 Inner loop unwound while outer loop not unwound.

```c
main_inner_unwound()
{
    q = 1;
    while(true)
    {
        Kiwi.Pause(); q = 2;
        for (int v=0; v<10; v++) { q <<= 1; }
        Kiwi.Pause(); q = 1;
    }
}
```

In `main_inner_unwound` the inner loop will be unwound at compile time because it has constant bounds and no Pause call in its body. (This unwind will be performed in the bevelab recipe stage, not KiwiC front end.)

4.24 Entry Point With Parameters

A top-level entry point with formal parameters, such as

```c
[Kiwi.HardwareEntryPoint()]
main_withparam(int x)
{
    ...
}
```

is currently not allowed in normal sequencer mode, although in future it would be reasonable for these to be treated as additional inputs. This will be relaxed soon.

Top-level arguments are allowed in RPC (§7.1) and Accelerator major HLS modes (§15).

The `-root` command line flag is an alternative to the HardwareEntryPoint marker. Supplying this information on the command line is compatible with multiple compilation approaches where a given source file needs to be processed in different ways on different compilation runs.
5 Generate Loop Unwinding: Code Articulation Point

The KiwiC front end unwinds certain loops such as those that perform storage allocation and fork threads. The main behavioural elaborate stage of the KiwiC flow also unwinds other loops. Because of the behaviour of the former, the latter operates on a finite-state system and it makes its decisions based on space and performance requirements typical in high-level synthesis flows. Therefore, the loop unwinding performed in the KiwiC front end can be restricted just to loops that perform structural elaboration. These are known as generate loops in Verilog and VHDL. It is a typical Kiwi programming style to spawn threads and allocate arrays and other objects in such loops. Such elaboration that allocates new heap items, in Kiwi 1, must be done in the KiwiC front end since the rest of the HPR recipe deals only with statically-allocated variables.

Since threads both describe compile-time and run-time behaviour a means is needed to distinguish the two forms of loop. The approach adopted is that every thread in the source code is treated as generally having a lasso shape, consisting of code that is executed exactly once before entering any non-unwound, potentially-infinite loop.

The front-end algorithm used selects an articulation point in the control graph of a thread where all loops before this point have been unwound and all code reachable after that point has its control graph preserved in the program output to the next stage. Figure 2 illustrates the general pattern. The articulation point is called the end of static elaboration point. The point selected is the first branch target that is the subject of a conditional branch during an interpreted run of the thread or the entry point to the last basic block encountered that does not contain a Kiwi.Pause() call.

The branch will be conditional either because it depends on some run-time input data or because it is after at least one Kiwi.Pause() call. The semantics of Kiwi.Pause() imply that all code executed after the call are in a new run-time clock cycle. Apparently-conditional branches may be unconditional because of constant folding/propagation during the interpreted run. This is the basis of generate-style loop unwinding in the lasso stem.

Some programming styles require the heap changes shape at run time. A simple example occurs
when an array or other object is allocated after the first call to Kiwi.Pause. We have found that programmers quite often write in this style, perhaps not always intentionally, so it is useful if KiwiC supports it.

```c
main_runtime_malloc()
{
  ...
  Kiwi.Pause();
  int [] a = new Int[10];
  for (int i=0; i<10; i++) a[i] = i;
  while (true) { ... }
}
```

Provided the heap allocator internal state is modelled in the same way as other variables, no further special attention is required. In this fragment the heap values are compile-time constants.

```c
main_runtime_dyn_malloc()
{
  ...
  Kiwi.Pause();
  if (e)
  { int [] a = new Int[10];
    for (int i=0; i<10; i++) a[i] = i;
  }
  while (true) { ... }
}
```

If the value of ‘e’ in runtime_dyn_malloc is not a compile-time constant, KiwiC cannot compile this since there would be two possible shapes for the heap on the exit for the if statement. A solution is to call a.Dispose() before exit, but KiwiC currently does not support Dispose calls.

There’s also the matter of saved thread forks ....

Here the outer loop is non-unwound loop yet has a compile-time constant value on each read if the inner loop is unwound.

```c
while(true) // not unwind
{ 
  for (int i=0;i<3;i++) foo[i].bar(f);
  ... 
}
```

## 6 Supported Libraries Cross Reference

We have started documenting our library coverage in this section.

### 6.1 System.Collections.Generic

Currently (August 2016), none of the standard collection types, such as Dictionary, are provided in the distro. Please contribute.
6.2 System.Random

For random number generation, for both WD and FP, please use KiwiSystem.Random instead of System.Random.

KiwiSystem.Random dg = new KiwiSystem.Random();

6.3 Console.WriteLine and Console.Write

The Write and WriteLine methods are the standard means for printing to the console in C# and Kiwi. They can also print to open file descriptors. They embody printf like functionality using numbered parameters in braces.

Overloads are provided for used with up to four arguments. Beyond this, the C# compiler allocates a heap array, fills this in and passes it to WriteLine, after which it requires garbage collection. This should provide no problem for Kiwi’s algorithm that converts such dynamic use to static use but if there is a problem then please split a large WriteLine into several smaller ones with fewer than five arguments (beyond the format string).

Argument formats supported are

1. \{n\} — display arg \( n \) in base 10
2. \{n:x\} — display arg \( n \) in base 16

Kiwi will convert console writes to Verilog’s $display and $write PLI calls with appropriate munging of the format strings. These will come out during RTL simulation of the generated design. They can also be rendered on the substrate console during FPGA execution.

On important choice is whether this console output is preserved for the FPGA implementation. By default it is, with the argument strings compiled to hardware and copied character by character over the console port.

Sometimes two other behaviours are selectively wanted:

- Additional (quick/debugging) console display that is only converted to Verilog PLI calls. This will display output during an RTL simulation of the FPGA (e.g. using Modelsim) but will be discarded by the vendor FPGA tools that convert KiwiC output to FPGA bit streams.
- To disable all Console.Write and Console.WriteLine output by default from the FPGA console such that these calls behave just like item 1 above.

To achieve item 1, do not call Console.Write or Console.WriteLine. Instead call Kiwi.Write or Kiwi.WriteLine.

To achieve item 2, alter the recipe file or add the following command line argument to KiwiC

-kiwi-fpgaconsole-default=disable
6.4 get_ManagedThreadId
- returns an integer representing the current thread identifier (tid).

```csharp
int tid = Thread.CurrentThread.ManagedThreadId;
Console.WriteLine("Receiver process started. Tid={0}", tid);
```

// OLD
```csharp
Console.WriteLine("Receiver process started. Tid={0}", System.Threading.ManagedThreadId);
```

6.5 System.BitConverter

6.6 System.String.ToCharArray
- convert a string to an array of chars. Chars are 16 bits wide in dotnet. They are tagged shorts and do not behave quite the same as shorts for various output options.

6.7 System.IO.Path.Combine
- join a pair of file name paths - OS-specific. FileStream

6.8 TextWriter

6.9 TextReader
The TestReader ReadLine api is allowed to create garbage under Kiwi provided the outer loop frees or garbages the returned string on every iteration. It must not, for example, store a handle on the returned string in an array.

6.10 FileReader

6.11 FileWriter

6.12 Threading and Concurrency with Kiwi
One novel feature of Kiwi that sets it apart from other HLS systems is its support for concurrency. Threads can be spawned in the static lasso stem but Kiwi does not support thread creation at runtime. Kiwi supports Thread.Create() and Thread.Start(). To run a method of the current object on its own thread use code like this:

```csharp
public static void IProc()
{
    while (true) { ... }
}
```
Thread IProcThread = new Thread(new ThreadStart(IProc));
IProcThread.Start();

Or use delegates to pass arguments to a spawned thread running a method of perhaps another object:

Thread filterChannel = new Thread(delegate() { ZProc(1, 2, 3); });
filterChannel.Start();

Exiting threads can be joined with code like this:

... missing ...
Thread.Join(); // not tested currently.

Mutual exclusion is provided with the lock primitive of C#. Its argument must be the object handle of any instance (not a static class).
The Monitor.Wait and Monitor.PulseAll are supported for interprocess events.

lock (this)
{
    while (!emptyflag) { /* Kiwi.NoUnroll(); */ Monitor.Wait(this); }
datum = v;
    emptyflag = false;
    Monitor.PulseAll(this);
}

The NoUnroll directive to KiwiC can decrease compilation time by avoiding unrolling exploration.

### 6.12.1 Volatile Declarations

Variables that are shared between threads may need to be marked as volatile. The normal semantics are that memory fences are inferred from lock block boundaries and other concurrency primitives such as PulseAll. However, if shared variables are used without such fences they should be declared as volatile. Otherwise a process spinning on a change written by another thread may never see it change.
The C# language does not support volatile declarations of some types. You may get an error such as

//tinytest0.cs(16,26): error CS0677: ‘tinytest0.shared’: A volatile field cannot be of the type ‘ulong’

To overcome this, you can try to use the Kiwi-provided custom volatile attribute instead for now. For instance:
[Kiwi.Volatile()]
static ulong shared_var;

This technique will not stop the C# compiler from optimising away a spin on a shared variable, but the C# compiler may not do a lot of optimisation, based on the idea that backend (jitting) runtimes will implement all required optimisations. Ideally KiwiC works out which variables need to be volatile since all threads sharing a variable are compiled to FPGA at once.

7 Kiwi C# Attributes Cross Reference

The KiwiC compiler understands various .NET assembly language custom attributes that the user has added to the source code. In this section we present the attributes available. These control thinks such as I/O net widths and assertions and to mark up I/O nets and embed assertions that control unwinding.

C# definitions of the attributes can be taken from the file support/Kiwi.cs in the distribution. The Kiwi attributes can be used by referencing their dll during the C# compiler.

```
    gmcs /target:library mytest.dll /r:Kiwi.dll
```

Many attributes are copied into the resulting .dll file by the gmcs compiler. Other code from such libraries is not copied and must be supplied separately to KiwiC. To do this, list the libraries along with the main executable on the KiwiC command line.

**WARNING:** THE ATTRIBUTE LIST IS CURRENTLY NOT STABLE AND THIS LIST IS NOT COMPLETE. For the most up-to-date listing, see hprls/kiwi/Kiwi.cs.

The C# language provides a mechanism for defining declarative tags, called attributes, that the programmer may place on certain entities in the source code to specify additional information. An attribute is specified by placing the name of the attribute, enclosed in square brackets, in front of the declaration of the entity to which it applies. We present design decisions regarding attributes that allow a C# program to be marked up for synthesis to hardware using the KiwiC compiler that we are developing [2]. This compiler accepts CIL (common intermediate language) output from either the .NET or Mono C# compilers and generates Verilog RTL.

7.1 Kiwi.Remote() Attribute

RPC (Remote-Procedure Call) Interface Between Compilations.

Marking up given methods to be remotely callable:

Object-oriented software sends threads between compilation units to perform actions. Synthesizable Verilog and VHDL do not allow threads to be passed between separately compiled circuits: instead, additional I/O ports must be added to each circuit and then wired together at the top level. Accordingly, we mark up methods that are to be called from separate compilations with a remote attribute.

```
[Kiwi.Remote(‘parallel:four-phase’)]
```
public return_type entry_point(int a1, bool a2, ...)  
{ ... }

When an implemented or up-called method is marked as ‘Remote’, a protocol is given and KiwiC generates additional I/O terminals on the generated RTL that implement a stub for the call. The currently implemented protocol is asynchronous, using a four-phase handshake and a wide bus that carries all of the arguments in parallel. Another bus, of the reverse direction, conveys the result where non-void. Further protocols can be added to the compiler in future, but we would like to instead lift them so they can be specified with assertions in C# itself.

KiwiC will generate hardware both for the client and the server as separate RTL files. In more-realistic examples, there will be multiple files, with one being the top-level that contains client calls to some of the others which in turn make client calls to others, with the leaf modules in the design hierarchy being servers only.

One can also envision leaf modules in the design hierarchy making upcalls to parents, but this is not currently implemented in Kiwi.

class test10  
{  
    static int limit = 10;  
    static int jvar;  

    [ Kiwi.Remote(‘‘client1-port’’, ‘‘parallel: four-phase’’) ]  
public static int bumper(int delta)  
{  
    jvar += delta;  
    return jvar;  
}  

[Kiwi.HardwareEntryPoint()]  
public static void Main()  
{  
    Console.WriteLine(‘‘Test 10 Limit=’’ + limit);  
    for (jvar=1;jvar<=limit;jvar+=2)  
    {  
        Console.Write(jvar + ‘‘ ’’);  
    }  
    Console.WriteLine(‘‘ Test 10 finished.’’);  
}  

See demo on this link

http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/rpc.html

7.2 Flag Unreachable Code

Kiwi.NeverReached("This code is not reached under KiwiC compilation.");
This call can be inserted in user code to create a compile-time error if elaborated by KiwiC. If a thread of control that is being expanded by KiwiC encounters this call, it is a compile-time error. For flagging invalid run-time problems, please use System.Diagnostics.Debug.Assert within Kiwi code.

7.3 Hard and Soft Pause (Clock) Control

This section needs joining up with the repeated copy elsewhere in this manual!

Many net-level hardware protocols are intolerant to clock dilation. In other words, their semantics are defined in terms of the number of clock cycles for which a condition holds. A thread being compiled by KiwiC defaults to soft pause control (or other default set in the recipe or command line), meaning that KiwiC is free to stall the progress of a thread at any point, such as when it needs to use extra clock cycles to overcome structural hazards. These two approaches are incompatible. Therefore, for a region of code where clock cycle allocation is important, KiwiC must be instructed to use hard pause control.

The Kiwi.Pause() primitive may be called without an argument, when it will pause according to the current pause control mode of the calling thread. It may also be called with the explicit argument ‘soft’ or ‘hard’.

The current pause control mode of the current thread can be updated by callingKiwi.SetPauseControl.

When a thread calls Kiwi.SetPauseControl(hardPauseControl) its subsequent actions will not be split over runtime clock cycles except at places where that thread makes explicit calls to Kiwi.Pause() or makes a blocking primitive call.

The default scheduling mode for a thread can be restored by making the thread callsKiwi.SetPauseControl(autoPauseControl).

Finally, blockb pause control places a clock pause at every basic block and maximal pause control turns every statement into a separately-clocked operationKiwi.SetPauseControl(maximalPauseControl).

7.4 End Of Static Elaboration Marker - EndOfElaborate

    public static void EndOfElaborate()
    {
        // Every thread compiled by KiwiC has its control flow partitioned
        // between compile time and run time. The division is the end
        // of elaboration point.
        // Although KiwiC will spot the end of elaboration point for itself,
        // the user can make a manual call to this at the place where they
        // think elaboration should end for confirmation.
        // This will be just before the first Pause in hard-pause mode or
        // undecidable name alias or sensitivity to a run-time input etc..
    }
7.5 Loop NoUnroll Manual Control

Put a call to ‘Kiwi.NoUnroll(loopvar)’ in the body of a loop that is NOT to be unrolled by KiwiC. Pass in the loop control variable.

If there is a ‘KiwiC.Pause()’ in the loop, that’s the default anyway, so the addition of a NoUnroll makes no difference.

The number of unwinding steps attempted by the CIL front end can be set with the ‘-cil-uwind-budget N’ command line flag. This is different from the ubudget command line flag used by the FSM/RTL generation phase.

Because a subsume attribute cannot be placed on a local variable in C#, an alternative syntax based on dummy calls to Unroll is provided.

```csharp
public static void Unroll(int a)
{
    // Use these unroll functions to instruct KiwiC to subsume a variable (or variables)
    // during compilation. It should typically be used with loop variables:
    //
    // for (int cpos = 0; cpos < height; cpos++)
    // { Kiwi.Unroll(cpos);
    // ...
    // }

    public static void Unroll(int a, int b)
    {
        // To subsume annotate two variables at once.
    }

    public static void Unroll(int a, int b, int c)
    {
        // To request subsumation of more than three variables note that
        // calling Unroll(v1, v2) is the same as Unroll(v1 + v2). I.e. the
        // support of the expressions passed is flagged to be subsumed in total or
        // at least in the currently enclosing loop.
    }
}
```

7.6 Elaborate/Subsume Manual Control

OLD: Ignore this paragraph from 2015 onwards.

This manual control was used in early versions of KiwiC but has not been needed recently.

KiwiC implements an elaboration decision algorithm. It decides which variables to subsume at compile time and which to elaborate into concrete variables in the output RTL design.

The decisions it made can be examined by grepping for the word ‘decided’ in the obj/h1.log file.

The algorithm sometimes makes the wrong decision. This is being improved in future releases.

For variables that can take attributes in C# (i.e. not all variables), it can be forced one way or the other by instantiating one of the pair of attributes, Elaborate or Subsume.

For example, to force a variable to be elaborated, use:

```csharp
[Kiwi.Elaborate()]
bool empty = true;
```

Examples of variables that cannot be attributed is the implied index variable used in a foreach loop.
or the explicit local defined inside a for loop using the for (int i=...;... ; ...) syntax. The force of an elab can also be made using the -fecontrol command line option. For instance, one might put -fecontrol 'elab=var1;elab=var2';

### 7.7 Synchronous and/or Asynchronous RAM Mapping

See [8]

### 7.8 Register Widths and Overflow Wrapping

Integer variables of width 1, 8, 16, 32 and 64 bits are native in C# and CIL but hardware designers frequently use other widths. We support declaration of registers with width up to 64 bits that are not a native width using an ‘HwWidth’ attribute. For example, a five-bit register is defined as follows.

```csharp
[Kiwi.HwWidth(5)] static byte fivebits;
```

When running the generated C# natively as a software program (as opposed to compiling to hardware), the width attribute is ignored and wrapping behaviour is governed by the underlying type, which in the example is a byte. We took this approach, rather than implementing a genuine implementation of specific-precision arithmetic by overloading every operator, as done in OSCI SystemC [1], because it results in much more efficient simulation, i.e. when the C# program is run natively. Although differences between simulation and synthesis can arise, we expect static analysis in KiwiC to report the vast majority of differences likely to be encountered in practice. Current development of KiwiC is addressing finding the reachable state space, not only so that these warnings can be generated, but also so that efficient output RTL can be generated, such that tests that always hold (or always fail) in the reachable state space are eliminated from the code.

The following code produces a KiwiC compile-time error because the wrapping behaviour in hardware and software is different.

```csharp
[Kiwi.HwWidth(5)] byte fivebits;
void f()
{
    fivebits = (byte)(fivebits + 1);
}
```

The cast of the rhs to a byte is needed by normal C# semantics. Compiling this example gives an error:

KiwiC:assignment may wrap differently:
(widthclocks_fivebits(storage=8 )+1)&mask(7..0):
assign wrap condition test rw=8, lw=5, sw=8

### 7.9 Net-level Input and Output Ports

Input and Output Ports can arise and be defined in a number of ways.
Net-level I/O ports are inferred from static variables in top-most class being compiled. These are suitable for GPIO applications such as simple LED displays and push buttons etc. The following three examples show input and output port declarations, where the first two have their input and output have their width specified by the underlying type and the last by an explicit width attribute.

\[
\begin{align*}
\text{[Kiwi.OutputBitPort("done")]} & \text{ static bool done;} \\
\text{[Kiwi.InputPort("serin")]} & \text{ static bool serialin;} \\
\text{[Kiwi.HwWidth(5)] [Kiwi.OutputPort("data_out")]} & \text{ static byte out5;} \\
\end{align*}
\]

KiwiC can create obscure names if these I/O declarations are not in a top-level class. So, the contents of the string are a friendly name used in output files.

For designers used to the VDHL concept of a bit vector, we also allow arrays of bools to be designated as I/O ports. This can generate more efficient circuits when a lot of bitwise operations are performed on an I/O port.

\[
\begin{align*}
\text{[Kiwi.OutputWordPort(11, 0, "dvi_d")]} & \text{ public static int}[] \text{ dvi_d = new bool [12];} \\
\text{[Kiwi.OutputWordPort(11, 0, "dvi_i")]} & \text{ public static int}[] \text{ dvi_i = new int [12];} \\
\end{align*}
\]

Although it makes sense to denote bitwise outputs using booleans, this may require castings, so ints are also allowed, but only the least significant bit will be an I/O port in Verilog output forms.

Currently we are extending the associated kiwi library so that abstract data types can be used as ports, containing a mixture of data and control wires of various directions. Rather than the final direction attribute being added to each individual net of the port, we expect to instantiate the same abstract datatype on both the master and slave sides of the interface and use a master attribute, such as ‘forwards’ or ‘reverse’, to determine the detailed signal directions for the complete instance.

The following examples work

\[
\begin{align*}
\text{ // four bit input port} \\
\text{[Kiwi.HwWidth(4)] [Kiwi.InputPort("")]} & \text{ static byte din;} \\
\text{ // six bit local var} \\
\text{[Kiwi.HwWidth(6)] static int} & \text{ j = 0;} \\
\end{align*}
\]

A short-cut form for declaring input and output ports

\[
\begin{align*}
\text{[Kiwi.OutputIntPort("")]} & \text{ public static int result;} \\
\text{[Kiwi.OutputWordPort(31, 0)]} & \text{ public static int bitvec_result;} \\
\end{align*}
\]

7.10 Clock Domains

You do not need to worry about clock domains for general scientific computing: they are only a concern for hardware interfacing to new devices. KiwiC generates synchronous logic. By default the output circuit has one clock domain and requires just one master clock and reset input. The allocation of work to clock cycles in the generated hardware depends on the current ‘pause mode’ and an unwind budget described in [2] and the user’s call to built-in functions such as ‘Kiwi.Pause’.
Terminal names `clock` and `reset` are automatically generated for the default clock domain. To change the default names, or when more than one clock domain is used, the ‘`ClockDom`’ attribute is used to mark up a method, giving the clock and reset nets to be used for activity generated by the process loop of that method.

```csharp
[Kiti.ClockDom("clknet1", "resetnet1")]
public static void Work1()
{ while(true) { ... } }
```

A method with one clock domain annotation must not call directly, or indirectly, a method with a differing such annotation.

### 7.11 Remote

Object-oriented software sends threads between compilation units to perform actions. Synthesizable Verilog and VHDL do not allow threads to be passed between separately compiled circuits: instead, additional I/O ports must be added to each circuit and then wired together at the top level. Accordingly, we mark up methods that are to be called from separate compilations with a remote attribute.

```csharp
[Kiti.Remote("parallel:four-phase")]
public return_type entry_point(int a1, bool a2, ...)
{ ... }
```

When an implemented or up-called method is marked as ‘Remote’, a protocol is given (or implied) and KiwiC generates additional I/O terminals on the generated RTL that implement a stub for the call. The currently implemented protocol is synchronous (using the current clock domain - TODO explain how to wire up), using a four-phase handshake and a wide bus that carries all of the arguments in parallel. Another bus, of the reverse direction, conveys the result where non-void. Further protocols can be added to the compiler in future, but we would like to instead lift them so they can be specified with assertions in C# itself. The protocol argument can be omitted from the attribute.

A remote-marked method is either an entry point or a stub for the current compilation. This depends on whether it is called from other hardware entry points (roots).

If it is called, then it is treated as a stub and its body is ignored. Call sites will initiate communication on the external nets. The directions of the external nets is such as to send arguments and receive results (if any).

If it is not called from within the current compilation, then it is treated as a remote-callable entity. The directions of the external nets is such as to receive arguments and return results (if any).

### 7.12 Elaboration Pragmas - Kiwi.KPragma

```csharp
public static int KPragma(bool fatalFlag, string cmd_or_message)
public static int KPragma(bool fatalFlag, string cmd_or_message, int arg0)
public static int KPragma(bool fatalFlag, string cmd_or_message, int arg0,
```
Kiwi.KPragma with first argument as Boolean true can be used to conditionally abend elaboration. This behaves the same way as System.Diagnostics.Debug.Assert described in §7.13 except that a user-defined error code can be passed in arg0.

With the Bool false, it is used to log user progress messages during elaboration.

Kiwi.KPragma calls present in run-time loops can be emitted at runtime using the Console.WriteLine mechanisms (in the future - current release ignores them beyond elaboration).

Kiwi.KPragma calls with magic string values will be used to instruct the compiler, but no magic words are currently implemented.

### 7.13 Assertions `Debug.Assert()`

Sometimes it is convenient to generate compile-time errors or warnings. Other times we want to flag a run-time abend, as per §2.2.

Typically you might want to direct flow of control differently using the function Kiwi.inHardware() and to abort the compilation if it has gone wrong. Call the function Kiwi.KPragma(true/false, ‘my message’) to generate compile time messages. If the first arg holds, the compilation stops, otherwise this serves as a warning message.

You can make use of System.Diagnostics.Debug.Assert within Kiwi code.

In KiwiC 1.0 you have to re-code dynamic arrays with static sizes and this is needed for all on-chip arrays in Kiwi 2.0. The code below originally inspected the fileStream Length attribute and created a dynamic array. But it had to be modified for Kiwi 1.0 use as follows

```csharp
int length = (int)fileStream.Length; // get file length - will be known at runtime
System.Console.WriteLine("DNA file length is {0} bytes.", length);
const int max_length = 1000 * 1000 * 10; // Arrays need to be constant length for System.Diagnostics.Debug.Assert(length <= max_length, "DNA file length exceeds static buffer = new byte[max_length]; // create buffer to read the file
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
{
    sum += count; // sum is a buffer offset for next reading
}
System.Console.WriteLine("All read, length={0}", sum);
```

The C# compiler may/will ignore the Assert calls unless some flag is passed ... 

### 7.14 Assertions - Temporal Logic

Universal assertions about a design can be expressed with a combination of a predicate method (i.e. one that returns a bool) and a temporal logic quantifier embedded in an attribute. For instance, to assert that whenever the following method is called, it will return true, one can put
where the string $\text{AG}$ is a computational tree logic (CTL) universal path quantifier and the second argument is a message that can be printed should the assertion be violated. Although the function 'pred1' is not called by any C# code, KiwiC generates an RTL monitor for the condition and Verilog $\text{display}$ statements are executed should the assertion be violated. In order to nest one CTL quantifier in another, the code of the former can simply call the latter’s method. Since this is rather cumbersome for the commonly used $\text{AX}$ and $\text{EX}$ quantifiers that denote behaviour in the next state, an alternative designation is provided by passing the predicate to a function called ‘Kiwi.next’. A second argument is an optional number of cycles to wait, defaulting to one if not given. Other temporal shorthands are provided by ‘Kiwi.rose’, ‘Kiwi.fell’, ‘Kiwi.prev’, ‘Kiwi.until’ and ‘Kiwi.wuntil’. These all have the same meaning as in PSL.

We are currently exploring the use of assertions to describe the complete protocol of an I/O port. Such a description, when compiled to a monitor, serves as an interface automaton. To automatically synthesise glue logic between I/O ports, the method of [3] can be used, which implements all non-blocking paths through the product of a pair of such interface automata.

8 Memories in Kiwi

Arrays allocated by the C# code must be allocated hardware resources. Small arrays are commonly converted directly into Verilog array definitions that compile to on-chip RAMs using today’s FPGA tools. There are a number of (adjustable) threshold values that select what sort of RAM to target. Larger arrays are placed off-chip by default. Arrays that are only written at each location precisely once with a constant value for each location are treated as read-only look-up tables (ROMs).

Sometimes there are multiple ports to a given memory space/bank for bandwidth reasons. For instance, on the Xilinx Zynq, it is common to use two high-performance AXI bus connections to the same DRAM bank. In addition, there can be multiple memory controllers each with its own channel. We prefer the term channel to the older term bank since bank now refers to an internal bank within a DRAM chip that can have up to one row open in each bank. Kiwi does not currently support multiple channels.

Terminology summary: we use the following hierarchy of terms to describe the off-chip memory architecture: bit, lane, word, row, col, bank, rank, channel.

Explanation: A word is addressed with a binary address. The row, col, bank and rank are all fields in the address. Ordering between col and bank may vary. Channels potentially have disjoint address spaces. Mapping the channel number into the address would eliminate spatial reuse and simply be an extension of the rank. Within the word there are multiple lanes that are separately writable and each lane has some number of bits. In today’s CPUs from Intel and ARM, the lane size is 8 (a byte lane) and the word size is also 8, making it a 64-bit word. On FPGAs, where clock frequencies are lower than DRAM speeds, word sizes of 512 can commonly be used with a correspondingly larger number of lanes.

In this documentation, we use the term ‘off-chip’ to denote resources that are not instantiated by KiwiC and which, instead, are provided by the substrate platform. In reality, the resources might
FPGAs tools support RAMs in four general ways. The four ways provide increasingly better FPGA area use, but become more complex to read and write.

1. **Flip-flop register file**: Each bit of RAM becomes a flip-flop. This does not limit the number of concurrent readers or writers.

2. **Distributed RAM**, also known as **LUT RAM**: The look-up table (LUT) of a typical FPGA is used normally for something like an arbitrary two-output function of five inputs. It is then actually a 32-word RAM of 2-bit words. The can be used as RAM by many FPGAs. It is called distributed, LUT or slice RAM.

3. **Block RAM**: As well as I/O, flip-flops and LUTs, all modern FPGAs also provide block RAMs as a first-class programmable resource. Typically these are dual ported and 18 kilobit in size.

4. **Off-chip RAM - SRAM or DRAM**: Rather than storing data on the FPGA, I/O pins are used to connect to external, standard RAM parts.

  The FPGA tools will generally automatically choose which of the first three forms in the above list to infer for a given RTL array declaration. They take into account the size and use pattern. Important aspects of the use pattern are whether the output is used in the same clock cycle as the address is generated and how many different and concurrent address patterns are used. The fourth RAM form is not automatically generated by FPGA tools, but HLS tools such as KiwiC will deploy it and the FPGA tools will simply see logic that wiggles pins to operate the RAM or generate AXI transactions destined for a complex memory subsystem.

<table>
<thead>
<tr>
<th>RAM Form</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Flip-flop register file</td>
<td>Each bit of RAM becomes a flip-flop. This does not limit the number of concurrent readers or writers.</td>
</tr>
<tr>
<td>Distributed RAM (LUT RAM)</td>
<td>The look-up table (LUT) of a typical FPGA is used normally for something like an arbitrary two-output function of five inputs. It is then actually a 32-word RAM of 2-bit words.</td>
</tr>
<tr>
<td>Block RAM</td>
<td>As well as I/O, flip-flops and LUTs, all modern FPGAs also provide block RAMs as a first-class programmable resource. Typically these are dual ported and 18 kilobit in size.</td>
</tr>
<tr>
<td>Off-chip RAM - SRAM or DRAM</td>
<td>Rather than storing data on the FPGA, I/O pins are used to connect to external, standard RAM parts.</td>
</tr>
</tbody>
</table>

The number of ports is unlimited for type 1 (register file) and the FPGA tools will typically implement such a register file if the number of operations per clock cycle is more than one. This depends...
on the number of subscription operators in the generated RTL, the number of different address expressions in use and whether the tools can infer disjointness in their use.

For types 2 through 4, the number of ports is decided by KiwiC and it generates that number of read, write and address busses. By default, KiwiC uses one port per clock domain, but this can be influenced in the future with PortsPerThread and ThreadsPerPort attributes.

In the current version of Kiwi, the res2-loadstore-port-count (formerly called res2-no-dram-ports) recipe setting configures the number of load/store ports available per thread. Also, each thread that makes off-chip loads and stores must have its own port since KiwiC does not automatically instantiate the DRAM (HFAST) arbiters; instead the substrate top-level needs to instantiate the arbiters when KiwiC generates more DRAM ports than physically exist on the FPGA.

The three thresholds set in the command line or recipe that distinguish between the four memory types are:

1. **res2-regfile-threshold**: the number of locations below which to not instantiate any sort of structural SRAM or register file: instead raw flip-flops are used.

2. **res2-combram-threshold**: the threshold in terms of number of locations at which to start instantiating synchronous, latency=1, structural SRAM.

3. **res2-offchip-threshold**: the threshold in terms of number of locations at which to map to an off-chip resource, such as TCM, ZBT or cached DRAM.

In addition to comparing sizes against compilation thresholds, the user can add CSharp attributes to instances to force a given technology choice on a per-RAM basis.

The SynchSRAM(n) attribute indicates that an array is to be mapped to an on-chip RAM type that is not the default for its size. The argument is the number of clock cycles of latency for read.

**TODO**: describe PortsPerThread and so on... these control multi-port RAMS and how the number of external ports is configured.

Kiwi has a scheduler in its restructure phase that runs at compile time to sequence operations on scarce resources such as complex ALUs and memory resources. Kiwi supposedly implements runtime arbitration for resources that are contended between threads, but the reality is currently different. It follows three policies: 1. For 'on-chip' RAMs like FPGA B-RAM it allocates one port per thread so, with Xilinx and Altera that support up to two ports only two threads can access an 'on-chip' B-RAM. 2. For ALUs it does not share them between threads and starts the ALU budgeting freshly for each thread, just as though the threads had been separately compiled. 3. For 'off-chip' RAM like DRAM, it generates one (more are possible via the command line) HFAST port per thread. The user must currently manually instantiate arbiters that mux this collection of ports onto the DRAM banks that are available.

However, Kiwi does not care whether 'off-chip' resources are actually off-chip and instead one can use the off-chip technique to multiplex and arbitrate multiple threads onto on-chip resources, such as a large, manually instantiated B-RAM.

### 8.1 On-chip RAM (and ROM) Mirror, Widen and Stripe Directives

To increase memory performance, three techniques are generally available (these techniques may not all be sensible for off-chip RAM resources). All of these increase the number of data bus wires...
to RAMs, thereby increasing available throughput.

1. A `Kiwi.Mirror(n)` directive applied to a C# array instructs KiwiC to make multiple copies of the RAM or ROM. This is most sensible for ROMs since all copies of a RAM must be updated with every write.

2. A `Kiwi.Widen(n)` directive applied to a C# array instructs KiwiC to pack $n$ words into a single location. This multiplies the data bus width by this factor. For RAMs, a RAM with laned writes may be needed. This will boost performance where an aligned group of $n$ words is commonly read and written at once.

3. A `Kiwi.Stripe(n)` directive applied to a C# array instructs KiwiC to allocate $n$ multiple RAMs or ROMs each of $1/n$th the size with every $n$th word placed in each of them.

(In order to pack multiple user arrays into a single RAM on the FPGA, additional directives are needed. Not described here currently.)

### 8.2 ROMs (read-only memories) and Look-Up Tables

Most FGPAs support ROMs. ROM inference is a variation on RAM inference. Combination and registered ROMs are both commonly used, depending on size. KiwiC will deploy ROMs with pipeline latency of 1 when the size in addresses exceeds the size set by `res2-combrom-threshold`.

ROM inference in KiwiC can be turned off with flag `repack-to-rom=disable` in which case RAMs are commonly generated and initialised with the ROM contents after the run-time reset. But, when ROMs are present, they are manifest in the generated Verilog RTL as arrays that have their only write operations embodied in Verilog `initial` statements that install the fixed data.

ROMs can sometimes usefully be mirrored. The `Kiwi.Mirror(4)` attribute can be applied to individual array instances to mirror them.

```
[Kiwi.Mirror(4)]
static readonly uint[] htab4 = { 0x51f4a750, 0x7e416553, 0x1a17a4c3, 0x3a275e96,
                               ... many more entries ...
};
```

Or else the command line flag `repack-to-rom=4` can be added, which would replicate all ROMs up to a factor of 4, but the additional copies would not be generated if they cannot usefully be used.

### 8.3 Forced Off-chip/Outboard Memory Array Mapping

The `OutboardArray` attribute forces that an array is to be mapped to a region of external memory instead of being allocated a private array inside the current compilation. Large arrays are placed off chip in this way by default. It is up to the substrate architect what sort of memory to attach to the resulting port: it could range from simple large SRAM bank to multiple DRAM banks with caches.

OLD: The fullest version of this attribute takes two arguments: a bank name and an offset in that bank.
OLD: Pre performance profiling: In general, arrays can be mapped to a specific bank by giving the bank name and leaving out the base address. KiwiC will then allocate the base addresses for each memory to avoid overlaps. If no bank name is given, then a default of 'drambank0' is automatically supplied. Therefore, without using any attributes, all large arrays are mapped into consecutive locations of a memory space called 'drambank0'.

8.4 Off-chip load/store ports

KiwiC generates load/store ports to access off-chip memory. (Off-chip means not instantiated by KiwiC, so the addressed resource can be on the same die in reality). With more load/store ports in use, greater memory access bandwidth is available AND greater opportunities for out-of-order memory service exist.

The off-chip port architecture is defined in recipe/command line settings. It is also written as a report file in every KiwiC run. The Off-chip Memory Physical Ports/Banks report looks something like this:

| Name     | No Words | Awidth | Dwidth | Lanes | LaneWidth |
|-----------+----------+--------+--------+-------+-----------|
| loadstor1 | 4194304  | 22     | 256    | 32    | 8         |

Total load/store port width = bits per lane * number of lanes.

Default -res2-loadstore-port-count=1

Number of LOADSTORE ports for automatic off-chipping of large RAMs.

res2-loadstore-port-lanes 32 LOADSTORE ports - number of write lanes.

res2-loadstore-lane-width 8 LOADSTORE lane width

When the number of lanes is 1 no lane write enables are used and the memory is word addressed always.

A suitable behavioural Verilog fragment to connect to them for simulation test purposes is available as part of the distro in the rams folder.

Typical DRAM controllers run much faster than the FPGA user logic and hence a wide word is presented to the KiwiC-generated code of 256 bits or so.

The user’s wanted data width is either rounded up to some integer multiple number of external words, or some fraction of a word where the fraction is rounded up to a bounding power of 2 number of lanes.

The restructure log file will explain, somewhat cryptically, how each DRAM bank is being used with a table that contains interleaved entries covering all the banks (portnames). The lines in this report can be decoded with experience: D16 means sixteen bits wide. AX means an array. etc..

Off-chip Memory Map
Performance generally needs to be enhanced above this baseline by packing data sensibly into DRAM words. Also, support of multiple in-flight requests is preferable for the highest performance. The KiwiC-generated code should be connected to an externally-provided memory controller that will often also include some sort of cache.

Three off-chip protocols are supported BVCI, HSIMPLE and HFAST. HFAST is most commonly used. BVCI allows multiple transactions to be in flight. AXI will be added shortly to KiwiC, but there are some AXI components in the support and substrates library. Including an HFAST to AXI protocol bridge and AXI master and slave shims for the Zynq substrate for CPU interaction and DRAM access.

When we say ‘off-chip’ we simply mean outside the generated hardware circuit - the substrate configuration may put various items on the same Physical chip.

8.4.1 HSIMPLE Offchip Interface & Protocol

Low-performance HSIMPLE uses four-phase handshake and only transfers data once every four clock cycles. It is more suitable for connecting to simple peripherals than DRAM. The following nets will require connection to the synthesis output when the DRAM is in use with the default, simple, 4/P HSIMPLE protocol.

```
output reg hs_dram0bank_req,
input hs_dram0bank_ack,
output reg hs_dram0bank_rwbar,
output reg [255:0] hs_dram0bank_wdata,
output reg [21:0] hs_dram0bank_addr,
input [255:0] hs_dram0bank_rdata,
output reg [31:0] hs_dram0bank_lanes,
```

When the number of lanes is one, there are no lane outputs.

8.4.2 HFAST Offchip Interface & Protocol

HFAST1 offers one cycle read latency and back-to-back operations, achieving 100 percent throughput. It is ideal for front-side cache connections where prefetch is not used.

The signature for HFAST is typically as follows (the total width and number of lanes and address bus width are all parameterisable).

```
output reg hf1_dram0bank_opreq,
input hf1_dram0bank_oprdy,       // Any posedge clk with overlap of opreq and opard
```
input hf1_dram0bank_ack, // Ack acknowledges the last request is complete.
output reg hf1_dram0bank_rvbar, // 1=read, 0=write on request active clock edge.
output reg [255:0] hf1_dram0bank_wdata, // For write, data to be written, valid on request.
output reg [21:0] hf1_dram0bank_addr // Address, valid on request active clock edge.
input [255:0] hf1_dram0bank_rdata, // Read result, valid on ack cycle.
output reg [31:0] hf1_dram0bank_lanes, // Byte lane qualifiers.

When the number of lanes is 1 no lane write enables are used and the memory is word addressed always.

A DDRAM2 controller is available in the file kiwi/rams/ddr2-models. This can be used for high-level simulations. It instantiates the DDR_DRAM_BANK underneath itself.

A behavioural model of a DDRAM2 is available in the file kiwi/rams/ddr2-models. It has signature:

```verilog
module DDR_DRAM_BANK(
    input clk, // DDR Clock - 800 MHz typically. We use one edge
    input reset, // Active high synchronous reset
    input ddr_ras, // Active low row address strobe
    input ddr_cas, // Active low col address strobe
    input [log2_internal_banks-1:0] ddr_ibank,// Internal bank select
    input ddr_rwbar,// On CAS: 1=read, 0=write. On RAS 1=precharge,
    input [2*dwidth-1:0] ddr_wdata, // The wdata and rdata busses are here twice their
    input [awidth-1:0] ddr_mux_addr, // Multiplexed address bus
    input [2*dwidth/8-1:0] ddr_dm, // Lanes: Separate nets here for +ve and -ve edge
    output reg [2*dwidth-1:0] ddr_rdata); // Read data bus.
```

```verilog
parameter log2_dwidth = 5;
parameter dwidth = (1<<log2_dwidth); // Word width in bits - we actually have twice
// FOR DRAM style
// E.g. MT41K256M32-125 DDR3 @ 800 MHz/1.25ns RCD-RP-CL=11-11-11 Arch=32M x 32 bits x 8
parameter LOG2_ROW_SIZE = 15; // Log_2 number of words per RAS
parameter LOG2_COL_SIZE = 10; // Log_2 number of words per CAS
parameter PRECHARGE_LATENCY = 11;
parameter ACTIVATE_LATENCY = 11;
parameter CAS_LATENCY = 11;
parameter log2_internal_banks = 3;
parameter awidth = LOG2_ROW_SIZE; // Address width in bits - word addressed.
// DRAM burst size - can be dynamically encoded in high-order CAS address. Currently fixed for DDR) this requires 4 clocks to transfer the burst.
parameter burstSize = 4;
```

HFAST2 is the same as HFAST1 but uses a two-cycle, fully-pipelined read latency.

A simple cache is provided. Its signature is:

```verilog
module cache256_hf1
(input clk,
 input reset, // synchronous, active high.

 // Front-side interface
 input fs_rwbar,
```
output reg [noLanes*laneSize-1:0] fs_rdata,
input [noLanes*laneSize-1:0] fs_wdata,
input [addrSize-1:0] fs_wordAddr,
output fs_oprdy,
input fs_opreq,
output reg fs_ack,
input [noLanes-1:0] fs_lanes,

// Back-side interface
output reg bs_rwbar,
input [noLanes*laneSize-1:0] bs_rdata,
output reg [noLanes*laneSize-1:0] bs_wdata,
output reg [addrSize-1:0] bs_wordAddr,
input bs_oprdy,
output reg bs_opreq,
in
put bs_ack,
output reg [noLanes-1:0] bs_lanes
);

parameter dram_dwidth = 256; // 32 byte DRAM burst size or cache line.
parameter laneSize = 8;
parameter noLanes = dram_dwidth / laneSize; // Bytelanes.

The cache must be manually instantiated by the substrate designer.
HFAST arbiters can be instantiated on the front or back side of the cache, so that multiple synthesised load/store
ports can share one cache or multiple caches can share one DRAM bank. Sharing would be inconsistent.
The default substrate runs the DRAM and DRAM controller at 800 MHz and the Cache and KiwiC generated
code at 133 Mhz which is 1/6th of this.

8.4.3 BVCI Offchip Interface & Protocol

Text missing.

8.5 AXI and HFAST-to-AXI mapping

AXI has become the most prevalent SoC and FPGA bus interface standard. AXI supports burst transactions and
out-of-order service. Such AXI service discipline is well-suited to a high-performance DRAM bank controller.
(Such a bank controller typically has 8 internal banks, all of which can be concurrently open on a DRAM row.)

Today’s CPUs use multiple load/store stations per core that are pari passu with that core’s ALUs. KiwiC-generated hardware is no different. Each load/store station is busy with at most one scalar load/store request and this can only be served in order.

As with CPUs, there are two techniques that adapt between single-issue load/store stations: multiplexing and caching.

- Multiplexing multiple single-issue, in-order clients onto a single bus readily generates a traffic load that can be served out of order. In addition, there may be spatial locality between requests that can be aggregated into a burst.
- The front-side of a cache is optimised for random-access, low-latency operations. Since each is served
(nominally) instantly, there is no scope for out-of-order discipline. On the other hand, the back side of a cache creates line fills and writebacks that are burst operations.

KiwiC load/store stations are served with HFAST interfaces. In the fullness of time, KiwiC will provide automated support for HFAST to AXI adaptation but currently a substrate that manually matches the number of load/store ports is required.

The substrate typically converts the KiwiC-generated HFAST interfaces to AXI or other off-chip protocols not currently supported by KiwiC. The substrate provider writes RTL transactors to convert protocols.

### 8.6 Off-chip address size

KiwiC assumes it can use address zero upwards in the off-chip space. The substrate must offset the address bus to address available SoC regions if this is not the case.

KiwiC accepts a recipe parameter to bound the amount of off-chip memory it can use in its one channel. Where a design attempts to use more memory, a compile-time error is raised.

‘res2-loadstore-lane-addr-size’ gives the off-chip address bus width in bits. In other words, this is the log2 no of words of memory available in each address space. Providing different limits for different off-chip spaces will be enabled in future. The word size and lane structure is defined with ‘res2-loadstore-port-lanes’ and ‘res2-loadstore-lane-width’ where the first of these is typically 4, 8, 16 or 32 and the second nearly always 8 (i.e. byte-sized lanes).

### 8.7 B-RAM Inference

B-RAM instantiation is normally automatic in FPGA tools. B-RAMs with an access latency of one clock cycle are normally used although KiwiC can support zero and two cycle reads (but how to access them is not described here! TODO).

A B-RAM is inferred from a structure following one of several paradigms based on all addresses passing through a single register or all read data being passed through a single register. These can be mapped onto the same underlying technology by posting the writes as necessary but the effects of read while writing to the same location differ.

KiwiC generates on-chip RAMs as explicit instances in the generated RTL. It uses ‘read before’ coding style. The FPGA Vendor ‘read after’ forms, where newly written data is read out are not explicitly found in the generated RTL: KiwiC will forward the data for itself when needed, either at compile or run time.

```vhdl
// (C) Xilinx 2009. Single-Port B-RAM with Byte-wide Write Enable: Read-First mode
// File: HDL_Coding_Techniques/rams/bytewrite_ram_1b.v
//
module v_bytewrite_ram_1b #(
    parameter SIZE = 1024,
    parameter ADDR_WIDTH = 10,
    parameter COL_WIDTH = 9,
    parameter NB_COL = 4)
(
    input clk,
    input [NB_COL-1:0] we,
    input [ADDR_WIDTH-1:0] addr,
    input [NB_COL*COL_WIDTH-1:0] di,
```
output reg [NB_COL*COL_WIDTH-1:0] do;
reg [NB_COL*COL_WIDTH-1:0] RAM [SIZE-1:0];
always @(posedge clk) begin
  do <= RAM[addr];
end

generate
  genvar i;
  for (i = 0; i < NB_COL; i = i+1) begin
    always @(posedge clk)
      if (we[i]) RAM[addr][(i+1)*COL_WIDTH-1:i*COL_WIDTH] <= di[(i+1)*COL_WIDTH-1:i*COL_WIDTH];
  end
endgenerate
endmodule

// Single Port Block RAM with registered output Option
// Please note that XST infers distributed RAM or B-RAM based on the size. For small RAMs, you
module TWO_CYCLE_READ_BRAM(
  input clk,
  input wen,
  input [6:0] a,
  input [15:0] di,
  output reg [15:0] do);
  reg [15:0] ram [0:127];
  reg [15:0] do0;
always @(posedge clk) begin
  if (wen) ram[a] <= di;
  do0 <= ram[a];
  do <= do0;
end
endmodule

Style 1:

always @(posedge clk) begin
  addr_reg <= addr ... ;
  if (wen ...) data[addr_reg] <= (wdata ...);
  rdata = data[addr_reg]; // Note blocking assign used or
  // else the rhs freely used elsewhere.
end

Style 2:

always @(posedge clk) begin
  if (wen ...) data[addr] <= (wdata ...);
8.8 Dual-port and multi-port RAMs

See demo test50.

The FPGA libraries contain dual-port RAMs. These can be used for sharing data between up to two threads. Threads can also share data via a scalar variables. Kiwi supports any number of threads reading or writing shared scalar variables but there are technology restrictions on shared access to arrays.

Where an array is small enough to instantiated as an FPGA on-chip B-RAM (block RAM), and overrides are not applied, then such a B-RAM will be used. Both Xilinx and Altera provided FPGAs with on-chip, dual-ported B-RAMs with synchronous read latency of one cycle.

If three threads operated on the shared memory, Kiwi would generate an instance of a triple-ported SRAM module but this would not be found in the technology library when then FPGA tools were applied. A hardware designer could implement such a device, but it would probably have to be variable latency (i.e. have handshake wires) and this can be requested with CSharp attributes on the array instance. The preferred/support design style for when three or more threads share an array is to ensure the underlying memory if ‘off-chip’ and then each thread will make access to its via its own load/store port (see Kiwi manual).

By default, KiwiC will use one port on an SRAM for each thread that operates on it. However, by setting the PortsPerThread parameter to greater than one then greater access bandwidth per clock cycle for each thread is possible. Note that Xilinx Virtex BRAM supports up to two ports per BRAM in total, so having ports per thread set to two is the maximum sensible value and that may only sensible if there is only one thread making access to the RAM. In the future, several threads in the same clock domain might get to share the physical ports if the compiler can spot they are temporarily disjoint (i.e. never concurrent).

We need to add a little more explanation or forward reference here please ...

9 Substrate Gateway

The substrate gateway is a hardware/software boundary for use on platforms such as Zynq or others that run embedded linux with a console, network and filesystem. It has an associated protocol for providing operating system access.

9.1 Console I/O

This section will explain how to do console I/O via the substrate gateway.

9.2 Filesystem Interface

The basic dotnet classes for streamreader, StreamWriter, textreader and textwriter are provided via the substrate gateway. Random access using fseek is also supported.

The following nets will require connection to the synthesis output when the Kiwi file system is in use.

For high performance computing applications the filesystem is part of the Kiwi Substrate (alongside the DRAM).
A suitable behavioural Verilog fragment to connect to them for simulation test purposes is `/kiwi/filesystem/kiwifs_bev.v` that provides the basic console and file stat/exists/open/close/read/write calls required by the dotnet Stream and File.IO classes.

The remainder of this part of the user manual is missing, but please check the Bowtie Genome Sequencer demo for an example of file system use.

### 9.3 Hardware Server

The `Server` attribute indicates that a method and the methods it calls in turn are to be allocated to a separate RTL module that is instantiated once and shared over all calling threads.
10 Kiwi Performance Tuning

An HLS system can be set to optimise for

1. Performance: achieving the best execution time, aiming for maximal clock frequency and minimal number of clock cycles,
2. Area: using as little area as possible, generally at the expense of many more clock cycles,
3. Debugibility: renaming and sharing registers as little as possible and providing additional debug and trace resources for interactive access.

The main parameters for tuning the Kiwi Area/Performance tradeoff, folding space over time are:

1. The `bevelab-soft-pause-threshold` parameter. The nominal range is 0 to 100 with useful values currently being between 5 and 40. A lower value tends towards more clock cycles and possibly less area. Values above 40 may lead to very long KiwiC compile time.
2. The loop unwind limits alter the amount that a loop is unwound at compile time, leading to parallelism. For instance, the `Kiwi.Unroll("COUNT=4", lvar);` attribute added to the C# source code suggests that the loop whose control variable is called ‘lvar’ is unwound by a factor of 4.
3. Structural Resource Budgets: The restructure phase accepts ten or so recipe settings that limit the maximum number of structural resources, such as floating-point ALUs allocated per thread. Smaller settings lead to smaller designs that use more clock cycles.
4. RAM thresholds: Settings such as `res2-offchip-threshold` alter the amount of block RAM allocated. This is faster than external (off-chip) SRAM or DRAM but uses more FPGA resources.
5. The setting `res2-loadstore-port-lanes` alters the number of external memory ports used. These each operate in order, so if you have more of them and mux them externally onto separate resources or an out-of-order bus then you get more parallelism and external RAM bandwidth.
6. ALU latency: Settings such as `fp_fl_dp_div` describe the type of divider to generate. For such components you can provide your own implementations, alongside those provided in the Kiwi libraries like `cvgates.v`, and specify whether they are fixed or variable latency, fully-pipelined and what the fixed or expected latency in clock cycles is.
7. Register colouring affinity: The `kiwic-colour-enable` setting alters the amount to which KiwiC reuses registers. With it disabled, the hardware is easier to inspect/debug, but many more registers are generated. An experimental, spatially-aware binder is being added to Kiwi at the moment. This will handle both registers and ALUs and gives a floorplan plot.

10.1 Kiwi Performance Predictor

In 2015 a performance predictor was added to Kiwi so that estimates of run-time performance can be rapidly provided without having to do an FPGA place-and-route or even a complete pre-FPGA RTL simulation. The performance predictor is based on basic block visit ratios stored in a database that is updated with the results from short runs. When the application is edited and recompiled with KiwiC, a new prediction is generated, straightaway, based on the contents of the database generated by previous versions. Short profile runs of the new design can then be run to improve prediction accuracy. Every prediction is reported with confidence limits. The reported confidence is reduced (wider error bars) both by certain design edits and by extrapolating to runs that are much longer than those used for profiling.

Performance prediction is based on accurate knowledge of control flow branching ratios: the percentage of time a conditional branch is taken or not taken. This enables execution counts for each basic block to be estimated. Profile information from previous runs is the default basis for this knowledge. To ensure the information stored
in the profile database is robust against program edits, it cannot be indexed by fragile tags such as a basic block number in global syntax-directed enumeration. Instead, performance prediction uses the method names occurring naturally in the application program as timing markers. Every method has a clear entry point as well as potentially several exit points (return statements are numbered in their textual order in the CIL byte code... branches to the exit). With loops that contain no method calls in their bodies, the user must add a method call to a dummy method (null body) and that method should be (preferably?) annotated with a KppMarker attribute. Conditional branches and basic block names are then taken in a syntax-directed way from the code between the named control-flow points and discrepancies in the control flow graph between named points is used to flag warnings and discard profile information no longer usable.

All call strings for a method can either be considered separately or in common. The call string is the concatenation of the call site textual names from the thread or program entry point. If the call strings are considered in common, they are being disregarded and the average over all call strings is used.

These attributes also enable the user to control the way the performance estimation report is presented. They also enable the user to provide a substitute loop or visit count that overrides the stored profile. This provides the basis for extrapolating the run time from a small test or profiling data set to the envisioned real date size that will be processed on the FPGA.

Where the performance predictor cannot find profile information for a branch it assumes a 50/50 division and the number of such assumptions and their effect on the confidence in the result is included in the report.

10.2 Phase Changes, Way Points and Loop Markers

Hardware itself does not have a start and end time. Instead, performance metrics are always quoted between a START/FINISH pair of named events. A typical program is structured with a time-domain series of internal phases, such as ‘startup’, ‘load’, ‘compute’ and ‘report’. The performance predictor makes separate predictions for each phase and sums them. The confidence for different phases may be different, typically according to which part of the program was most recently edited. A marker between phases is called a way point. Kiwi.KppMark dummy calls and/or Kiwi.KppMarker attributes are used to define waypoints. Each way point has a name and all but the last start a phase that optionally also has a name. The entry and exit waypoints should be called START and FINISH respectively. The program’s control flow cannot loop around a way point. If a KppMarker is found in a loop body, or a method body where that method is called more than once, the provided labels are code point markers (explained below).

```plaintext
Kiwi.KppMark("START", "subsequent-phase-name1");
...
Kiwi.KppMark("waypoint-name2", "subsequent-phase-name2");
...
Kiwi.KppMark("waypoint-name3", "subsequent-phase-name3");
...
Kiwi.KppMarker("FINISH");
```

A waypoint is a special form of code point marker. The use of code point markers adds robustness to the information stored in the profile database against program edits, allowing it to be safely applied to edited programs. The markers provide index points that can be associated with loop heads and other control-flow points, to assist in robustness of the profile for complex method bodies. Basic block names are then named in a syntax-directed way with respect to, and as textual extensions of, the previous and next labelled control point. KppMark has no innate multi-threaded capabilities and so should generally be set by an application’s master-controlling thread, assuming it has one.
10.3 Growth Parameter Assertions/Denotations

C# attributes also enable the user to provide a substitute loop or visit count that overrides the stored profile. This provides the basis for extrapolating the run time from a small test or profiling data set to the envisioned real data set size that will be processed on the FPGA. Also, hardware itself does not have a start and end time - it is static/eternal. Instead, performance metrics are always quoted between a start/end pair of named code labels, again specified with C# attributes. Times for various phases within a program, such as ‘load’, ‘process’ and ‘write out’, can also be predicted by inserting appropriate further control-graph delineations with an attribute that denotes a way point.

10.4 Debug and Single Step

There is no explicit support for hardware debug currently in Kiwi. However, we will add shortly the following features to the generated RTL that can be hooked up to a management CPU via the substrate gateway. They each add hardware overhead but this can be trimmed out mostly by FPGA tools when reporting resources are left disconnected.

2. Thread status and control bit vectors: run enable and status: not started, running and exited.
3. CPU register debug access ports: additional read/write logic is generated enabling programmed I/O access to every register.

Register colouring, RAM binding with memory maps and ALU binding is reported in the KiwiC report file. Only a static mapping, generated at KiwiC compile time, is used.

Watchpoints are best implemented in the C# source code and recompiled, or else use vendor tools like ChipScope etc..

11 Spatially-Aware Binder

An experimental, spatially-aware binder is being added to Kiwi at the moment. This will handle both registers and ALUs and gives a floorplan plot.

12 Generated RTL

Kiwi generates Verilog RTL for synthesis to FPGA by vendor tools. It can also generate SystemC but we do not use that here.

KiwiC will assume the presence of various IP blocks in Verilog. These include RAMs and fixed and floating point ALUs. It will instantiate instances of them.

The library blocks are generally provided in the following source files:

```
CV_FP_ARITH_LIB=$(HPRLS)/hpr/cv_fparith.v
CV_INT_ARITH_LIB=$(HPRLS)/hpr/cvgates.v
CVLIBS=$(CV_INT_ARITH_LIB) $(CV_FP_ARITH_LIB)
```
12.1 RAM Library Blocks

Fixed-latency RAMs are provided in the cvgates.v. They have names such as CV_SP_SSRAM_FL1 which denotes a synchronous RAM with fixed read latency of one clock cycle (FL1) and one port (SP). The cvgates implementations are intended to be synthesisable by FPGA tools.

Parameter overrides set the address range and word and lane width.

12.2 ALU Library Blocks

These blocks are found in cv_fparith.v

Example: CV_FP_FL5_DP_ADDER - floating point, fixed latency of 5 clock cycles, double precision

CV_FP_FL_SP_MULTIPLIER

Key: FLASH=combinational. FLn = fixed latency of $n$ clock cycles, VL variable latency with

13 Incremental Compilation

Compiling everything monolithically does not scale to large projects. A modular design is always needed for revision control, unit testing and other reasons.

There are several cut points in the design flow where separately-compiled modules can be combined. Numbers 1 and 3 in the following list are relatively obvious, so we discuss only number 2.

1. KiwiC will accept any number of .dll or .exe files on its command line. These will have been generated, typically, from separate invocation of the C# compiler.

2. The Kiwi.Remote() attribute described in §7.1 enables a given method to be cut out for separate compilation.

3. Incremental invocation of FPGA tools is also typically possible, where some RTL files have been seen before and others are new, but is beyond the scope of this document.

4. (In principle it is possible to load and save VMs to disk (serialised in XML) and so incremental compilation at intermediate points in the opath recipe is a future option.)

Separately-compiled modules will not share hardware resources (such as registers, ALUs or RAMs) between them. Also, each will, in general, have its own (set of) load/store port(s) for access to DRAM.

14 Examples

There are some examples in the standard distribution, such as primes and cuckoo cache.

14.1 A get-started example: 32-bit counter.

Here’s how to make a simple synchronous counter that produces a 32-bit net-level output.
using KiwiSystem;
{
    [Kiwi.OutputWordPort("counter")]
    static int counter;

    [Kiwi.HardwareEntryPoint()]
    static int Main2()
    {
        while(true)
        {
            Kiwi.Pause();
            counter = counter + 1;
        }
    }
}
Part IV

Expert and Hardware-level User Guide

15 Kiwi Hard-Realtime Pipelined Accelerators

Note: real-time Pipelined Accelerator mode is being implemented 3Q16.

Classical HLS generates a custom datapath and controlling sequencer for an application. The application may run once and exit or be organised as a server that goes busy when given new input data. KiwiC supported only, up until now, that classical way for each thread. We call this ‘sequencer major HLS mode’.

In the new ‘Pipelined Accelerator’ major HLS mode, KiwiC will generate a fully-pipelined, fixed-latency stream processor that tends not to have a controlling sequencer, but which instead relies on predicated execution and a little backwards and forwards forwarding along its pipeline.

A pipelined accelerator mode with latency set to zero results in a purely combinatorial circuit in terms of input to output data path, but it may post writes to registers and RAMs that still need a clock.

The prior Kiwi.Remote() attribute, described in §7.1, enables a given method to be cut out for separate compilation. This was non-rentrant and does not enforce hard real time.

When generating a real-time accelerator, a C# function (method with arguments and return value) is designated by the user as the target root, either using a C# attribute or a command line flag to the KiwiC compiler. The user may also state the maximum processing latency. He will also typically state the reissue frequency, which could be once per clock cycle and whether stalls (flow control) is allowed.

For a real-time accelerator, multiple ‘calls’ to the designated function are being evaluated concurrently in the generated hardware. Operations on mutable state, including static RAMs and DRAM are allowed, but care must be taken over the way multiple executions appear to be interleaved, just as care is needed with re-entrant, multithreaded software operating on shared variables. Local variables are private to each invocation.

Although we default to every concurrent run’s behaviour being treated in isolation, we support two means for inter-run communication: we can address the arguments and intermediate state of neighbouring (in the time domain) runs and, as mentioned just above, we can read and write mutable state variables that are shared between runs.

A root module for a hardware accelerator is a C# static method with arguments and a return value. Variable-latency leaf cells cannot be instantiated (currently) in accelerator mode where the latency varies by more than the reinitiation interval. Further details need defining, but, for now, we need to avoid off-chip DRAM and KiwiC will request fixed-latency integer dividers (latency equal to the bit width) instead of the more commonly instantiated variable-latency divider.

15.1 Pipelined Accelerator Example 1

A simple example is test54 in KiwiC regression suite.

```csharp
static readonly uint[] htab4 = { 0x51f4a750, 0x7e416553, 0x1a17a4c3, 0x3a275e96,
    ... many more entries ... };

// We require a reissue interval of 1 (fully pipelined)
// We want a maximum latency of 16.
```
We can specify the reissue interval via the C# attribute. In this example, a reissue interval of 1 is specified. This generates fully-pipelined hardware that can be supplied with fresh arguments every clock cycle.

We also specify the maximum result latency as 16. KiwiC will determine its own latency, up to this value, guided by the logic cost settings, and report it in the KiwiC.rpt output file.

The ROM, in the full source code of the example, has 256 entries, and so is implemented as a statically-initialised block RAM on most FPGAs. This has a synchronous access time of one clock cycle. For multiple, concurrent accesses, as required by the reissue interval of 1, the ROM must be mirrored. Owing to loop-carried ROM address dependencies, the minimum implementation latency, by inspection, is 5 cycles.

All loops offered in pipelined accelerator mode must be fully unwindable by KiwiC. This means they must have a hard and obvious upper iteration limit, but they may have data-dependent early exit.

Internally, in our first implementation, the bevelab recipe stage unwinds all loops. This gives a single superstate to the restructure recipe stage which operates in a mode where all holding registers and input operands are replicated as needed in pipeline form and where mirroring of structural resources, such as the ROM in the above example, is used to avoid structural hazards arising not only for multiple use by a single run, as normal, but over different stages in that run that are separated by more than the reissue interval.

16 Designing General/Reactive Hardware with Kiwi

Kiwi can be used in an RTL-like style for some applications. This is where the user takes more active control over clock cycle mapping than is required or desired by scientific users.

The Kiwi system has a hard pause mode, clock domains and net-level I/O facilities for specifying cycle-accurate hardware. This is needed for bit-bang coding to connecting to existing hardware interfaces like AXI, I²C and LocalLink. Ideally, protocols are supported natively by Kiwi and bit-banging can be avoided.

16.1 Input and Output Ports

Input and Output Ports can arise and be defined in a number of ways.

Net-level I/O ports are inferred from static variables in top-most class being compiled. These are suitable for GPIO applications such as simple LED displays and push buttons etc., The following two examples show input and output port declarations, where the input and output have their width specified by the underlying type and by attribute, respectively.

```c
[Kiti.InputPort("serin") static bool serialin;
[Kiti.HwWidth(5)] [Kiti.OutputPort("data_out") static byte out5;
```

The contents of the string are a friendly name used in output files.

For designers used to the VDHL concept of a bit vector, we also allow arrays of bools to be designated as I/O ports. This can generate more efficient circuits when a lot of bitwise operations are performed on an I/O port.
Although it makes sense to denote bitwise outputs using booleans, this may require castings, so ints are also allowed, but only the least significant bit will be an I/O port in Verilog output forms.

16.2 Register Widths and Wrapping

Integer variables of width 1, 8, 16, 32 and 64 bits are native in C# and CIL but hardware designers frequently use other widths. We support declaration of registers with width up to 64 bits that are not a native width using an 'HwWidth' attribute. For example, a five-bit register is defined as follows.

```csharp
[Kiwi.HwWidth(5)] static byte fivebits;
```

When running the generated C# natively as a software program (as opposed to compiling to hardware), the width attribute is ignored and wrapping behaviour is governed by the underlying type, which in the example is a byte. We took this approach, rather than implementing a genuine implementation of specific-precision arithmetic by overloading every operator, as done in OSCI SystemC [1], because it results in much more efficient simulation, i.e. when the C# program is run natively.

Although differences between simulation and synthesis can arise, we expect static analysis in KiwiC to report the vast majority of differences likely to be encountered in practice. Current development of KiwiC is addressing finding the reachable state space, not only so that these warnings can be generated, but also so that efficient output RTL can be generated, such that tests that always hold (or always fail) in the reachable state space are eliminated from the code.

The following code produces a KiwiC compile-time error because the wrapping behaviour in hardware and software is different.

```csharp
[Kiwi.HwWidth(5)] byte fivebits;
void f()
{
    fivebits = (byte)(fivebits + 1);
}
```

The cast of the rhs to a byte is needed by normal C# semantics.

Compiling this example gives an error:

KiwiC assign wrap error:
(widthclocks_fivebits{storage=8 }+1)&mask(7..0):
assign wrap condition test rw=8, lw=5, sw=8

The following examples work

```csharp
[Kiwi.HwWidth(4)] [Kiwi.InputPort("")] static byte din;
[Kiwi.HwWidth(6)] static int j = 0;
```

A short-cut form for declaring input and output ports
16.3 How to write state machines...

Kiwi hardware coding styles: how to code combinational, Mealy and Moore systems in hard-pause mode.

16.3.1 Moore Machines

First compare the Moore machines define by main_pre and main_post:

```java
main_pre()
{
    q = 100;
    while (true) { q -= din; Kiwi.Pause(); }
}

main_post()
{
    q = 100;
    while (true) { Kiwi.Pause(); q -= din; }
}
```

each has some initial reset behaviour followed by an indefinite looping behaviour. Their difference is the contents of q on the first tick: main_pre will subtract din on the first tick whereas main_post does not. In both cases, q is a Moore-style output (i.e. dependent on current state but not on current input).

The shortly-to-be-implemented optimisation in bevelab will make a further change: the run-time program counter will disappear entirely for main_post because the loading of q with its initial value will be done as part of the hardware reset. However, main_pre will still use a state machine to implement its different behaviour on the first clock tick.

16.3.2 Mealy and combinational logic:

Coding Mealy-style logic and purely combinational sub-circuits is not currently supported. Purely combinational logic could possibly inferred from an unguarded infinite loop, such as main_comb

```java
main_comb() { while (true) q = (din) ? 42:200; }
```

However, main_comb is not a sanitary program to run under KiwiS since it will hog excessive CPU power.

Mealy-style coding could better be implemented with a new attribute as illustrated in main_mealy where the mel output is a function of both the current state q and current input din.

```java
main_mealy() { while (true) { q += 1; mel = q+din; Kiwi.Pause(); } }
```

Exploring this further would best be done in conjunction with further development of SystemCsharp to yield a nice overall semantic. TODO perhaps?
16.4 State Machines

Explicit state machines can be coded fairly naturally:

```java
main_explicit_state_mc()
{
    q = 1;
    while(true)
    {
        Kiwi.Pause();
        switch(q)
        {
            case 1: q = 2; break;
            case 2: q = 3; break;
            case 3: q = 1; break;
        }
    }
}
```

and the position of the single Kiwi.Pause() statement before or after the switch statement only alters the reset behaviour, as discussed above.

Implicit state machines can also be used:

```java
main_implicit_state_mc()
{
    q = 1;
    while(true)
    {
        Kiwi.Pause(); q = 2;
        Kiwi.Pause(); q = 3;
        Kiwi.Pause(); q = 1;
    }
}
```

Because main_implicit_state_mc is a relatively simple example, the KiwiC compiler can be expected to reuse the initial state as the state entered after the third Pause call, but in general the compiler may not always spot that states can be reused.

16.5 Clock Domains

A synchronous subsystem designed with kiwi requires a master clock and reset input. The allocation of work to clock cycles in the generated hardware is controlled by an unwind budget described in [2] and the user’s call to built-in functions such as 'Kiwi.Pause'. By default, one clock domain is used and default net names `clock` and `reset` are automatically generated. To change the default names, or when more than one clock domain is used, the `ClockDom` attribute is used to mark up a method, giving the clock and reset nets to be used for activity generated by the process loop of that method.

```java
[Kiwi.ClockDom("clknet1", "resetnet1")]
public static void Work1()
{ while(true) { ... } }
```

A method with one clock domain annotation must not call directly, or indirectly, a method with a differing such annotation.
17 SystemCSharp

SystemCSharp follows the design of SystemC and currently there is a very initial version of it in existence. Please see the README.txt in its folder.

SystemCSharp is a library, written in C#, that provides RTL semantics for hardware modelling. In particular, it provides signals that support the evaluate/commit paradigm of synchronous digital logic, where all variables in a clock domain take on their new values, atomically, on the active edge of the relevant clock.

The KiwiC compiler can generate SystemCSharp output by using the `-csharp-gen=enable` command line flag. The default output name is the default name with the suffix `.sysc.cs` added. The `-csharp=filename` flag can be used to change the output filename.
Kiwi Developers’ Guide and Compiler

18 KiwiC Internal Operation

KiwiC is a compiler for the Kiwi project. It aims to produce an RTL design out of a named sub-program of a C# program.

KiwiC does not currently invoke the C# compiler: instead it reads a CIL portable assembly language file (.exe or .dll) generated by a Microsoft or Mono C# compiler.

Figure 3 shows key components of the main flow through the tool as set up with the provided recipe file (KiwiC00.rcp). The full recipe contains ten or so stages and the obj folder created by running the tool contains the log files and intermediate forms for each stage. Other output flows and formats can be deployed by changing the recipe. The dotted line shows that using the simvnl command line option the internal simulator (Diosim) can be applied to the RTL after it has been round-tripped through Verilog. For debugging, Diosim can be applied to any HPR machine intermediate form, by varying the recipe. (There’s also a shortcut ‘-conerefine=disable
FIRST PASS:
Per method-basis, remove stack
Insert SPILL variables

SECOND PASS
Inlines all methods of a thread

THIRD PASS: Generate HPR DIC per thread.
Design becomes fully statically allocated,
Unwinds loops,
Points-to analysis for each heap object,
Remaining object pointers become enumeration types.

Local Variables at same call depth
of same thread are reused

Heap must be same shape at each point
on each
iteration of non-unwound loops

One machine per thread
using shared variables and mutexes
for communication

Figure 4: The internal flow of the KiwiC front-end recipe stage.
-repack=disable -verilog=disable’ that will cause diosim to run the original VM generated by the KiwiC front end without conversion to hardware). This is needed for the profile-directed feedback.

The .NET executable bytecode is read using the Mono.Cecil front end. Any needed libraries, including Kiwi.dll and Kiwic.dll are also read in. These are combined with some canned (hardwired in the front end) system libraries. The result is a large CIL abstract syntax tree. This can be output for tracing/debugging if desired.

The KiwiC front end (IL elaborate stage) converts the .net AST to the internal form used by the core library, the HPR VM2 machine, which contains imperative code sections and assertions. Code sections can be in series or parallel with each other, using Occam-like SER and PAR blocks.

The front end-stage subsumes a number of variable present in the input source code, including all fixed object pointers.

The VM code emitted by KiwiC front end is a set of parallel ‘dic’ blocks. These are ‘directly indexed code’ arrays of imperative commands and there is one for for each user thread. They are placed in parallel using the PAR construct. Each dic array is indexed by a program counter for that thread. There is no stack or dynamic storage allocation. The statements are: assign, conditional branch, exit and calls to certain built-in functions, including hpr_testandset, hpr_printf and hpr_barrier. The expressions occurring in branch conditions, r.h.s. of assignment and function call arguments still use all of the arithmetic and logic operators found in the IL input form. In addition, limited string handling, including a string concat function are handled, so that console output from the CIL input is preserved as console output in the generated forms (eg. $display in Verilog RTL).

The conversion to FSM is described in §24 and is common to the h2ool flow. Actually, that is the old scheduler and a new one is now being used.

Its output is an HPR machine stylised in that there are no program counters and every statement operates in parallel.

The output forms available include Verilog RTL, which we have used for FPGA layout. The stylised output from the FSM generation stage is readily converted to a list of Verilog blocking assignments.

## 18.1 Background: HPR Library and Orangepath Tool

HPR/Orangepath is a refinement framework designed for synthesis of protocols and interfaces in hardware and software forms.

Orangepath H2 represents a system as an hierarchy of abstract machines. Each machine is a database of declarations, executable code and assertions/goals. The goals are assertions about the system behaviour, input directly, or generated from compilation of temporal logic and data conservation rules into automata. Executable code can pass through the system unchanged, but any undriven internal nodes are provided with driver code that ensures the system meets its goals.

As far as possible, all operations are ‘src-to-src’ like operations on HPR virtual machines and the operations are stored in a standard opath command format to be executed by an orangepath recipe (program of commands)

The library is structured as a number of components that operate on a VM to return another VM. The opath mini-language enables a ‘recipe’ to be run that invokes a sequence of library operations in turn. An opath recipe is held in an XML file and the default file is KiwiC00.rcp.

Loops in the recipe can be user to repeat a step until a property holds.

The opath core provides command line handling so that parameters from the receive and the command line are fed to the components. The opath core also processes a few ‘early args’ that must be at the start of the command line. These enable the recipe file to be specified and the logging level to be set.

The Orangepath library has a number of supported input and output formats.

In this manual, we concentrate almost entirely on the .NET CIL input format and the Verilog RTL output
format.

Skip this section if you do not feel you need to know about the internal workings of KiwiC.

CIL code is the assembly language used by the mono and .NET projects. The HPR tool can read in CIL assembly code when invoked using the KiwiC command. This assembly code is generated by a large number of third party compilers from various input languages. OLD: NO: All dynamic storage allocation must be made by constructors before the entrypoint method is called. There are numerous other limitations on how the CIL code must be structured.

This stage reads CIL bytecode and generates VM code for the HPR virtual machine. The CIL bytecode is parsed to an AST by a bison parser.

A variable is a static or dynamic object field, a top-level method formal, a local variable, or a stack location. For each variable we decide whether to subsume it at the CIL processing stage. If not subsumed, it appears in the abstract VM code that is fed to the next stage (where it may then get subsumed for other reasons). Variables that are subsumed in this way tend to be object and array handles. Such variables must contain compile-time constant values throughout the execution of the output code.

We perform a symbolic execution of each thread at the CIL basic block level and emit VM code for each block. CIL label names that are branch destinations define the basic block boundaries and these appear verbatim in the emitted VM code.

Although CIL bytecode defines a stack machine, no stack operations are emitted from the CIL processing stage. Stack operations within a basic block are symbolically expanded and values on the stack at basic block boundaries are stored and loaded into stack variables on exit and entry to each basic block. The stack variables are frequently subsumed, but can appear in the VM code and hence, from time-to-time, in the output RTL.

A -root command line flag enables the user to select a number of methods or classes for compilation. The argument is a list of hierarchic names, separated by semicolons. Other items present in the CIL input code are ignored, unless called from the root items.

Where a class is selected as the root, its contents are converted to an RTL module with IO terminals consisting of various resets and clocks that are marked up in the CIL with custom attributes (see later, to be written). The constructors of the class are interpreted at compile time and all assignments made by these constructors are interpreted as initial values for the RTL variables. Where the values are not further changed at run time, the variables turn into compile-time constants and disappear from the object code.

Where a class is selected as a root, all of the methods in that class will be compiled as separate entry points and it is not normally appropriate for one to call another: calls should generally be to methods of other classes.

Where a method is given as a root component, its parameters are added to the formal parameter list of the RTL module created. Where the method code has a preamble before entering an infinite loop, the actions of the preamble are treated in the same way as constructors of a class, viz. interpreted at compile-time to give initial or reset values to variables. Where a method exits and returns a non-void value, an extra parameter is added to the RTL module formal parameter list.

The VM code can be processed by the HPR tool in many ways, but of interest here is the 'convert_to rtl' operation that is activated by the '-vnl' command line option. (NB: This is now on by default in the KiwiC00 recipe, disable with -verilog-gen=disable).

KiwiC TimesTable.exe -root 'TimesTable;TimesTable.Main' -vnl TimesTable.v

More than one portable assembly (CIL/PE) file can be given on the command line and KiwiC will aggregate them. The file name of the last file listed will be used to name the compilation outputs by default (in the absence of other command line flags).

(At some point, KiwiC might be extended to also invoke the C# compiler if given a C# file.)
Figure 5: The main flow implemented in the KiwiC tool (same as figure ??).
Part VI

Miscellaneous

19 FAQ and Bugs

Note: Do not use Console.WriteLine or Write with 4 or more arguments since MCS converts these calls to a different style not supported by KiwiC.

Q. I get a postscript file called `nolayout.eps’ what is this?
A. The HPR library contains a constructive placer that writes a floor plan to an eps file. This is used for net-length power analysis on output RTL. It is also being used in the constructive placer to decide how best to colour registers and bind functional units such as ALUs.

Q. Do you have any Xillybus or JetStream (Manchester) demos?
A. No, but we expect these to be contributed soon ...

Q. KiwiC is generating a circuit with too many output terminals to fit in my FPGA. Why is this?
A. You may be directly instantiating the Kiwi-generated RTL as the top-level of your FPGA. This is not a normal design route: you should most likely be using a standard Kiwi substrate for your FPGA and it is the substrate that instantiates the Kiwi code. The problem most likely arises from the Waypoint outputs. These are only for simulation purposes and they can be safely ignored. If they are left disconnected in the component that instantiates the Kiwi-generated RTL the FPGA tools will delete the logic that drives them instead of attempting to route them to the output.

```csharp
output reg [639:0] KppWaypoint0,
output [639:0] KppWaypoint1,
```

You can also use command line flag `-vnl-keep-waypoints=disable` to turn off their rendering.

Q. What IP-XACT support does Kiwi have?
A. None at the moment, but the debug access port that Kiwi should provide in the future will likely have its structure reported in IP-XACT (§10.4).

Q. I tried more ideas for one-liners, such as:

```csharp
exist = Array.IndexOf(LUT, tmp) > -1 ? true : false;
```

but it didn’t work.

A. Since Kiwi imports very little of the standard C# libraries, the `IndexOf` method of the Array class is most likely missing. For 2-D and greater arrays, Kiwi uses an implementation in Kiwi.cs and it is easily possible to add the implementation of Index into those implementations in C# src code form and it should then work. For 1-D arrays, the bulk of the implementation is hardcoded inside KiwiC, but there should be potential to extend the hardcoding with additional C# code and place that, ultimately, in Kiwi.cs as well. Its a matter of knowing what to put in there. I’ll run a little test. It should work.

Q. Why do I get KiwiC error: do not update your formal parameters for now.
A. The message you have now encountered is a result of storing or modifying a formal parameter to a function which is functionality that is sadly currently missing. Just copy your formal into a local var at the start of the function body for now.
void myfun (int fp)
{
    int copied_fp = fp;
    copied_fp += 1; // Do not directly modify your call by value formals for the moment. (P)
}

Q. What does this mean: System.Exception: CV_INT_FL2_MULTIPLIER_S unrecognised gate for presim

A. This is from the built-in simulator, diosim. The design has used a fixed-latency of 2 multiplier (from cvgates.v or elsewhere) but the simulator does not know how to simulate it. Restructure its own simulation model for each component it deploys, but one fix is to not apply diosim to (miss off the -sim=nnnn flag) since the generated RTL should be ok.

Q. How can I get meaningful line numbers in my error messages from KiwiC?

A. Line numbers are hard to track through the C# front end, but errors should be reported on a method name basis. There is a fairly-detailed log file written to the obj/h02_kiwic folder but it is hard to understand. Increasingly you can get a finer cross reference with the source code by embedding waypoints in your source file.

Q. Why are bools using 32 bits, even in arrays?

A. A C# compiler may compiles them this way - CIL has no run-time bool class. It may be best to instantiate your own bit-packed array class with suitable overloads if you want to exploit bit-level storage.

Q. Can I generate a VCD using the builtin simulator, diosim.

A. Yes, use the "-sim=nnnn" argument to set the number of cycles to simulate for and add "-diosim-vcd=myvcd.vcd" to set the output file name. The "-recipe=recipes/simkcode.rcp" command line flag is also useful for just running the KiwiC front end in a software-like simulation.

Q. Why is the reset input not used in the generated RTL?

A. See §35. The reset net is disconnected unless you indeed add

    -vnl-resets=synchronous
    or
    -vnl-resets=asynchronous

or change this XML line in the file /distro/lib/recipes/KiwiC00.rcp

    <defaultsetting> resets none </defaultsetting>

Q. Why does the type of the output result end up as: reg [31:0] FIFO result; instead of reg FIFO result;

A. In Verilog, integers are signed and registers are not. You can alter this by adjusting the definition of result. Recent Verilog standards also allow signed registers to be defined.

Q. I have lots of X uncertain values in my simulation

A. Is the source of X from flip-flops that are not cleared at reset or is it floating inputs? Did you put -vnl-resets=synchronous? You do not need this on all FPGA simulations since FPGA flops are self resetting, but with the associated simulator you may need this.

It is good to trace the pc10nz program counter (or similar name) generated by KiwiC for each thread. This normally starts at zero. You can cross check that with the dot graphviz output or the tables appended to the back of the .v file (also present in the obj/h08_restructure/s00... file).
Q. I thought I would have a go at synthesizing the …
However, the Verilog finish statement gets in the way. Should there really be a finish command in synthesizable Verilog?

A. If the main entry point to the C\# program allows its thread to exit then a finish will be put in the output code. This is indeed not synthesisable. Quite often one wants the program to exit when run native but not when synthesised. The solution to this is to place the main body of the program in a subroutine that is called from the Main method (ie the entry point). The same subroutine is also called from a second method where it is enclosed in an infinite while loop. This second method can then be named as the root to KiwiC and this will avoid a finish statement in the generated code.

Suppressing the default operation on main thread exit statement can be controlled with a command line flag \{-kiwic-finish\}.
\index{-finish}
\index{-kiwic-finish}

```
\begin{verbatim}
-kiwic-finish= [ enable | disable ]
```

Another solution is to mark up the main body subroutine with the Kiwi.Remote() attribute. This places it in an infinite loop, and adds handshaking wires to start and stop its execution.

Another solution is to put an infinite loop in the main entry point (perhaps including a Kiwi.Pause() statement in the loop if there is other complexity to ensure KiwiC spends less time working out that it is infinite).

Q. I get the following strange error message even when I am sure my program is not allocating fresh memory inside the thread lasso loop : Bad form heap pointer for obj_alloc (already allocated a variable sized object ?).

A. Check whether you are allocating local arrays on the stack: if these are just constant lookup tables makes sure you put the keyword const in front to make them statically-allocated.

Q. I get an error like [ERROR] FATAL UNHANDLED EXCEPTION: System.Exception: threadstart//T403/Main/t55:2:
Creating class instance this/uid token=System/Action'2/star1/@/16/SS/TX1/SINT/TX0: Bad form heap pointer for obj_alloc of type System/Action'2/star1/@/16/SS/TX1/SINT/TX0 post end of elaboration point (or have already allocated a runtime variable sized object ?). storemode=STOREMODE,completime_heap, sbrk=/tend:nota, const constant_fold_meets entry_point=0

A. This is a Kiwi 1 restriction - most heap objects need to be allocated before the end of static elaboration. Consider moving the code that allocates the heap object to the class constructor or else to another method that you call earlier. (For allocate-once items, this code migration will become automatic soon.)

Q. Can I use in Kiwi the data type struct?

A. Kiwi aims to support static and dynamic classes well. Structs in C# are slightly odd things and Kiwi has a little support form them that is properly well tested. This is being fixed Sept 2016. Normally you should use classes but if you have a good reason to use structs we can see how well it is currently working.

Q. What string formatting is supported in Console.Write or WriteLine?

A. Up to three arguments are supported. String, integer decimal, integer hex and floating point should all work. String catenation is also supported provided it is done a KiwiC compile time.

Q. I get FPGA or RTL SIM error regarding CV_SP_SSRAFL1 missing.
A. This is a single-ported synchronous static RAM with fixed latency of 1 read cycle. It will most likely be mapped to block RAM by FPGA tools. There are a number of such components that KiwiC instantiates. Please include a Kiwi technology library such as distro/lib/cvgates.v in your back end compile.

Q. Does Kiwi support the keyword 'break'?

A. Yes, all control flow constructs like for/while/continue/break are handled by the C# compiler and just appear as goto's in the CIL dot net code input to KiwiC.

examples - all are standard dot net

{0} - arg 0 in decimal or floating

{1} - arg 1

{2} - arg 2

{1:x} - arg 1 in hex

{1:X} - arg 1 in upper case hex

{1:3} - field width of 3 decimal

{1:03X} - field width of 3, hex with leading zeros

Q. If I instantiate: static ulong[] buffer = new ulong[10], KiwiC will generate registers. In the simulation I noticed that I got, not 10 regs, but 18. I tried also with static ulong[] buffer = new ulong[5] and got 8 regs.

A. A short array of 10 entries is most likely to be mapped to 10 separate registers, especially if you only use constant subscripts. If your subscripts can be determined not to use the whole range or only use multiples of a some constant or fall in disjoint regions you will get other patterns. Quite how it gets allocated depends on the pattern of subscripts you use. The figure 18 you quote is presumably inflation on top of that from other aspects of the design? Kiwi does not replicate and mirror storage at the moment although this could possibly be useful under some circumstances. Ditto 5 to 8. Also, it depends on how many time you assign to buffer and how many different calls to new you make. I assume you have just one assign outside of any loop or re-ntrant code.

Q. I try to instantiate 2 ulong[256] arrays. In the RTL there are two memories, one A_64_US_[255:0][63:0] and one A_64_US_[2047][63:0]. I checked also the verilog file and I noticed that the address of the second array, whenever there is an operation, is multiplied by 8. Is it because of some optimization?

A. The byte address of a u64 array will be a factor of 8 different from the word address. Also If you only used every 8th location in an array, the repack recipe stage might notice this and divide each address by 8 to save space. The addresses on the input to the repack recipe stage are byte addresses. The addresses afterwards should be efficiently packed addresses, which would be /64 if you used only every 8th word owing to both effects acting.

Q. KiwiC seems to be deleting most of my design. Is this correct?

A. The processing stage called conerefine deletes unused parts of the design. A part of the design is unused if it generates no output. Outputs include PLI calls like Console.WriteLine or net-level outputs flagged with kiwi.outputwordport or similar. Adding ~conerefine=disable to the command line suppresses the associated trimming, resulting in a larger RTL or other output file, although occasionally this may lead to elements being present at the code generation stage that cannot be sensibly rendered in the output language.

Q. If I want a net-level I/O bus wider than 128 bits (the size of a ulong), what can I do?
A. There is some support for this that needs documenting, where an array is passed as I/O. The colourbars example illustrates this style, but it is not in the repo and has not been tested for a while. However, having a static C# struct (not a class) as an I/O ought to work. However, C# structs is not mature in KiwiC. We can easily fix a few basic cases now however. See test51.

Q. KiwiC is taking a very long time to compile and then fails. It says it has run out of unwind steps. Why is this?

A. If you are in a soft pause mode, KiwiC will infer Kiwi.Pause() statements where it feels necessary to allocate work to clock cycles. In hard pause mode KiwiC is not free to insert such pauses. If you have an infinite loop without a pause in it, KiwiC will fail to unwind the loop. Check that all control paths (PC trajectories) inside infinite loops have at least one Kiwi.Pause() inside them.

Q. KiwiC is trying to start wine and creating file paths with backslashes in them, even though I am running on Linux. It also reports it is running on NT 5.2 when there is no windows machine anywhere involved.

A. On recent linux systems, on encountering a .exe the shell will start wine and try to open windows and so on. The KiwiC shell scripts enable you to define MONO and you should set this in your environment to ‘mono’ or ‘/usr/bin/mono’. If this still does not fix the problem please set you shell env var MONO_OS OVERRIDE to something beginning with ‘l’ such as linux64 and KiwiC will override the installed path combiner and related options.

+++ checking failed:
Factorial_fac[15:0]:OUTPUT::Unsigned{init=0, io_output=true, HwWidth=16, storage =32} := Factorial_fac*FTFT4FactorialCircuit_V_0: assignment may wrap differently
[Kiwi.OutputWordPort(15, 0)] static uint fac = 1;

A. These are warnings that the generated RTL will behave differently from the dot net versions if overflow occurs in the custom bit width fields.

You defined the output port to be a sixteen bit register but used the ’uint’ dot net valuetype to model it in the dll. You are performing an operation on this field that is sensitive to its width. The warning is that there might be a difference in behaviour if, e.g. you increment this value so that it goes above 56535.

Part VII

Orangepath Synthesis Engines

The Orangepath project supports various internal synthesis engines. The aim is to include SSMG but some more simple engines are also provided. The other engines include the FSM generator, the PSL compiler and the restructurer.

Because all input is converted to the HPR machine and all output is from that internal form it is sensible to use the HPR library for translation purposes without doing any actual synthesis.

A synthesis engine rewrites one HPR machine as another.

20 A* Live Path Interface Synthesiser

The H2 front end tool allows access to the live path interface synthesiser.
The A* version is described on this web page. http://www.cl.cam.ac.uk/djg11/wwwhpr/gpibpage.html

The follow-on to this work is being undertaken by MJ Nam.

21 Transactor Synthesiser

The transactor synthesiser is described on this link
http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/transactors

22 Asynchronous Logic Synthesiser

The H1 tool implements an asynchronous logic synthesiser described on this link.

23 SAT-based Logic Synthesiser

The H1 tool implements a SAT-based logic synthesiser described on this link.
http://www.cl.cam.ac.uk/djg11/wwwhpr/dslogic.html

(This synthesiser is currently not part of the main HPR revision control branch.)

24 Bevelab: Synchronous FSM Synthesiser

Bevelab is an HPR plugin that converts HPR threaded forms to RTL form. Both the input and outputs to this stage typically have the concept of a program counter per thread, but the number of program counter states is greatly reduced. In the output form, many assignments and array writes are made in parallel. A custom data path is generated for each thread and the program counter becomes the internal state of a micro-sequencer that controls that data path. The emitted program counter does not need to be treated differently, then on, from any other scalar register, although the distinction is preserved in the output form for readability, debugging (and perhaps to assist proof tools), and for the Kiwi Performance Predictor that needs to track the control flow graph through the complete toolchain.

(An alternative to bevelab is the VSFG stage (§25) that can achieve greater throughput with heavily-pipelined components in the presence of complex control flow.)

Usually, the input is in DIC form where the DIC contains assignments, conditional gotos, fork/join and leaf calls to HPR library functions. More-advanced imperative control flow constructs, such as while, for, continue, break, call and return need to have been already removed.

The resulting RTL is generally ‘synthesizable’ as defined by language standards for Verilog, VHDL and SystemC. The resulting RTL is generally ‘synthesizable’ as defined by language standards for Verilog, VHDL and SystemC. Although it uses common subexpression sharing, it is hopelessly inefficient since a naive compilation to hardware would instantiate a fresh, flash arithmetic operator at every textual site where an operator occurs. In addition, it will typically be full of structural hazards where RAMs are addressed at multiple locations in one clock cycle, whereas in reality they are limited in number of simultaneous operations by their number of ports. Finally, the RAMs and ALUs are assumed to be combinatorial by this RTL, whereas in reality they are pipelined or variable latency.
Converting to one of the output languages, such as SystemC, is by a subsequent plugin. But the output of bevelab is normally first passed via restructure (that overcomes structural hazards and performs load balancing) to the verilog-gen plugin where it is converted to Verilog RTL syntax.

Both bevelab and restructure can trade execution time against number of resources in parallel use: the time/space fold/unfold. Bevelab is the core component of any ‘C-to-gates’ compiler. It packs a sequential imperative program into a hardware circuit. As well as packing multiple writes into one cycle, it can unwind loops any bounded number of times. Loops that read and write arrays can generate very large multiplexor trees if the array subscripts are incomparable at unwind time, since there are very many possible data bypasses and forwardings needed. Therefore, a packing that minimises the number of multiplexors is normally chosen. A simple greedy algorithm is used by default: as much logic as possible is packed into the first state, defined by the entry point to the thread, subject to four limits:

1. a multiplexing logic depth heuristic limit being reached,
2. a name alias (undetermined array address comparison) being needed,
3. a user-annotated loop unwind limit being reached, and
4. containing an intrinsically pausing operation.

Once the first state is generated, which may contain multiple input conditional branches that become predication within that state, successive micro-sequencer states are generated until closure.

Certain operations are already known to be pausing. One is a user-level explicit pause where the source code contains a call to ‘Kiwi.Pause()’. This is needed for net-level protocols, such as parallel to serial conversion in a UART. Others, such as trying to use results from integer divide, any floating point arithmetic, non-fully-pipelined multiply and reads from RAMs that are known to be registered also generate pauses when their source operands are also generated in the current micro-sequencer state.

Bevelab operates using the heuristics given in Table 3 takes an additional input, from the command line,
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Style</th>
<th>Default</th>
<th>Max</th>
</tr>
</thead>
<tbody>
<tr>
<td>Maximum number of name aliases array read</td>
<td></td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>Maximum number of multiplexors in logic path</td>
<td>loops</td>
<td>10</td>
<td></td>
</tr>
<tr>
<td>Maximum default number of iterations to unwind loops</td>
<td></td>
<td>4</td>
<td></td>
</tr>
</tbody>
</table>

Table 3: Bevelab Heuristic Table.

which is an unwind budget: a number of basic blocks to consider in any loop unwind operation. Where loops are nested or fork in flow of control, the budget is divided over the various ways.

The flag `generate-nondet-monitors` turns on and off the creation of embedded runtime monitors for non-deterministic updates.

The flag `preserve-sequencer` should be supplied to keep the per-thread vestigial sequencer in RTL output structures. This makes the output code more readable but can make it less compact for synthesis, depending on the capabilities of the FPGA tools to do their own minimisation.

The string `-vnl-resets=synchronous` should be passed in to add synchronous resets to the generated sequencer logic. This is the default.

The string `-vnl-resets=asynchronous` should be passed in to add asynchronous resets to the generated sequencer logic.

The string `-vnl-resets=none` should be passed in to suppress reset logic for FPGA targets. FPGA's tend to have built-in, dedicated reset wiring. See §3.

Bevelab has a number of scheduling algorithms (selectable from recipe or command line). Alternatively, bevelab can be replaced with a different opath plugin, such as VSFG or the future one that does more register colouring.

### 24.1 Bevelab: Internal Operation

The central data structure is the pending activation queue, where an activation consists of a program counter name, program counter value and environment mapping variables that have so far been changed to their new (symbolic) values.

The output is a list of finite-state-machine edges that are finally placed inside a single HPR parallel construct. The edges have to forms (g, v, e) (g, fname, [ args]) where the first form assigns e to v when g holds and the second calls function fname when g holds.

Both the pending activation queue and the output list have checkpoint annotations so that edges generated during a failed attempt at a loop unwind can be discarded.

The pending activation list is initialised with the entry points for each thread. Operation removes one activation and symbolically steps it through a basic block of the program code, at which time zero, one or two activations are returned. These are either added to the output list or to the pending activation list. An exit statement terminates the activation and a basic block terminating in a conditional branch returns two activations. A basic block is terminated with a single activation at a blocking native call, such as hp_r_pause. When returned from the symbolic simulator, the activation may be flagged as blocking, in which case it is fed to the output queue. Otherwise, if the unwind budget is not used up the resulting activations are added to the pending queue.

A third queue records successfully processed activations. Activations are discarded and not added to the pending queue if they have already been successfully processed. Checking this requires comparison of symbolic environments. These are kept in a "close to normal form" form so that syntactic equivalence can be used. This list is also subject to rollback.
Operation continues until the pending activation queue is empty. A powerful proof engine for comparing activations would enable this condition to be checked more fully and avoid untermination with a greater number of designs.

25 VSFG - Value State Flow Graph

VSFG is an alternative to the bevelab plugin - it uses distributed dataflow instead of having a centralised micro-sequencer per thread. It is based on the paper ‘A New Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware’ [4]. It can achieve greater throughput with heavily pipelined components in the presence of complex control flow compared with traditional loop unwinding and static scheduling.

Its implementation within Kiwi is currently experimental (January 2015).

26 PSL Synthesiser

The PSL synthesiser converts PSL temporal assertions into FSM-based runtime monitors.

27 Statechart Synthesiser

The Sys-ML statechart synthesiser is built in to the front end of the H2 tool. It must be built in to other front ends that generate HPR VMs.

28 SSMG Synthesiser

SSMG is the main refinement component that converts assertions to executable logic using goal-directed search. The SSMG synthesiser is described in a separate document and is a complete sub-project with respect to HPR.

29 Repack Processing Step

The repack function is essentially KiwiC-specific. It is therefore described in the KiwiC chapters of this manual (§4.8.1).

30 Restructure Processing Step

Restructuring is need to overcome structural hazards arising when there are insufficient resources for all the required operations to take place in parallel and to generally sequence operations in the time domain. Resources are mainly ALUs and memory ports. Table[1] shows the main parameters that control time/space trade off while restructuring a design. Further parameters relate to the cache size and architecture, DRAM clock speed. The repack phase (§29) generated as many memories as possible. These must now be allocated to the allowed hardware resources, which may mean combining memories to reduce their total number, but taking into account a good balance for port bandwidth. Hardware platforms vary in the number of DRAM banks provided. The number of block RAMs inside an individual FPGA, like the number of ALUs to use, can be varied between one compilation and another.
### Table 4: An Example Structural Resource Guide Table.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Style</th>
<th>Default Max</th>
</tr>
</thead>
<tbody>
<tr>
<td>Max no of integer adders and subtractors per thread</td>
<td>flash</td>
<td>unlimited</td>
</tr>
<tr>
<td>Max no of integer multipliers per thread</td>
<td>one-cycle</td>
<td>5000 bit products</td>
</tr>
<tr>
<td>Max no of integer dividers per thread</td>
<td>vari-latency</td>
<td>5</td>
</tr>
<tr>
<td>Max no of F/P ALUs per thread</td>
<td>fixed latency of 5</td>
<td>5</td>
</tr>
<tr>
<td>Max size register file (bits)</td>
<td></td>
<td>512</td>
</tr>
<tr>
<td>Max size single-port block RAM per thread</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Max no of single-port block RAMs per thread</td>
<td></td>
<td>2</td>
</tr>
<tr>
<td>Max no dual-port block RAMs shared over threads</td>
<td></td>
<td>2</td>
</tr>
<tr>
<td>Max size dual-port block RAMs shared over threads</td>
<td></td>
<td>bits</td>
</tr>
<tr>
<td>No of DRAM front-side cache ports</td>
<td></td>
<td>unlimited</td>
</tr>
<tr>
<td>No of DRAM banks</td>
<td></td>
<td>platform-specific</td>
</tr>
</tbody>
</table>

The restructure phase bounds the number of each type of structural resource generated for each thread. It then generates a static schedule for that thread. Certain subsystems can have variable latency, in which case the static schedule is based on the average execution time, with stalls and holding registers being generated for cases that run respectively slower or faster than nominal. The schedule may also get stalled at execution time owing to dynamic events that cannot be predicted in advance. Typical dynamic events are cache misses, contention for shared resources from other threads and blocking message passing between threads.

The scheduler statically maps memory operations to ports on multi-ported memories. It overcomes all static hazards, ensuring that no attempt to use a resource more than once at a time occurs. It therefore ensures that different operations occur in different cycles, with automatic insertion of holding registers to maintain data values that would not be available when needed.

The five-stage pipeline for FPUs consists of, for an add, the following fully-pipelined steps: 1. unpack bit fields and compare mantissas, 2. shift smaller mantissa, 3. add mantissas, 4. normalise, 5. round and repack.

## Part VIII

### Output Formats

The HPR library contains a number of output code generators. All of these write out a representation of an internal HPR machine. Not all forms of HPR machine can be written out in all output forms, but, where this is not possible, a synthesis engine should be available that can be applied to the internal HPR machine to convert it.

Certain output formats can encode both an RTL/hardware-style and a software/threaded style. For instance, a C-like input file can be rendered out again in threaded C style, or as a list of non-blocking assignments using the SystemC library.

The following output formats may be created:

1. **RTL Form:** The RTL output is written as a Verilog RTL. One module is created that either contains just the RTL portion of the design, or the RTL and instances of each MPU that is executing software parts of the design.

2. **Netlist Form:** The RTL output is compiled to a structural netlist in Verilog that contains nothing but
gate and flip-flop instances.

3. **H2 IMP Form:** The HPR form is output to an IMP file. This has the same syntax as the imperative subset of H2.

4. **SMV form:** The HPR VM is output as an SMV code and the assertions that have not been compiled or refined are output as assertions for SMV to check.

5. **C Form:** The HPR VM is output as C code suitable for third-party compilers. RTL forms may also be output as synthesisable SystemC.

6. **UIA MPU Form:** The IMP imperative language is compiled to IMP assembly language and output as a .s file.

7. **IP XACT form:** The structural components are written out as IP XACT definitions and instances.

8. **S-expression form:** The HPR VM is dumped a lisp S-expression to a file.

9. **UIA Machine Code:** The IMP assembly is compiled to machine code for the UIA microcontroller. This is output as Intel Hex and also as a list of Verilog assignments for initialising a memory with this code.

The net-based output architecture is suitable for direct implementation as a custom SoC (system on chip). H2 defines its own microcontroller and we use the term MPU to denote an H2 microcontroller with an associated firmware ROM. The net-based architecture consists of RTL logic and some number of MPUs. However, by requesting that all output is as C code for a single MPU, the net-based output degenerates to a single file of portable C code.

Additional output files include log files and synthesisable and high-level models of the UISA microprocessor that executes IMP machine machine code.

### 31 Arithmetic and RAM Leaf Cells

The tool will expect the user to provide definitions of various leaf cells with the output from the tool at the input to the RTL synthesis step. A number of suitable definitions are included in cvgates.v and cv_fpgates.v and it may commonly be sufficient just to include these two files in the RTL compilation.

The leaf cell names follow a few conventions:

1. All have a clock and reset input, even if not needed.
2. All have a fail output, even if they cannot fail or will report their error in-band using, for example, NaN.
3. The main outputs is listed before inputs, but associative instantiation is normally used anyway. For divide and mod the numerator is listed before the denominator. For subtractors the lhs is listed first.
4. The naming convention has the letters \( VL \) for a variable-latency component and this has handshake wires. Otherwise \( FL_n \) denotes a fixed-latency of \( n \) clock cycles, fully-pipelined. The tool will schedule an average budget for variable latency components.
5. Parameter overrides, listed in the order output, first input, second input, set the precision of ALU connections and RAM dimensions.

For variable-latency leaf cells, the handshake protocols is as follows:

- Handshake uses a \texttt{req} input and a \texttt{rdy} output.
- New args are latched on a cycle where \texttt{req} is asserted.
- Results are ready in a cycle when \texttt{rdy} is asserted.
- New work may be presented with \texttt{req} during the same cycle that new args are entered.
• Asserting req before the last rdy has been delivered is an error.
• No combinational path between req and rdy allowed inside the ALU.

Note: The above is for on-chip devices instantiated directly by the tool. Off-chip RAM connections use a separate protocol (HSIMPLE, HFAST, AXI, BVCI).

31.1 Fixed-point ALUs

The RTL backend will use built-in RTL operators for adders and substructors. For multipliers and dividers and modulus with non-constant arguments it instantiates specific units, such as CV_INT_VL_DIVIDER_US. Very small multipliers are rendered with the RTL asterisk infix operator and left to the FPGA tools as per the adders/substructors.

Kiwi generally calls out to variable latency dividers and fixed-latency multipliers. It uses an estimate for the variable latency computation time in its schedules. When using a fixed latency it increases the latency requested for larger parameter widths. Whether fixed or variable is indicated in the component kind name. Instantiated components cope with any argument width as specified by parameter overloads.

Kiwi does not currently generate the fixed-point ALU implementations and it may request one that is not in the provided cvgates.v baseline library, in which case the poor user must provide their own implementation. For example, an extreme design might call for a 512 by 1024 fixed latency multiplier with 5 clock cycle latency.

Recipe parameters alter the points at which the library enlarges the provisioned latency.

31.2 Floating-point ALUs

Floating-point ALUs follow the pattern of fixed-points ALUs, except that add and subtract are also always instantiated ALUs and the RTL compiler is not expected to handle them. A different set of recipe parameters control their structure (fixed/variable latency and expected/required latency).

Only 32 and 64 bit, IEEE standard floating point is currently used by default. A future extension will provide for custom width floating point, since this is a very powerful feature of HLS that can save a lot of energy and area. The extension will give the same behaviour on mono WD as on RTLSIM and FPGA.

A core set of floating point ALUs is provide in cv_fpgates.v. These are soft macros that the RTL tools are expected to map to whatever is available in the target FPGA or ASIC library. Specific shims and bindings to assist with Altera and Xilinx are likely to be added to the distro in the near future.

31.3 Floating-point Convertors

There is no budget limit on the number of convertors is currently imposed.

The convertors required normally are

- CV_FP_CVT_FL2_F32_I32 // Integer 32 to float 32 with fixed latency of 2
- CV_FP_CVT_FL2_F32_I64 // Integer 32 to float 32 with fixed latency of 2
- CV_FP_CVT_FL2_F64_I32 // Integer 32 to float 32 with fixed latency of 2
- CV_FP_CVT_FL2_F64_I64 // Integer 32 to float 32 with fixed latency of 2
- CV_FP_CVT_FL2_I32_F32 // Integer 32 from float 32 with fixed latency of 2
- CV_FP_CVT_FL2_I32_F64 // Integer 32 from float 32 with fixed latency of 2
- CV_FP_CVT_FL2_I64_F32 // Integer 32 from float 32 with fixed latency of 2
CV_FP_CVT_FL2_I64_F64 // Integer 32 from float 32 with fixed latency of 2
CV_FP_CVT_FL0_F32_F64 // Float 32 from float 64 (FL=0 implies combinational)
CV_FP_CVT_FL0_F64_F32 // Float 32 from float 64 (FL=0 implies combinational)

31.4 RAM and ROM Leaf Cells

A set of standard static RAM cells is provided in cvgates.v. These are parameterisable in width, length and number of lanes by overrides. They are single and dual ported and of latencies 0, 1 and 2 clock cycles.

Kiwi and other tools built in the HPR library generate instances of these RAMs.

RTL tools are expected to map these to appropriate structures, such as LUT RAM and block RAM on FPGA.

RAM instances are also generated with no write ports and static initialisations using the Verilog initial statements. RTL tools will treat these as ROMs. Unlike RAMs, where the user is expected to manually couple a definition from cvgates.v or elsewhere to their RTL synthesis step input, ROMs are are embedded in the main RTL output files from a run of the tool.

Part IX

General Orangepath Facilities

The Orangepath tool provides facilities for a number of experimental compilers. This part describes the core features, not all of which will be used in every flow.

32 FILES AND DIRECTORIES

When an Orangepath tool is run, it creates a directory in the current directory for temporary files. This is the obj directory. This obj directory contains temporary files used during compilation.

The .plt files are plot files that can be viewed using diogif, either on an X display or converted to .gif files.

The .h2logs file contains a log of the most recent compilation. These are placed in a folder named with the early arg -log-dir-name.

33 Espresso

Espresso is not currently needed for Fsharp implementation of HPR.

The Moscow ML implementation of the Orangepath tool requires espresso to be installed in /usr/local or else the ESPRESSO environment variable to point to the binary. If set to the ASCII string NULL then the optimiser is not used.

The -no-espresso flag can also be used to disable call outs to this optimiser. Internal code may be used instead.
34 Cone Refine

The cone refine optimiser deletes parts of the design that have no observable output. It can be disabled using the flag `-cone-refine disable`.

35 HPR Command Line Flags

The very first args to an HPR/Orangepath tool are the early args that enable the recipe file to be selected and the logging level and location to be set.

The first argument to an HPR/Orangepath tool, such as h2comp or KiwiC, is a source file name. Everything else that follows is an option. Options are now described in turn.

The HPR/LS logger makes an object directory and writes log files to it.
Flag `-verboselevel=n` turns on diversion of log file content to be mirrored on the standard output. 0 is the default and 10 makes everything also come out on the console. Console writes are flushed after each line and this is also a means of viewing the final part of a log that has not been flushed owing to stdio buffering.
Flag `-verbose` turns on a level of console reporting. Certain lines that are written to the obj/log files appear also on the console.
Flag `-verbose2` turns on a further level of console reporting. Certain lines that are written to the obj/log files appear also on the console.
Flag `-recipe fn.xml` sets the file name for the recipe that will be followed.
Flag `-loglevel n` sets the logging level with 100 being the maximum n that results in the most output.
Flag `-give-backtrace` prevents interceptions of HPR backtraces and will therefore give a less processed, raw error output from mono.
The developer mode flag, `-devx`, enables internal messages from the toolchain that are for the benefit of developers of the tool. Setting the environment variable `HPRLS_DEVX=1` performs the same action.
Flag `-root rootname` specifies the root facet for the current run. A number of items can be listed, separated by semicolons. The ones before the last one are scanned for static and initialisation code whereas the last one is treated as an entry point.

In Kiwi, roots may instead or also be specified using the dot net attribute `Kiwi.Hardware`.
When you want only a single thread to be compiled to hardware, either add a Kiwi.Hardware attribute or use a root command line flag. If you have both the result is that two threads are started doing the same operations in parallel. The currently fairly-simplistic implemention of offchip has no locks and is not thread safe, so both threads may do operations on the offchip nets at once.

NOTE: Many of the command line flags listed here have a different command line syntax using the Fsharp version of KiwiC. To get their effect one must currently either make manual edits to the recipe xml file (e.g. kiwici00.rcp) or else simply list them on the command line using the form `--flagname value`.

If the special name `GLOBALS` is specified as a root, then the outermost scope of the assembly, covering items such as the globals found in the C language, is scanned for variable declarations.
Flag `-preserve-sequence` structures output code with an explicit case or switch statement for each finite-state machine.

Synthcontrol `-bevelab-repack-pc=disable` creates sequencer encodings where the PC ranges directly over the h2 line numbers: easier for cross-referencing when debugging. Otherwise it defaults to a packed binary or unary coding depending on `-bevelab-onehot-pc`.
Option \texttt{-array-scalarise} all converts all arrays to register files. Other forms allows names to be specifically listed. See § ??.

\begin{verbatim}
-vnl-resets=none
-vnl-resets=synchronous
-vnl-resets=asynchronous
\end{verbatim}

or change this XML line in the file /distro/lib/recipes/KiwiC00.rcp

\begin{verbatim}
<defaultsetting> resets none </defaultsetting>
\end{verbatim}

When doing RTL simulation of the KiwiC-generated RTL output, one can sometimes encounter a ‘lock up’ where the design makes no further progress. Tracing the ‘pc’ variable in the output code will reveal it is stuck when trying to make a conditional branch whose predicate evaluates to dont-care owing to un-initialised registers or disconnected inputs.

HPR (KiwiC) (by default) does not generate initialisation code to set static variables to their default values (zero for integers and floats and false for booleans). The same goes for RAM contents.

For RAM contents, with KiwiC, the user code must contain an explicit clear operation in a C# loop.

To overcome the problem with uninitialised registers, we can potentially use \texttt{-vnl-resets=synchronous} or \texttt{-vnl-resets=asynchronous}. This will make the RTL simulate properly and overcomes most lockup problems. But we get additional wiring in the output that can repeat the FPGA’s own hardwired or global reset mechanisms.

Clearly the design can be synthesised separately with and without resets. But to avoid the duplication of effort, hence with a common RTL file (one synthesis run only), one must take one of the following five routes, where the first two use a KiwiC compile with the default \texttt{-vnl-resets=none}.

1. use an RTL simulator option that has an option where all registers start as zero instead of X,

2. add a set of additional initial statements to the generated RTL that are ignored for FPGA synthesis (HPR vnl could generate these automatically but does not at the moment),

3. request a reset input to the generated sub-system (using \texttt{-vnl-resets=synchronous}) but tie this off to the inactive state at the FPGA instantiation of that subsystem and expect the FPGA tools to strip it out as redundant logic so that it does not consume FPGA resource.

4. trust the FPGA tools to detect a synchronous reset net as such (by boolean dividing FPGA D-input expressions by it) and map it to the FPGA hardwired reset mechanisms so that it does not consume FPGA resource.

5. use \texttt{-vnl-resets=asynchronous} and trust the FPGA tools to map this to the hardware global reset net.

Note, the vnl output stage always generates subsystems with a reset input but this is (mostly) ignored under the default option of \texttt{-vnl-resets=none}.

See § ??.

\texttt{"-subexps=off"}

The \texttt{subexps} flag turns off sub-expression commoning-up in the backend.

\begin{verbatim}
-vnl-rootmodname name
\end{verbatim}

Use the \texttt{-vnl-rootmodname} flag to set the output module name in Verilog RTL output files.
-vnl-roundtrip name= [ enable | disable ]

Converts generated Verilog back to internal VM form for further processing.
When enabled, generated RTL will be converted back again before (for example) being simulated with diosim. When disabled, the input to the verilog generate (vnl) recipe stage will be passed on unchanged and a typical recipe will then simulate that directly.

"-ifshare=on"
"-ifshare=none"
"-ifshare=simple"

The default ifshare operation is that guards are tally counted and the most frequently used guard expressions are placed outermost in a nested tree of if statements.
The ifshare flag turns off if-block generation in output code. If set to ‘none’ then ever statement has its own ‘if’ statement around it. If it is set to ‘simple’ then minimal processing is performed. The default setting is ‘on’.

"-dpath=on"
"-dpath=none"
"-dpath=simple"

When dpath=on, with the preserve sequencer options for a thread, a separate ‘datapath’ engine is split out per threads and shared over all data operations by that thread.
Synthcontrol cone-refine-keep=a,b,c accepts a comma-separated list of identifiers names as an argument and instructs the cone-refine optimiser/trimmer to retain logic that supports those nets.

-xtor mode specifies the generation of TLM transactors and bus monitors. The mode may be initiator, target or monitor.

-render-root rootname specifies the root facet for output from the the current run. If not specified, the root facet is used. This has effect for interface synthesis where the root module is not actually what is wanted as the output from the current run.

-ubudget n specifies a budget number of basic blocks to loop unwind when generating RTL style outputs.
The -finish=[true false] flag controls what happens when the main thread exits. Supplying this flag causes generated output code to exit to the simulation environment rather than hanging forever. When running under a simulator such as Modelsim or when generating SystemC it is helpful to exit the simulation but certain design compiler and FPGA tools will not accept input code that finishes since there is no gate-level equivalent (no self-destruct gate).
The -restructure flag controls mechanisms for overcoming static hazards and moving on-chip RAMs to off chip. Currently the argument is the name of the protocol for off-chip RAMs, which may be BVCI or HSIMPL.

35.1 Other output formats

The -sysc flag causes the tool to generate SystemC output files.
Header and code files are generated with suffix .cpp and .h. Additional header files are generated for shared interfaces and structures. Generally, to make a design consisting of a number of C++ classes, the tool is run a number of times with different root and sysc command line options.
The -smv flag causes the tool to generate a nuSMV output file.
The –ucode flag causes generation of UIA microprocessor code for the design.

–vnl fn.v specifies to generate a Verilog model and write it to file fn.v.

–gatelib NAME requests that the Verilog output is in gate netlist format instead of RTL. The identifier NAME specifies the cell library and is currently ignored: a default CAMHDL cell library is used.

–gatelib NAME requests that the Verilog output is in gate netlist format. This takes precedence over –vnl that causes RTL output.

### 35.2 General Command Line Flags

- The –version flag gives tool version and help string.
- The –help flag gives tool version and help string.
- The –opentrace flag sets the opentrace level: this alters the debugging output but most debugging is in the h2log file anyway.
- The –rwtrace flag sets the rwtrace level, rather like the –opentrace option.

### 35.3 Simulation Control Command Line Flags

- The HPR L/S library provides a built-in simulator called diosim. This can run on the input forms, the post-generation forms and post output generation forms. By default it runs on the latest form generated. Flags to alter this default will be removed.
- Only the two Verilog output forms, RTL and gatelevel, support conversion back into HPR machine form for post generation simulation.
- –sim n specifies to simulate the system using the built-in HPR event-driven simulator for n cycles. The output is written to t.plt for viewing. The –traces flag provides a list of net patterns to trace in the simulator.
- The –title title flag names the diosim plot title.
- The –sim-rtl flag causes diosim to simulate the results of the generator processor (e.g. compilation to FSM) rather than the input form.
- The –sim-gates flag causes diosim to simulate the results of compilation to gates (–gatelib is used) rather than the input form.
- The –diosim-techno=enable flag causes print statements from the simulator to include ANSI colour escape codes for various highlighting options.
- The –plot plotfile flag causes plot file output of the diosim simulation to a named plot file.
- The plot file can be viewed under x-windows and/or converted to a gif using the diogif program.

### 36 Diosim Simulator

- The Orangepath system contains its own simulator called diosim. Since the target is output from the compiler as portable code to be fed into third-party C and Verilog compilers, it is not strictly necessary to use the Orangepath simulator. However, the simulator provides a self-contained means of evaluating a generated target without using external tools.
- The simulator accepts an hierarchical H2 machine and simulates it.
The simulator will verify all safety assertion rules that contain no temporal logic operators. Other safety and all liveness assertions are ignored.

Non-deterministic choices are made on the basis of a PRBS that the user may seed.

The PRBS is also used for synthetic input generation from plant machines or external inputs. PRBS values used for external inputs are checked against plant safety assertions and rejected if they would violate.

Output is a log and plot file. The plot file is currently in digif format, but a VCD format should be added.

Detailed logging can be found in the obj/log files. If a program prints the string ‘diosim:traceon’ or ‘diosim:traceoff’ the level of logging is changed.

If a program prints ‘diosim:exit’ then diosim will exit a though builtin function hpr_exit() were called.

Old Get Started (MOSCOW ML)

**HPR was implemented in moscow ML but you should now be using the dotnet (fsharp) version. Please now ignore this section.**

These instructions apply to running on koo, but should be understandable enough for self-port to other machines.

The h2tool requires the toolmisc, and h2tool directories from the usr/groups/han CVS hprls tree. Examples and documentation are in the examples and doc directories.

The KiwiC tool also requires the mono directory.

Both the h2tool and the KiwiC tool have separate front-end parser binaries. These must be compiled using the command make in the relevant subdirectory: h2fe or cilfe.

Before use, please set the MOSML and HPRLS shell variables

At the computer lab use:

```
export MOSML=/usr/groups/theory/mosml2.01
```

HPRLS should be set to your copy of the HPR L/S library. DJG uses:

```
export HPRLS=$HOME/d320/hprls
```

ESPRESSO should point to the unix espresso binary or be set to NULL. Espresso is not currently needed for Fsharp implementation of HPR.

If you get this error:

```
Uncaught exception:
Fail: libmunix.so: cannot open shared object file: No such file or directory
while loading C library libmunix.so
```

then the setting of LD_LIBRARY_PATH is not working.

The following env vars should be sufficient for the KiwiC command to be run in any directory:

```
export HPRLS=/usr/local/hprls/hprls for current 'stable' release
```

or

```
export HPRLS=/home/djg11/d320/hprls for djg latest live version
```

also
export MOSML=/usr/local/mosml2.01

Basic invocation, in any directory where the source .il file resides

$(HPRLS)/mono/KiwiC TimesTable.il -root 'TimesTable;TimesTable.Main' -vnl TimesTable.v

See also $HPRLS/mono/README

To compile a .il file from C# using the Kiwi attribute library, the following sequence of commands can be used:

KLIB=$(PREFIX)/support/Kiwi.dll
KIWIC=$(PREFIX)/bin/KiwiC $(KLIB)
ROOT=-root 'DVI4.DVI_Ports;DVI4.Generator.Generate_Synch_Only'

framestore.exe:framestore.cs
   gmcs framestore.cs -r:$(KLIB)

framestore:framestore.exe
   $(KIWIC) framestore.exe $(ROOT) -vnl verilogoutput.v

KiwiC using C++ instead of C#  
Visual C++ and gcc4cil will generate dotnet portable assemblies from C++ code.
Using the gcc4cil compiler you should find a binary called "cil32-gcc" in the <path_to_cross_compiler>/bin  
directory. To create a CIL file use this compiler with the -S option.

Getting gcc4cil.

1. Get Gcc4Cil from the svn-repository that is mentioned on the  
Gcc4Cil website (http://www.mono-project.com/Gcc4cil)  
"svn co svn://gcc.gnu.org/svn/gcc/branches/st/cli"

2. As Gcc4Cil wants to compile files for the Mono-platform, you  
need the Mono-project installed on your system. The easiest way to  
install it is to use "Linux installer for x86" that can be found  
derunder http://www.mono-project.com/Downloads . Installation  
instructions are available under  
http://www.mono-project.com/InstallerInstructions .

3. It may be possible that you need to install the portable .NET  
project. During the manual compilation of gcc4cil I got errors, that  
made me install this project. However I could not find a line in the  
automatic generated Makefile that has a reference to the p.net path  
in my home-dir. If you get the impression that you need it, you can  
find it here: http://www.gnu.org/software/dotgnu/pnet-install.html

4. Because I did not know that there was a automatic script for this, I did a  
<path_to_gcc4cil>/configure using the following options  
--prefix=<where it should be installed to>  
--with-mono=<install_dir_of_mono>  
--with-gmp=<install_dir_of_glib>

I then did a make bootstrap-lean and installed the following libraries because
of compile errors:
- bison-2.3.tar.gz*
- glib-2.12.9.tar.gz
- pkg-config-0.22.tar.gz

I think it is likely that you may want to skip this step, as this step DOES NOT generate a compiler for CIL but for boring x86 code (what I learned after I did this). However I set up paths to the installed libraries in this step, so I mention it. I do not know for sure if all those paths are needed in the end. As it works for me now, I won’t remove them:

```bash
setenv HOST_MONOLIB "/home/petero/mono-1.2.5.1/lib"
setenv HOST_MONOINC "/home/petero/mono-1.2.5.1/include/mono-1.0:/home/petero/mono-1.2.5.1/include/mono-1.0-vc6"
setenv CIL_AS "/home/petero/p.net/lib:/home/petero/p.net/bin"
```

5. In the directory where you put the gcc4cil source code, you can find a shell script called "cil32-crosstool.sh". Execute this and the crosscompiler for C-to-CIL compilation hopefully now gets compiled.

References

Index