Kiwi Developers' Guide and Compiler Internal Operation

**Figure 6:** The main components of the default KiwiC flow using the default recipe (KiwiC00.rcp) in the KiwiC tool.
$\begin{figure}\centerline{\epsfbox{base00.eps}}\end{figure}$

**Figure 7:** The internal flow of the KiwiC front-end recipe stage.
$\begin{figure}\centerline{\epsfbox{kiwife-flow.eps}}\end{figure}$

KiwiC is a compiler for the Kiwi project. It aims to produce an RTL design out of a named sub-program of a C# program.

KiwiC does not currently invoke the C# compiler: instead it reads a CIL portable assembly language file (.exe or .dll) generated by a Microsoft or Mono C# compiler.

Figure 6 shows key components of the main flow through the tool as set up with the provided recipe file (KiwiC00.rcp). The full recipe contains ten or so stages and the obj folder created by running the tool contains the log files and intermediate forms for each stage. Other output flows and formats can be deployed by changing the recipe. The dotted line shows that using the simvnl command line option the internal simulator (Diosim) can be applied to the RTL after it has been round-tripped through Verilog. For debugging, Diosim can be applied to any HPR machine intermediate form, by varying the recipe. (There's also a shortcut `-conerefine=disable -repack=disable -verilogen=disable' that will cause diosim to run the original VM generated by the KiwiC front end without conversion to hardware). This is needed for the profile-directed feedback.

The .NETexecutable bytecode is read using the Mono.Cecil front end. Any needed libraries, including Kiwi.dll and Kiwic.dll are also read in. These are combined with some canned (hardwired in the front end) system libraries. The result is a large CIL abstract syntax tree. This can be output for tracing/debugging if desired (using the kiwic-cil-dump flag).

The KiwiC front end (IL Elaborate stage) converts the .net AST to the internal representation used by the core HPR/LS library. This is the HPR VM2 machine.

The VM code emitted by KiwiC front end is a set of parallel `DIC' blocks. These are `directly indexed code' arrays of imperative commands and there is one for each user thread. They are placed in parallel using the PAR construct. Each DIC array is indexed by a program counter for that thread. There is no stack or dynamic storage allocation. The statements are: assign, conditional branch, exit and calls to certain built-in functions, including hpr_testandset, hpr_printf and hpr_barrier. The expressions occurring in branch conditions, r.h.s. of assignment and function call arguments still use all of the arithmetic and logic operators found in the IL input form. In addition, limited string handling, including a string concat function are handled, so that console output from the CIL input is preserved as console output in the generated forms (eg. $display in Verilog RTL).

Memory disambiguation and partitioning into statically-sized memories and DRAM is done by the repack receipe stage (§29) . The KiwiC front end has labelled every storage operation with a storage class. Repack conglomorates classes that are assigned between and then uses arithmetic pointer analysis rules for alias analysis. Its input is an HPR VM where every variable and array location has a virtual address (hidx) in a so-called wondarray. A wondarray is allocated for every dotnet datatype (except structs). The wondarray contains

words of that datatype but only the words on integer multiples of the datatype's size in bytes are used. The output from repack has had all of these mapped to scalars or to smaller 1-D arrays and each is branded with an identifier. Some input variables to repack have been allocated a reserved `unadressable' hidx which means they are scalar and do not have their address taken. These go through repack without modification and appear as identical scalars in the repack output. In Kiwi use, these correspond to static variables.

The conerefine recipe stage deletes unused parts of the design. A part of the design is unused if it generates no output. Outputs include PLI calls like Console.WriteLine or net-level outputs flagged with Kiwi.OutputWordPort or similar. Object and array handles that are not manipulated actively by the program are removed.

The conversion from imperative code to FSM is performed, normally, by bevelab, described in §24. This allocates work to clock cycles based on the Kiwi.Pause() statements manually embedded by the designer or automatically inserted by the KiwiC front end. The bevelab output is an HPR machine where every statement from every thread nominally operates in parallel -- i.e. pure RTL. However, some PC-like annotations are retained for easily projection (and re-encoding) in FSM form. FSM re-encoding for thread's controller will later typically be done by the FPGA tools to simplify the controller output decode function.

The restructure recipe stage (§30) binds and schedules operations and storage to physical resources. Storage decisions are made as to which vectors and scalars to place in what type of component (flip-flops, unregistered SRAM, registered SRAM, DP SRAM or off-chip in DRAM) and which structural instance thereof to use. ALU's and other primitives are also instantiated and bindings of program operations are made. Owing to the FSM annotations preserved by bevelab, the binder can easily determine which RTL statements are disjoint. Each state in the input FSM potentially becomes multiple, so-called, microstates in the output as structural hazards on memory ports are avoided and pipelined ALU operations are composed. Allocation decisions are based on heuristic rules parametrised by command-line flags and recipe file values, such as the number of floating-point multipliers per thread.

The output forms available include Verilog RTL, which we have used for FPGA layout. The stylised output from the FSM generation stage is readily converted to a list of Verilog non-blocking assignments.

Background: HPR/LS Library (aka Orangepath)

HPR L/S (aka Orangepath) is a library and framework designed for synthesis and simulation of a broad class of computer systems, protocols and interfaces in hardware and software forms. The Orangepath library provides facilities for a number of experimental compilers.

The primary internal representation (IR) is a so-called HPR VM2 virtual machine. The framework consists of a number of plugins that operate on this IR. Hence, in type terms at least, all operations are `src-to-src'. But in practice, certain forms cannot be used in certain places: for instance a VM2 containing RTL code cannot be rendered directly as C++ (it would have to be passed through the bevelab plugin first).

HPR virtual machines and the operations to be applied to them are stored in a standard opath command format to be executed by an Orangepath recipe (program of commands).

A characteristic feature of Orangepath is that plugins can potentially, always be applied in any order and often have inverses. For instance a plugin that outputs RTL is reveresed by a plugin that reads in RTL. A plugin that performs HLS from behavioural code to RTL would be reversed that by a plugin that gives a single-threaded imperative program from a large body of parallel RTL code.

A simulator plugin, called diosim, is able to simulate the IR in any form and, in particular, is able to simulate interactions between parts of the system defined in different styles. For instance it can simulate a pair of CPU cores communicating with each other where one is modelled in RTL and the other as a cycle-callable ISS. Asynchronous I/O and network hardware is also modellable with these primitives.

A so-called recipe, which is an XML file, invokes the plugins in a particular order, supplying parameters to them. The input and output of each recipe stage is a so-called HPR VM2 machine. Loops in the recipe can be user to repeat a step until a property holds. The opath core provides command line handling so that parameters from the recipe and the command line are combined and fed to the plugin components as they are invoked. The opath core also processes a few `early args' that must be at the start of the command line. These enable the recipe file to be specified and the logging level to be set.

The Orangepath library has plugins that support a variety of external input and output formats.

An HPR VM2 machine contains scalar and 1-D array declarations, imperative code sections and assertions.

Values are signed and unsigned integers of any width and floating point of any width is also supported in the framework but library components currently only work for IEEE 32 and 64 bit formats. Enumeration types are also supported, the most important being the boolean type. For all enumerations, an exclusion principle is applied, in that if an expression is known not to be any but one of the values of enumeration, then it must be that one value. Booleans are held differently from other enumerations internally but all expressions on enumerations are only stored in minimised form (using Espresso or otherwise). The library supports a great deal of constant folding and identity operation elimination (such as multiplying by zero or one). It has limited handling for strings and string constants, which are either treated in the same way that they are handled in Verilog, which is as an expression or register of width 8 times the string length in characters, or as a special string handle type (where widtho=-1). But the Kiwi front-end and the repack stage can map a fixed set of strings to an enumeration type of a suitable width with the strings stored only once and indexed by the enumeration.

Expressions are held memoised, and in a normal form, as far as possible, that makes identity checking and common sub-expression reuse easier. This is especially useful to be able to rapdily confirm, as often as possible, index expression equality or inequality, to avoid name alias RaW/WaW dependencies on arrays and loop value forwarding for sequential access patterns.

The imperative code is in any mix of RTL and DIC forms. RTL contains register transfer assignments, partitioned into clock domains, where all assignments in a clock domain run in parallel on the active edge of the clock. There is also a combinational domain that has no clock. The DIC imperative form (directly indexed code) is an array of statements indexed by a program counter, where the main statements are: scalar assignment, 1-D array assignment, library call and conditional branch within the array. Code sections can be in series or parallel with each other, using CSP/Occam-like SER and PAR blocks. Assertions are coded in temporal logic and associated with a clock domain, just like PSL (property specification language). And a dataflow/transport-triggered IR form is being implemented at the moment.

HPR L/S (aka Orangepath) represents a system as an hierarchy of abstract machines in a tree structure. Its aim is to 'seamlessly' model both hardware and software in a common intermediate form that suits easy co-synthesis and co-simulation.

Each machine is a collection of declarations, executable code and assertions/goals. But typically, an individual machine only uses on form of representation.

Other plugins generate machines, read them in from files or other front-end languages, or write them out.

The goals are assertions about the system behaviour, input directly, or generated from compilation of temporal logic and data conservation rules into automata. Executable code can pass through the system unchanged, but any undriven internal nodes are provided with driver code that ensures the system meets its goals.

It also includes some temporal logic for assertions. Software can exist as both machine code/assembler and a high-level, block-structured, AST form.

A VM contains variable declarations, executable code, temporal logic assertions and child machines.

Variables are signed and unsiged integers of various precisions, single and double precision floating point and 1-D arrays of such variables. A small amount of string handling is also provided. All variable are static (no dynamic storage) and must be unique in a single namespace that spans the system. The variables are declared inside a given VM and may be global or local. Global variables may be accessed by code and assertions in any VM and local ones should (not enforced) only be accessed in locally (or in son machines?).

Expressions commonly use the hexp_t form and commands use the hbev_t form. Single-bit variables have hbexp_t form. A library of 'ix_xxx' primitives can be called as functions or procedures from hexp_t, hbexp_t and hbev_t respectively. Expressions are all stored in a memoising heap using weak pointers.

The executable code of a VM has several basic forms (dic, asm, rtl, cmd, fsm). All code and assertions access the variables for read and write (but assertions don't tend to write!) regardless of form.

DIC

DIC - Directly-indexed array: Imperative program (assign/conditional branch/builtin call) stored in an array indexed by a PC.

ASM

RTL and FSM

CMD

Abstract syntax tree of a block-structured imperative program (for/while/break/continue/assign/if etc) or single assigment statement.

Finite-State Machines

FSM - Finite-state machine form - like RTL but the assignments are collated into disjoint sets that are separated by a current-state variable or program counter. Any RTL can be factored out in this way (a Shannon Decomposition) using any set of bits as the program counter. The decomposition is a reversible transform.

CSP/Occam

Message-passing, CSP-like channels are another thing that should perhaps be added as a primitive form in future. They make perfect sense in the overall framework. CSP communication primitives should really be added ... to add channels and complete the picture.

The executable code may be clocked or nonclocked. Fragments may be put in serial or parallel using the SP_par and SP_seq combinators. There are two variants of SP_par, for lockstep and asynchronous composition.

VSDG - a dataflow graph for a single basic block with additional state edges representing memory order constraints. VSFG - an executable form of the VSDG where back edges in the control flow graph are represented using nested graphs.

The library is structured as a number of components that operate on a VM to return another VM. The opath (orangepath) mini-language enables a 'recipe' to be run that invokes a sequence of library operations in turn. An opath recipe is held in an XML file.

Automatic recipes: The overall systems is a pluggable library. Where certain components only accept certain input forms and such a component is specified to be used by a recipe, it is envisioned that automatic invokation of the other components to serve as input adaptors will be triggered. Otherwise it is necessary to manually instantiate additional recipe stages.

In this manual, we concentrate almost entirely on the .NETCIL input format and the Verilog RTL output format.

**Figure 8:** General flow implemented in an early version of the KiwiC tool (same as figure ). This diagram does not clearly show the recipe stages now used. The DRAM restructure is now part of the general binding done by the later restructure phase.
$\begin{figure}\centerline{\epsfbox{orangecilflow.eps}}\end{figure}$

Internal Working of the KiwiC front end recipe stage

The IL Elaborate stage is implemented by the the FSharp files kiwipro/kiwic/src/*.fs. It reads in CIL code and writes out HPR `dic' form code. Internally it converts from CIL to, so-called, kcode, before generating HPR code. The kcode can be rendered to a file for debugging/inspection using the kiwic-kcode-dump flag. The dotnet VM is a stack machine and the dotnet code is stack code. The stack is removed during the conversion to kcode. Kcode is neither stack or register code: all data is instead stored in wondarrays or global static variables.

CIL code is the assembly language used by the mono and .NETprojects. Like other assembly languages, it has an assembler and disassembler for converting between binary and human-readable forms. KiwiC originally read the assembly using a bison parser but now reads the binary using the mono.cecil libraries.

The front end peforms a first pass of every method body that will be needed. This finds the basic block boundaries and the dotnet stack depth at every branch or jump. It gives a symbolic name to every code site where a type is needed. It symbolically executes the code using types without data and ignoring the control flow. Basic blocks that commence or resume with values on the dotnet stack are modified to avoid this situation by defining additional local variables, known as spills, and byprefixing with loads and postfixing with stores. These spill variables are frequently optimised away within the front end, but if they hold data over a Kiwi.Pause() they may appear in the output RTL. All return statements within a method are replaced with a branch to the end of the method. This sets up all the ground work for removing the dot net stack, on the fly, each time the method is called.

A -root command line flag or HardwareEntryPoint attribute enables the user to select a number of methods or classes for compilation. The argument is a list of heriarchic names, separated by semicolons. Other items present in the CIL input code are ignored, unless called from the root items.

Where a class is selected as the root, its contents are converted to an RTL module with IO terminals consisting of various resets and clocks that are marked up in the CIL with custom attributes (see later, to be written). The constructors of the class are interpreted at compile time and all assignments made by these constructors are interpreted as initial values for the RTL variables. Where the values are not further changed at run time, the variables turn into compile-time constants and disappear from the object code.

Where a class is selected as a root, all of the methods in that class will be compiled as separate entry points and it is not normally appropriate for one to call another: calls should generally be to methods of other classes.

Where a method is given as a root component, its parameters are added to the formal parameter list of the RTL module created. Where the method code has a preamble before entering an infinite loop, the actions of the preamble are treated in the same way as constructors of a class, viz. interpreted at compile-time to give initial or reset values to variables. Where a method exits and returns a non-void value, an extra parameter is added to the RTL module formal parameter list.

The VM code can be processed by the HPR tool in many ways, but of interest here is the 'convert_to_rtl' operation that is activated by the '-vnl' command line option. (NB: This is now on by default in the KiwiC00 recipe, disable with -verilog-gen=disable).

More than one portable assembly (CIL/PE) file can be given on the command line and KiwiC will aggregate them. The file name of the last file listed will be used to name the compilation outputs by default (in the absence of other command line flags).

(At some point, KiwiC might be extended to also invoke the C# compiler if given a C# file.)