Compilation from .net CIL Bytecode (second example)

Kiwi Scientific Acceleration: Multi-FPGA Designs, Incremental Compilation and Remote Procedure Call (via IP-XACT)

In this example, the following C# program was compiled to .net CIL bytecode. The program has a second entry point that is callable from a separate compilation unit.

The Need for Separate Compilation

Separate and incremental compilation is needed in all large projects to handle scale, component reuse, revision control and is the basis for project management. It can also be a basis for parallelism. Here we use it as a mechanism to trade execution time against silicon space.

Moreover, the default approach of Kiwi is one of 'flattening' where the leaf components are largely replicated at each instantiation within a hierarchic design. KiwiC reuses ALUs and local variable registers in both the spatial and time domains, but tends to generate the largest and fastest circuit it can, subject to ALU instance count limits per thread set in the recipe. For larger designs, to manage complexity, it is desirable to designate subsystems for separate compilation.

Multi-FPGA designs require separate runs of the FPGA logic synthesis tools for each FPGA.

The ability to use separately-compiled components is also needed as a black box import mechanism for third-party IP blocks.

Finally, the controlling FSMs for separately-compiled components, if they have them, are intrinsically separate and their states are re-used for each invokation which contrasts strongly with the approach for method calls within a single compilation whose control flow is expaneded in-line.

In some design styles, subsystems can also best be placed in a server pool with dynamic load balancing. Design-time manual control sets the number of instances generated. KiwiC will share such server instances in the time domain rather than instantiate as many as it needs (subject to ALU count limits).

Required MetaInfo

The IP-XACT schema provides much of the information needed to import a hardware IP block. Beyond providing the block name and version number, it gives a full description of the net-level interface and any TLM interfaces in higher-level models. The precision of the implemented function is manifested by the bit-widths of the busses.

Hence the SoC Render mode of compilation, illustrated below for the peered instances, is readily supported without extensions. Afterall, this is the primary use today for IP-XACT.

But further information is needed for replication and schedulling of such blocks in an HLS flow. The additional information needed is:

Instantiation Styles

There are two main module instantiation styles: IP blocks can be instantiated as peers or with hierarchy.

Each instanced block needs to have both a C# implementation and an RTL implementation packaged with an IP-XACT wrapper. The RTL and IP-XACT may have been generated by earlier runs of KiwiC or else may have been created by hand or have come from a third party. The C# version is required for two reasons: 1. so that the instantiating C# file will compile without a missing class error, and 2. so that the the system as a pure dotnet design in WD (workstation development) environment. Only a stub implementation (null method bodies) is needed for C# compilation to succeed. And for the dotnet run, only a high-level behavioural model is needed in the C# when the real implementation comes from elsewhere, such as when it is hardened IP like the NetFPGA CAM.

Peer instancing skeleton example:

  RAM r = new RAM(...);     // Create peer instances
  CPU c = new CPU(...);     // 
  IO  i = new IO(...);      //
  c.axi_m0.bind(r.axi_s0);  // Establish wiring between them.
  c.axi_m1.bind(i.axi_s0);  // bind is provided by SystemCsharp TLM.

Hierarchic instancing skeleton example:

  [Kiwi.Remote(...)] ALU a = new ALU(...);

  int foo(int x, int y) = { return x * a.f1(y/121); }

KiwiC will compile each of the above examples and generate its outputs (IP-XACT, RTL and SystemC).

In the peer instancing example, each of the three instantiated components is defined as a class that is itself marked up with the Kiwi.Remote() attribute. In the hierarchic example, the attribute is instead applied to the instance. Also, in the hierarchic example, the ALU instance may actually be placed outside the rendered containing RTL with additional top-level ports provided for wiring it up.

Note that the ALU in the hierarchic example might typically be stateless and hence replicatable. If so, its invokation will be completely on a par with the multiplier and divider instances also needed for method foo. The HLS binder will decide how many instances of it to make and the HLS scheduller will factor in the appropriate fixed pipelining delay or variable delay and handshake nets.

Subsystem Abend Syndrome Routing

Kiwi defines that if any subsystem stops with an abend syndrome code, this must be passed up through parent modules to the substrate wrapper. And all modules must halt at that instant so PC values can be collected.

An example of glue logic being inserted by SoC Render is when it must collect these abend syndromes and PC values from each instantiated module and combine them into a larger abend code and to halt the composite when any component abends.

In the peer instancing example, the KiwiC front end will invoke the SoC Render function of the HPR library that underlies Kiwi.

SoC Render is a simple IP-XACT-driven wiring generator with support for:

Apart from syndrome and PC collection, another example, at the moment, is that KiwiC generates HFAST load/store ports but the Zynq platform requires these to be adapted to AXI. This can either be done automatically by SoC Render or by using the IP Integrator GUI within Vivado.

I/O Paradigms

Low-level RTL essentially uses a shared variable paradigm for net-level interconnection between compilation units. The protocol is not manifest in the design and must be kept and checked elsewhere. Kiwi supports that for interconnection with specific hardware standards, normall in Hard Pause mode, but it is not the preferred approach.

Object-oriented software sends threads between compilation units to perform actions. Synthesisable Verilog and VHDL do not allow threads to be passed between separately-compiled circuits: instead, additional I/O ports must be added to each circuit and then wired together at the top level. Accordingly, we mark up methods that are to be called from separate compilations with a remote attribute.

  public return_type entry_point(int a1, bool a2, ...)
  { ... }

When an implemented or up-called method is marked as `Remote', a protocol is given and KiwiC generates additional I/O terminals on the generated RTL that implement a stub for the call. The currently implemented protocol is asynchronous, using a four-phase handshake and a wide bus that carries all of the arguments in parallel. Another bus, of the reverse direction, conveys the result where non-void. Further protocols can be added to the compiler in future, but we would like to instead lift them so they can be specified with assertions in C# itself.

Protocol selection: Four-phase Handshake

KiwiC generates several possible protocols for unit interconnection. Here we use HSIMPLE four phase. (2016 note: AXI streaming is now the prefered protocol). Reference material regarding the four-phase handshake protocol is described HERE.

Other protocols supported by KiwiC are BVCI and HFAST. HFAST has higher throughput. BVCI supports multiple outstanding requests in order. External bus multiplexors can map a number of in-order HFAST clients onto a single AXI or other out-of-order bus protocol.

C# Source Code: Remote Procedure Call

The C# code has both a 'Main' entry point that is mentioned on the KiwiC command line and a secondary entry point, defined by the Kiwi.Remote attribute, that is compiled by KiwiC so as to be invoked by a separately-compiled hardware component. We can also use this style of interface when calling from software on a soft-core to hardware generated by KiwiC.

// Remote procedure call test (H/W port generation).
class test10
    static int limit = 10;
    static int jvar;

    [ Kiwi.Remote("client1-port", "parallel: four-phase") ]
    public static int bumper(int delta)
        jvar += delta;
	return jvar;

    public static void Main()
 	Console.WriteLine("Test 10 Limit=" + limit);
   	for (jvar=1;jvar<=limit;jvar+=2) 
	  Console.Write(jvar + " ");
     	Console.WriteLine(" Test 10 finished.");

A problem with the above code is a structural hazard that leads to a possible race when the variable jvar is potentially updated twice on the same clock cycle: once by the Main method and once by the remote invoker.

KiwiC detects the race problem, giving the following cryptic warning and will chose an arbitrary resolution:

VNL.v_1 incompatible assigns: consistency check
  lhs=pcnet105  g=true

Output code

KiwiC will generate hardware both for the client and the server as separate RTL files. In more-realistic examples, there will be multiple files, with one being the top-level that contains client calls to some of the others which in turn make client calls to others, with the leaf modules in the design hierarchy being servers only.

One can also envision leaf modules in the design hierarchy making upcalls to parents, but this is not currently implemented in Kiwi.

For the above example, the generated server RTL looks as follows:

// Verilog output file generated at Thu Sep 18 23:43:12 BST 2008
// KiwiC: HPR Orange IL/.net front end: Version alpha 21: 16-Sep-08
// -root test10 -vnl VNL.v -preserve-sequencer 1 
module VNL(reset, clk, bumper_delta, bumper_ack, bumper_req, bumper_retval);
  input reset;
  input clk;
  input bumper_delta;
  output bumper_ack;
  input bumper_req;
  output bumper_retval;
  reg [1:0] testtest10pc;
  reg [2:0] testtest11pc;
  reg bumper_ack;
  integer test10_jvar;
  integer test10_limit;
   always @(posedge clk) begin //Start HPR test10_test13
          if (reset) testtest10pc <= 0;
          case (testtest10pc)
          0:  begin testtest10pc <= 1;
                 test10_jvar <= 1;
                 $display("%s%d", "Test 10 Limit=", 10);
                 $write("%d%s", 1, " ");
          1:  begin if (9<=test10_jvar) testtest10pc <= 2;
                 if (test10_jvar<9) test10_jvar <= test10_jvar+2;
                 if (9<=test10_jvar) 
                   begin test10_jvar <= test10_jvar+2;
                      $display(" Test 10 finished.");
                       end if (test10_jvar<9) $write("%d%s", test10_jvar+2, " ");
          2:  begin 
                  end endcase
          //End HPR test10_test13

          //Start HPR test10_test12
          if (reset) testtest11pc <= 0;
          case (testtest11pc)
          0: testtest11pc <= 1;
          1: if (!bumper_req) testtest11pc <= 2;
          2:  begin testtest11pc <= 3;
                 bumper_ack <= 0;
          3: if (bumper_req) testtest11pc <= 4;
          4:  begin testtest11pc <= 5;
                 test10_jvar <= test10_jvar+bumper_delta;
          5:  begin 
                  end endcase
          //End HPR test10_test12
          // Start delx test10_test13
  // End delx test10_test13

  // Start delx test10_test12
  // End delx test10_test12


// eof (hprls verilog)

Fixing Races Using a Mutex

To overcome the race condition, each update to the shared variable must be guarded.

We can add explicit mutex variables to the code or use the C# lock primitive.

Here is one of the access sites augmented with explicit synchronisation primitives:

  [ Kiwi.Remote("client1-port", "parallel: four-phase") ]
    public static int bumper(int delta)
        while(!mutex1) hpr_testandset(ref mutex1, 1);
        jvar += delta;
        hpr_testandset(ref mutex1, 0);
        return jvar;

Alternatively we can use C#s lock keyword. Lock is often applied to the current object instance but for a static method we do not have an instance. We create a dummy object just to lock on.

   Object my_mutex = new Object();

   [ Kiwi.Remote("client1-port", "parallel: four-phase") ]
   public static int bumper(int delta)
      Lock(my_mutex) { jvar += delta; }
      return jvar;

To apply the mutex to the for loop requires a little restructuring since it is a most unpleasant example that probably has little real-world use:

   bool bb = true;
       lock(my_mutex) { bb = jvar<=limit; jvar+=2; } 
       if (bb) ...


We have seen how to manually control KiwiC compilation, making inter-compilation unit calls and so that a resource can be shared instead of expanding in line.

When multiple components are assembled with sharing of ports, whether it is the DRAM interface or multiple clients invoking one server, multiplexors and arbiters are required. KiwiC uses a standard set of protocols, so only a finite set of arbiters and multiplexors is needed. Another note will discuss their relative performance and the level of automation available for instantiating them.

