Orangepath/HPR Logic Synthesis Project: Hardware and Embedded Software Synthesis from Executable Specifications.
Compilation from .net CIL Bytecode (second example)

N-body Mini Demo of Kiwi: Synthesis from .net CIL Bytecode

In this example, a simple implementation of the N-body chaotic system was coded in C# and compiled to .net CIL bytecode DLL. The resulting DLL can be run on the workstation or compiled to FPGA using the KiwiC compiler.

The N-body problem typically suffers badly from structural hazards, where each body needs to read the information regarding all the others to update its position. We discuss ways of handling this in the Conclusion.

Source Program

This is the simple N-body test application:

using System;
using KiwiSystem;


// using KiwiFormGraphics;

class nbody
{
  const int bodies = 3;
  static byte d = 0;

  class planet
  {
    int pos_x, pos_y;
    int vel_x, vel_y;

    int divider(int d0) // Approximation to division with numerator 1024 (constant denominators generat faster dividers since KiwiC mutiplies by the reciprocal).
    {
      bool f = d0 < 0;
      int r;
      int d = (f)? -d0: d0;
      if (d < 2) r = 1024*64 / 2;
      else if (d < 3) r = 1024*64 / 3;
      else if (d < 4) r = 1024*64 / 4;
      else if (d < 7) r = 1024*64 / 7;
      else if (d < 12) r = 1024*64 / 12;
      else if (d < 16) r = 1024*64 / 16;
      else if (d < 28) r = 1024*64 / 28;
      else if (d < 32) r = 1024*64 / 32;
      else if (d < 64) r = 1024*64 / 64;
      else if (d < 100) r = 1024*64 / 100;
      else if (d < 200) r = 1024*64 / 200;
      else r = 1024*64 / 400;
      return (f) ? -r: r;
    }

    public planet(int k0)
    {
      if (k0 == 0)
      {
        pos_x = 0;
        pos_y = 0;
      }
      else
      {
        int k = k0 * 351123; // pseudo-random start position.
        pos_x = k % 4096 - 1024;
        pos_y = (k / 4096) % 4096 - 1024;
      }
      vel_x = 0;
      vel_y = 0;
    }

    uint clip(int v)
    {
      if (v < 0) return 0U;
      else if (v>512) return 512U;
      else return (uint)(v);
    }

    int clip1(int v, int max)
    {
      if (v < -max) return -max;
      else if (v>=max) return max-1;
      else return v;
    }


    public void runme(int me)
    {
      KiwiFormGraphics.setget_pixel(clip(256 + pos_x/8), clip(256 + pos_y/8), false, (byte)(me * 6 * 32));
      if (me == 0) return; // Body zero does not move

      pos_x += vel_x / 1024;
      pos_y += vel_y / 1024;

      KiwiFormGraphics.setget_pixel(clip(256 + pos_x/8), clip(256 + pos_y/8), false, (byte)(me * 6 * 32 + d/32));

      pos_x = clip1(pos_x, 256*8);
      pos_y = clip1(pos_y, 256*8);

      int dx = 0, dy = 0;
      for (int i=0; i < bodies; i++) 
       {
         if (i == me) continue;
         // Should really be using quadratic distance rather than two separate distances.
         if (planets[i].pos_x != pos_x) dx += divider((planets[i].pos_x - pos_x));
	 if (planets[i].pos_y != pos_y) dy += divider((planets[i].pos_y - pos_y));
       }
      vel_x += dx;
      vel_y += dy;
    }
  }

  static  planet [] planets = new planet[bodies];

  [Kiwi.HardwareEntryPoint]
  public static int Main()
  {
    Console.WriteLine("Start nbody bodies={0}", bodies);
    KiwiFormGraphics.open(512, 512);
    for (int i=0; i < bodies; i++) planets[i] = new planet(i);

    while(true)
     {
       d += 1;
       Kiwi.Pause();
       for (int i=0;  i < bodies; i++) planets[i].runme(i);
     }
    return 0;
  }
}

The above application was compiled using the C# compiler to generate a .exe file that can be used on workstation or compililed with KiwiC to FPGA.

Graphics Library

For testing on the workstation we need a pixel plot shim to a suitable graphics library. Here we use the Windows Forms API (which also works on mono/linux).

using System;
using System.Windows.Forms;
using System.Drawing;
using System.Drawing.Drawing2D;
using System.Drawing.Text;

// This version of KiwiFormGraphics is for use on the workstation when running without FPGA support.
public static class KiwiFormGraphics
{
  static Graphics g;
  static Pen pen;
  static int Width, Height;
  static bool opened = false;

  static int scale = 2;

  public static void open(int w, int h)
  {
    Console.WriteLine("Opened {0} {1}", w, h);
    Form f = new Form ();
    f.Width = w*scale;
    f.Height = h*scale;
    f.Visible = true;
    g = f.CreateGraphics();
    g.Clear(System.Drawing.Color.Green);
    f.Activate(); 
    pen = new System.Drawing.Pen(Color.Black, 3);
    opened = true;
  } 

  public static int setget_pixel(uint x, uint y, bool mode, int color)
  {
      //Console.WriteLine("Mark {0} {1} {2}", x, y, color);
      if (!opened) open(Width, Height);
      PointF point1 = new PointF(x*scale, y*scale);
      PointF point2 = new PointF(x*scale+1, y*scale+1);

      pen = new System.Drawing.Pen(Color.FromArgb(color, color, color), 1);
      g.DrawLine (pen, point1, point2);      
      return 0;
  }
}

Workstation Output

Running on a workstation we get an animated plot display as shown in the following screen shot and the following video: VIDEO (.mp4).

VGA output still photo

To run on your own workstation you need the following compiled CIL files ZIP FILE

FPGA Compilation

Compilation to a Verilog RTL file using KiwiC generates about 1700 lines of code: nbody.v.

Only the application code .exe file is compiled to hardware, not the Windows Forms API. So a second file which provides only the callable interface is supplied to KiwiC. We call this a shim. This reads as follows:

public class KiwiFormGraphics
{
  static byte [] framestore;

  static int Width, Height;

  static bool opened = false;

  [Kiwi.Remote("server1", "parallel:four-phase")] // Just a stub/shim
  public static byte setget_pixel(uint x, uint y, bool readf, byte wdata)
  {
    if (!opened) open(Width, Height);
    uint addr = (uint)(x * Width + y);
    if (readf) 
      return framestore[addr];
    else 
     { framestore[addr] = wdata;
       Console.WriteLine("Set pixel ({0},{1}) to {2}", x, y, wdata);
       return wdata; 
     }
   }

  public static void open(int w, int h)
  {
    opened = true;
    framestore = new byte [w*h];
  }

}

The entry point in the shim is marked with the following attribute which directs KiwiC to ignore the body of the function and instead generate handshaking wires for connection to a separately-compiled hardware module.

[Kiwi.Remote("server1", "parallel:four-phase")]

So the resultant API to the generated RTL is

module nbody(
  output reg KiwiFormGraphics_setget_pixel_req, 
  input KiwiFormGraphics_setget_pixel_ack, 
  input signed [31:0] KiwiFormGraphics_setget_pixel_return, 
  output reg [7:0] KiwiFormGraphics_setget_pixel_wdata, 
  output reg KiwiFormGraphics_setget_pixel_readf, 
  output reg [31:0] KiwiFormGraphics_setget_pixel_y, 
  output reg [31:0] KiwiFormGraphics_setget_pixel_x, 
  input clk, 
  input reset);

The Framestore

The framestore is fairly boring: it has been implemented using C# and in hand-coded RTL. The hand-coded one is shown in the video clip (it has an additional character generator that puts the phrase KIWI C# HLS in the background. C# version framestore.cs.

Both forms of framestore also require an I2C controller to set up the Chrontel VGA/DVI chip. This was coded as a separate C# design and compiled with KiwiC. The source C# code and the generated RTL are as follows: C# I2C Controller and I2C RTL.

The FPGA toplevel

To stitch all the components together a manual toplevel.v RTL file was created. Automating this is an interesting future project: for instance to automatically multiplex between several writes to the framestore. Currently one would write a C# wrapper program that instantiates each application object.

If you have an ML605 card the following file can be copied to slot 4 of the compact flash card for loading using Xilinx ace: rev4.ace.zip.

FPGA Video

You may need to save these videos on your deskstop before playing them.

Click for video: FPGA RUN (.mp4).

Second video: FPGA RUN (.mp4).


Adding Further Bodies

Further particles/bodies can be added by increasing the compile-time constant bodies. But we have a discontinuity in performance when we have 9 more more bodies.

Arrays allocated in the C# code must be converted into an appropriate memory technology based on their size. KiwiC maps arrays of variables to synchronous RAMs or to flip-flops depending on size, or to off-chip DRAM for large arrays. There are default threshold settings for the transition points and C# custom attributes can be used to override these settings on a per-instance basis.

The important thershold in this example is res2-regfile-threshold, the threshold in terms of number of locations below which to not instantiate any sort of structural SRAM and use flip-flops instead. This has a default value of 8 in the standard KiwiC recpie. (The recipe is a guiding file read in by the compiler.)

Default KiwiC behaviour is to place each field of each object in an array of objects in its own array. Hence concurrent access to all fields in an object within a clock cycle without structural hazard is always possible (e.g. to x and y dimensions and to velocity and position fields).

But each particle needs to read off the positions of all the others. If the position arrays are mapped to flip-flops then FPGA wiring simply serves to distribute the positions. If the position information is in RAMs then KiwiC will introduce sequencing steps to access all the required information over the available RAM ports in some number of cycles. Hence much slower performance results. Since the current particle's position is to be compared against all others, KiwiC needs to be smart enought to read that once and store in holding registers while comparing with all others.

The Xilinx or other FPGA tools also make decisions about whether to use RAMs or flip-flops, but only can only freely decide when there is no timing variation between the alternatives. KiwiC has much more freedom, since it chooses the number of clock cycles to be used (and whether the address is computed the cycle before the data is loaded or stored to the RAM etc.)

Nine Bodies

Three bodies leads to a chaotic system (although only with quadrature gravity etc not included in the above code!). Here we have nine bodies (one higher than the KiwiC default setting).

From the graphical output we can see an obvious difference: with all the extra gravity the particles stay much closer together.

Nine body FPGA ace file : rev3.ace.zip MISSING.

Demo MP4 Video.


A Multi-threaded Implementation

The example above used one thread which operated on each particle in turn. The same thread made the call out to the plotting display.

A simple modification that adds performance (certainly for the FPGA implementation) is to move the plotting work to a separate thread:

      Thread PlotThread = new Thread(new ThreadStart (PlotterTask));
      PlotThread.Start();

A naive split without synchronisation would typically lead to certain configurations being missed with dotted tracks being plotted, so either synchronisation should be used or the the plotter function should draw a line from the last observed point of a particle to its current position. We implement our own line drawing code in C# for the FPGA version, but it would unfair to include this behaviour in a performance comparison with the workstation because the workstation has its own native line-drawing code. However, it is still useful to debug/validate our line drawing code by running it on the workstation.

With a single thread, it is possible for the positions of each particle to be stored in a common array without excessive structural hazards. This is because each call to the plot function invokes a synchronous four-phase handshake and the way the code is structured there is such a call between each update. Also ... needs its own position and that of all of its neighbours

A problem arises with using a thread for each particle when there are a large number of particles.

... work in prog ...

General improvements: The power of C# can be used to avoid the explict call to our special divide approximation: instead all arithmetic should be done on our own type which has its overloaded division operator implemented with our chosen (heuristic) algorithm.

Also, should use foreach to step through the array!


Updated May 2012.               UP.