Kiwi Scientific Acceleration: C# high-level synthesis for FPGA execution.
Linked List Example

Kiwi Scientific Acceleration: Linked Lists Example

High-level synthesis (HLS) in the past has commonly not supported linked lists owing to the dynamic storage allocation requirement.

Kiwi has always supported 'dynamic' object creation using the C# new operator and object pointer manipulations. This enables use of linked lists and other data structures that use pointers, such as trees and meshes. But prior to Kiwi 2.0 the underlying memory pool size for each type of object had to be separately determined at compile time. Kiwi 2 will remove that restriction.

Static allocation in HLS has always been the norm. This is suitable when highly-customised on-FPGA data busses and RAM structures are being allocated. But when large amounts of data are placed in DRAM, the benefit of static allocation disappears and we desire to seamlessly migrate to a more-conventional, processor-like style of hardware.

Contents

Here we explain the various ways of achieving dynamic memory allocation using Kiwi.

Each approach can has pros and cons according to how-well the code is optimised and how easy to use it is.

We also need to discuss how one aproach is selected in preference to another.

Kiwi 1 Baseline
Kiwi 1 Unsafe Extensions
Kiwi Autodipose
Kiwi 2 Genuine Run-time Heap
Avoiding a Manually-coded Allocator
Conclusions

Kiwi 1 Baseline

Kiwi 1 Rule: Provided the calls to 'new' are all done in the elaborated part of each thread, Kiwi is happy to use heap store. Dynamic storage inside the runtime part of each thread (post end of static elaboration) is only allowed if the heap is the same size on each iteration.

(Regarding garbage collection: KiwiC 1 has an automatic mode that works in 'obvious' cases where memery is manifestly never in use on the next iteration of a non-unwound loop and it provides an explicit Kiwi.Dispose() method that the application code must invoke in other cases.)

However, in general the Kiwi 1 Rule for allocation is overly restrictive. The simplest technique to overcome this is for the application programmer to write a custom manual storage allocator that works out of a static pool. This is demonstrated next.

Baseline Example of Kiwi Linked List

In the baseline approach (Kiwi 1), the C# compiler's constructor and newobj calls are used at KiwiC compile time to initialise a free pool list. The user then has his own allocator for use at Kiwi run time. There is no garbage collection but the user can also implement a custom deallocator to return items to the free list and call it explictly.

Here is a small example according to that method, providing a linked list of integers. Clearly trees and more complex structures are a straightforward extension.

class LinkedListOfInts
{
   public int car;
   public LinkedListOfInts cdr;   // This allows a potentially-infinite linked list.

   public LinkedListOfInts(int initial_cell_content) // Constructor used by C# compiler.
   {
     car = initial_cell_content;
   }

   // Static Members - the freelist
   static LinkedListOfInts freelist = null;

   public static void createFreepool(int no) // Compile-time creation of a free list 
   {
      for (int i = 0; i < no; i++)
      {     
	LinkedListOfInts available_cell = new LinkedListOfInts(11*i);
	available_cell.cdr = freelist;
	freelist = available_cell;
	Console.WriteLine("  Created Item {0}", i);
      }
      Console.WriteLine("Test 39r0: Freepool created.");
    }

   static Object locker = new Object();

   public static LinkedListOfInts alloc(int no)  // Run-time allocator
   {
     LinkedListOfInts result = null;
     if (freelist == null) KiwiSystem.KiwiBasicStatusMonitor.RunTimeAbend("no free cells", 1);
     else
     {
       lock (locker) 
       { result = freelist;  freelist = result.cdr; }
       result.car = no;
     }
     return result;
   }

   public void Dispose()
   {
     lock (locker)
     {
       cdr = freelist;
       freelist = this;
     }
   }
}

This pool can then be used, for example, as follows:

class test39r0
{
    [Kiwi.HardwareEntryPoint()]
    public static void Main()
    {
       const int poolSize = 4;
       Console.WriteLine("Test 39r0 Linked List Simple Demo.  Poolsize={0}", poolSize);
       LinkedListOfInts.createFreepool(poolSize);

       // Before the first Pause - create all the cells you will use and put them on the freelist.
       Kiwi.EndOfElaborate();
       Kiwi.Pause();

       // Now we are a customer for our cells - make a simple list
       LinkedListOfInts baser = null;

       for (int i = 0; i < 3; i++)
       {     
 	 LinkedListOfInts cell = LinkedListOfInts.alloc(i*3+50);
	 cell.cdr = baser;
	 baser = cell;
	 Console.WriteLine("  Runtime Alloc Item {0}", i);
       }

       for (LinkedListOfInts p = baser; p != null; p = p.cdr)
	 {
	   Kiwi.Pause();
	   Console.WriteLine("  Readback Item {0}", p.car);
	 }
       Console.WriteLine("Test 39r0 Linked List Simple Demo Finished.");		
    }
}

When we run this under mono we get the expected output.

And running the generated RTL under a Verilog simulator produces the same output, with a minor variation in whitespace tabs.

iverilog vsys.v DUT.v /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cv_fparith.v
./a.out
VCD info: dumpfile vcd.vcd opened for output.
Test 39 Linked List Simple Demo.  Poolsize=4
  Created Item          0
  Created Item          1
  Created Item          2
  Created Item          3
Test 39: Freepool created.
  Runtime Alloc Item           0
  Runtime Alloc Item           1
  Runtime Alloc Item           2
  Readback Item  56
  Readback Item  53
  Readback Item  50
Test 39 Linked List Simple Demo Finished.

The generated RTL is on this link here: DUT.v.

Large Static Pools

This approach uses a static pool. KiwiC allocates static pools to on-chip register files, BRAMs or off-chip memories according to their size. Being static, this size is known at compile time.

When such an example is compiled with a small pool size, the KiwiC compiler uses a different approach compared with the more-typical multi-megabyte data structures. KiwiC places large datastructures in DRAMs and intermediate ones in on-chip block RAM. The above mini-example has only four cells, so you will see explicit scalarisation in the generated code. Thresholds for that sort of behaviour (i.e. what is regarded as small) are adjustable in the recipe/cmdline.

Kiwi 1 With Unsafe Extensions - Polymorphic Static Pool

The above approach is strongly-typed. It uses a dedicated memory pool for each object type. In the C# type hierarchy a certain amount of up and downcasting is allowed, of course, while still remaining strongly-typed, and up-casts will just result give inefficient memory use. (By-the-way, KiwiC will often put different fields from a common object in different on-chip BRAMs, but that is an orthogonal aspect.)

A more memory-efficient approach supports allocation of objects of different types and sizes from a common pool. Efficiency arises from statistical multiplexing gain. But this may result in more complex (more flexible) wiring paths being needed when the pool is on-chip. When off-chip with a single bank we will generally see processor-like datapaths anyway (i.e. general purpose data paths and not custom wiring).

Also, invoking the above allocator for a free pool of a million cells works, but is slow. As in C# normally, the Kiwi compiler allocates a 1-D array of object handles and fills each one with a pointer to a heap object. KiwiC then spots that these pointers are never changed and converts the handle array into what it calls a ROM (read-only memory) and it then inlines operations on that ROM wherever it can, which is always, and the ROM does not appear in the generated code. Nonetheless, KiwiC is very slow to do this (currently).

A potential alternative approach is to use C#'s structure arrays that avoid handles. But C#'s structs are not the same as object records: they are copied as valuetypes, null cannot be used and there are partial update constraints. So these do not supply a fully-effective alternative.

Unsafe Shared Static Pool

We allocate a static pool as before. But we now use a small amount of 'unsafe' C# that does pointer arithmetic and a backdoor 'sizeof' function. Unsafe C# allows us to cast an integer to an object pointer. Mono will not run the resulting code but KiwiC can compile it which is all we need if we use the normal memory manager instead on Mono. This is as efficient as it can be and it also compiles quickly. More complex heap allocator algorithms are easy to implement on this basis, such as Fibonacci buddy heaps.

Even in unsafe C#, we are not allowed to take the size of a class using the C# sizeof function. (However this can be done easily in PE/CIL assembler. The CSharp compiler from VisualStudio supports the '#if IL' form to embed CIL code directly in the C# source file. For use with mono we would need to separately compile the embedded IL, but we can automate that process somewhat by cutting out of the C# source file with a shell script and passing it to ilasm.) To get around all of that fuss however, KiwiC implements a backdoor templated/parametric class that reports the size in bytes of its argument. This is called Kiwi.ObjectHandler.

The following code then is sufficient for large heaps. Yet it is still rather cumbersome to have to use a heap allocator that is not the built-in 'new' operator of C#.

unsafe class LinkedListOfInts
{
   public int car;                // Content Data
   public LinkedListOfInts cdr;   // This allows a potentially-infinite linked list.

   public LinkedListOfInts(int initial_cell_content) // Constructor used by C# compiler.
   {
     car = initial_cell_content;
   }

   static LinkedListOfInts freelist = null;
   static int sbrk = 0;

   const int heapSize = 1000 * 1000 * 100; // 100 megabyte
   static byte [] heapPool = new byte[heapSize];

   static Kiwi.ObjectHandler<LinkedListOfInts> sizer = new Kiwi.ObjectHandler<LinkedListOfInts>();

   static int itemSize()
   {
   #if IL
       // Inline assembler works under Microsoft's C# compiler.
       sizeof LinkedListOfInts
       ret
   #endif
      // Using mcs on linux we use a KiwiC backdoor that can measure the size of an object.
      return sizer.size();
   }

   public static void createFreepool(int no_)
   {
      sbrk = 0;   
      Console.WriteLine("Test 39r1: Freepool created, {0} bytes.", heapSize);
    }

   static Object locker = new Object();

   public static LinkedListOfInts alloc(int no)
   {
     LinkedListOfInts result = null;
     int bytes = itemSize();
     if (sbrk+bytes > heapSize && freelist == null) KiwiSystem.KiwiBasicStatusMonitor.RunTimeAbend("no free cells", 1);
     else if (freelist == null) 
     {
        lock (locker) 
     	  { 
  	    //This compiles but does not run under mono: it generates the "castclass LinkedListOfInts" CIL instruction.
	    //Need to cast via object to avoid C# compiler complaint.  We write out each step here for clarity.
            Object fs = heapPool;
            int ifs = (int)fs;
            Object p = (Object) (sbrk + ifs);
       	    result = (LinkedListOfInts)(p);
	    sbrk += bytes;
	  }
     }
     else
     {
       lock (locker) 
       { result = freelist;  freelist = result.cdr; }
     }
     result.car = no;
     return result;
   }

   public void Dispose()
   {
     lock (locker)
     {
       cdr = freelist;
       freelist = this;
     }
   }
}

Kiwi Autodispose

Kiwi 1 implements a form of garbage collection called 'autodispose'. This can currently (October 2016) be enabled with -autodispose=enable. It will be turned on by default in the near future when further escape analysis is completed.

Currently it disposes of a little too much and when that memory is reused we have a nasty aliasing problem since that store was still live with other data. This will crop up with linked-list and tree examples or where the address of a field in a heap object is taken.

When autodispose fails to free something (or is turned off) you can explicitly free such store with a call to obj.Dispose() or Kiwi.Dispose(Object obj).

Genuine Run-time Heap

KiwiC partitions every thread into two regions as described in the blue box:

Lasso thread shape

Since threads both describe compile-time and run-time behaviour a means is needed to distinguish the two forms of loop. The approach adopted is that every thread in the source code is treated as generally having a lasso shape, consisting of code that is executed exactly once before entering any non-unwound, potentially-infinite loop.
The front-end algorithm used selects an articulation point in the control graph of a thread where all loops before this point have been unwound and all code reachable after that point has its control graph preserved in the program output to the next stage. The articulation point is called the end of static elaboration point. The point selected is the first branch target that is the subject of a conditional branch during an interpreted run of the thread or the entry point to the last basic block encountered that does not contain a Kiwi.Pause() call.
The branch will be conditional either because it depends on some run-time input data or because it is after at least one Kiwi.Pause() call. The semantics of Kiwi.Pause() imply that all code executed after the call are in a new run-time clock cycle. Apparently-conditional branches may be unconditional because of constant folding/propagation during the interpreted run. This is the basis of generate-style loop unwinding in the lasso stem.

KiwiC can readily determine whether a call to new() is in the elaboration lasso stem or post the end of elaboration point (i.e. in the body of runtime process loops).

Any calls to new() after static elaboration are intrinsically run-time allocations. Kiwi 2 has a run-time heap allocator.

Manually-Coded Post-Elaboration Allocator

The Kiwi 2 style is illustrated here. This example returns items of one type only. However, it is easy to add further methods that return other types of object. They will all come from the same run-time bool.

(We do need to describe here how to map these operations to separate off-chip memory banks and/or load-store ports... TODO)

class IntegratedLinkedListOfInts
{
   public int car;                // Content Data
   public IntegratedLinkedListOfInts cdr = null;

   public IntegratedLinkedListOfInts(int initial_cell_content) // Constructor used by C# compiler.
   {
     car = initial_cell_content;
   }

   static IntegratedLinkedListOfInts freelist = null;

   static Object locker = new Object(); // Serves as a mutex.

   public static IntegratedLinkedListOfInts alloc(int no)
   {
     IntegratedLinkedListOfInts result = null;
     if (freelist == null) 
     {
       // In this demo we call the native 'new' function both before and after the end
       // of static elaboration.  
       result = new IntegratedLinkedListOfInts(no);
     }
     else  // But we still need to manage our own disposals manually.
     {
       lock (locker) 
       { result = freelist;  freelist = result.cdr; }
       result.car = no;
     }
     return result;
   }

   public void Dispose()
   {
     lock (locker)
     {
       cdr = freelist;
       freelist = this;
     }
   }
}

Post-Elaboration Allocator - Example Application

class test39r2
{
    [Kiwi.HardwareEntryPoint()]
    public static void Main()
    {
       Console.WriteLine("Test 39r2 Linked Lists in DRAM Demo - with post-elabotate calls to new(). Trees are a simple extension.");

       // Add two cells to the list during static elaborate
       IntegratedLinkedListOfInts cell1 = IntegratedLinkedListOfInts.alloc(42);
       IntegratedLinkedListOfInts cell2 = IntegratedLinkedListOfInts.alloc(44);
       IntegratedLinkedListOfInts baser = cell1;
       cell1.cdr = cell2;

       /*------------------ */ Kiwi.EndOfElaborate(); /*------------------ */

       Kiwi.Pause(); // Wait for a single clock edge - further work will be post elaborate.

       for (int i = 0; i < 4; i++)
       {     
         Kiwi.Pause(); // Pause inside this loop before making further cells.
 	 IntegratedLinkedListOfInts cell = IntegratedLinkedListOfInts.alloc(i*3+200);
	 cell.cdr = baser;
	 baser = cell;
	 Console.WriteLine("  Runtime Alloc Item {0}", i);
       }

       for (IntegratedLinkedListOfInts p = baser; p != null; p = p.cdr)
	 {
	   Kiwi.Pause();
	   Console.WriteLine("  Readback Item {0}", p.car);
	 }
       Console.WriteLine("Test 39r2 Linked List DRAM Demo Finished.");		
    }
}

Post-Elaboration Allocator - Output From Demo

Test 39r2 Linked Lists in DRAM Demo - with post-elaborate calls to new(). Trees are a simple extension of lists.
  Runtime Alloc Item 0
  Runtime Alloc Item 1
  Runtime Alloc Item 2
  Runtime Alloc Item 3
  Readback Item 209
  Readback Item 206
  Readback Item 203
  Readback Item 200
  Readback Item 42
  Readback Item 44
Test 39r2 Linked List DRAM Demo Finished.

Avoiding a Manually-coded Allocator

The downside of the coding style presented above is that it is twice as long as it needs to be owing to the manually-implemented freepool manager. The overhead is less significant in real examples where there is more application logic, but nonetheless, it would be great if KiwiC helped more on the allocate side. We can implement our own dispose and neglect garbage collection for now.

... TBC Feb 2017

Conclusions

... TBC Feb 2017

December 2015. Now being updated Feb 2017... UP.