Orangepath/HPR Logic Synthesis Project: Hardware and Embedded Software Synthesis from Executable Specifications.
Compilation from .net CIL Bytecode (second example)

Kiwi Scientific Acceleration::Shared Memory using FPGA Dual-Port RAMs

Data can commonly be shared between threads using arrays and scalars. Typically a protocol using scalars is used to signal the readiness and or location of data in shared arrays.

For efficient software execution, the Thread.Pulse() and Thread.Wait() primtives should be instantiated. Kiwi will mostly ignore them since concurrent hardware does not dissipate dynamic energy while polling for an input event - this is the fundamental operating principle of hardware.

Here we present a simple example of sharing data between two threads in that sort of style.

Source File

The following CSharp file was compiled with mono mcs.

// Kiwi Scientific Acceleration Example - Illustration of inter-thread IPC using shared dual port memory.
// (C) 2016 DJ Greaves, University of Cambridge, Computer Laboratory.

using System;
using System.Text;
using KiwiSystem;
using System.Diagnostics;
using System.Threading;


public class test50
{
  const int problemSize = 30;

  static int [] sharedData = new int [problemSize];

  static volatile bool exiting = false;

  static volatile int sum = 12345678;

  static volatile char command2 = 'x';


  public static void secondProcess()
  {
     while(!exiting)
      {
        while (command2 == 'I' && !exiting) Kiwi.Pause();
	//Console.WriteLine("second process dispatching {0}", command2);
	if (command2 == 'U') // Update data
	{
	   for (int i=0/2; i<problemSize; i++) { Kiwi.Pause(); sharedData[i] = i+sum; }
	}
	else if (command2 == 'S') // Compute Sum
	{
	   sum = 0;
	   for (int i=0; i<problemSize; i++) { Kiwi.Pause(); sum += sharedData[i]; }
	}
 	else if (command2 == 'P') // Print Sum
 	{
             Console.WriteLine("sp: data sum {0}", sum);
  	     Kiwi.Pause();
        }
	else if (command2 == 'D') // Print Data
 	{
    	   //Console.WriteLine("second process Print Data {0}", command2);
	   for (int i=0; i<problemSize; i++) 
	   { 
             Console.WriteLine("sp: Print data: sharedData[{0}] = {1}",  i, sharedData[i]);
  	     Kiwi.Pause();
           }
        }
	//Console.WriteLine("second process dispatched {0}", command2);
	command2 = 'I'; // Set back to idle
      }
   } 


  public static int computeSum()
  {
     int mysum = 0;
     for (int i=0; i<problemSize; i++) { Kiwi.Pause(); mysum += sharedData[i]; }
     return mysum;	
  }

  public static void clearto(int v0)
  {
     int vv = v0;
     for (int i=0; i<problemSize; i++) { sharedData[i] = vv++; Kiwi.Pause(); }
     sharedData[problemSize-1] = 99;
  }

  public static void test50_phase0()
   {
     Console.WriteLine("Kiwi Demo - Test50 phase0 starting.");
     Console.WriteLine("  Test50 Remote Status={0}, sum= {1}", command2, sum);
     clearto(30);
     Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
     command2 = 'D';
     Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();

     for(int iteration = 0; iteration < 3; iteration++)
     {
	Console.WriteLine("  Test50 fancy={0} rs={1} sum={2}.", iteration, command2, sum);
	command2 = 'P';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
	command2 = 'S';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
	command2 = 'P';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
	command2 = 'U';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
	command2 = 'S';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();
	command2 = 'P';
	Kiwi.Pause();     while (command2 != 'I') Kiwi.Pause();

	clearto(40 + iteration);
	Console.WriteLine("   point2 {0} {1}.", command2, sum);
        Kiwi.Pause();
     }
     Console.WriteLine("Finished main process.");
    }

  [Kiwi.HardwareEntryPoint()] public static void Main()
	{
	   Console.WriteLine("Kiwi Demo - Test50 starting.");
           Thread thread1 = new Thread(new ThreadStart(secondProcess));
           thread1.Start();

	   Kiwi.Pause();

	   test50_phase0();
	   exiting = true;
	   Console.WriteLine("Test50 starting join.");
           //thread1.Join();
	   Console.WriteLine("Test50 done.");
	}
}

Generated RTL from KiwiC

The RTL contains an instance of a dual-port RAM containing 30 words of 32 bits:

 ... snip ...
  CV_2P_SSRAM_FL1 #(6'd32, 3'd5, 5'd30, 6'd32) A_SINT_CC_SCALbx10_ARA0(clk, reset,
            A_SINT_CC_SCALbx10_ARA0_RDD0, A_SINT_CC_SCALbx10_ARA0_AD0,
            A_SINT_CC_SCALbx10_ARA0_WEN0, A_SINT_CC_SCALbx10_ARA0_REN0, A_SINT_CC_SCALbx10_ARA0_WRD0,
            A_SINT_CC_SCALbx10_ARA0_RDD1, A_SINT_CC_SCALbx10_ARA0_AD1, 
            A_SINT_CC_SCALbx10_ARA0_WEN1, A_SINT_CC_SCALbx10_ARA0_REN1, A_SINT_CC_SCALbx10_ARA0_WRD1);

// 3 vectors of width 5
// 4 vectors of width 1
// 3 vectors of width 32
// 1 vectors of width 16
// 256 bits in scalar variables
// Total state bits in module = 387 bits.
// 130 continuously assigned (wire/non-state) bits
//   cell CV_2P_SSRAM_FL1 count=1
// Total number of leaf cells = 1
endmodule

Full RTL output file: test50.v (Verilog).

Simulation Test Bench

// Kiwi Scientific Acceleration
// (C) 2010-16 DJ Greaves, University of Cambridge.                                                                                                                                                               
`timescale 1ns/1ns

module SIMSYS();
   
   reg clk, reset;
   initial begin reset = 1; clk = 1; # 33 reset = 0; end
   always #5 clk = !clk; // 10ns period for clock = 100 MHz (lowish FPGA clock freq!) 
   initial begin # (100 * 1000 * 1000) $display("Finish HDL simulation on timeout %t.", $time); $finish(); end
   initial begin $dumpfile("vcd.vcd"); $dumpvars(); end
   
   DUT the_dut(.clk(clk), .reset(reset));
   endmodule

Console Output

The above, generated RTL is run on the Icarus verilog simulator.

VCD info: dumpfile vcd.vcd opened for output.
Kiwi Demo - Test50 starting.
Kiwi Demo - Test50 phase0 starting.
  Test50 Remote Status=x, sum= 12345678
sp: Print data: sharedData[0] = 30
sp: Print data: sharedData[1] = 31
sp: Print data: sharedData[2] = 32
sp: Print data: sharedData[3] = 33
sp: Print data: sharedData[4] = 34
sp: Print data: sharedData[5] = 35
sp: Print data: sharedData[6] = 36
sp: Print data: sharedData[7] = 37
sp: Print data: sharedData[8] = 38
sp: Print data: sharedData[9] = 39
sp: Print data: sharedData[10] = 40
sp: Print data: sharedData[11] = 41
sp: Print data: sharedData[12] = 42
sp: Print data: sharedData[13] = 43
sp: Print data: sharedData[14] = 44
sp: Print data: sharedData[15] = 45
sp: Print data: sharedData[16] = 46
sp: Print data: sharedData[17] = 47
sp: Print data: sharedData[18] = 48
sp: Print data: sharedData[19] = 49
sp: Print data: sharedData[20] = 50
sp: Print data: sharedData[21] = 51
sp: Print data: sharedData[22] = 52
sp: Print data: sharedData[23] = 53
sp: Print data: sharedData[24] = 54
sp: Print data: sharedData[25] = 55
sp: Print data: sharedData[26] = 56
sp: Print data: sharedData[27] = 57
sp: Print data: sharedData[28] = 58
sp: Print data: sharedData[29] = 99
  Test50 fancy=0 rs=I sum=12345678.
sp: data sum 12345678
sp: data sum 1375
sp: data sum 41685
   point2 I 41685.
  Test50 fancy=1 rs=I sum=41685.
sp: data sum 41685
sp: data sum 1665
sp: data sum 50385
   point2 I 50385.
  Test50 fancy=2 rs=I sum=50385.
sp: data sum 50385
sp: data sum 1694
sp: data sum 51255
   point2 I 51255.
Finished main process.
Test50 starting join.
Test50 done.

VCD Waveforms

timing waveform LARGER

For comparison, output when the same .exe file is run on Mono

MONO_PATH=/home/djg11/d320/hprls/kiwipro/kiwic/distro/support mono test50.exe
Kiwi Demo - Test50 starting.
Kiwi Demo - Test50 phase0 starting.
  Test50 Remote Status=I, sum= 12345678
sp: Print data: sharedData[0] = 30
sp: Print data: sharedData[1] = 31
sp: Print data: sharedData[2] = 32
sp: Print data: sharedData[3] = 33
sp: Print data: sharedData[4] = 34
sp: Print data: sharedData[5] = 35
sp: Print data: sharedData[6] = 36
sp: Print data: sharedData[7] = 37
sp: Print data: sharedData[8] = 38
sp: Print data: sharedData[9] = 39
sp: Print data: sharedData[10] = 40
sp: Print data: sharedData[11] = 41
sp: Print data: sharedData[12] = 42
sp: Print data: sharedData[13] = 43
sp: Print data: sharedData[14] = 44
sp: Print data: sharedData[15] = 45
sp: Print data: sharedData[16] = 46
sp: Print data: sharedData[17] = 47
sp: Print data: sharedData[18] = 48
sp: Print data: sharedData[19] = 49
sp: Print data: sharedData[20] = 50
sp: Print data: sharedData[21] = 51
sp: Print data: sharedData[22] = 52
sp: Print data: sharedData[23] = 53
sp: Print data: sharedData[24] = 54
sp: Print data: sharedData[25] = 55
sp: Print data: sharedData[26] = 56
sp: Print data: sharedData[27] = 57
sp: Print data: sharedData[28] = 58
sp: Print data: sharedData[29] = 99
  Test50 fancy=0 rs=I sum=12345678.
sp: data sum 12345678
sp: data sum 1375
sp: data sum 41685
   point2 I 41685.
  Test50 fancy=1 rs=I sum=41685.
sp: data sum 41685
sp: data sum 1665
sp: data sum 50385
   point2 I 50385.
  Test50 fancy=2 rs=I sum=50385.
sp: data sum 50385
sp: data sum 1694
sp: data sum 51255
   point2 I 51255.
Finished main process.
Test50 starting join.
Test50 done.

The output from Mono and Icarus is exactly the same. The built-in simulator, accessible by adding '-sim 1000' to the KiwiC command line also generates the same console output and a similar VCD waveform dump.

Sequential Consistency

KiwiC does not currently support fine-grained store order control. Where a number of writes are generated in one major cycle (delimited by hard or soft pauses) the writes within that major cycle are freely reordered by the restructure recipe stage to maximimse memory port throughput. However, KiwiC already maintains ordering in PLI and other system calls, so extending this preservation to remotely-visible writes can easily be added in the near future.

When doing such manual, low-level concurrency programing, as on this page, we should insert manual memory fences. Or else we should use C#s synchronised methods.

Write buffers and copy-back caches may also be instantiated outside the KiwiC-generated code in uncore structures that are part of the substrate for a given FPGA blade. KiwiC has no control over these.

We are writing a paper that explores this space ... C\# provides the \verb+Thread.MemoryBarrier()+ call to control memory read and write re-ordering between threads... but in the meantime you have to use \verb+Kiwi.Pause()+ to ensure write ordering.

C# provides the Thread.MemoryBarrier() call to control memory read and write re-ordering between threads.

A good and relevant article on memory modelling is .NET Atomics and Memory Model Semantics by Alex Ronne Petersen.

Conclusion

The FPGA libraries contain dual-port RAMs. These can be used for sharing data between up to two threads. This example also shared data via a scalar variable, called command2. Kiwi supports any number of threads reading or writing shared scalar variables but there are technology restrictions on shared access to arrays. Where an array is small enough to instantiated as an FPGA on-chip B-RAM (block RAM), and overrides are not applied, then such a B-RAM will be used. Both Xilinx and Altera provided FPGAs with on-chip, dual-ported B-RAMs with synchronous read latency of one cycle.

If three threads operated on the shared memory, Kiwi would generate an instance of a triple-ported SRAM module but this would not be found in the technology library when then FPGA tools were applied. A hardware designer could implement such a device, but it would probably have to be variable latency (i.e. have handshake wires) and this can be requested with CSharp attributes on the array instance.

The preferred/supported design style for when three or more threads share an array is to ensure the underlying memory is 'off-chip' and then each thread will make access to its via its own load/store port. These load/store ports have handshaking that allow structural hazards arising from simultaneous attempted access to be serialised in the time domain. Array's are mapped off-chip if they are above a standing size for block RAM (BRAM) or if they are marked up with a Kiwi C# attribute. Further details (should be) are in the Kiwi manual.

By default, KiwiC will use one port on an SRAM for each thread that operates on it. However, by setting the PortsPerThread parameter to greater than one then greater access bandwidth per clock cycle for each thread is possible. Note that Xilinx Virtex BRAM supports up to two ports per BRAM in total, so having ports per thread set to two is the maximum sensible value and that may only be sensible if there is only one thread making access to the RAM. In the future, several threads in the same clock domain might get to share the physical ports if the compiler can spot they are temporarily disjoint (i.e. never concurrent).

... we need to add a little more explanation or forward reference here please ...


Updated April 2016               UP.