Data can commonly be shared between threads using arrays and scalars. Typically a protocol using scalars is used to signal the readiness and or location of data in shared arrays.
For efficient software execution, the Thread.Pulse() and Thread.Wait() primtives should be instantiated. Kiwi will mostly ignore them since concurrent hardware does not dissipate dynamic energy while polling for an input event - this is the fundamental operating principle of hardware.
Here we present a simple example of sharing data between two threads in that sort of style.
The following CSharp file was compiled with mono mcs.
// Kiwi Scientific Acceleration Example - Illustration of inter-thread IPC using shared dual port memory. // (C) 2016 DJ Greaves, University of Cambridge, Computer Laboratory. using System; using System.Text; using KiwiSystem; using System.Diagnostics; using System.Threading; public class test50 { const int problemSize = 30; static int [] sharedData = new int [problemSize]; static volatile bool exiting = false; static volatile int sum = 12345678; static volatile char command2 = 'x'; public static void secondProcess() { while(!exiting) { while (command2 == 'I' && !exiting) Kiwi.Pause(); //Console.WriteLine("second process dispatching {0}", command2); if (command2 == 'U') // Update data { for (int i=0/2; i<problemSize; i++) { Kiwi.Pause(); sharedData[i] = i+sum; } } else if (command2 == 'S') // Compute Sum { sum = 0; for (int i=0; i<problemSize; i++) { Kiwi.Pause(); sum += sharedData[i]; } } else if (command2 == 'P') // Print Sum { Console.WriteLine("sp: data sum {0}", sum); Kiwi.Pause(); } else if (command2 == 'D') // Print Data { //Console.WriteLine("second process Print Data {0}", command2); for (int i=0; i<problemSize; i++) { Console.WriteLine("sp: Print data: sharedData[{0}] = {1}", i, sharedData[i]); Kiwi.Pause(); } } //Console.WriteLine("second process dispatched {0}", command2); command2 = 'I'; // Set back to idle } } public static int computeSum() { int mysum = 0; for (int i=0; i<problemSize; i++) { Kiwi.Pause(); mysum += sharedData[i]; } return mysum; } public static void clearto(int v0) { int vv = v0; for (int i=0; i<problemSize; i++) { sharedData[i] = vv++; Kiwi.Pause(); } sharedData[problemSize-1] = 99; } public static void test50_phase0() { Console.WriteLine("Kiwi Demo - Test50 phase0 starting."); Console.WriteLine(" Test50 Remote Status={0}, sum= {1}", command2, sum); clearto(30); Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'D'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); for(int iteration = 0; iteration < 3; iteration++) { Console.WriteLine(" Test50 fancy={0} rs={1} sum={2}.", iteration, command2, sum); command2 = 'P'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'S'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'P'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'U'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'S'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); command2 = 'P'; Kiwi.Pause(); while (command2 != 'I') Kiwi.Pause(); clearto(40 + iteration); Console.WriteLine(" point2 {0} {1}.", command2, sum); Kiwi.Pause(); } Console.WriteLine("Finished main process."); } [Kiwi.HardwareEntryPoint()] public static void Main() { Console.WriteLine("Kiwi Demo - Test50 starting."); Thread thread1 = new Thread(new ThreadStart(secondProcess)); thread1.Start(); Kiwi.Pause(); test50_phase0(); exiting = true; Console.WriteLine("Test50 starting join."); //thread1.Join(); Console.WriteLine("Test50 done."); } }
The RTL contains an instance of a dual-port RAM containing 30 words of 32 bits:
... snip ... CV_2P_SSRAM_FL1 #(6'd32, 3'd5, 5'd30, 6'd32) A_SINT_CC_SCALbx10_ARA0(clk, reset, A_SINT_CC_SCALbx10_ARA0_RDD0, A_SINT_CC_SCALbx10_ARA0_AD0, A_SINT_CC_SCALbx10_ARA0_WEN0, A_SINT_CC_SCALbx10_ARA0_REN0, A_SINT_CC_SCALbx10_ARA0_WRD0, A_SINT_CC_SCALbx10_ARA0_RDD1, A_SINT_CC_SCALbx10_ARA0_AD1, A_SINT_CC_SCALbx10_ARA0_WEN1, A_SINT_CC_SCALbx10_ARA0_REN1, A_SINT_CC_SCALbx10_ARA0_WRD1); // 3 vectors of width 5 // 4 vectors of width 1 // 3 vectors of width 32 // 1 vectors of width 16 // 256 bits in scalar variables // Total state bits in module = 387 bits. // 130 continuously assigned (wire/non-state) bits // cell CV_2P_SSRAM_FL1 count=1 // Total number of leaf cells = 1 endmodule
Full RTL output file: test50.v (Verilog).
// Kiwi Scientific Acceleration // (C) 2010-16 DJ Greaves, University of Cambridge. `timescale 1ns/1ns module SIMSYS(); reg clk, reset; initial begin reset = 1; clk = 1; # 33 reset = 0; end always #5 clk = !clk; // 10ns period for clock = 100 MHz (lowish FPGA clock freq!) initial begin # (100 * 1000 * 1000) $display("Finish HDL simulation on timeout %t.", $time); $finish(); end initial begin $dumpfile("vcd.vcd"); $dumpvars(); end DUT the_dut(.clk(clk), .reset(reset)); endmodule
The above, generated RTL is run on the Icarus verilog simulator.
VCD info: dumpfile vcd.vcd opened for output. Kiwi Demo - Test50 starting. Kiwi Demo - Test50 phase0 starting. Test50 Remote Status=x, sum= 12345678 sp: Print data: sharedData[0] = 30 sp: Print data: sharedData[1] = 31 sp: Print data: sharedData[2] = 32 sp: Print data: sharedData[3] = 33 sp: Print data: sharedData[4] = 34 sp: Print data: sharedData[5] = 35 sp: Print data: sharedData[6] = 36 sp: Print data: sharedData[7] = 37 sp: Print data: sharedData[8] = 38 sp: Print data: sharedData[9] = 39 sp: Print data: sharedData[10] = 40 sp: Print data: sharedData[11] = 41 sp: Print data: sharedData[12] = 42 sp: Print data: sharedData[13] = 43 sp: Print data: sharedData[14] = 44 sp: Print data: sharedData[15] = 45 sp: Print data: sharedData[16] = 46 sp: Print data: sharedData[17] = 47 sp: Print data: sharedData[18] = 48 sp: Print data: sharedData[19] = 49 sp: Print data: sharedData[20] = 50 sp: Print data: sharedData[21] = 51 sp: Print data: sharedData[22] = 52 sp: Print data: sharedData[23] = 53 sp: Print data: sharedData[24] = 54 sp: Print data: sharedData[25] = 55 sp: Print data: sharedData[26] = 56 sp: Print data: sharedData[27] = 57 sp: Print data: sharedData[28] = 58 sp: Print data: sharedData[29] = 99 Test50 fancy=0 rs=I sum=12345678. sp: data sum 12345678 sp: data sum 1375 sp: data sum 41685 point2 I 41685. Test50 fancy=1 rs=I sum=41685. sp: data sum 41685 sp: data sum 1665 sp: data sum 50385 point2 I 50385. Test50 fancy=2 rs=I sum=50385. sp: data sum 50385 sp: data sum 1694 sp: data sum 51255 point2 I 51255. Finished main process. Test50 starting join. Test50 done.
MONO_PATH=/home/djg11/d320/hprls/kiwipro/kiwic/distro/support mono test50.exe Kiwi Demo - Test50 starting. Kiwi Demo - Test50 phase0 starting. Test50 Remote Status=I, sum= 12345678 sp: Print data: sharedData[0] = 30 sp: Print data: sharedData[1] = 31 sp: Print data: sharedData[2] = 32 sp: Print data: sharedData[3] = 33 sp: Print data: sharedData[4] = 34 sp: Print data: sharedData[5] = 35 sp: Print data: sharedData[6] = 36 sp: Print data: sharedData[7] = 37 sp: Print data: sharedData[8] = 38 sp: Print data: sharedData[9] = 39 sp: Print data: sharedData[10] = 40 sp: Print data: sharedData[11] = 41 sp: Print data: sharedData[12] = 42 sp: Print data: sharedData[13] = 43 sp: Print data: sharedData[14] = 44 sp: Print data: sharedData[15] = 45 sp: Print data: sharedData[16] = 46 sp: Print data: sharedData[17] = 47 sp: Print data: sharedData[18] = 48 sp: Print data: sharedData[19] = 49 sp: Print data: sharedData[20] = 50 sp: Print data: sharedData[21] = 51 sp: Print data: sharedData[22] = 52 sp: Print data: sharedData[23] = 53 sp: Print data: sharedData[24] = 54 sp: Print data: sharedData[25] = 55 sp: Print data: sharedData[26] = 56 sp: Print data: sharedData[27] = 57 sp: Print data: sharedData[28] = 58 sp: Print data: sharedData[29] = 99 Test50 fancy=0 rs=I sum=12345678. sp: data sum 12345678 sp: data sum 1375 sp: data sum 41685 point2 I 41685. Test50 fancy=1 rs=I sum=41685. sp: data sum 41685 sp: data sum 1665 sp: data sum 50385 point2 I 50385. Test50 fancy=2 rs=I sum=50385. sp: data sum 50385 sp: data sum 1694 sp: data sum 51255 point2 I 51255. Finished main process. Test50 starting join. Test50 done.
The output from Mono and Icarus is exactly the same. The built-in simulator, accessible by adding '-sim 1000' to the KiwiC command line also generates the same console output and a similar VCD waveform dump.
KiwiC does not currently support fine-grained store order control. Where a number of writes are generated in one major cycle (delimited by hard or soft pauses) the writes within that major cycle are freely reordered by the restructure recipe stage to maximimse memory port throughput. However, KiwiC already maintains ordering in PLI and other system calls, so extending this preservation to remotely-visible writes can easily be added in the near future.
When doing such manual, low-level concurrency programing, as on this page, we should insert manual memory fences. Or else we should use C#s synchronised methods.
Write buffers and copy-back caches may also be instantiated outside the KiwiC-generated code in uncore structures that are part of the substrate for a given FPGA blade. KiwiC has no control over these.
We are writing a paper that explores this space ... C\# provides the \verb+Thread.MemoryBarrier()+ call to control memory read and write re-ordering between threads... but in the meantime you have to use \verb+Kiwi.Pause()+ to ensure write ordering.
C# provides the Thread.MemoryBarrier() call to control memory read and write re-ordering between threads.
A good and relevant article on memory modelling is .NET Atomics and Memory Model Semantics by Alex Ronne Petersen.
The FPGA libraries contain dual-port RAMs. These can be used for sharing data between up to two threads. This example also shared data via a scalar variable, called command2. Kiwi supports any number of threads reading or writing shared scalar variables but there are technology restrictions on shared access to arrays. Where an array is small enough to instantiated as an FPGA on-chip B-RAM (block RAM), and overrides are not applied, then such a B-RAM will be used. Both Xilinx and Altera provided FPGAs with on-chip, dual-ported B-RAMs with synchronous read latency of one cycle.
If three threads operated on the shared memory, Kiwi would generate an instance of a triple-ported SRAM module but this would not be found in the technology library when then FPGA tools were applied. A hardware designer could implement such a device, but it would probably have to be variable latency (i.e. have handshake wires) and this can be requested with CSharp attributes on the array instance.
The preferred/supported design style for when three or more threads share an array is to ensure the underlying memory is 'off-chip' and then each thread will make access to its via its own load/store port. These load/store ports have handshaking that allow structural hazards arising from simultaneous attempted access to be serialised in the time domain. Array's are mapped off-chip if they are above a standing size for block RAM (BRAM) or if they are marked up with a Kiwi C# attribute. Further details (should be) are in the Kiwi manual.
By default, KiwiC will use one port on an SRAM for each thread that operates on it. However, by setting the PortsPerThread parameter to greater than one then greater access bandwidth per clock cycle for each thread is possible. Note that Xilinx Virtex BRAM supports up to two ports per BRAM in total, so having ports per thread set to two is the maximum sensible value and that may only be sensible if there is only one thread making access to the RAM. In the future, several threads in the same clock domain might get to share the physical ports if the compiler can spot they are temporarily disjoint (i.e. never concurrent).
... we need to add a little more explanation or forward reference here please ...
Updated April 2016 UP.