Kiwi CHStone dfsin Demo

The CHStone testsuite contains a number of simple C programs designed to test an HLS system.

The dfsin test computes the sine function using Taylor's series, but rather than using the built-in floating-point available in modern HLS systems, it includes a complete, synthesisable library of double-precision floating-point routines. This is called softfloat.c

We manually converted softfloat.cs and dfsin.cs to C#.

Convert dfsin.c to C#

There is not a lot of difference between C and C# for straightforward imperative programming: the main thing is that C# supports far fewer type coercions and so explict casts have to be added at various points:

/*
// CSharp form. - Converted manually to CSharp as a Kiwi experiment.
+--------------------------------------------------------------------------+
| CHStone : a suite of benchmark programs for C-based High-Level Synthesis |
| ======================================================================== |
|                                                                          |
| * Collected and Modified : Y. Hara, H. Tomiyama, S. Honda,               |
|                            H. Takada and K. Ishii                        |
|                            Nagoya University, Japan                      |
|                                                                          |
| * Remark :                                                               |
|    1. This source code is modified to unify the formats of the benchmark |
|       programs in CHStone.                                               |
|    2. Test vectors are added for CHStone.                                |
|    3. If "main_result" is 0 at the end of the program, the program is    |
|       correctly executed.                                                |
|    4. Please follow the copyright of each benchmark program.             |
+--------------------------------------------------------------------------+
*/
/*
 * Copyright (C) 2008
 * Y. Hara, H. Tomiyama, S. Honda, H. Takada and K. Ishii
 * Nagoya University, Japan
 * All rights reserved.
 *
 * Disclaimer of Warranty
 *
 * These software programs are available to the user without any license fee or
 * royalty on an "as is" basis. The authors disclaims any and all warranties, 
 * whether express, implied, or statuary, including any implied warranties or 
 * merchantability or of fitness for a particular purpose. In no event shall the
 * copyright-holder be liable for any incidental, punitive, or consequential damages
 * of any kind whatsoever arising from the use of these programs. This disclaimer
 * of warranty extends to the user of these programs and user's customers, employees,
 * agents, transferees, successors, and assigns.
 *
 */

using float64 = System.UInt64;
using KiwiSystem; using System;

static class dfsin
{
  
  public static float64 sin (float64 rad)
  {
    float64 app = rad;
    float64 diff = rad;
    int inc = 1;
    float64 m_rad2 = softfloat.float64_neg (softfloat.float64_mul (rad, rad));
    do
      {
        gloops++; // Global iteration counter - for debugging.
	diff = softfloat.float64_div(softfloat.float64_mul (diff, m_rad2),
				     softfloat.int32_to_float64 ((2 * inc) * (2 * inc + 1)));
	app = softfloat.float64_add (app, diff);
	inc++;
      }
    while (softfloat.float64_ge (softfloat.float64_abs (diff), 0x3ee4f8b588e368f1ULL));	/* 0.00001 */
    return app;
  }

/*
+--------------------------------------------------------------------------+
| * Test Vectors (added for CHStone)                                       |
|     test_in : input data                                                 |
|     test_out : expected output data                                      |
+--------------------------------------------------------------------------+
*/
  const int  N = 36;

  static float64 [] test_in =
    {
      0x0000000000000000ULL,	/*      0  */
      0x3fc65717fced55c1ULL,	/*   PI/18 */
      0x3fd65717fced55c1ULL,	/*   PI/9  */
      0x3fe0c151fdb20051ULL,	/*   PI/6  */
      0x3fe65717fced55c1ULL,	/*  2PI/9  */
      0x3febecddfc28ab31ULL,	/*  5PI/18 */
      0x3ff0c151fdb20051ULL,	/*   PI/3  */
      0x3ff38c34fd4fab09ULL,	/*  7PI/18 */
      0x3ff65717fced55c1ULL,	/*  4PI/9  */
      0x3ff921fafc8b0079ULL,	/*   PI/2  */
      0x3ffbecddfc28ab31ULL,	/*  5PI/9  */
      0x3ffeb7c0fbc655e9ULL,	/* 11PI/18 */
      0x4000c151fdb20051ULL,	/*  2PI/3  */
      0x400226c37d80d5adULL,	/* 13PI/18 */
      0x40038c34fd4fab09ULL,	/*  7PI/9  */
      0x4004f1a67d1e8065ULL,	/*  5PI/6  */
      0x40065717fced55c1ULL,	/*  8PI/9  */
      0x4007bc897cbc2b1dULL,	/* 17PI/18 */
      0x400921fafc8b0079ULL,	/*   PI    */
      0x400a876c7c59d5d5ULL,	/* 19PI/18 */
      0x400becddfc28ab31ULL,	/* 10PI/9  */
      0x400d524f7bf7808dULL,	/*  7PI/6  */
      0x400eb7c0fbc655e9ULL,	/* 11PI/9  */
      0x40100e993dca95a3ULL,	/* 23PI/18 */
      0x4010c151fdb20051ULL,	/*  8PI/6  */
      0x4011740abd996affULL,	/* 25PI/18 */
      0x401226c37d80d5adULL,	/* 13PI/9  */
      0x4012d97c3d68405bULL,	/*  3PI/2  */
      0x40138c34fd4fab09ULL,	/* 14PI/9  */
      0x40143eedbd3715b7ULL,	/* 29PI/18 */
      0x4014f1a67d1e8065ULL,	/* 15PI/9  */
      0x4015a45f3d05eb13ULL,	/* 31PI/18 */
      0x40165717fced55c1ULL,	/* 16PI/9  */
      0x401709d0bcd4c06fULL,	/* 33PI/18 */
      0x4017bc897cbc2b1dULL,	/* 17PI/9  */
      0x40186f423ca395cbULL
    };				/* 35PI/18 */
  



  static float64 [] test_out = {
    0x0000000000000000ULL,	/*  0.000000 */
    0x3fc63a1a335aadcdULL,	/*  0.173648 */
    0x3fd5e3a82b09bf3eULL,	/*  0.342020 */
    0x3fdfffff91f9aa91ULL,	/*  0.500000 */
    0x3fe491b716c242e3ULL,	/*  0.642787 */
    0x3fe8836f672614a6ULL,	/*  0.766044 */
    0x3febb67ac40b2bedULL,	/*  0.866025 */
    0x3fee11f6127e28adULL,	/*  0.939693 */
    0x3fef838b6adffac0ULL,	/*  0.984808 */
    0x3fefffffe1cbd7aaULL,	/*  1.000000 */
    0x3fef838bb0147989ULL,	/*  0.984808 */
    0x3fee11f692d962b4ULL,	/*  0.939693 */
    0x3febb67b77c0142dULL,	/*  0.866026 */
    0x3fe883709d4ea869ULL,	/*  0.766045 */
    0x3fe491b81d72d8e8ULL,	/*  0.642788 */
    0x3fe00000ea5f43c8ULL,	/*  0.500000 */
    0x3fd5e3aa4e0590c5ULL,	/*  0.342021 */
    0x3fc63a1d2189552cULL,	/*  0.173648 */
    0x3ea6aedffc454b91ULL,	/*  0.000001 */
    0xbfc63a1444ddb37cULL,	/* -0.173647 */
    0xbfd5e3a4e68f8f3eULL,	/* -0.342019 */
    0xbfdffffd494cf96bULL,	/* -0.499999 */
    0xbfe491b61cb9a3d3ULL,	/* -0.642787 */
    0xbfe8836eb2dcf815ULL,	/* -0.766044 */
    0xbfebb67a740aae32ULL,	/* -0.866025 */
    0xbfee11f5912d2157ULL,	/* -0.939692 */
    0xbfef838b1ac64afcULL,	/* -0.984808 */
    0xbfefffffc2e5dc8fULL,	/* -1.000000 */
    0xbfef838b5ea2e7eaULL,	/* -0.984808 */
    0xbfee11f7112dae27ULL,	/* -0.939693 */
    0xbfebb67c2c31cb4aULL,	/* -0.866026 */
    0xbfe883716e6fd781ULL,	/* -0.766045 */
    0xbfe491b9cd1b5d56ULL,	/* -0.642789 */
    0xbfe000021d0ca30dULL,	/* -0.500001 */
    0xbfd5e3ad0a69caf7ULL,	/* -0.342021 */
    0xbfc63a23c48863ddULL
  };				/* -0.173649 */


  [Kiwi.OutputBitPort("done")]  static bool done = false;

  [Kiwi.HardwareEntryPoint()]
  public static int Main()
  {
    bm_main(true);
    done = true;
    Kiwi.Pause();
    return 0;
  }

  [Kiwi.InputWordPort(63, 0)] static ulong arg;
  [Kiwi.OutputWordPort(63, 0)] static ulong result;

  //[Kiwi.HardwareEntryPoint()]
  // Pipelined-accelerator entry point
  public static void synthroot()
  {
    result = sin(arg);
  }

  public static int bm_main (bool verbosef)
  {
    Console.WriteLine("dfsin: Testbench start");
    int main_result;
    int i;
    main_result = 0;
    for (i = 0; i < N; i++)
      {
	float64 result;
	Kiwi.Pause();
	result = sin (test_in[i]);
	Kiwi.Pause();
	main_result += (result == test_out[i])?1:0;
	if (verbosef)
	  {
	    Console.WriteLine("Test: input={0:X} expected={1:X} output={2:X} ", test_in[i], test_out[i], result);
	    if ((result ^ test_out[i]) != 0) Console.WriteLine("   hamming error={0:X}", result ^ test_out[i]);
	  }
      }
    if (verbosef)
      {
	Console.WriteLine ("Result: {0}/{1}", main_result, N);
	if (main_result == 36) {
	  Console.WriteLine("RESULT: PASS");
	} else {
	  Console.WriteLine("RESULT: FAIL");
	}
      }
    Console.WriteLine("dfsin: Testbench finished");
    return main_result;
  }
}

Convert softfloat.c to softfloat.cs

To save others the effort, here is our manually transcoded softfloat.cs source file. It is a line-for-line conversion, more-or-less.

softfoat.cs

KiwiC Compile

mcs dfsin.cs /r:../softfloat.dll /r:/home/djg11/d320/hprls/kiwipro/kiwic/distro/support/Kiwi.dll
dfsin.cs(78,21): warning CS0414: The private field `dfsin.test_in_copy' is assigned but its value is never used
dfsin.cs(203,45): warning CS0414: The private field `dfsin.done' is assigned but its value is never used
dfsin.cs(216,45): warning CS0414: The private field `dfsin.result' is assigned but its value is never used
Compilation succeeded - 3 warning(s)
/home/djg11/d320/hprls/kiwipro/kiwic/distro/bin/kiwic -vnl-resets=synchronous \
   -bevelab-default-pause-mode=soft -bevelab-soft-pause-threshold=10  \
   dfsin.exe -vnl=dfsin.v -vnl-resets=synchronous ../softfloat.dll

RTL SIM

With a suitable testbench we can perform RTL simulation:

module SIMSYS();
   reg clk, reset;
   initial begin reset = 1; clk = 0; #33 reset = 0; end
   always #5 clk = !clk;

   wire done;
   wire [31:0] codesent;
   dfsin the_dfsin(.clk(clk), 
	       .reset(reset),
	       .done(done)
	       );

   always @(posedge clk) begin
      if (done) begin
	 $display("Exit on done asserted after %d clocks.", $time/10);
	 $finish;
	 end
   end
endmodule

Giving the following output

iverilog dfsin.v vsys.v 
./a.out| tee icarus.spool
VCD info: dumpfile vcd.vcd opened for output.
dfsin: Testbench start
Test: input=0000000000000000 expected=0000000000000000 output=0000000000000000 
Test: input=3fc65717fced55c1 expected=3fc63a1a335aadcd output=3fc63a1a335aadcd 
Test: input=3fd65717fced55c1 expected=3fd5e3a82b09bf3e output=3fd5e3a82b09bf3e 
Test: input=3fe0c151fdb20051 expected=3fdfffff91f9aa91 output=3fdfffff91f9aa91 
Test: input=3fe65717fced55c1 expected=3fe491b716c242e3 output=3fe491b716c242e3 
Test: input=3febecddfc28ab31 expected=3fe8836f672614a6 output=3fe8836f672614a6 
Test: input=3ff0c151fdb20051 expected=3febb67ac40b2bed output=3febb67ac40b2bed 
Test: input=3ff38c34fd4fab09 expected=3fee11f6127e28ad output=3fee11f6127e28ad 
Test: input=3ff65717fced55c1 expected=3fef838b6adffac0 output=3fef838b6adffac0 
Test: input=3ff921fafc8b0079 expected=3fefffffe1cbd7aa output=3fefffffe1cbd7aa 
Test: input=3ffbecddfc28ab31 expected=3fef838bb0147989 output=3fef838bb0147989 
Test: input=3ffeb7c0fbc655e9 expected=3fee11f692d962b4 output=3fee11f692d962b4 
Test: input=4000c151fdb20051 expected=3febb67b77c0142d output=3febb67b77c0142d 
Test: input=400226c37d80d5ad expected=3fe883709d4ea869 output=3fe883709d4ea869 
Test: input=40038c34fd4fab09 expected=3fe491b81d72d8e8 output=3fe491b81d72d8e8 
Test: input=4004f1a67d1e8065 expected=3fe00000ea5f43c8 output=3fe00000ea5f43c8 
Test: input=40065717fced55c1 expected=3fd5e3aa4e0590c5 output=3fd5e3aa4e0590c5 
Test: input=4007bc897cbc2b1d expected=3fc63a1d2189552c output=3fc63a1d2189552c 
Test: input=400921fafc8b0079 expected=3ea6aedffc454b91 output=3ea6aedffc454b91 
Test: input=400a876c7c59d5d5 expected=bfc63a1444ddb37c output=bfc63a1444ddb37c 
Test: input=400becddfc28ab31 expected=bfd5e3a4e68f8f3e output=bfd5e3a4e68f8f3e 
Test: input=400d524f7bf7808d expected=bfdffffd494cf96b output=bfdffffd494cf96b 
Test: input=400eb7c0fbc655e9 expected=bfe491b61cb9a3d3 output=bfe491b61cb9a3d3 
Test: input=40100e993dca95a3 expected=bfe8836eb2dcf815 output=bfe8836eb2dcf815 
Test: input=4010c151fdb20051 expected=bfebb67a740aae32 output=bfebb67a740aae32 
Test: input=4011740abd996aff expected=bfee11f5912d2157 output=bfee11f5912d2157 
Test: input=401226c37d80d5ad expected=bfef838b1ac64afc output=bfef838b1ac64afc 
Test: input=4012d97c3d68405b expected=bfefffffc2e5dc8f output=bfefffffc2e5dc8f 
Test: input=40138c34fd4fab09 expected=bfef838b5ea2e7ea output=bfef838b5ea2e7ea 
Test: input=40143eedbd3715b7 expected=bfee11f7112dae27 output=bfee11f7112dae27 
Test: input=4014f1a67d1e8065 expected=bfebb67c2c31cb4a output=bfebb67c2c31cb4a 
Test: input=4015a45f3d05eb13 expected=bfe883716e6fd781 output=bfe883716e6fd781 
Test: input=40165717fced55c1 expected=bfe491b9cd1b5d56 output=bfe491b9cd1b5d56 
Test: input=401709d0bcd4c06f expected=bfe000021d0ca30d output=bfe000021d0ca30d 
Test: input=4017bc897cbc2b1d expected=bfd5e3ad0a69caf7 output=bfd5e3ad0a69caf7 
Test: input=40186f423ca395cb expected=bfc63a23c48863dd output=bfc63a23c48863dd 
Result: 36/36
RESULT: PASS
dfsin: Testbench finished
Kiwi done at time 673110.  67310 cycles.

All 36 computations of sine gave exactly the same result under mono as under Kiwi. Moreover, these were the same results as are canned in the dfsin testbench as an expected-output array.

Performance

KiwiC supports several fundamental compilation modes. The default 'sequencer mode' provides no outer loop parallelism and gives a basic level of performance:

KiwiC - Sequencer Mode Performance

In sequencer mode, a custom datapath and controller are generated. The trade off of time and space depends on the 'logic cost' command-line setting.

Performance is also greatly influenced by the type of multipliers and dividers used. In the unmodified dfsin compilation, KiwiC used one 32-bit, signed multiplier and two 64-bit, unsigned multipliers with latencies of 5 and 7 clock cycles respectively. It also used two variable-latency dividers where the latency is twice the distance between the m.s.b. of the numerator and denominator.

//   cell CV_INT_FL1_MULTIPLIER_S count=1
//   cell CV_INT_VL_DIVIDER_US count=2
//   cell CV_INT_FL5_MULTIPLIER_US count=2

The 64-bit multipliers that were instantiated only multiply 32-bit quantities, but a 64-bit result is needed. The FPGA tools notice this and remove some of the DSP slices that would be required for 64-bit input operands. However, currently, KiwiC allocates the extra two clock cycle latency for this in its static schedules. This extra-time, no doubt, relaxes timing requirements elsewhere assuming d-type migration is enabled.

We compiled with different LogicCost settings to investigate the tradeoff of time w.r.t. silicon area:

LogicCost=37 Kiwi done at time 673110.  67310 cycles.

LogicCost=35 : Kiwi done at time 681750.  68174 cycles.
Thread HWMain uid=HWMain10 has 2204 CIL instructions in 851 basic blocks
Thread mpc10 has 539 bevelab control states (pauses).


LogicCost=30 : Exit on done asserted after                70806 cycles.
Thread HWMain uid=HWMain10 has 2204 CIL instructions in 851 basic blocks
Thread mpc10 has 478 bevelab control states (pauses)
Reindexed thread xpc10 with 1103 minor control states.

LogicCost=20 : Driven Done at 76131 cycles.
Thread HWMain uid=HWMain10 has 2204 CIL instructions in 851 basic blocks
Thread mpc10 has 613 bevelab control states (pauses)
Reindexed thread xpc10 with 1212 minor control states.

Performance-wise, in sequencer mode, KiwiC is acheiving roughly the same sort of figure (within 5 percent) as given in the Canis LegUp paper that reports on the original C++ implementation of this benchmark. [LegUp: An Open Source High-Level Synthesis Tool for FPGA-Based Processor/Accelerator Systems. A CANIS, ... J ANDERSON]

Assuming an FPGA clock of 200 MHz, the fastest Kiwi implementation is completing in about 250 microseconds.

For comparison, the C# version on mono takes 36 microseconds on an Intel i5-4570 CPU @ 3.20GHz and the original C++ version takes half as long. Roughly. To be confirmed.

KiwiC - Pipelined Accelerator Mode Performance

In pipelined accelerator mode, KiwiC will initiate a new test every clock cycle, since there are no stallable operations (such as cached DRAM access). Execution time is therefore the number of outer loop iterations times the number of inner loop iterations plus the pipeline depth. This will be about 270 clock cycles (1.6 microseconds) ... TBC

Comparison with KiwiC Floating Point

It is unnecessary to use the softfloat library since KiwiC supports floating point directly. So a C# implementation of the Taylor test was written and compiled.

(This takes 5.1us to run all 36 tests compiled with mcs 3.2.8 under mono 3.2.8 on an Intel i5-4570 CPU @ 3.20GHz).

  public static double sin_floatnat(double rad)
  {
    double app = rad;
    double diff = rad;
    int inc = 1;
    double m_rad2 = - rad * rad;
    //System.Console.WriteLine ("m_rad2={0:X}", m_rad2);
    do
      {
        gloops++; // Global iteration counter - for debugging.
	diff =  (diff *  m_rad2) / ((double)((2 * inc) * (2 * inc + 1)));
	app += diff;
	inc++;
      }
    while (Math.Abs(diff)>0.00001); 	/* , 0x3ee4f8b588e368f1ULL)) */
    return app;
  }

  // KiwiC supports the relevant library functions for the following two shims to
  // adapt the float64 testbench for double. They are nops that compile to nothing.
  static double to_double(float64 farg)
  {
    byte [] asbytes = BitConverter.GetBytes(farg);
    double rr = BitConverter.ToDouble(asbytes, 0);
    return rr;
  }

 static float64 from_double(double darg)
  {
    byte [] asbytes = BitConverter.GetBytes(darg);
    return BitConverter.ToUInt64(asbytes, 0);
  }

  ...
     result = from_double(sin_floatnat(to_double(test_in[i])));
  ...

This compiles and runs but yields some small errors in the lsb of the results owing to a slightly non-standard implementation of one of the floating-point primitives in the cv_fparith.v library of components.

iverilog dfsin-floatnat.v vsys.v /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cvgates.v  /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cv_fparith.v
./a.out| tee icarus.spool
VCD info: dumpfile vcd.vcd opened for output.
dfsin: Testbench start
Test: input=3fc65717fced55c1 expected=3fc63a1a335aadcd output=3fc63a1a335aadce 
   hamming error=0000000000000003
Test: input=3fd65717fced55c1 expected=3fd5e3a82b09bf3e output=3fd5e3a82b09bf40 
   hamming error=000000000000007e
Test: input=3fe0c151fdb20051 expected=3fdfffff91f9aa91 output=3fdfffff91f9aa90 
   hamming error=0000000000000001
Test: input=3fe65717fced55c1 expected=3fe491b716c242e3 output=3fe491b716c242e3 
Test: input=3febecddfc28ab31 expected=3fe8836f672614a6 output=3fe8836f672614a6 
Test: input=3ff0c151fdb20051 expected=3febb67ac40b2bed output=3febb67ac40b2bef 
   hamming error=0000000000000002
Test: input=3ff38c34fd4fab09 expected=3fee11f6127e28ad output=3fee11f6127e28ae 
   hamming error=0000000000000003
Test: input=3ff65717fced55c1 expected=3fef838b6adffac0 output=3fef838b6adffac3 
   hamming error=0000000000000003
Test: input=3ff921fafc8b0079 expected=3fefffffe1cbd7aa output=3fefffffe1cbd7ac 
   hamming error=0000000000000006
Test: input=3ffbecddfc28ab31 expected=3fef838bb0147989 output=3fef838bb014798b 
   hamming error=0000000000000002
Test: input=3ffeb7c0fbc655e9 expected=3fee11f692d962b4 output=3fee11f692d962b6 
   hamming error=0000000000000002
Test: input=4000c151fdb20051 expected=3febb67b77c0142d output=3febb67b77c0142e 
   hamming error=0000000000000003
Test: input=400226c37d80d5ad expected=3fe883709d4ea869 output=3fe883709d4ea86c 
   hamming error=0000000000000005
Test: input=40038c34fd4fab09 expected=3fe491b81d72d8e8 output=3fe491b81d72d8eb 
   hamming error=0000000000000003
Test: input=4004f1a67d1e8065 expected=3fe00000ea5f43c8 output=3fe00000ea5f43cd 
   hamming error=0000000000000005
Test: input=40065717fced55c1 expected=3fd5e3aa4e0590c5 output=3fd5e3aa4e0590c2 
   hamming error=0000000000000007
Test: input=4007bc897cbc2b1d expected=3fc63a1d2189552c output=3fc63a1d21895521 
   hamming error=000000000000000d
Test: input=400921fafc8b0079 expected=3ea6aedffc454b91 output=3ea6aedffc03ab53 
   hamming error=000000000046e0c2
Test: input=400a876c7c59d5d5 expected=bfc63a1444ddb37c output=bfc63a1444ddb38c 
   hamming error=00000000000000f0
Test: input=400becddfc28ab31 expected=bfd5e3a4e68f8f3e output=bfd5e3a4e68f8f3b 
   hamming error=0000000000000005
Test: input=400d524f7bf7808d expected=bfdffffd494cf96b output=bfdffffd494cf95e 
   hamming error=0000000000000035
Test: input=400eb7c0fbc655e9 expected=bfe491b61cb9a3d3 output=bfe491b61cb9a3c8 
   hamming error=000000000000001b
Test: input=40100e993dca95a3 expected=bfe8836eb2dcf815 output=bfe8836eb2dcf813 
   hamming error=0000000000000006
Test: input=4010c151fdb20051 expected=bfebb67a740aae32 output=bfebb67a740aae43 
   hamming error=0000000000000071
Test: input=4011740abd996aff expected=bfee11f5912d2157 output=bfee11f5912d2161 
   hamming error=0000000000000036
Test: input=401226c37d80d5ad expected=bfef838b1ac64afc output=bfef838b1ac64b00 
   hamming error=00000000000001fc
Test: input=4012d97c3d68405b expected=bfefffffc2e5dc8f output=bfefffffc2e5dcb2 
   hamming error=000000000000003d
Test: input=40138c34fd4fab09 expected=bfef838b5ea2e7ea output=bfef838b5ea2e7fa 
   hamming error=0000000000000010
Test: input=40143eedbd3715b7 expected=bfee11f7112dae27 output=bfee11f7112dae6c 
   hamming error=000000000000004b
Test: input=4014f1a67d1e8065 expected=bfebb67c2c31cb4a output=bfebb67c2c31cb1e 
   hamming error=0000000000000054
Test: input=4015a45f3d05eb13 expected=bfe883716e6fd781 output=bfe883716e6fd7b8 
   hamming error=0000000000000039
Test: input=40165717fced55c1 expected=bfe491b9cd1b5d56 output=bfe491b9cd1b5da6 
   hamming error=00000000000000f0
Test: input=401709d0bcd4c06f expected=bfe000021d0ca30d output=bfe000021d0ca364 
   hamming error=0000000000000069
Test: input=4017bc897cbc2b1d expected=bfd5e3ad0a69caf7 output=bfd5e3ad0a69cc40 
   hamming error=00000000000006b7
Test: input=40186f423ca395cb expected=bfc63a23c48863dd output=bfc63a23c4886237 
   hamming error=00000000000001ea
Result: 2/36
RESULT: FAIL
dfsin: Testbench finished
gloops=267
Exit on done asserted after                 4460 clocks.
Driven Done at 44605

Comparison with Mono, C++ and x86 Floating Point.

Rather than using softfloat, we can use the native implementations of double-precision floating point available in mono and Kiwi. Also we can compile the C++ version for x86 and use its native floating point hardware (which for double precision is well-known to be better than FPGAs until recently).

A C++ implementation of the same Taylor series code for sine reads as follows:

double df_double_sin (double rad)
{
  double app;
  double diff;
  double m_rad2;
  int inc;
  app = diff = rad;
  inc = 1;
  m_rad2 = - (rad * rad);
  do
    {
      diff = (diff * m_rad2) / ((double)(((2 * inc) * (2 * inc + 1))));
      app += diff;
      inc++;
      gloops ++;
    }
  while (fabs (diff) > 0.00001);
  return app;
}

The inner loop has 13 instructions, four of which are floating point. We do not count the compare as a FLOP since, owing to IEEE coding, all comparisons of FP are cheap. The total accumulated in gloops, over the 36 tests of the dfsin benchmark, is 268.

  400610:       8d 48 ff                lea    -0x1(%rax),%ecx
  400613:       f2 0f 59 cb             mulsd  %xmm3,%xmm1   // DP multiply
  400617:       0f af c8                imul   %eax,%ecx
  40061a:       83 c0 02                add    $0x2,%eax
  40061d:       f2 0f 2a d1             cvtsi2sd %ecx,%xmm2  // DP convert
  400621:       89 d1                   mov    %edx,%ecx
  400623:       83 c2 01                add    $0x1,%edx     //
  400626:       f2 0f 5e ca             divsd  %xmm2,%xmm1   // DP divide
  40062a:       66 0f 28 d1             movapd %xmm1,%xmm2
  40062e:       f2 0f 58 c1             addsd  %xmm1,%xmm0   // DP add
  400632:       66 0f 54 d5             andpd  %xmm5,%xmm2
  400636:       66 0f 2e d4             ucomisd %xmm4,%xmm2  // DP compare
  40063a:       77 d4                   ja     400610  

Running on an Intel i5-4570 CPU @ 3.20GHz, this version takes, with gcc -O0, -01, -02 and -O3 takes 2.1 1.9, 1.9 and 1.9 microseconds. The Intel CPU would achieve, using all cores (assuming all four cores have independent FPU resouces), 268 * 4 * 6 / 1.9E-6 = 3.5 GFLOPS.

The Intel CPU has four cores and takes about 60 Watts, so can run four instances at once. The FPGA takes only a few Watts and can run many more instances at once ... ... comparison to be completed ...

Comparison of Cordic to Taylor Sine Series under CHSTONE Benchmark

We can compare the hardware cost of Taylor with that of Cordic.

public class Cordic 
{
  const int iterations = 40;

  static readonly long [] ctab = new [] { 7853981633974,
    4636476090008, 2449786631269, 1243549945468, 624188099960,
    312398334303, 156237286205, 78123410601, 39062301320, 19531225165,
    9765621896, 4882812112, 2441406201, 1220703119, 610351562,
    305175781, 152587891, 76293945, 38146973, 19073486, 9536743,
    4768372, 2384186, 1192093, 596046, 298023, 149012, 74506, 37253,
    18626, 9313, 4657, 2328, 1164, 582, 291, 146, 73, 36, 18 };

  public static long sin(long theta) 
  {
    const bool vd = false;
    bool flip = false;
    
    const long int_pi = (long)(0.5 + Math.PI * 1e13);

    if (theta < 0)      // Range reduction - make +ve
      {
	theta -= int_pi;
	flip = !flip;
      }

    while (theta >= int_pi * 2)  // Range reduction - Take Mod 2.PI
      {
	theta -= int_pi * 2;
	Kiwi.NoUnroll(); 
      }

    if (theta >= int_pi)  // Range reduction - Take Mod PI
      {
	theta -= int_pi;
	flip = !flip;	
      }

    if (theta >= int_pi / 2)  // Range reduction - Reflect about PI/2
      {
	theta = int_pi - theta;
      }

    const long x0 = (long) (6.07252935008962e14);  /*prescaled constant for iterations=40*/
    long x=x0, y=0;
    if (vd) Console.WriteLine("Starting value is {0}", x);
    for (int k=0; k<iterations; ++k)
      {
	long tx, ty;
	if (vd) Console.WriteLine("              {0} cf {1}   y={2}", theta, ctab[k], y);
	if (theta >= 0) // All nice simple integer arithmetic
	  {
	    if (vd) Console.WriteLine(" k={0} ABOVE", k);
	    tx = x - (y>>k);
	    ty = y + (x>>k);
	    theta -= ctab[k];
	  }
	else
	  {
	    if (vd) Console.WriteLine(" k={0} BELOW", k);
	    tx = x + (y>>k);
	    ty = y - (x>>k);
	    theta += ctab[k];
	  }
	x = tx; y = ty;
      }  

    long rv = (flip) ? -y: y;
    return rv; 
  }
}

Full cordic.cs source file cordic.cs.

KiwiC Compile

$ mcs cordic.cs  /r:/home/djg11/d320/hprls/kiwipro/kiwic/distro/support/Kiwi.dll
cordic.cs(84,45): warning CS0414: The private field `Cordic.done' is assigned but its value is never used
Compilation succeeded - 1 warning(s)
$ kiwic -res2-loadstore-port-count=0 -vnl-resets=synchronous -vnl-roundtrip=disable -bevelab-default-pause-mode=soft -bevelab-soft-pause-threshold=39 -vnl-rootmodname=cordic1  cordic.exe -vnl=cordic.v
$ ls -l
  -rw-rw-r--   1 djg11 djg11   2838 Oct 12 09:34 icarus.spool
  -rw-rw-r--   1 djg11 djg11 660438 Oct 12 09:34 vcd.vcd
  -rwxr-xr-x   1 djg11 djg11 277952 Oct 12 09:34 a.out
  -rw-rw-r--   1 djg11 djg11  54996 Oct 12 09:31 KiwiC.rpt
  -rw-rw-r--   1 djg11 djg11  83490 Oct 12 09:31 cordic.v
  drwxrwxr-x  12 djg11 djg11   4096 Oct 12 09:31 obj
  -rw-rw-r--   1 djg11 djg11    354 Oct 12 09:31 non-layout.eps
  -rw-rw-r--   1 djg11 djg11  21751 Oct 12 09:30 controlflow_combined.dot
  -rw-rw-r--   1 djg11 djg11  16086 Oct 12 09:30 profiler_cf_combined.xml
  -rwxrwxr-x   1 djg11 djg11   6656 Oct 12 09:30 cordic.exe
  -rw-rw-r--   1 djg11 djg11   7275 Oct 12 09:30 cordic.cs
  -rw-rw-r--   1 djg11 djg11   2154 Oct 12 09:18 Makefile
  -rw-rw-r--   1 djg11 djg11     88 Oct 12 09:18 opath.rpt
  -rw-rw-r--   1 djg11 djg11    846 Oct 11 16:29 vsys.v

Example generated RTL cordic.v.

Run

We run the baseline cordic version and see it takes 6016 clocks for 36 experiments. There are 36 iterations of 40 innerloops making about 4 clocks per iteration which is rather low performance for Cordic. By updating the soft-pause-threshold setting for KiwiC, better and better performance was achieved, crossing over to 'superscalar' and ending up with 3 or more iterations per clock cycle.

Soft Pause Threshold Clocks for Complete Test Clocks per Iteration
38 6016 4.2
170 3062 2.1
270 2954 2.1
500 1509 1.0
1500 922 0.6

iverilog cordic.v vsys.v /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cvgates.v  /home/djg11/d320/hprls/kiwipro/kiwic/distro/lib/cv_fparith.v
./a.out| tee icarus.spool
VCD info: dumpfile vcd.vcd opened for output.
cordic: Testbench start
Test: input=0 expected=0 output=430 
Test: input=3141592653589 expected=309016994374947 output=309016994376533 
Test: input=6283185307179 expected=587785252292473 output=587785252292491 
Test: input=9424777960769 expected=809016994374947 output=809016994375099 
Test: input=12566370614359 expected=951056516295153 output=951056516294784 
Test: input=15707963267948 expected=1000000000000000 output=1000000000000138 
Test: input=18849555921538 expected=951056516295153 output=951056516294784 
Test: input=21991148575128 expected=809016994374947 output=809016994375099 
Test: input=25132741228718 expected=587785252292473 output=587785252292491 
Test: input=28274333882308 expected=309016994374947 output=309016994376533 
Test: input=31415926535897 expected=0 output=430 
Test: input=34557519189487 expected=-309016994374947 output=-309016994376533 
Test: input=37699111843077 expected=-587785252292473 output=-587785252292491 
Test: input=40840704496667 expected=-809016994374947 output=-809016994375099 
Test: input=43982297150257 expected=-951056516295153 output=-951056516294784 
Test: input=47123889803846 expected=-1000000000000000 output=-1000000000000138 
Test: input=50265482457436 expected=-951056516295153 output=-951056516294784 
Test: input=53407075111026 expected=-809016994374947 output=-809016994375099 
Test: input=56548667764616 expected=-587785252292473 output=-587785252292491 
Test: input=59690260418206 expected=-309016994374947 output=-309016994376533 
Test: input=62831853071795 expected=0 output=-430 
Test: input=65973445725385 expected=309016994374948 output=309016994376533 
Test: input=69115038378975 expected=587785252292472 output=587785252292491 
Test: input=72256631032565 expected=809016994374947 output=809016994375099 
Test: input=75398223686155 expected=951056516295153 output=951056516294784 
Test: input=78539816339744 expected=1000000000000000 output=1000000000000138 
Test: input=81681408993334 expected=951056516295153 output=951056516294784 
Test: input=84823001646924 expected=809016994374947 output=809016994375099 
Test: input=87964594300514 expected=587785252292473 output=587785252292491 
Test: input=91106186954104 expected=309016994374947 output=309016994376533 
Test: input=94247779607693 expected=0 output=430 
Test: input=97389372261283 expected=-309016994374947 output=-309016994376533 
Test: input=100530964914873 expected=-587785252292472 output=-587785252292491 
Test: input=103672557568463 expected=-809016994374948 output=-809016994375099 
Test: input=106814150222052 expected=-951056516295154 output=-951056516294784 
Test: input=109955742875642 expected=-1000000000000000 output=-1000000000000138 
Result: 36/36
RESULT: PASS
cordic: Testbench finished
Exit on done asserted after                 6016 clocks.

Basic Amenability to and Optimisation for Standard HLS

There are four floating point operations per inner loop of this Taylor series computation: an add, a divide, a multiply and a convert of integer to float. The comparison on diff can be performed with an integer comparison so does not count.

In terms of parallelism, without loop unwinding, there is none available within the loop, owing to the data dependence between operations, save for the simultaneous comparison of diff with its summation into app.

A benefit from loop unwinding is also not directly achievable using the bracketing in the source code, since the multiplication cannot start until the value of diff from the previous iteration arrives. Re-associating the multiply/divide operators would be obviously be an improvement, but it is also fairly obvious that a precomputed lookup table of the reciprocals of the divisors, that are the same for all input arguments, would convert an expensive division to a faster multiplication.

Conclusions

This is a CPU-bound application benchmark, so a like-for-like implementation on FPGA should be at least a couple of orders of magnitude better than a contemporary, out-of-order, high-performance CPU.

However, the entire contents of softfloat.cs are available as hard macros on x86 processors, owing to them having floating point hardware instructions, so to out perform the CPU when it uses its hardware floating point instruction set is a lot tougher. But the latest Altera FPGAs also have hardened floating point, so this will tend to level the playing field ... nonetheless ...

Futher conclusions ... to be drawn!


UP