Kiwi Scientific Acceleration: C# high-level synthesis for FPGA execution.
Linked List Example

Kiwi Scientific Acceleration: Kiwi Bit Tally (Ones Counting) Comparison

Counting the number of ones in a 32-bit word is slow with naive software. RISC chips such as the ARM now break away from their RISC roots by including such instructions in their ever expanding instruction set. Here we look at four routines that realise this function and discuss how they behave under HLS.

Four Algorithms

Here are four sections of code that all perform the same function. Each is written in C#.

Tally00 - Control Flow

The zeroth routine uses a loop with a conditional branch on each execution. It has data-dependent control flow.

  public static uint tally00(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        if (((ind >> v) & 1) != 0) tally ++;
      }
    return tally;
  }

The generated RTL from Kiwi HLS is on this link LINK.v.

Tally01 - Arithmetic

Implementation number one replaces the control flow with arithmetic.

  public static uint tally01(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        tally += ((ind >> v) & 1);
      }
    return tally;
  }

The generated RTL from Kiwi HLS is on this link LINK.v.

Tally02 - From Nifty Bit Hacks

This version uses a nifty programming trick

  public static uint tally02(uint ind)
  {
    // Borrowed from the following, which explains why this works:
    // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel 
    uint output = ind - ((ind >> 1) & 0x55555555);
    output = ((output >> 2) & 0x33333333) + (output & 0x33333333);
    output = ((output + (output >> 4) & 0xF0F0F0F) * 0x1010101);
    return output  >> 24;
  }

The generated RTL from Kiwi HLS is on this link LINK.v.

Tally03 - Look-up Table Efficiency

This one uses a 'reasonable-sized' lookup table.

  // A 256-entry lookup table will comfortably fit in any L1 dcache.
  // But Kiwi requires a mirror mark up to make it produce 4 of these.
  static readonly byte [] tally8 = new byte [] {  
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };

  public static uint tally03(uint ind)
  {
    uint a0 = (ind >> 0) & 255;
    uint a1 = (ind >> 8) & 255;
    uint a2 = (ind >> 16) & 255;
    uint a3 = (ind >> 24) & 255;
    return (uint)(tally8[a0] + tally8[a1] + tally8[a2] + tally8[a3]);
  }

This look-up table needs to be replicated to give, say, four copies, otherwise there will be a structural hazard and performance is low.

The generated RTL from Kiwi HLS with ROM mirroring disabled is on this link LINK.v.

With ROM mirroing enabled we see this inventory

// Structural Resource (FU) inventory:// 1 vectors of width 3
// 8 vectors of width 1
// 8 vectors of width 8
// 3 vectors of width 32
// Total state bits in module = 171 bits.
// 32 continuously assigned (wire/non-state) bits 
//   cell SROM_A_8_US_CC_tally8_SCALbx10_tally8_ARA0_FL1 count=4

Mirrored ROM RTL.v

Testbench


class bench
{
  const int limit = 100000000;  // Some stopping value
  [Kiwi.OutputBitPort("done")] static bool done = false;

  [Kiwi.HardwareEntryPoint()]
  public static void Main()
  {
    Console.WriteLine("BitTally 03 Limit=" + limit);
    for (int jojo=1;jojo<=limit;jojo*=21) 
      {
        Kiwi.NoUnroll(); 
        uint testd = (uint)(51*jojo);
        uint tally = BitTally.tally03(testd);
        Console.WriteLine("   value={0:X}  has {1} ones (decimal).", testd, tally); 
      }  
    Console.WriteLine(" Test BitTally finished.");
    done = true;
    Kiwi.Pause();
  }
}

All four designs work and give the same result when simulated.

Run under mono

MONO_PATH=/home/djg11/d320/hprls/kiwipro/kiwic/distro/support:. mono bittally.exe
BitTally 03 Limit=100000000
   value=33  has 4 ones (decimal).
   value=42F  has 6 ones (decimal).
   value=57DB  has 11 ones (decimal).
   value=734F7  has 13 ones (decimal).
   value=975843  has 11 ones (decimal).
   value=C6A3D7F  has 18 ones (decimal).
   value=4B70B6B  has 15 ones (decimal).
 Test BitTally finished.

Run under RTL_SIM

iverilog bittally.v vsys.v 
./a.out| tee icarus.spool
VCD info: dumpfile vcd.vcd opened for output.
BitTally 03 Limit=100000000
   value=00000033  has 4 ones (decimal).
   value=0000042f  has 6 ones (decimal).
   value=000057db  has 11 ones (decimal).
   value=000734f7  has 13 ones (decimal).
   value=00975843  has 11 ones (decimal).
   value=0c6a3d7f  has 18 ones (decimal).
   value=04b70b6b  has 15 ones (decimal).
 Test BitTally finished.
RTLSIM: Exit on done asserted after                  405/10 cycles.

Performance Comparison

Performance: All designs took about about 10 clocks to run the 7 outerloops, except for the unmirrored ROM version which took 40.

Area and clock frequency comparison: TBC ...

Conclusions

The coding style made little difference, but some HLS tools will require a markup on the ROM to allow it to be relicated. Kiwi uses a figure from its recipe to limit the amount of ROM mirroring but the user can change the recipe or add markups (not shown here).


December 2016.               UP.