Counting the number of ones in a 32-bit word is slow with naive software. RISC chips such as the ARM now break away from their RISC roots by including such instructions in their ever expanding instruction set. Here we look at four routines that realise this function and discuss how they behave under HLS.
Here are four sections of code that all perform the same function. Each is written in C#.
The zeroth routine uses a loop with a conditional branch on each execution. It has data-dependent control flow.
public static uint tally00(uint ind) { uint tally = 0; for (int v =0; v<32; v++) { if (((ind >> v) & 1) != 0) tally ++; } return tally; }
The generated RTL from Kiwi HLS is on this link LINK.v.
Implementation number one replaces the control flow with arithmetic.
public static uint tally01(uint ind) { uint tally = 0; for (int v =0; v<32; v++) { tally += ((ind >> v) & 1); } return tally; }
The generated RTL from Kiwi HLS is on this link LINK.v.
This version uses a nifty programming trick
public static uint tally02(uint ind) { // Borrowed from the following, which explains why this works: // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel uint output = ind - ((ind >> 1) & 0x55555555); output = ((output >> 2) & 0x33333333) + (output & 0x33333333); output = ((output + (output >> 4) & 0xF0F0F0F) * 0x1010101); return output >> 24; }
The generated RTL from Kiwi HLS is on this link LINK.v.
This one uses a 'reasonable-sized' lookup table.
// A 256-entry lookup table will comfortably fit in any L1 dcache. // But Kiwi requires a mirror mark up to make it produce 4 of these. static readonly byte [] tally8 = new byte [] { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, }; public static uint tally03(uint ind) { uint a0 = (ind >> 0) & 255; uint a1 = (ind >> 8) & 255; uint a2 = (ind >> 16) & 255; uint a3 = (ind >> 24) & 255; return (uint)(tally8[a0] + tally8[a1] + tally8[a2] + tally8[a3]); }
This look-up table needs to be replicated to give, say, four copies, otherwise there will be a structural hazard and performance is low.
The generated RTL from Kiwi HLS with ROM mirroring disabled is on this link LINK.v.
With ROM mirroing enabled we see this inventory
// Structural Resource (FU) inventory:// 1 vectors of width 3 // 8 vectors of width 1 // 8 vectors of width 8 // 3 vectors of width 32 // Total state bits in module = 171 bits. // 32 continuously assigned (wire/non-state) bits // cell SROM_A_8_US_CC_tally8_SCALbx10_tally8_ARA0_FL1 count=4
Mirrored ROM RTL.v
class bench { const int limit = 100000000; // Some stopping value [Kiwi.OutputBitPort("done")] static bool done = false; [Kiwi.HardwareEntryPoint()] public static void Main() { Console.WriteLine("BitTally 03 Limit=" + limit); for (int jojo=1;jojo<=limit;jojo*=21) { Kiwi.NoUnroll(); uint testd = (uint)(51*jojo); uint tally = BitTally.tally03(testd); Console.WriteLine(" value={0:X} has {1} ones (decimal).", testd, tally); } Console.WriteLine(" Test BitTally finished."); done = true; Kiwi.Pause(); } }
All four designs work and give the same result when simulated.
Run under mono
MONO_PATH=/home/djg11/d320/hprls/kiwipro/kiwic/distro/support:. mono bittally.exe BitTally 03 Limit=100000000 value=33 has 4 ones (decimal). value=42F has 6 ones (decimal). value=57DB has 11 ones (decimal). value=734F7 has 13 ones (decimal). value=975843 has 11 ones (decimal). value=C6A3D7F has 18 ones (decimal). value=4B70B6B has 15 ones (decimal). Test BitTally finished.
Run under RTL_SIM
iverilog bittally.v vsys.v ./a.out| tee icarus.spool VCD info: dumpfile vcd.vcd opened for output. BitTally 03 Limit=100000000 value=00000033 has 4 ones (decimal). value=0000042f has 6 ones (decimal). value=000057db has 11 ones (decimal). value=000734f7 has 13 ones (decimal). value=00975843 has 11 ones (decimal). value=0c6a3d7f has 18 ones (decimal). value=04b70b6b has 15 ones (decimal). Test BitTally finished. RTLSIM: Exit on done asserted after 405/10 cycles.
Performance: All designs took about about 10 clocks to run the 7 outerloops, except for the unmirrored ROM version which took 40.
Area and clock frequency comparison: TBC ...
The coding style made little difference, but some HLS tools will require a markup on the ROM to allow it to be relicated. Kiwi uses a figure from its recipe to limit the amount of ROM mirroring but the user can change the recipe or add markups (not shown here).
December 2016. UP.