HOME       UP       PREV       NEXT (HLS Synthesisable Subset.)  

Adopting a Suitable Coding Style for HLS

A coding style that works well for a contemporary Vonn Neumann computer may not be ideal for HLS.

For now at least, we cannot simply deploy existing software and expect good results.

Here are four sections of code that all perform the same function. Each is written in CSharp.

The zeroth routine uses a loop with a conditional branch on each execution. It has data-dependent control flow.

  public static uint tally00(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        if (((ind >> v) & 1) != 0) tally ++;
      }
    return tally;
  }

Implementation number one replaces the control flow with arithmetic.

  public static uint tally01(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        tally += ((ind >> v) & 1);
      }
    return tally;
  }

This version uses a nifty programming trick

  public static uint tally02(uint ind)
  {
    // Borrowed from the following, which explains why this works:
    // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel 
    uint output = ind - ((ind >> 1) & 0x55555555);
    output = ((output >> 2) & 0x33333333) + (output & 0x33333333);
    output = ((output + (output >> 4) & 0xF0F0F0F) * 0x1010101);
    return output  >> 24;
  }

This one uses a 'reasonable-sized' lookup table.

  // A 256-entry lookup table will comfortably fit in any L1 dcache.
  // But Kiwi requires a mirror mark up to make it produce 4 of these.
  static readonly byte [] tally8 = new byte [] {  
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    ...
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };

  public static uint tally03(uint ind)
  {
    uint a0 = (ind >> 0) & 255;
    uint a1 = (ind >> 8) & 255;
    uint a2 = (ind >> 16) & 255;
    uint a3 = (ind >> 24) & 255;
    return (uint)(tally8[a0] + tally8[a1] + tally8[a2] + tally8[a3]);
  }
The look-up table needs to be replicated to give four copies. A mark-up may be needed with some HLS tools such as Kiwi.

Which of these works best on FPGA? The analysis on the following link will be discussed in lectures: »Kiwi Bit Tally (Ones Counting) Comparison


14: (C) 2012-17, DJ Greaves, University of Cambridge, Computer Laboratory.