HOME       UP       PREV       NEXT (HLS Synthesisable Subset.)  

Adopting a Suitable Coding Style for HLS

A coding style that works well for a contemporary Von Neumann computer may not be ideal for HLS.

For now at least, we cannot simply deploy existing software and expect good results.

Here are four sections of code that all perform the same function. Each is written in CSharp.

The zeroth routine uses a loop with a conditional branch on each execution. It has data-dependent control flow.

  public static uint tally00(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        if (((ind >> v) & 1) != 0) tally ++;
      }
    return tally;
  }

Implementation number one replaces the control flow with arithmetic.

  public static uint tally01(uint ind)
  {
    uint tally = 0;
    for (int v =0; v<32; v++)
      {
        tally += ((ind >> v) & 1);
      }
    return tally;
  }

This version uses a nifty programming trick

  public static uint tally02(uint ind)
  { // Borrowed from the following, which explains why this works:
    // http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel 
    uint output = ind - ((ind >> 1) & 0x55555555);
    output = ((output >> 2) & 0x33333333) + (output & 0x33333333);
    output = ((output + (output >> 4) & 0xF0F0F0F) * 0x1010101);
    return output  >> 24;
  }

This one uses a `reasonable-sized' lookup table.

  // A 256-entry lookup table will comfortably fit in any L1 dcache.
  // But Kiwi requires a mirror mark up to make it produce 4 of these.
  static readonly byte [] tally8 = new byte [] {  
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
    ...
    3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
  };

  public static uint tally03(uint ind)
  {
    uint a0 = (ind >> 0) & 255;
    uint a1 = (ind >> 8) & 255;
    uint a2 = (ind >> 16) & 255;
    uint a3 = (ind >> 24) & 255;
    return (uint)(tally8[a0] + tally8[a1] + tally8[a2] + tally8[a3]);
  }
The look-up table needs to be replicated to give four copies. The logic synthesiser might ultimately replace such a ROM with combinational gates if this will have lower area.

Which of these works best on FPGA? The experiments on the following link may be discussed in lectures: »Kiwi Bit Tally (Ones Counting) Comparison

A good HLS tool should not be sensitive to coding style and should use strength reduction and sub-expression sharing and all standard compiler optimisations. (But we'll soon discuss that layout of data in memories is where we should exercise care and where automated techniques can help less).

(Strength Reduction is compiler-speak for replacing one operator with a less costly operator where possible, such as replacing multiply by -1 with subtract from 0).


14: (C) 2012-18, DJ Greaves, University of Cambridge, Computer Laboratory.