HOME       UP       PREV       NEXT (H/W Design Partition)  

Conservation Cores Approach

Suppose something like the following fragment of code is a dominant consumer of power in a portable embedded mobile device:

  for (int xx=0; xx<1024; xx++)
     unsigned int d = Data[xx];
     int count = 0;
     while (d > 0) { if (d & 1) count ++;   d >>= 1; }
     if (!xx || count > maxcount) { maxcount = count; where = xx; }
This kernel tallies the set bit count in each word: such bit-level operations are inefficient using general-purpose CPU instruction sets.

Dedicated hardware avoids instruction fetch overhead and is generally more power efficient.

Analysis using Amdhal's law and high-level simulation (SystemC TLM) can establish whether a hardware implementation is worthwhile.

There are several feasible partitions:

  1. Extend the CPU with a custom datapath and custom ALU for the inner tally function controlled by a custom instruction.
  2. Add a tightly-coupled custom coprocessor with fast data paths to load and store operands from/to the main CPU. The main CPU still generates the address values xx and fetches the data as usual.
  3. Place the whole kernel in a custom peripheral unit with operands being transferred in and out using programmed I/O or pseudo-DMA.
  4. As 3, but with the new IP block having bus master capabilities so that it can fetch the data itself, with polled or interrupt-driven synchronisation with the main CPU.

A custom ALU operation implemented in two similar ways: as a custom instruction or as a coprocessor.

A custom function implemented as a peripheral IP block, with optional DMA (bus master) capability.

The special hardware in all approaches may be manually coded in RTL or compiled using HLS from the original C implementation.

In the first two approaches, both the tally and the conditional update of the maxcount variable might be implemented in the hardware but most of the gain would come from the tally function itself and the detailed design might be different depending on whether custom instruction or coprocessor were used.

The custom instruction operates on data held in the normal CPU register file. The bit tally function alone reads one input word and yields one output word, so it easily fits within the addressing modes provided for normal ALU operations.

Performing the update of both the maxcount and word registers in one custom instruction would require two register file writes and this may not be possible in one clock cycle and hence, if this part of the kernel is placed in the custom datapath we might lean more towards the co-processor approach.

Whether to use the separate IP block really depends on whether the processor has something better to do in the meantime and that there is sufficient bus bandwidth for them both to operate.

With increasing available transistor count in the form of dark silicon (ie.\ switched off most of the time) in recent and future VLSI, implementing standard kernels as custom hardware cores is a potential major trend for power conservation: sometimes called conservation cores.

11: (C) 2008-13, DJ Greaves, University of Cambridge, Computer Laboratory.