HOME UP PREV NEXT (Beyond Pure RTL: Behavioural descriptions of hardware.)

Instruction-Level Parallelism

Q. Does a program have a certain level of implicit parallelism?

A. With respect to a specific compiler optimisation level and a specific input data set it does indeed.

Here is a concrete example:

Prihozhy - Code for counting digits 1..5 in integer n.
void main()
{ unsigned long n=21414;
  int m[5], k=0;
  for (int i=0;i<5;i++) m[i]=0;
  while(n) { m[n%10]=1; n/= 10; }
  for (int j=0;j<5;j++) if (m[j]) k++;
}

Critical Path in Digit Counting C Program (Prihozhy). In their 2003 paper, Prihozhy, Mattavelli and Mlynek, determine the available parallelism in various C programs. For small programs, such as the digit counting example above, with given input data, the critical path length and the total number of instructions (or clock cycles) can be drawn out. The critical path is highlighted with bold/wider arrows. (Someone volunteer to make them organge?) Their ratio gives the available instruction-level parallelism, such as 174/30=5.8. »Data Dependencies Critical Path Evaluation

(Note also David Wall's `Limits of instruction-level parallelism'. In Proc. ASPLOS-4, 1991. and J Mak's `Limits of parallelism using dynamic dependency graphs' 2009.) Basic algorithm: Like the static timing analyser for hardware circuits, for each computation node we add its delay to that of the latest-arriving input.

But with different input data, the control flow varies and the available parallelism varies. Also the code can be re-factored to increase the parallelism.

The parrellism of the digit counter example can be improved from 30 to 25, as shown in the paper, e.g. by using variants of the while-to-do transformation.

  // When we know g initially holds:
  while (g) do { c }  <--> do { c } while (g)

A greater, alternative parallelism metric is obtained by neglecting control hazards. If we ignore the arrival time of the control input to a multiplexing point we typically get a shorter critical path. Speculative execution computes more than one input to a multiplexing point (e.g. a carry-select adder). The same is achieved with perfect branch prediction.

Q. So, does a given program have a fixed certain level of implicit parallelism? If so, we should be able to measure it using static analysis.

A. No. In general it greatly depends on that amount of data-dependent control flow, the disambiguation of array subscript expressions (decideable name aliases) and compiler tricks.

Q. When is one algorithm the same as another that computes the same result?

A. Authors differ, but perhaps the algorithms are the same if there exists a set of transform rules, valid for all programs, that maps them to each other? (That's the DJG definition anyway!)

(This program counted the divide by 10 as one operation: HLS addresses the multi-cycle nature of operations such as load and division, which gives a more complex timing diagram.)