HOME UP PREV NEXT (Atomic Commutable Effects)

Parallel-For and Parallel-ForEach Loops

A number of systems have implemented the 'parallel-For' or 'parallel-ForEach' loop construct. It is normally a mark-up applied to a standard FOR loop that can be safely ignored such that standard sequential execution will return the correct result, allbeit, by taking more time.

For example, in C++ OpenMP, one puts

  #pragma omp parallel for

and in CSharp one can map a delegate using

    Parallel.For(0, matARows, i => ...)

The programmer must be aware of adverse interactions between the loop bodies. Where one body reads a value written by another, the order of schedulling is likely to make a difference.

Commutable effects such as increment and bit-set are ok if they are atomic.

Q. What is the role of a parallel-For inside an HLS tool that is searching for parallelism anyway?

A. Classical HLS tools normally approach FOR-loops using unwinding. Such unwinding typically leads to structural hazards on memory arrays and hence there has emerged a vast literature on memory banking and polyhedral mapping that we have already discussed. Such unwinding typically assumes that there is little control-flow variation between the bodies and is generally applied to inner loops only. A parallel-For is useful for outer loop unwinding or where the same static schedule is unlikely to hold for each iteration owing to data-dependent control flow or variable-length operations (cache misses, divide etc.).

This example shows outer-loop parallelisation for matrix multiply. The inner loops should be unwound to some suitable extent by the HLS tool's normal behaviour.

// https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/
//              how-to-write-a-simple-parallel-for-loop
// A basic matrix multiplication.
// Parallelize the outer loop to partition the source array by rows.
    Parallel.For(0, matARows, i =>
        {
          for (int j = 0; j < matBCols; j++)
            {
              double temp = 0;
              for (int k = 0; k < matACols; k++)
                {
                  temp += matA[i, k] * matB[k, j];
                }
              result[i, j] = temp;
            }
        }); // Parallel.For

But is there an ideal memory layout or banking model for matrix multiplication ?

Note that hardware accelleration of matrix multiplication is currently (2017) receiving a lot of attention as it is the primary cost in convolutional neural networks.

Automatic Parallel Inference:has had limited success in the past but the field is still evolving.
Manual Parallel Markups: have been added to most imperative high-level languages recently, and they also allow the programmer to asser that there are no name aliases which sometimes increases the available parallelism by a factor of 10 or more.