A number of systems have implemented the 'parallel-For' or 'parallel-ForEach' loop construct. It is normally a mark-up applied to a standard FOR loop that can be safely ignored such that standard sequential execution will return the correct result, allbeit, by taking more time.
For example, in C++ OpenMP, one puts
#pragma omp parallel forand in CSharp one can map a delegate using
Parallel.For(0, matARows, i => ...)
The programmer must be aware of adverse interactions between the loop bodies. Where one body reads a value written by another, the order of schedulling is likely to make a difference.
Commutable effects such as increment and bit-set are ok if they are atomic.
Q. What is the role of a parallel-For inside an HLS tool that is searching for parallelism anyway?
A. Classical HLS tools normally approach FOR-loops using unwinding. Such unwinding typically leads to structural hazards on memory arrays and hence there has emerged a vast literature on memory banking and polyhedral mapping that we have already discussed. Such unwinding typically assumes that there is little control-flow variation between the bodies and is generally applied to inner loops only. A parallel-For is useful for outer loop unwinding or where the same static schedule is unlikely to hold for each iteration owing to data-dependent control flow or variable-length operations (cache misses, divide etc.).
This example shows outer-loop parallelisation for matrix multiply. The inner loops should be unwound to some suitable extent by the HLS tool's normal behaviour.
// https://docs.microsoft.com/en-us/dotnet/standard/parallel-programming/ // how-to-write-a-simple-parallel-for-loop // A basic matrix multiplication. // Parallelize the outer loop to partition the source array by rows. Parallel.For(0, matARows, i => { for (int j = 0; j < matBCols; j++) { double temp = 0; for (int k = 0; k < matACols; k++) { temp += matA[i, k] * matB[k, j]; } result[i, j] = temp; } }); // Parallel.For
But is there an ideal memory layout or banking model for matrix multiplication ?
Note that hardware accelleration of matrix multiplication is currently (2017) receiving a lot of attention as it is the primary cost in convolutional neural networks. The takeaway from this slide (missing from printed notes) is:
33: (C) 2008-18, DJ Greaves, University of Cambridge, Computer Laboratory. |