Multiple basic blocks, even from one thread, will be executing at once, owing to pipelining.
Frequently, an inner loop consists of one basic block repeated, and so it is competeting with itself for structural resources and data hazards.
Each time offset in a block's schedule needs to be checked for structural hazards against the resource use of all other blocks that are potentially running at the same time.
Every block has its own static schedule of determined length.
The early part of the schedule generally contains control-flow predicate computation to determine which block will run next. This can be null if the block is the initialisation code for a subsequent loop (i.e. the basic block ends on a branch destination rather than on a conditional branch).
The later part of the block contains data reads, data writes and ALU computations. Data operations can also occur in the control phase but when resources are tight (typically memory read bandwidth) the control work should be given higher schedulling priority and hence remain at the top of the schedule.
Owing to the concept of a thread, there will be at most one control section running at any one time. But there can be a number of data sections from successors and from predecessors still running.
The interblock schedule problem has exponential growth properties with base equal to the average control-flow fan out, but only a finite part needs to be considered governed by the maximal block length. .html .par As well as avoiding structural hazards, the schedule must contain no RaW or WaW hazards. So a block must read a datum at a point in its schedule after any earlier block that might be running has written it. Or if at the same time, forwarding logic must be synthesised.
It may be necessary to add a 'postpad' to relax the schedule. This is a delay beyond what is needed by the control flow predicate computation before following the control flow arc. This introduces extra space in the global schedule allowing more time and hence requiring fewer structural resources.
In the highlighted time slot in the figure, the D3 operations of the first block are concurrent with the control and data C1+D1 operations of a later copy of itself when it has looped back or with the D1 phase of Block 2 if exited from its tight loop.
|19: (C) 2012-17, DJ Greaves, University of Cambridge, Computer Laboratory.|