Take one thread and a body of code:
Works well when there is little or no cycle time variation. Not so good with DRAM+cache or floating point.
Creates a precise schedule of addresses on register file and RAM ports and ALU function codes.
Typically unwinds inner loops by some factor.
Can cope with data-dependent control flow, but predicate hoisting needed if datapath is heavily pipelined.
Data-dependent control flow and RAM bandwidth ultimately limit parallelism.
Profiling or datapath description hints are needed for a sensible datapath structure since sequencer states are not equiprobable and we do not want to deploy resource on seldom-used data paths.
7: (C) 2008-13, DJ Greaves, University of Cambridge, Computer Laboratory. |