HOME       UP       PREV       NEXT (Parellelism: The key to high performance.)  

Higher-level: Generative, Behavioural or Declarative?

There are several primary, high-level design expression styles we can consider (in practice use a blend of them ?):

Purely generative approaches, like the Lava and Chisel hardware construction languages (HCLs), just `print out' a circuit diagram or computation graph.

There is also the generate statement in Verilog and HLD RTLs.

Such approaches do not handle data-dependent IF statements: instead muxes must be explicitly printed (although Chisel has some syntactic support to mux generation with when like statements). »Lava Compiler (Singh)

Generative approaches for super-computer programming, such as DryadLINQ from Microsoft, also elegantly support rendering large static computation trees (CSP or Kahn Networks) that can be split over processing nodes. They are agnostic as to what mix of nodes is used: FPGA, GPU or CPU. »pn tool for Process Networks.

But the most interesting systems support complex data-dependent control flow. These are either:

Historically, the fundamental problem to be addressed was Programmers like imperative programs but the PC concept with its associated control flow limits available parallelism.

An associated problem is that even a fairly pure functional program has limited inferable parallelism in practice »Parallel Haskell (Jones, Marlow, Singh).

A further problem is that RAM memories are a vital component in all hardware systems and they are intrinsically imperative, so a project-to-imperative step must exist at some point in the design flow. (Write once may help with modern abundant VM.) Parallelisation is limited by the name alias problem.

Using a parallel set of guarded atomic actions (as in Bluespec) is pure RTL: which is declarative (since no threads).

All higher-level styles are amenable to substantial automatic design space exploration by the compiler tool. The tool performs datapath and schedule generation, including re-encoding and re-pipelining to meet timing closure and power budgets.

So-called `classical HLS' converts a behavioural thread to a static schedule (fixed at compile time). But the parallelism inferred is always limited.

Sometimes no scheduller is needed. This can mean a fully-pipelined implementation can be generated. Alternatively, a systolic array or process network can be rendered, that again has no sequencer, but which accepts data with an initiation interval greater than unity (fixed for systolic array and variable for a CSP/Kahn-like network).


4: (C) 2012-18, DJ Greaves, University of Cambridge, Computer Laboratory.