Orangepath Synthesis Engines

The HPR L/S (aka Orangepath) library supports various internal synthesis engines. These are plugins.

Because all input is converted to the HPR virtual machine and all output is from that internal form it is also sensible to use the HPR library for translation purposes without doing any actual synthesis.

All plugins rewrite one HPR machine as another. But some that read in an external file, like the Kiwi front end or the deserialiser or the RTL front end simply ignore the input machine they are fed by the Orangepath recipe.


A* Live Path Interface Synthesiser

The H2 front end tool allows access to the live path interface synthesiser. The A* version is described on this web page. http://www.cl.cam.ac.uk/ djg11/wwwhpr/gpibpage.html

This plugin has not been tested recentl.


Transactor Synthesiser

The transactor synthesiser is described on this link

http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/transactors

This plugin has not been tested recently.


Asynchronous Logic Synthesiser

The H1 tool implements an asynchronous logic synthesiser described on this link.

http://www.cl.cam.ac.uk/ djg11/wwwhpr/dsasynch.html

This plugin has not been tested recently.


SAT-based Logic Synthesiser

The H1 tool implements a SAT-based logic synthesiser described on this link.

http://www.cl.cam.ac.uk/ djg11/wwwhpr/dslogic.html

This synthesiser is currently not part of the main HPR revision control branch.


Bevelab: Synchronous FSM Synthesiser

Bevelab is an HPR plugin that converts HPR threaded forms to RTL form. Both the input and outputs to this stage typically have the concept of a program counter per thread, but the number of program counter states is greatly reduced. In the output form, many assignments and array writes are made in parallel. A custom data path is generated for each thread and the program counter becomes the internal state of a micro-sequencer that controls that data path. The emitted program counter does not need to be treated differently, then on, from any other scalar register, although the distinction is preserved in the output form for readibility, debugging and ease of determining disjoint structural operations in restructure (and perhaps to assist proof tools), and for the HPR Performance Predictor that needs to track the control flow graph through the complete toolchain.

(Alternatives to Bevelab are Systolic and VSFG. VSFG (§25) can achieve greater throughput with heavily-pipelined components in the presence of complex control flow. Systolic requires bounded loops and projects to a systolic array (not described in this manual).)

Usually, the input is in DIC form where the DIC contains assignments, conditional gotos, fork/join and leaf calls to HPR library functions. More-advanced imperative control flow constructs, such as while, for, continue, break, call and return need to have been already removed.

The resulting RTL is generally `synthesisable' as defined by language standards for Verilog, VHDL and SystemC. Although it uses common subexpression sharing, it is hopelessly inefficient since a naive compilation to hardware would instantiate a fresh, flash arithmetic operator at every textual site where an operator occurs. In addition, it will typically be full of structural hazards where RAMs are addressed at multiple locations in one clock cycle, whereas in reality they are limited in number of simultaneous operations by their number of ports. Finally,the RAMs and ALUs are assumed to be combinatorial by this RTL, whereas in reality they are pipelined or variable latency.

Converting to one of the output languages, such as SystemC, is by a subsequent plugin. But the output of Bevelab is normally first passed via Restructure (that overcomes structural hazards, re-pipelines and performs load balancing) to the Verilog-gen plugin where it is converted to Verilog RTL syntax.

Both Bevelab and Restructure can trade execution time against number of resources in parallel use: the time/space fold/unfold. Bevelab is the core component of any `C-to-gates' compiler. It packs a sequential imperative program into a hardware circuit. As well as packing multiple writes into one cycle, it can unwind loops any bounded number of times. Loops that read and write arrays can generate very large multiplexor trees if the array subscripts are incomparable at unwind time, since there are very many possible data bypasses and forwardings needed. Therefore, a packing that minimises the number of multiplexors is normally chosen. A simple greedy algorithm is used by default: as much logic as possible is packed into the first state, defined by the entry point to the thread, subject to four limits:

  1. a multiplexing logic depth heuristic limit being reached,
  2. a name alias (undetermined array address comparison) being needed,
  3. a user-annotatated loop unwind limit being reached, and
  4. containing an intrinsically pausing operation.
Once the first state is generated, which may contain multiple input conditional branches that become predication within that state, successive micro-sequencer states are generated until closure.

Certain operations are already known to be pausing. One is a user-level explicit pause where the source code contains a call to `Pause()'. This is needed for net-level protocols, such as parallel to serial conversion in a UART, and for connecting to hard IP blocks that have synchronous interfaces. Others, such as trying to use results from integer divide, any floating point arithmetic, non-fully-pipelined multiply and reads from RAMs that are known to be registered also generate pauses when their source operands are also generated in the current micro-sequencer state.

Figure 9: Details of the Hard Pause Mode algorithm, as provided by Bevelab plugin in the HPR L/S library.


Table 4: Bevelab Heuristic Table.
Parameter Style Default Max    
Maximum number of name aliases array read   0    
Maximum number of multiplexors in logic path   10    
Maximum default number of iterations to unwind loops 4    


Bevelab operates using the heuristics given in Table 4. It takes an additional input, from the command line, which is an unwind budget: a number of basic blocks to consider in any loop unwind operation. Where loops are nested or fork in flow of control, the budget is divided over the various ways.

The flag generate-nondet-monitors turns on and off the creation of embedded runtime monitors for non-deterministic updates.

The flag preserve-sequencer should be supplied to keep the per-thread vestigal sequencer in RTL output structures. This makes the output code more readable but can make it less compact for synthesis, depending on the capabilites of the FPGA tools to do their own minimisation.

The string -vnl-resets=synchronous should be passed in to add synchronous resets to the generated sequencer logic. This is the default.

The string -vnl-resets=asynchronous should be passed in to add assynchronous resets to the generated sequencer logic.

The string -vnl-resets=none should be passed in to supress reset logic for FPGA targets. FPGA's tend to have built-in, dedicated reset wiring. See §38.

Bevelab has a number of scheduling algorithms (selectable from recipe or command line). Alternatively, Bevelab can be replaced with a different opath plugin, such as VSFG or Systolic.

Bevelab: Hard Pause Mode Internal Operation

This section describes only Hard Pause Mode. This is where the position of clock pulses is under explict programmer control via the insertion of Pause() calls.

The central data structure is the pending activation queue, where an activation consists of a program counter name, program counter value and environment mapping variables that have so far been changed to their new (symbolic) values.

The output is a list of finite-state-machine edges that are finally placed inside a single HPR parallel construct. The edges have to forms (g, v, e) (g, fname, [ args]) where the first form assigns e to v when g holds and the second calls function fname when g holds.

Both the pending activation queue and the output list have checkpoint annotations so that edges generated during a failed attempt at a loop unwind can be discarded.

The pending activation list is initialised with the entry points for each thread. Operation removes one activation and symbolically steps it through a basic block of the program code, at which time zero, one or two activations are returned. These are either added to the output list or to the pending activation list. An exit statement terminates the activation and a basic block terminating in a conditional branch returns two activations. A basic block is terminated with a single activation at a blocking native call, such as hpr_pause. When returned from the symbolic simulator, the activation may be flagged as blocking, in which case it is fed to the output queue. Otherwise, if the unwind budget is not used up the resulting activations are added to the pending queue.

A third queue records successfully processed activations. Activations are discarded and not added to the pending queue if they have already been successfully processed. Checking this requires comparison of symbolic environments. These are kept in a "close to normal form" form so that syntactic equivalence can be used. This list is also subject to rollback.

Operation continues until the pending activation queue is empty. A powerful proof engine for comparing activations would enable this condition to be checked more fully and avoid untermination with a greater number of designs.

Bevelab: Soft Pause Mode Internal Operation

Classical HLS operates by loop unwinding to expose parallelism. This is achieved by Bevelab when running in Soft Pause Mode. It reorganises the input program in terms of the number and size of basic blocks. Instead of expecting explict Pause() calls in the input language, as in Hard Pause Mode, basic blocks of appropriate size and with other properies are automatically generated from raw procedural programming. This is the approach needed for general acceleration of scientific (aka Big Data) programs.

The transforms available are :

  1. Loop fusion: combining the operations of two successive loops with identical trip counts into one loop;
  2. Loop unrolling: expanding the body of a loop by an unwind factor;
  3. Predication: replacing control flow with conditional expressions;
  4. De-predication: converting conditional expressions into control flow.


VSFG - Value State Flow Graph

VSFG is an alternative to the bevelab plugin - it uses distributed dataflow instead of having a centralised micro-sequencer per thread. It is based on the paper `A New Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware' [5]. It can achieve greater throughput with heavily pipelined components in the presence of complex control flow compared with traditional loop unwinding and static schedulling.

Its implementation within Kiwi is currently experimental (January 2015).


PSL Synthesiser

The PSL synthesiser converts PSL temporal assertions into FSM-based runtime monitors.


Statechart Synthesiser

The Sys-ML statechart synthesiser is built in to the front end of the H2 tool. It must be built in to other front ends that generate HPR VMs,


SSMG Synthesiser

SSMG is the main refinement component that converts assertions to executable logic using goal-directed search. The SSMG synthesiser is described in a separate document and is a complete sub-project with respect to HPR.


Repack Recipe Stage

The repack function is essentially KiwiC-specific. It is therefore described in the KiwiC chapters of this manual (§4.8.1).


Restructure Recipe Stage


Table 5: An Example Structural Resource Guide Table.
Parameter Style Default Max    
Max no of integer adders and subtractors per thread flash unlimited    
Max no of integer multipliers per thread one-cycle 5000 bit products    
Max no of integer dividers per thread vari-latency 5    
Max no of F/P ALUs per thread fixed latency of 5 5    
Max size register file (bits)   512    
Max size single-port block RAM per thread        
Max no of single-port block RAMs per thread   2    
Max no dual-port block RAMs shared over threads   2    
Max size dual-port block RAMs shared over threads   bits    
No of DRAM front-side cache ports   unlimited    
No of DRAM banks   platform-specific    


Restructuring is need to overcome structural hazards arising when there are insufficient resources for all the required operations to take place in parallel and to generally sequence operations in the time domain. Resources are mainly ALUs and memory ports. Table 5 shows the main parameters that control time/space trade off while restructuring a design. Further parameters relate to the cache size and architecture, DRAM clock speed. The repack phase (§29) generated as many memories as possible. These must now be allocated to the allowed hardware resources, which may mean combining memories to reduce their total number, but taking into account a good balance for port bandwidth. Hardware platforms vary in the number of DRAM banks provided. The number of block RAMs inside an individual FPGA, like the number of ALUs to use, can be varied between one compilation and another.

The restructure phase bounds the number of each type of structural resource generated for each thread. It then generates a static schedule for that thread. Certain subsystems can have variable latency, in which case the static schedule is based on the average execution time, with stalls and holding registers being generated for cases that run respectively slower or faster than nominal. The schedule may also get stalled at execution time owing to dynamic events that cannot be predicted in advance. Typical dynamic events are cache misses, contention for shared resources from other threads and blocking message passing between threads.

The scheduller statically maps memory operations to ports on multi-ported memories. It overcomes all static hazards, ensuring that no attempt to use a resource more than once at a time occurs. It therefore ensures that different operations occur in different cycles, with automatic insertion of holding registers to maintain data values that would not be available when needed.

The five-stage pipeline for FPUs consists of, for an add, the following fully-pipelined steps: 1. unpack bit fields and compare mantissas, 2. shift smaller mantissa, 3. add mantissas, 4. normalise, 5. round and repack.



Subsections
David Greaves 2019-11-14