Output and Analysis Recipe Stages

The HPR library contains the Diosim simulator, output generators and other analysis tools. Each is a plugin invoked by an Orangepath recipe stage.

HPR Output Formats Supported

The HPR library contains a number of output code generators. All of these write out a representation of an internal HPR machine. Not all forms of HPR machine can be written out in all output forms, but, where this is not possible, a synthesis engine should be available that can be applied to the internal HPR machine to convert it.

Certain output formats can encode both an RTL/hardware-style and a software/threaded style. For instance, a C-like input file can be rendered out again in threaded C style, or as a list of non-blocking assignments using the SystemC library.

The following output formats are created by selecting plugins:

RTL Form: The RTL output is written as a Verilog RTL. One module is created that either contains just the RTL portion of the design, or the RTL and instances of each MPU that is executing software parts of the design.
Netlist Form: The RTL output is compiled to a structural netlist in Verilog that contains nothing but gate and flip-flop instances.
H2 IMP Form: The HPR form is output to an IMP file. This has the same syntax as the imperative subset of H2. Discontinued now.
SMV form: The HPR VM is output as an SMV code and the assertions that have not been compiled or refined are output as assertions for SMV to check.
C++ and CSharp Forms: The HPR VM is output as C++ or C# code suitable for third-party compilers. RTL forms may also be output as synthesisable SystemC.
UIA MPU Form: The IMP imperative language is compiled to IMP assembly language and output as a .s file.
IP-XACT form: The structural components are written out as IP-XACT definitions and instances.
S-expression form: The HPR VM is dumped a lisp S-expression to a file.
UIA Machine Code: The IMP assembly is compiled to machine code for the UIA microcontroller. This is output as Intel Hex and also as a list of Verilog assignments for initialising a memory with this code.

The net-based output architecture is suitable for direct implementation as a custom SoC (system on chip). H2 defines its own microcontroller and we use the term MPU to denote an H2 microcontroller with an associated firmware ROM. The net-based architecture consists of RTL logic and some number of MPUs. However, by requesting that all output is as C code for a single MPU, the net-based output degenerates to a single file of portable C code.

Additional output files include log files and synthesisable and high-level models of the UISA microprocessor that executes IMP machine machine code.

C++, SystemC and C# Output Generators

The cpp-gen recipe stage writes the current design as C++ or SystemC depending on options supplied to it. This can render any mixture of behavioural or structural code, depending on which processing steps come before it in the Orangepath recipe.

It also can generate C# code.

The -cgen2=enable flag causes the tool to generate SystemC output files.

The -csharp-gen=enable flag causes the tool to generate C# output files.

Header and code files are generated with suffix .cpp and .h. Additional header files are generated for shared interfaces and structures. Generally, to make a design consisting of a number of C++ classes, the tool is run a number of times with different root and sysc command line options.

C# does not use header files as such, so files with suffix `.cs' are emitted. Classes may be spread over a number of files according to undocumented commandline options.

Note that emitting C# or C++ with the standard recipe writes these output files at the same point in the system flow as used for RTL output. Hence a large number of parallel, RTL-style assignments will be used. Using a shorter recipe or with some of the intermediate stages disabled, output closer to the input form can be rendered: for instance, with bevelab turned off assignments will be made in order using a thread instead of an HLS sequencer.

RTL Output Generator

The verilog-gen recipe stage writes the current design as Verilog RTL.

It is not a totally straightforward projection as RTL since sub-expressions of significant complexity that occur more than once are rendered only once and assigned to intermediate nets using continuous assigns under a greedy algorithm. This keeps the file size sensible with certain functions that would become exponential (e.g. a barrel shifter). The quality of the sharing is not optimisied owing to the assumption is that a subsequent logic synthesis tools will revisit these sharing decisions.

It also can convert the design to a netlist (i.e. do logic synthesis) and estimate the area of the result. This functionality should be split out into a sepearte recipe stage so, for instance, the net list could be rendered in SystemC instead.

It also contains a roundtrip function, such that the RTL it has generated is converted back into HPR internal form. It does this from the RTL AST so cannot serve for textual RTL input in its current form ... the RTL parser is in cv3cv3.zip and needs integrating ...

The RTL Generator can provide area and wiring length estimates and generate a graphical floorplan to help visualise the circuit structure and understand how much area is devoted to which resources.

Wiring length estimates based on the design hierarchy and Rent's Rule are fairly accurate and do not require an actual layout.

The flag -vnl-layout-delay-estimate=enable will create a layout.eps plot file.

IP-XACT Output Generator

The ip-xact-gen recipe stage writes the current design as an xml document following the IP-XACT `design' schema.

It can also write out bus specs and individual components used in the current design as IP-XACT xml documents.

This plugin is/was formerly not freestanding and could only be invoked via the verilog-gen recipe stage.

Built-in report writers

The Orangepath framework has two built-in rendering tools that produce a textual listing file (called report or report-full) and Graphviz dot figures.

The -report-each-step flag causes textual report files for each recipe stage to be written into the obj folders. Alternatively, a pseudo plugin can be put in the recipe at a stage where such a report should be written.

The -cfg-plot-each-step flag causes the control flow for each recipe stage to be written into a report file in the obj directory. You will typical want to render the dot files with something like dot -Tpng a.dot > a.png; eog a.png.

The restructure stage accepts some older flags such as -dotplot-plot=combined but these may be discontinued.

Arithmetic and RAM Leaf Cells

The tool will expect the user to provide definitions of various leaf cells with the output from the tool at the input to the RTL synthesis step. A number of suitable definitions are included in cvgates.v and cv_fpgates.v and it may commonly be sufficient just to include these two files in the RTL compilation.

The leaf cell names follow a few conventions:

All have a clock and reset intput, even if not needed.
All have a fail output, even if they cannot fail or will report their error in-band using, for example, NaN.
The main outputs is listed before inputs, but associative instantiation is normally used anyway. For divide and mod the numerator is listed before the denominator. For subtractors the lhs is listed first.
The naming convention has the letters VL for a variable-latency component and this has handshake wires. Otherwise FLn denotes a fixed-latency of clock cycles, fully-pipelined. The tool will schedule an average budget for variable lantency components.
Parameter overrides, listed in the order output, first intput, second input, set the precision of ALU connections and RAM dimensions.

For variable-latency leaf cells in the libray, the VLA protocol is used. The VLA handshake protocols is as follows:

Handshake uses a req input and a rdy output.
New input args are read in on a cycle where req is asserted, which will be just one cycle in response to a req.
Results are ready in a cycle when rdy is asserted.
New work may be presented with req during the same cycle that the output data becomes live (the rdy cycle).
Asserting req before the last rdy has been delivered will be ignored.
The output, once live, remains valid until another operation starts (i.e. until the cycle after req next holds).
No combinational path between inputs and outputs, including req and rdy, is allowed inside the component.

Components following the AXI Streaming protocol are also supported. This is the same as the Xilinx LocalLink protocol in all important aspects. It has a pair of handshake nets (ready/valid) for both the input and the output and does not hold its data on completion. Compared with VLA, the AXI streaming component requires another holding register to be instantiated by the HLS tool when it knows it may need the data in more than one subsequent cycle in its schedule.

Note: The above is for on-chip devices instantiated directly by the tool. Off-chip RAM connections use a separate protocol (HSIMPLE, HFAST, AXI, BVCI).

Fixed-point ALUs

The RTL backend will use built-in RTL operators for adders and substractors. For multipliers and dividers and modulus with non-constant arguments it instantiates specific units, such as CV_INT_VL_DIVIDER_US. Very small multipliers are rendered with the RTL asterisk infix operator and left to the FPGA tools as per the adders/substractors.

Kiwi generally calls out to variable latency dividers and fixed-latency multipliers. It uses an estimate for the variable latency computation time in its schedules. When using a fixed latency it increases the latency requested for larger parameter widths. Whether fixed or variable is indicated in the component kind name. Instantiated components cope with any argument width as specified by parameter overloads.

Kiwi does not currently generate the fixed-point ALU implementations and it may request one that is not in the provided cvgates.v baseline library, in which case the poor user must provide their own implementation. For example, an extreme design might call for a 512 by 1024 fixed latency multiplier with 5 clock cycle latency.

Recipe parameters alter the points at which the library enlarges the provisioned latency.

Floating-point ALUs

Floating-point ALUs follow the pattern of fixed-points ALUs, except that add and subtract are also always instantiated ALUs and the RTL compiler is not expected to handle them. A different set of recipe parameters control their structure (fixed/variable latency and expected/required latency).

Only 32 and 64 bit, IEEE standard floating point is currently used by default. A future extension will provide for custom width floating point, since this is a very powerful feature of HLS that can save a lot of energy and area. The extension will give the same behaviour on mono WD as on RTLSIM and FPGA.

A core set of floating point ALUs is provide in cv_fpgates.v. These are soft macros that the RTL tools are expected to map to whatever is available in the target FPGA or ASIC library. Specific shims and bindings to assist with Altera and Xilinx are likely to be added to the distro in the near future.

Floating-point Convertors

There is no budget limit on the number of convertors is currently imposed.

The convertors required normally are

 CV_FP_CVT_FL2_F32_I32  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F32_I64  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F64_I32  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F64_I64  // Integer 32 to float 32 with fixed latency of 2                    

 CV_FP_CVT_FL2_I32_F32  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I32_F64  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I64_F32  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I64_F64  // Integer 32 from float 32 with fixed latency of 2

 CV_FP_CVT_FL0_F32_F64  // Float 32 from float 64 (FL=0 implies combinational)                                        
 CV_FP_CVT_FL0_F64_F32  // Float 32 from float 64 (FL=0 implies combinational)

RAM and ROM Leaf Cells

A set of standard static RAM cells is provided in cvgates.v. These are parameterisable in width, length and number of lanes by overrides. They are single and dual ported and of latencies 0, 1 and 2 clock cycles.

Kiwi and other tools based on the HPR library generate instances of these RAMs.

RTL tools are expected to map these to appropriate structures, such as LUT RAM and block RAM on FPGA.

RAM instances are also generated with no write ports and static initialisations using the Verilog initial statements. RTL tools will treat these as ROMs. Unlike RAMs, where the user is expected to manually couple a definition from cvagtes.v or elsewhere to their RTL synthesis step input, ROMs are are embedded in the main RTL output files from a run of the tool.

Subsections

David Greaves 2019-11-14