Output Formats

The HPR library contains a number of output code generators. All of these write out a representation of an internal HPR machine. Not all forms of HPR machine can be written out in all output forms, but, where this is not possible, a synthesis engine should be available that can be applied to the internal HPR machine to convert it.

Certain output formats can encode both an RTL/hardware-style and a software/threaded style. For instance, a C-like input file can be rendered out again in threaded C style, or as a list of non-blocking assignments using the SystemC library.

The following output formats may be created:

  1. RTL Form: The RTL output is written as a Verilog RTL. One module is created that either contains just the RTL portion of the design, or the RTL and instances of each MPU that is executing software parts of the design.

  2. Netlist Form: The RTL output is compiled to a structural netlist in Verilog that contains nothing but gate and flip-flop instances.

  3. H2 IMP Form: The HPR form is output to an IMP file. This has the same syntax as the imperative subset of H2.

  4. SMV form: The HPR VM is output as an SMV code and the assertions that have not been compiled or refined are output as assertions for SMV to check.

  5. C++ and CSharp Forms: The HPR VM is output as C++ or C# code suitable for third-party compilers. RTL forms may also be output as synthesisable SystemC.

  6. UIA MPU Form: The IMP imperative language is compiled to IMP assembly language and output as a .s file.

  7. IP XACT form: The structural components are written out as IP XACT definitions and instances.

  8. S-expression form: The HPR VM is dumped a lisp S-expression to a file.

  9. UIA Machine Code: The IMP assembly is compiled to machine code for the UIA microcontroller. This is output as Intel Hex and also as a list of Verilog assignments for initialising a memory with this code.

The net-based output architecture is suitable for direct implementation as a custom SoC (system on chip). H2 defines its own microcontroller and we use the term MPU to denote an H2 microcontroller with an associated firmware ROM. The net-based architecture consists of RTL logic and some number of MPUs. However, by requesting that all output is as C code for a single MPU, the net-based output degenerates to a single file of portable C code.

Additional output files include log files and synthesisable and high-level models of the UISA microprocessor that executes IMP machine machine code.

Arithmetic and RAM Leaf Cells

The tool will expect the user to provide definitions of various leaf cells with the output from the tool at the input to the RTL synthesis step. A number of suitable definitions are included in cvgates.v and cv_fpgates.v and it may commonly be sufficient just to include these two files in the RTL compilation.

The leaf cell names follow a few conventions:

  1. All have a clock and reset intput, even if not needed.
  2. All have a fail output, even if they cannot fail or will report their error in-band using, for example, NaN.
  3. The main outputs is listed before inputs, but associative instantiation is normally used anyway. For divide and mod the numerator is listed before the denominator. For subtractors the lhs is listed first.
  4. The naming convention has the letters VL for a variable-latency component and this has handshake wires. Otherwise FLn denotes a fixed-latency of $n$ clock cycles, fully-pipelined. The tool will schedule an average budget for variable lantency components.
  5. Parameter overrides, listed in the order output, first intput, second input, set the precision of ALU connections and RAM dimensions.

For variable-latency leaf cells, the handshake protocols is as follows:

Note: The above is for on-chip devices instantiated directly by the tool. Off-chip RAM connections use a separate protocol (HSIMPLE, HFAST, AXI, BVCI).

Fixed-point ALUs

The RTL backend will use built-in RTL operators for adders and substractors. For multipliers and dividers and modulus with non-constant arguments it instantiates specific units, such as CV_INT_VL_DIVIDER_US. Very small multipliers are rendered with the RTL asterisk infix operator and left to the FPGA tools as per the adders/substractors.

Kiwi generally calls out to variable latency dividers and fixed-latency multipliers. It uses an estimate for the variable latency computation time in its schedules. When using a fixed latency it increases the latency requested for larger parameter widths. Whether fixed or variable is indicated in the component kind name. Instantiated components cope with any argument width as specified by parameter overloads.

Kiwi does not currently generate the fixed-point ALU implementations and it may request one that is not in the provided cvgates.v baseline library, in which case the poor user must provide their own implementation. For example, an extreme design might call for a 512 by 1024 fixed latency multiplier with 5 clock cycle latency.

Recipe parameters alter the points at which the library enlarges the provisioned latency.

Floating-point ALUs

Floating-point ALUs follow the pattern of fixed-points ALUs, except that add and subtract are also always instantiated ALUs and the RTL compiler is not expected to handle them. A different set of recipe parameters control their structure (fixed/variable latency and expected/required latency).

Only 32 and 64 bit, IEEE standard floating point is currently used by default. A future extension will provide for custom width floating point, since this is a very powerful feature of HLS that can save a lot of energy and area. The extension will give the same behaviour on mono WD as on RTLSIM and FPGA.

A core set of floating point ALUs is provide in cv_fpgates.v. These are soft macros that the RTL tools are expected to map to whatever is available in the target FPGA or ASIC library. Specific shims and bindings to assist with Altera and Xilinx are likely to be added to the distro in the near future.

Floating-point Convertors

There is no budget limit on the number of convertors is currently imposed.

The convertors required normally are

 CV_FP_CVT_FL2_F32_I32  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F32_I64  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F64_I32  // Integer 32 to float 32 with fixed latency of 2
 CV_FP_CVT_FL2_F64_I64  // Integer 32 to float 32 with fixed latency of 2                    

 CV_FP_CVT_FL2_I32_F32  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I32_F64  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I64_F32  // Integer 32 from float 32 with fixed latency of 2
 CV_FP_CVT_FL2_I64_F64  // Integer 32 from float 32 with fixed latency of 2

 CV_FP_CVT_FL0_F32_F64  // Float 32 from float 64 (FL=0 implies combinational)                                        
 CV_FP_CVT_FL0_F64_F32  // Float 32 from float 64 (FL=0 implies combinational)

RAM and ROM Leaf Cells

A set of standard static RAM cells is provided in cvgates.v. These are parameterisable in width, length and number of lanes by overrides. They are single and dual ported and of latencies 0, 1 and 2 clock cycles.

Kiwi and other tools built in the HPR library generate instances of these RAMs.

RTL tools are expected to map these to appropriate structures, such as LUT RAM and block RAM on FPGA.

RAM instances are also generated with no write ports and static initialisations using the Verilog initial statements. RTL tools will treat these as ROMs. Unlike RAMs, where the user is expected to manually couple a definition from cvagtes.v or elsewhere to their RTL synthesis step input, ROMs are are embedded in the main RTL output files from a run of the tool.



Subsections
David Greaves 2016-12-05