CARDs: Machine-readable FU and IP BLock Datasheets for SoC/Multiblade System Assembly and Incremental HLS.

Design reuse and portability of RTL modules in Verilog and VHDL is greatly facilitated by machine-readable metainfo, often in the form of IP-XACT lists of ports in XML. A number of GUI-based development environments support manual click-and-join of compatible ports to create system-level interconnect with sometimes more than 100 nets in a complex bus (eg. AXI-5) being connected at once. However, using IP-XACT extensions, sufficient information can be recorded for a rule-based system to choose which connections to make and optimise the solution. For cloud deployment, automatic partition over a network of SERDES-connected FPGAs is also possible.

Moreover, the same level of detail is sufficient for an HLS (high-level synthesis) tool to use/instantiate black-box components and custom FUs that are the outputs of previous HLS runs. IP-XACT allows ports to be described at more than one level of abstraction, with a TLM variant being used in the HLS source code and a structural instance appearing in the output RTL. But further meta-information is needed to describe which methods can be simultaneously invoked and to manage shared DRAM memory space.

This page seeks collaborators to develop this concept for adoption in an industry standards forum. Our approach includes a parameterisable generic hardware interface formalism that encompasses the vast majority of current FU interface paradigms, including simple pipelining with controlled initiation interval, the dual-port, put-get paradigm from Bluespec and standard synchronous interfaces (eg. AXI).

NANDA23 Poster

Poster PDF

IP-XACT Extensions Working Documents

This note defines a generic framework and reference terms for formal specification of synchronous functional hardware units (FUs) to serve as a basis for automated system assembly (and partition over multiple FPGAs in the cloud) and incremental HLS. It recommends a documentation/database style for pre-existing components, including outputs from previous compilations, that is easily represented as IP-XACT extensions. It includes a canonical description framework for protocols on net-level interfaces and their mapping to TLM-style method calls (PDF). Much further information about an FU (such how to replicate it for load balancing or delete it if seemingly unused).

The current draft proposal extensions are on this page:IP-XACT Incremental HLS Extensions.

The current implementation and definition of the IP-XACT extensions is in the src files for the HPR L/S logic synthesiser library.

https://bitbucket.org/djg11/bitbucket-hprls2/src/master/

We hope to see the community adopt a set of IP-XACT extensions for documenting these aspects.

See: Terminology and Definitions for Multi-ported Functional Units (PDF).

BlueParrot HLS Compiler Example

CBG BlueParrot is an experimental HLS compiler that supports incremental compilation. This means that the output of one compilation can be used as a child FU (functional unit) by a parent compilation. Generally, HLS tool chains have not emphasised this design flow because it can defeat inter-unit optimisations and HLS compilation time has not been significant compared with back-end place and route procedures (either for FPGA nor ASIC). However, this has made it very difficult for pre-existing components, such as CAMs, DRAM controllers and other more-specialist IP blocks to be exploited via HLS. However, implementors of HLS tools soon realise that a generic description framework is needed within their tool to encompass the broad range of standard FUs they require (mainly RAMs and ALUs) and there is little reason why this API should not be public.

With suitable extensions IP-XACT has been adapted for this purpose. The HPR/LS logic synthesis library already writes out IP-XACT summaries of each RTL or SystemC output it creates. It is a small step to document the extended attributes in this IP-XACT file.

Much of HLS research has been directed at garnering the highest possible performance from a set of nested loops using static schedulling. However, most data today is stored in DRAM, which is unpredictable in access time, or conveyed from SSD or fileservers over networks that both have unpredictable access times. When these behaviours are all combined on a single FPGA, as a set of concurrent threads rendered in hardware, the static schedulling will suffer from pipeline stalls and loose performance. Also, some basic FU operations and algorithms, such as dividing or greatest-common-factor, are intrinsically variable in latency, even if using a static schedule for their inner loops. Hence there is a need for static and dynamic schedulling in FPGAs (see "Combining Dynamic & Static Scheduling in HLS" Cheng, Josipovic and Constantinides, 2020).

BlueParrot uses the VSFG approach (see "A New Dataﬂow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware" Zaidi and Greaves 2014), extended with memory fence instructions, to convert a C-like program into a tokenised dataflow graphs. One graph exists for each thread or externally-callable method exported by an FU. These run in parallel and can make TLM-like calls on the methods of child/instantiated FUs using the IP-XACT extensions. Each callable method is re-entrant, enabling a number of concurrent activations to be supported, with results delivered in-order. Server-farm wrappers, also coded in BlueParrot, can exploit the incremental compilation aspects to provide super-scalar performance if needed.

Internally, a naive conversion to a fully tokenised intermediate form is followed by an aggressive optimiser that yields high performacnce, removing as much redundancy in the token handling as possible. It applies semantic-preserving rewrites, such as the introduction of Mealy paths, permuting the order of some operations and changing FIFO depths.

Traffic Monitor Example

//
// The traffic monitor example - toy version.
//
module VICTIM_DETERMINER() // More-or-less null toy implementation
{
  export uint8_t provide(uint24_t key) { int ctr = 202; return ctr; };
  export void note_use(uint24_t key) { /* nothing */ };
}

module CAM8x24() // A fake stub in place of the hardened real one for test purposes
{
  export uint8_t findindex(uint24_t key) { int vale = 12; return vale; };
  export void settag(uint8_t idx, uint24_t key) { /* nothing */ };
}

module TRAFFIC_MONITOR()
{
  instance CAM8x24: thecam();

  register uint7_t: spare_counter;

  memory sram uint6_t: countram[256];

  instance VICTIM_DETERMINER: vd();

  export void update(uint24_t key)
  {
    int idx = cam.findindex(key);
    if (idx < 0)
     {
       idx = vd.provide(key);
       thecam.settag(idx, key);
       countram[idx] = 1;
     }
     else countram[idx] = MIN(15, countram[idx]);
     vd.note_use(key);
  };

  export int read(uint24_t key)
  {
     int idx = thecam.findindex(key);
     return (idx<0) ? -1: countram[idx];
  };
}

This example TRAFFIC_MONITOR instantiates four child components: two are hard-coded native in the language (the RAM and the register) and two are defined in IP-XACT. It exports two methods. It is based on a real example from the NetFPGA project.

The two exported methods default to having separate hardware port pairs (each port pair has a put and an aout side). However, they could be instructed with a pragma to share one port pair, in which case the exported IP-XACT will document their code points in an extra METHOD_SELECT subfield on the single put side.

The instantiated CAM appears to have a 24-bit-wide tag input bus, but the real chip had a 12-bit wide port where two words need to be transferred in. To give this illusion of 24-bits being transferred in one operation, various approaches are possible. One is to instantiate a BlueParrot wrapper around the real chip that exports the preferred wider interface and to make the adaption in the wrapper. This wrapper could be pre-compiled and imported for the current compilation in IP-XACT form, or it can be compiled at the same time as the TRAFFIC_MONITOR where it will be 'in-lined' or 'flattened'. The optimiser should be able to make inter-module optimisations in the latter case, giving a better result.

HPR System Integrator Example

HPR System Integrator is an open-source EDA tool that uses the HPR L/S library. It is primarily used to generate top-level circuit structures around logic generated from diverse sources. Main Page.

Draft/position paper: HPR System Integrator: A SoC/Multiblade Link Editor David J Greaves Computer Laboratory, University of Cambridge, UK. Working draft. DOWNLOAD PDF

FPL 2017 Demo.

Older: 2008 draft page for the original formal proof cards project: CARDs behavioural datasheets.