Kiwi Scientific Acceleration: C# high-level synthesis for FPGA execution.
Zynq Substrate

Kiwi Scientific Acceleration: Zynq Substrate Programmed I/O and DMA Basic Demo


This page illustrates the setup used on the Zynq platform. Other server blades are typically connected via PCIe. Whether the PCIe port is hardened or PL, it will typically export both AXI master and slave ports and the situation will be pretty much as for the Zynq.

A major variation between FPGA cards will be whether the external DRAM is shared with the local ARM or x86 cores. On the Zynq it is shared and whether it is cache consistent is determined by whether to use the ACP or HP connections on the way out of the PL.

General Substrate Arrangement (Zynq Example)

The general arrangement on Xilinx Zynq is described here.

The design emitted by KiwiC needs to be inserted in the substrate for the appropriate board or server blade.

The main substrate shim is boiler-plate RTL code that connects to the M_AXI_GPO memory-mapped I/O bus for simple start/stop control and parameter exchange. It is recommended that every design compiled has a serial number hard-coded in the C# source code and that this is modified on every design iteration. The first function of the substrate shim is to provide readback of this value.

The other features of the shim are starting and stopping the design and collecting abend codes. Sources of abend are null-pointer de-reference, out-of-memory, divide-by-zero, user assertion failure, and so on.

A Kiwi design that makes access to main memory will have a number of load/store ports. These can be half-duplex or simplex. Simplex is preferred when main memory is served over the AXI bus, as in the Zynq design. (Of course there may be a lot of BRAM memory in the synthesised design itself, but that does not appear on this figure.) Simplex works well with AXI since each AXI port itself consists of two independent simplex ports, one for reading and one for writing.

In the illustrated example, the design used three simplex load/store ports. These need connecting to the available AXI busses hardened on the Zynq platform and made available to the FPGA programmable logic. The user has the choice of a cache-coherent, 64-bit AXI bus that will compete with the ARM cores for the L2 cache front-side bandwidth, or four other high-performance 64-bit AXI busses that offer high DRAM bandwidth. These four are not used in the example figure.

Each KiwiC-generated load-store port is an in-order unit, like an individual load or store station in an out-of-order processor. By multiplexing their traffic onto AXI-4 busses, bus bandwidths are matched and out-of-order service from the DRAM system is exploited.

Each load/store port in the generated RTL has is properly described in the IP-XACT rendered by KiwiC that describes the resulting design. When this IP-XACT is imported into a design suite, manual wiring of the load/store ports to the AXI switch ports can be done in a schematic editor. (Soon this will be automated when KiwiC invokes HPR System Integrator tool (also built on the HPR L/S library and part of that distro)).

Note that KiwiC as of December 2016 generates so-called HFAST ports, that are either half-duplex, loadonly or storeonly. These are what was described in KiwiC-generated IP-XACT. The user has to currently manually instantiate, in the schematic editor, little protocol convertors that come with KiwiC and which convert HFAST variants to AXI variants for connection to the vendor-provided AXI switch blocks. Again, this will be automated when SoC Render is enabled.

Initial Basic Demo (for ACS P35 January 2017)

This basic demo example contains all of the files to install a Kiwi design on a Xilinx Zynq platform and make memory-mapped I/O access to the Kiwi design for control, debug and data transfer using a crude net-level interface.

Here we will edit the RTL manually, since that is a good experience to have!

Everything needed is packaged in a ZIP file or Git folder. This provides a Makefile that invokes the Vivado Design Suite in batch mode without its GUI so no mouse clicks are needed.

Note: The ZIP file contains some IP blocks generated by Vivado where the copyright belongs to Xilinx.

This example illustrates low-level, manual access. For scientific users, a more automated approach via HPR System Integrator or similar wiring tool would be appropriate.

Zynq APU Block Diagram (C) Xilinx inc.
Block Diagram of the Main Zynq Hardened Data Paths.

General structure of Kiwi designs on the Zynq Platform using the ksubs3 substrate.
Block diagram of the Kiwi substrate, ksubs3.1, installed in a Zynq FPGA (all parts simplified). The module kiwi_axi_pio_target is the 'director shim' and the innercore contains user role(s) called DUTs. A substrate server runs on one of the ARM cores for file system and I/O handling.

The ksubs3 features include:

The ksubs3.1 is fairly low performance (10 to 20 MB/sec) but 3.2 has per-VC credit-based flow control and has (will have) much greater throughput and an interrupt facility.

ksubs3 Directory Structure

We are moving to ksubs3.1 Jan 2018:

Director Shim Programmer's View Memory Map

Register numbers 0..31 are defined for the director shim:
 0  0x43c00000 serial no
 1  0x43c00004 counter 
 2  0x43c00008 write fold, run/stop and interrupt control in future.
 3  0x43c0000c mon0
 4  0x43c00010 mon1
 5  0x43c00014 waypoint
 6  0x43c00018 result_lo
 7  0x43c0001c result_hi
 8  0x43c00020 tx_Noc16_lo
 9  0x43c00024 tx_Noc16_hi 
10  0x43c00028 tx_Noc16_status 
11  0x43c0002c 
12  0x43c00030 rx_Noc16_lo
13  0x43c00034 rx_Noc16_hi 
14  0x43c00038 rx_Noc16_status 
15  0x43c0003c 

Register numberds 32 onwards are served by the DUT if it wishes. 
32  0x43c00080  DUT role register 0
33  0x43c00084  DUT role register 1

Simple Memory-Mapped I/O Device

The role (user's design) can contain the following example of an array that is mapped for I/O operation by either of the host ARMs. Since it marked as public, parts of the C# application proper can make random access to it according to some user-defined, application-specific protocol or register file definition.

The RTL director shim (part of the shell/substrate) defines the first 32 (or so) registers in the mapped space. User registers should typically start at offset (32 ** 2).

The array is small enough for Kiwi to map it to a register file instead of a BRAM, so single-cycle access without contention is always possible just using FPGA wiring.

This design template can be adjusted as much as you like. For instance other status and control registers are easy enough to define in C# and make addressable via the I/O bus.

  [Kiwi.OutputWordPort("pio_address")]
  static int pio_address = 0;
  [Kiwi.InputWordPort("pio_wdata")]
  static int pio_wdata;
  [Kiwi.OutputWordPort("pio_rdata")]
  static int pio_rdata = 0;
  [Kiwi.InputBitPort("pio_hwen")]
  static bool pio_hwen;

  public const int pio_size = 10;

  // TODO put Kiwi register file attribute here. [Kiwi.RegisterFile("pio_space")]
  public static int [] pio_space = new int [pio_size];

  [Kiwi.HardwareEntryPoint()]
  public static void host_pio_process()
  {
    while (true)
      {
	if (pio_hwen) pio_space[pio_address] = pio_wdata;
	else pio_rdata = pio_space[pio_address];
	Kiwi.Pause();
      }
  }

Procedure

Copy the files to a folder.

Update the very start of the ksubs3.tcl file to set design_dir to the current folder and any other folders where your RTL source files are to be found.

Update the start of the ksubs3.tcl file to select the target device and pinout file:

# Set the Zynq Chip type
# Parallella wants: set part xc7z010clg400-1  set vdefine PARCARD10=1
# Zedboard wants:   set part xc7z020clg484-1  set vdefine ZEDBOARD20=1
set part xc7z010clg400-1  
set vdefine PARCARD10=1

# Set the PCB pinout
# Parallella wants: set pinout $design_dir/pinouts/parallella10.xdc
# Zedboard wants:   set pinout $design_dir/pinouts/zedboard20.xdc
set pinout $design_dir/pinouts/parallella10.xdc

Your own design is instantiated inside ksubs3_innercore.v - you need to modify the RTL of ksubs3_innercore.v accordingly (remove all reference to primes-offchip.v or similar).

Kiwi HLS Specific Procedure

For Kiwi HLS use, set the ANAME variable in the top of Makefile to the name of your design (without the .cs suffix).

NAME=lu-decomp-sp

Adjust your design's 24-bit serial number by adjusting the setting in KSubs3_Director.cs. You should probably do this each time you make a major change.

  [Kiwi.OutputWordPort("design_serial_number")]
  static int design_serial_number = 0x552904;

Non-Kiwi HLS Step

Delete the KiwiC invokation from the Makefile. Provide your own RTL to instantiate as the DUT inside ksubs3_innercore.v instead.

Compile RTL and Install Bitstream

Type

$ make

One to 360 minutes later, your design file will be here

ls -l /tmp/ksubs3-fpga
total 2108
-rw-rw-r-- 1 djg11 djg11   10255 Feb  7 09:22 clock_util.rpt
-rw-rw-r-- 1 djg11 djg11   10213 Feb  7 09:22 post_place_util.rpt
-rw-rw-r-- 1 djg11 djg11   16978 Feb  7 09:22 post_route_power.rpt
-rw-rw-r-- 1 djg11 djg11     651 Feb  7 09:22 post_route_status.rpt
-rw-rw-r-- 1 djg11 djg11   12899 Feb  7 09:22 post_route_timing_summary.rpt
-rw-rw-r-- 1 djg11 djg11    7679 Feb  7 09:22 post_synth_util.rpt
-rw-rw-r-- 1 djg11 djg11 2083856 Feb  7 09:23 topfpga.bit
md5sum /tmp/ksubs3-fpga/topfpga.bit
ee4d365bff12b116832962157a2006cd  /tmp/ksubs3-fpga/topfpga.bit

Copy the bit file to the Zynq linux using scp or Dropbox etc..

From Zynq command line, install the bit file into the FPGA using

cat topfpga.bit > /dev/xdevcfg

Read back your serial number ...

root@parcard-djg1:~# gcc -o devmem2 devmem2.c 
root@parcard-djg1:~# ./devmem2 0x43c00000 w
/dev/mem opened.
Memory mapped at address 0xb6fc5000.
Value at address 0x43C00000 (0xb6fc5000): 0x552904

The RTL design will most likely not start the moment it is loaded via xdevcfg since there is a hardware reset or other run/stop control signal from the director shim that is not yet enabled. Starting ksubs3.1-server binary on an ARM core will release the hold on the RTL and the role/DUT will run. For a design that does not wait for a go signal from the director shim, it will most likely stall straightaway, blocking on a Noc16 operation, such as when it opens its first file. In the future, a ksubs3-server-daemon might already be running on the ARM side, and so the program could more commonly start as soon as it is loaded.

When the subtrate server is run with certain debug turned on we see the slotted ring packets or the file system operations being reported. Non-empty ring packets can be printed out as they pass through the server code on the ARM. Here the application is sending a file name to be opened:

  Device serial no 65991102
  Ksubs3.1-server: Loaded design serial no is 0x65991102
  ctr code now is b 0x1800000   cmd_code=0
  ctr code now is b 0x10   cmd_code=1
  00  0x00000000 0x00000000 RX 00  0000000000000000
  Code 0x80800010
  00  0x00000000 0x00000000 RX 41  0000000000010000
  Kiwi fserver: cmd open file 1 a2=10000
  TX 50  0000000000000000
  Code 0x80800010
  00  0x00000000 0x00000000 RX 50  0000000000000000
  Code 0x80800010
  00  0x00000000 0x00000000 RX 41  0000000000020054
  Kiwi fserver: cmd open file 1 a2=20054
  TX 50  0000000000000000
  Code 0x80800010
  00  0x00000000 0x00000000 RX 50  0000000000000000
  Code 0x80800010
  00  0x00000000 0x00000000 RX 41  0000000000020065
  Kiwi fserver: cmd open file 1 a2=20065
  ...

If we run without debugging turned on we see minimal output. The application has run, written its result to the parallel bus provided for simple answers up to 64 bits. It then wrote code 1 to the Abend register which caused the substrate server to exit.

  Device serial no 65991102
  Ksubs3.1-server: Loaded design serial no is 0x65991102
  Target exits with return code rc=0x1 ... Done
  Result register answer is 0x00000004000002B1

Files for this specific demo

The ARM-side Zynq substrate code was placed here and the ksubs3 rtl and rtl_sim files will be there soon:

 https://djg11@bitbucket.org/djg11/kiwi-ksubs3.git

The Kiwi application used in the above demo, together with the file system stub is Knoc16Test Folder.

Substrate and Directing

By directing we mean the ability to start/stop/singlestep and collect run-time faults from an application program.

A runtime fault arises when a subsystem writes a value other than 255 to its abend syndrome register. The code 255 means the subsystem has not started yet, thereby allowing code 0 to denote normal exit. When a fault is detected, the abend syndrome register for the faulting subsystem, and the PC values and/or waypoins for all threads in all subsystems is collected via the directing interface by the substrate.

Under incremental compilation, SoC Render inserts glue logic to aggregate the fault vectors, tagging them with subsystem instance number as needed.

Waypoints and Virtual LEDs

Nearly all FPGA blades have a some simple LED indicators connected to IO pads. Kiwi provides a uniform way to drive these and the substrate makes their values available to the host CPU, which is useful when the LEDs are in a different room or continent from the application user.

Download

For local use, this is all on the links on the ACS P35 Toolinfo page.

Computer Laboratory users can copy a temporary version of ksubs3 from ~djg11/vivado-cmdlines/ksubs3

ksubs4 has both the DMA mastering and the PIO and NOC16 slave. It does not have the large amount of Xilinx IP that converts from AXI4 to AXI3, it instead connects directly to the hardened AXI4 connections at the PL boundary. The initial version, that does not have credit-based flow control, is here bitbucket.org/djg11/cbg-hpr-ksubs4a-zynq.git.

Conclusions

This example illustrated low-level, manual access. For scientific users, a more automated approach is needed. Please see ... TBC ...


December 2016.               UP.