Kiwi Scientific Acceleration: C# high-level synthesis for FPGA execution.
Zynq Substrate

Kiwi Scientific Acceleration: Zynq Substrate Programmed I/O and DMA Basic Demo

Stop press: March 2017: We are now making first tests of the Kiwi substrate for the Amazon EC2 F1, FPGA-in-the-cloud Service!


This page illustrates the setup used on the Zynq platform. Other server blades are typically connected via PCIe. Whether the PCIe port is hardened or PL, it will typically export both AXI master and slave ports and the situation will be pretty much as for the Zynq.

A major variation between FPGA cards will be whether the external DRAM is shared with the local ARM or x86 cores. On the Zynq it is shared and whether it is cache consistent is determined by whether to use the ACP or HP connections on the way out of the PL.

General Substrate Arrangement (Zynq Example)

The general arrangement on Xilinx Zynq is described here.

The design emitted by KiwiC needs to be inserted in the substrate for the appropriate board or server blade.

The main substrate shim is boiler-plate RTL code that connects to the M_AXI_GPO programmed I/O bus for simple start/stop control and parameter exchange. It is recommended that every design compiled has a serial number hard-coded in the C# source code and that this is modified on every design iteration. The first function of the substrate shim is to provide readback of this value.

The other features of the shim are starting and stopping the design and collecting abend codes. Sources of abend are null-pointer de-reference, out-of-memory, divide-by-zero, user assertion failure, and so on.

A Kiwi design that makes access to main memory will have a number of load/store ports. These can be half-duplex or simplex. Simplex is preferred when main memory is served over the AXI bus, as in the Zynq design. (Of course there may be a lot of BRAM memory in the synthesised design itself, but that does not appear on this figure.) Simplex works well with AXI since each AXI port itself consists of two independent simplex ports, one for reading and one for writing.

In the illustrated example, the design used three simplex load/store ports. These need connecting to the available AXI busses hardened on the Zynq platform and made available to the FPGA programmable logic. The user has the choice of a cache-coherent, 64-bit AXI bus that will compete with the ARM cores for the L2 cache front-side bandwidth, or four other high-performance 64-bit AXI busses that offer high DRAM bandwidth. These four are not used in the example figure.

Each KiwiC-generated load-store port is an in-order unit, like an individual load or store station in an out-of-order processor. By multiplexing their traffic onto AXI-4 busses, bus bandwidths are matched and out-of-order service from the DRAM system is exploited.

Each load/store port in the generated RTL has is properly described in the IP-XACT rendered by KiwiC that describes the resulting design. When this IP-XACT is imported into a design suite, manual wiring of the load/store ports to the AXI switch ports can be done in a schematic editor. (Soon this will be automated when KiwiC invokes SoC Render tool inside the HPR L/S library).

Note that KiwiC as of December 2016 generates so-called HFAST ports, that are either half-duplex, loadonly or storeonly. These are what was described in KiwiC-generated IP-XACT. The user has to currently manually instantiate, in the schematic editor, little protocol convertors that come with KiwiC and which convert HFAST variants to AXI variants for connection to the vendor-provided AXI switch blocks. Again, this will be automated when SoC Render is enabled.

Initial Basic Demo (for ACS P35 January 2017)

This basic demo example contains all of the files to install a Kiwi design on a Xilinx Zynq platform and make programmed I/O access to the Kiwi design for control, debug and data transfer using a crude net-level interface.

Here we will edit the RTL manually, since that is a good experience to have!

Everything needed is packaged in a ZIP file or Git folder. This provides a Makefile that invokes the Vivado Design Suite in batch mode without its GUI so no mouse clicks are needed.

Note: The ZIP file contains some IP blocks generated by Vivado where the copyright belongs to Xilinx.

This example illustrates low-level, manual access. For scientific users, a more automated approach is needed.

Zynq APU Block Diagram (C) Xilinx inc.
Block Diagram of the Main Zynq Hardened Data Paths


Block Diagram of the Ksubs2 Installed in FPGA (Zynq part simplified)

Directory Structure

Simple Programmed I/O Device

The director shim contains the following example of an array that is mapped for programmed I/O operation. Since it marked as public, parts of the C# application proper can make random access to it according to some user-defined, application-specific protocol.

The array is small enough for Kiwi to map it to a register file instead of a BRAM, so single-cycle access without contention is always possible just using FPGA wiring.

This design template can be adjusted as much as you like. For instance other status and control registers are easy enough to define in C# and make addressable via the programmed IO bus.

  [Kiwi.OutputWordPort("pio_address")]
  static int pio_address = 0;
  [Kiwi.InputWordPort("pio_wdata")]
  static int pio_wdata;
  [Kiwi.OutputWordPort("pio_rdata")]
  static int pio_rdata = 0;
  [Kiwi.InputBitPort("pio_hwen")]
  static bool pio_hwen;

  public const int pio_size = 10;

  // TODO put Kiwi register file attribute here. [Kiwi.RegisterFile("pio_space")]
  public static int [] pio_space = new int [pio_size];

  [Kiwi.HardwareEntryPoint()]
  public static void host_pio_process()
  {
    while (true)
      {
	if (pio_hwen) pio_space[pio_address] = pio_wdata;
	else pio_rdata = pio_space[pio_address];
	Kiwi.Pause();
      }
  }

Procedure

Copy the files to a folder.

Update the very start of the ksubs2.tcl file to set design_dir to the current folder and any other folders where your RTL source files are to be found.

Update the start of the ksubs2.tcl file to select the target device and pinout file:

# Set the Zynq Chip type
# Parallella wants: set part xc7z010clg400-1  set vdefine PARCARD10=1
# Zedboard wants:   set part xc7z020clg484-1  set vdefine ZEDBOARD20=1
set part xc7z010clg400-1  
set vdefine PARCARD10=1

# Set the PCB pinout
# Parallella wants: set pinout $design_dir/pinouts/parallella10.xdc
# Zedboard wants:   set pinout $design_dir/pinouts/zedboard20.xdc
set pinout $design_dir/pinouts/parallella10.xdc

Your own design is instantiated inside ksubs2_innercore.v - you need to modify the RTL of ksubs2_innercore.v accordingly.

Kiwi HLS Specific Procedure

For Kiwi HLS use, set the ANAME variable in the top of Makefile to the name of your design (without the .cs suffix).

NAME=lu-decomp-sp

Adjust your design's 24-bit serial number by adjusting the setting in KSubs2_Director.cs. You should probably do this each time you make a major change.

  [Kiwi.OutputWordPort("design_serial_number")]
  static int design_serial_number = 0x552904;

Non-Kiwi HLS Step

Delete the KiwiC invokation from the Makefile. Provide your own RTL to instantiate inside ksubs2_innercore.v instead.

Compile RTL and Install Bitstream

Type

$ make

One to 360 minutes later, your design file will be here

ls -l /tmp/ksubs2-fpga
total 2108
-rw-rw-r-- 1 djg11 djg11   10255 Feb  7 09:22 clock_util.rpt
-rw-rw-r-- 1 djg11 djg11   10213 Feb  7 09:22 post_place_util.rpt
-rw-rw-r-- 1 djg11 djg11   16978 Feb  7 09:22 post_route_power.rpt
-rw-rw-r-- 1 djg11 djg11     651 Feb  7 09:22 post_route_status.rpt
-rw-rw-r-- 1 djg11 djg11   12899 Feb  7 09:22 post_route_timing_summary.rpt
-rw-rw-r-- 1 djg11 djg11    7679 Feb  7 09:22 post_synth_util.rpt
-rw-rw-r-- 1 djg11 djg11 2083856 Feb  7 09:23 topfpga.bit
md5sum /tmp/ksubs2-fpga/topfpga.bit
ee4d365bff12b116832962157a2006cd  /tmp/ksubs2-fpga/topfpga.bit

Copy the bit file to the Zynq linux using scp or Dropbox etc..

From Zynq command line, install the bit file into the FPGA using

cat topfpga.bit > /dev/xdevcfg

Read back your serial number ...

root@parcard-djg1:~# gcc -o devmem2 devmem2.c 
root@parcard-djg1:~# ./devmem2 0x43c00000 w
/dev/mem opened.
Memory mapped at address 0xb6fc5000.
Value at address 0x43C00000 (0xb6fc5000): 0x552904

Instruct the director to start your program

  ... TBA

Substrate and Directing

By directing we mean the ability to start/stop/singlestep and collect run-time faults from an application program.

A runtime fault arises when a subsystem writes a value other than 255 to its abend syndrome register. The code 255 means the subsystem has not started yet, thereby allowing code 0 to denote normal exit. When a fault is detected, the abend syndrome register for the faulting subsystem, and the PC values and/or waypoins for all threads in all subsystems is collected via the directing interface by the substrate.

Under incremental compilation, SoC Render inserts glue logic to aggregate the fault vectors, tagging them with subsystem instance number as needed.

Waypoints and Virtual LEDs

Nearly all FPGA blades have a some simple LED indicators connected to IO pads. Kiwi provides a uniform way to drive these and the substrate makes their values available to the host CPU, which is useful when the LEDs are in a different room or continent from the application user.

Download

This is being made public Feb 2017.

Computer Laboratory users can copy a temporary version from ~djg11/vivado-cmdlines/ksubs2

If you need hprls you will find a stable copy is being added here at some point soon: /usr/groups/han/clteach/hprls2

Conclusions

This example illustrated low-level, manual access. For scientific users, a more automated approach is needed. Please see ... TBC ...


December 2016.               UP.