Feb 2017 - ACS P35 Mini-Project or Group Mini-Project Task

  ----------------------------------------------------------------------------------------

This year P35 Exercise 3 will consist of the following items:
 
  3.01 - chose an application task amenable to hardware acceleration
              (the suggested task is big data bloom filtering)

  3.02 - get a baseline version of it working in C on a unix workstation without acceleration


  3.03 - port the C version to run on the Prazor simulator without acceleration

  3.04 - port the C version to the Zynq ARM without acceleration

  3.05 - designing some sort of hardware accelerator (keep it simple)

  3.06 - documenting the accelerator giving its programming model and estimating its expected performance

  ----------------------------------------------------------------------------------------

  3.10 - creating both a SystemC TLM model and a synthesisable RTL implementation of the accelerator
             (the work creating covers manual coding and HLS)

  3.11 - installing the TLM version in the Prazor virtual platform and adapting the 3.03 output to use it

  3.12 - installing the RTL version on the Zynq FPGA and adapting the 3.04 output to use it

  ----------------------------------------------------------------------------------------

  3.20 - Evaluate the performance of your 3.12 implementation

  3.21 - Look at the predicted performance from your 3.11 implementaion and explain discrepancies

  3.22 - Instead of just considering execution time, complete 3.20 and 3.21 in energy efficiency terms.

Note, in order to generate data points for comparison, and for crosschecking predicted and measured performance, consider
adjusting the following parameters:

    Data block size
    gcc optimisaion level : -O0 versus -O2
    L1 and/or L2 cache disable
    CPU clock frequency adjustment
    use one or both ARM cores

  ----------------------------------------------------------------------------------------

Timetable

Exercise 3a  6/3/17 12:00
   socex3a: Consists of items 3.0X above.   


Exercise 3b - 16/3/17 12:00 - Credit 20 MARKS
   socex3b Consists of all the other items described above, done to some minimal
   standard by the end of the Lent Term.  Only a very sketchy evaluation is expected at this
   stage and simply showing 'it all worked' will do.

Exercise 3c - Also 16/3/17 12:00
   socex3c: Will be a specification of the work for socex4 and any formal request for
   collaboration permission.  The work is normally to further refine the 3b work
   in terms of accuracy and fidelity.  


----

Exercise 4 SoC Design Mini-Project & Structured Essay - 25/4/17 16:00 

Note that you will be able to use any part of socex3 for socex4a - the Mini Project, but you may instead want to start again.

For socex3, all parts, working in pairs is allowed, with one person
writing the application code and the other writing the TLM model or
message-passing API etc.. But all reports must make it clear who did what.

For socex4, all parts, you must work alone, or else get express permission to collaborate on the 4a work.  

-------------------

Further notes:

Please turn on the caches for realistic performance - if you are copying from Hello World then the -caches-off flag should be dropped (or else the appropriate cache enable writes to control registers made).

Questions Arising

Q. How do you activate the accelerator on request, without flooding the AXI bus? (maybe the functionality of your communication module?)

The simplest approach is to make programmed I/O loads and stores to the accelerator from the ARM and polling for completion
instead of using interrupts.

Q. In the questions, "adapting the 3.03 output" means to call the accelerator from the ARM binary?

A. Yes.

Q. Are there any restrictions/permissions on memory for Zynq? Should the ARM binary decide the memory space to be used by Zynq FPGA using malloc? Does it matter if its in stack or heap segment?

A. The Zynq space needs to be uncached and, if accessed via linux, the VM needs mapping too. 

The most simple approach to programmed I/O on real Zynq hardware is to adapt the source code for devmem2 which does not use malloc but uses instead
a mmap of /dev/mem which works when run as root.
http://free-electrons.com/pub/mirror/devmem2.c
    map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, target & ~MAP_MASK);
The address base used by the ksubs2 example is 0x43c00000.  
The mmap code may only make available 4 KBytes at a time: I have not experimented with more.


The Vivado system has a lot of support for doing this sort of thing in a much more automated way.  It includes the SDK tool
that some people use.  There are also off-the-shelf solutions for Zynq FPGA access, like xillybus.com.  But I always
like to start off doing it myself with about half a page of code instead of using fancy GUI tools.

On Prazor, you can bind your new device to busmux0 in zync.cpp at the same address as used on the real FPGA,
but this will be cached unless you refine the  UNCACHED_ADDRESS_SPACE64 macro in vhls/src/tenos/tenos.h that simply turns off
cacheing above 0xE000_0000 by default. Or you can bind such devices with higher addresses, above that limit,
so that you know they will not get cached. Clearly your application code needs to be compiled with the
appropriate, matching base value.


Q. How can I access main memory from Verilog?

A. I have not done this on Zynq so far myself.  I will extend the
ksubs2 reference design at some point in the future.  It needs to have
a second AXI interface configured in the other direction.  Without
this, you will have to copy data into and out of the RTL design's
private memories from the ARM.  In RTL for Vivado, there are rules (UG
901 RAM HDL Coding Guidelines) for BRAM inference which if you do not
follow will result in LUT RAM being used which is in much shorter
supply.  The ksubs2 folder has local memory example in it that follows
the rules. The main rule is that the read data must be stored in one
broadside register and not directly used for anything else.

Q. How can I get the (simulated) chip's clock period programmatically?

A. For the Zynq platform, the CPU clock period can be adjusted by a
write to a control register at 0x1F00_0120.  This, I think, is not modelled in the
Prazor Zynq model at the moment.  Instead, the ARM models in prazor
had their clock period factored into the setting of COREISA_IF.h
m_effective_instruction_period based on the average IPC and this was set
at compile time.  I believe it has changed a little since I last looked
at it, where the IPC is computed on the instruction mix.  One quick way forward
is to implement a NOP-style backdoor so the ARM can change this setting on the fly.
The NOP backdoors for ARM are implemented in /vhls/src/arm/backdoor_nops.C.
The better way is to implement and contribute the 0x1F00_0120 control register functionality for Prazor,
then the same code can run on either platform to adjust the frequency.


Q. For the time estimation, if I am using wait(...), how is the delay value useful?

http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/obj-2.1/zhp0d7af9879.html


Q In the Q/A there is something about two threads one for registers and one for memory. How can I find more information about this?

A. The standard pthreads approach works on the real Zynq.  The code djgthreads provides the same functionality
running bare metal on Prazor.  The cachetest demo in the images folder illustrates use.  There is another
one called beebs-product that was run in Manchester last week and that is known to be robust.  I will find it.


Q. Can we use Kiwi HLS? 

A. Yes, but the SystemC output is a little flakey/inefficient (as of mid Feb 2017) but I will be fixing some bugs so if you are happy to 
interact with me on this I'd be very keen for you to try it.

Q. How can I ensure that the work I am trying to do is done in
registers or memory? I guess I may use malloc() to allocate variables
to memory (and the "register" keyword to give compiler a hint to
allocate variables to registers?). But is it possible that the
compiler may change things for optimisation and change the location
where I actually do the work? Or there is a way to force the compiler
to do all the work purely in registers or memory?


A. You can be fairly sure that if your function uses only a few integer
variables (fewer than 5) and only has two arguments then all will be
done in registers, whereas if you call malloc to allocate heap memory
then the loads and stores to struct fields will not be in registers.
I suggest you write very short pieces of code, one or two lines in the
function body and look at the resulting disassembly or .S files to see
what is going on.


Q. If our program runs "under pthreads with two parallel cores in
use", what work is actually done on the two cores? Are we supposed to
run each mode (register or memory) on each core?

A. The simple bare-metal pthreads implementation called djgthreads.c is
non-preemptive and runs one thread per core.

Q. Are we supposed to run each mode (register or memory) on each core?

A. Yes, the number of cores in use is an orthogonal axis of
exploration and so all combinations should be looked at in turn, with
the non-idle cores doing identical work at any one time.

Q. Is it necessary to use craft_wrch() in Prazor? (I have noticed that UART is used anyway with printf) Or is it necessary for Zynq?

A. For Prazor, running bare metal, the craft_wrch is the best output routine to use.
On Zynq there is no such function, but running with linux you can use
putchar, followed by a flush if you need to see the output
straightaway.

Q. (Also, for some reason the UART window is currently very slow, might be a connection issue)

A. Normally this is fine when running locally.  You disable the window
with a command line flag -noxterm that stops the uart model using X
windows but the spool file and console output are still generated by
the uart.


Q. In regards "3.04 - port the C version to the Zynq ARM without acceleration",
How different is from running at a local UNIX workstation? I mean, if
I will not be using the FPGA, isn't running on the Zynq ARM cores the
same as compiling and running the 3.01 version inside the card's linux shell?

A. The differences should be small, both are little-endian D32, but the
workstation is A64 and clearly a different compiler is used as well as
a different machine.


Q. In regards "3.05 - designing some sort of hardware accelerator", wouldn't be to describe the accelerated design of the paper's model?

A. You need to specify the precise split of work between subsystems and
explain why a performance gain or energy saving should be achieved.
This may involve high-level (aka back-of-envelope) modelling equations
relating to memory size and data transfer rates.

Q. In regards "3.06 - documenting the accelerator giving its
programming model and estimating its expected performance",
what is meant by programming model? It can refer to many things.
For example, last time it was referring to the time abstractness of
SystemC (loosely-timed vs approximately-timed).

A. The programming model is the technical documentation a programmer must
have to write the device driver or otherwise use the hardware
resource.  It describes visible register and memory layout. It does
not describe internal state not visible to the programmer.