Prazor/VHLS Temporary User Manual

Prazor Home Page.

Very high level simulation -> VHLS. We aim to render complete energy estimates from real workloads on complete datacentres with accuracy of +/- (sqroot 10) / 2.

References

  • PDF: Product-Brief

  • PDF: Zynq-7000-Technical Reference Manual

  • PDF: instruction-set-reference-card

  • PDF: Cortex-A9-MPCore-TRM

  • PDF: ARMv7-M-manual

  • PDF: ARMv7-A-R-manual

  • PDF: Interrupt Controller Specification

  • PDF: ARM1176JZ-S-TRM

    Introduction

    The Prazor simulator is written using SystemC TLM 2.0 sockets and can model a number of CPU architectures (x86_64, ARM, MIPS, OpenRISC).

    Currently we are using it to model the Parallella Card or ZedBoard. The model is binary compatible with the real hardware meaning it can run the same linux kernel and use the same SD card images. We might next make a 4 core model that will be the basis of the new Raspberry Pi 2 or we might make an Allwinner A20 (cubieboard/sunxi) model.

    tlm Zynq and Parallella model

    Coverage

    The Zynq model within Prazor covers the ARM cores, L1 and L2 caches, GIC, SCU, DRAM, SD Card and all peripherals and timers needed to boot linux with UART input and output. The Ethernet is currently missing, as is the Epiphany chip, Neon, USB, the FPGA and TCM. Floating point is being added. Jazelle cannot be added since it is secret. All of these components are modelled as TLM 2.0 blocking SystemC classes with loose timing and quantum keeper.

    General Download

    Before open public release, please :

    1. Get an xparch user ID and password from DJG or MP
    
    2. Login via the web interface (password required) to    http://phabricator.xparch.com
       2a and change your password.
       2b Install your public key on phabricator
    
    3. Use ssh-agent to put your private key in scope and then
    
       git clone ssh://vcs@phabricator.xparch.com:22/diffusion/P/vhls.git
    
    
    4. cd to vhls and follow the README.md
    

    Computer Laboratory Local Use

    Owing to the nature of git, it is possible to clone your own git repo from the checkout in /usr/groups/han/clteach/btlm/current using the appropriate git command (someone please let me know what it is) or just doing a simple cp -ra of that folder so that you get the hidden hit repo management files.

    Static snapshots, called pvp_nn, of the simulator code are also installed on the Computer Lab file server and can copied or linked to

    /usr/groups/han/clteach/btlm/current
    .

    If updates are needed, revised/updated copies will be maintained at

    /usr/groups/han/clteach/btlm/pvp_xx

    Old Copy

    OLD: Prazor can be checked out from an old git repo if you register at bitbucket and pass your user id on to myself. (The public release will be in a month or so we always hope.) The git repo (containing x86_64, ARM32/Zynq, MIPS64 and OpenRISC) is

    OLD REPO NOT RECOMMENDED ANY MORE: git clone https://bitbucket.org/prazorvhls/prazor-virtual-platform
    

    Set ups

    You only need to set BOOST, LDFLAGS, TLM_POWER3 and CXXFLAGS for compiling the simulator. To run it you need the PRAZOR and TARCH settings.

    export CLTEACH=/usr/groups/han/clteach
    export BOOST=$CLTEACH/boost/boost_1_48_0
    export BOOST_ROOT=$BOOST
    export SYSTEMC=$CLTEACH/systemc/systemc-current
    export TLM_POWER3=$CLTEACH/tlm-power3
    export LDFLAGS="-L$SYSTEMC/lib-linux64 -L/usr/local/lib -L$TLM_POWER3/src/.libs"
    export CXXFLAGS="-I$SYSTEMC/include/ -I$BOOST_ROOT -I$SYSTEMC/include/tlm_core/tlm_2 -I$TLM_POWER3/include -g -O2"
    export TARCH=ARM32
    export PRAZOR=$CLTEACH/btlm/current
    

    If it complains it cannot find the TLM_POWER3 library or the SystemC library you need one or both of the following:

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/groups/han/clteach/tlm-power3/lib-linux64
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYSTEMC/lib-linux64
    

    Plugging your own IP Block Models into the Virtual Platform

    One purpose of building your own copy of the simulator is to explore the performance of a program on different architectures, such as with different cache sizes. You also need to recompile the simulator to include your own IP blocks.

    New blocks should be instantiated inside the platform configuration C++ file, such as

              /usr/groups/han/clteach/btlm/current/vhls/src/platform/arm/zynq/parallella/zynq{.cpp,.h}
    

    To instantiate a simple peripheral for programmed I/O access, first look at the way an existing I/O device is wired in, such as the UART. You can see it is connected to the I/O bus with the following line

                 BUSMUX64_BIND(busmux0, UARTS[uu]->port0, start, UART_SPACING)
    
    and your own device can be connected to busmux0 with an additional, similar such call. The busmux may automatically allocate a programmed I/O base address spaced by some factor given to its own constructor or else you can pass in an explicit base address. Note, busmuxes can be in a tree structure and all such along the route to an IP block must forward the transaction appropriately.

    Using a base address of 0x43c00000 will mean the self-same code can run on the virtual ARM as the real ARM.

    There may be code relating to ETHERNET_CRC in the platform file: if so, this is a placeholder that is otherwise unused and you can remove it or else adapt it as a basis of your own.

    Note that Prazor uses an extended generic payload, not the default one provided in SystemC. Hence, your device will need to instantiate a simple_target_socket where an extra type argument is passed to the socket constructor to that defines the payload, such as

        tlm_utils::simple_target_socket port0;
    
    Note also that Prazor uses 64-bit generic payloads even when modelling 32-bit systems like the ARM7.

    VHLS link

    When compiling applications for and running the pre-built simulator you will need something like:

    export VHLS=/usr/groups/han/clteach/btlm/current/vhls
    

    but you will have to change this to your own copy when building your own simulator.

    A First Run

    The first step is to run the pre-built simulator with a pre-compiled binary image and then look at the structure of what you have run. The next step is to modify both halves and confirm proficiency with this. This is mostly a matter of getting your environment variables set up correctly.

    The pre-built simulator is here:

    export VHLS=/usr/groups/han/clteach/btlm/current/vhls/src/vhls

    The pre-built hello-world program as an ELF binary for ARM is here:

    /usr/groups/han/clteach/btlm/current/vhls/images/hellow-world/hello

    It is easiest to run under make, but a manual command line run should also work. You'll need something like

       $VHLS \
           -dram-system-ini /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/system.ini.example \
           -dram-device /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini \
           -cores 1 -tracelevel 0 -global-qk-ns 1 -no-caches -image ./hello -name vv \
           -- red yellow green blue well done milos 
    

    The command line is divided into two sections using the -- separator. Three sections with two of these separators are sometimes used. The part before the first separator is processed by the Prazor simulator. The part afterwards is passed in to the running program as its argv command line. Generally, this program consists of a wrapper (energyshim.c) that splits the command line again at the second -- separator with the final part being used by the application proper and the earlier part controlling logging instrumentation.

    You will get a UART output spool file written to your working directory and also an energy use report from the TLM POWER3 library.

    If you get an access denied error this may be because the simulator is trying to write its output file to the current folder for which you do not have write access. You need to copy ...

    Hello World Structure

    Take a look at the makefile and the disassembly of this test program. We are running bare-metal and you will see the following components:

    DRAM data sheet.

    The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts. 8Gbit = 1 Gbyte = 2^30 bytes = 2^3 banks * 2^15 ras * 2^10 cas * 2^2 lanes. At 175 mA ICCD1 current it dissipates 236mW when active. See bottom for actual measurements.

    A suitable datasheet file for DRAMsim2 may be present here: DRAMSim2 on github and if not I provide a temporary one: DDR3_micron_32M_8B_x32_parallella.ini. Application note: Micron TN-41-01: Calculating Memory System Power for DDR3.

    DRAM bank interleave is controlled in hardware by the following registers that are set by linux as

    DRAM_addr_map_bank 0xF800603C  0x777
    DRAM_addr_map_col  0xF8006040  0xFFF00000
    DRAM_addr_map_row  0xF8006044  0x0F666666
    
    Which, in detail, gives the following mapping for the bottom 1 MByte (A[31:30]==2'b00):
       ra15 = 0
       ra[11:2] = 6+11 = A[28:17]
       ra1  = 6+10  = A[16]
       ra0  = 6+9   = A[15]
       ba[2]  = 7+7   = A[14]             //   777 sets this
       ba[1]  = 7+6   = A[13]
       ba[0]  = 7+5   = A[12]
       ca[13:10] = 0
       ca[9]  = 0+11  = A[11]
       ca[8]  = 0+10  = A[10]
       ca[5]  = 0+7   = A[7]
       ca[4]  = 0+6   = A[6]
       ca[3]  = 0+5   = A[5]
       ca[2:0] =   A[4:2]  // Burst/line Offset
       lane[1:0] = A[1:0]  // Byte Lane Offset
    
    So this is standard-enough layout: A[29:0] = { row, bank, col, lane }.

    In terms of DRAMSIM2 from U-Maryland, it is Scheme6, chan:row:bank:rank:col, with the chan and rank fields being null.

    Command Line Flags

    The -cores n flag specifies how many CPU cores are created.

    The -self-starting-cores n flag specifies how many CPU cores are in run mode at reset time. For Zynq linux this option should not be used, thus leaving the setting at its default of 1 because the linux kernel uses a store to the system controller block from the first CPU to enable the subsequent core(s).

    The -image flag specifies an ELF binary file to be loaded into the DRAM model.

    The -tracelevel n flag turns on a given level of tracing. n is in the range 0 to 9 with 0 being all tracing off. Multiple tracelevel flags can be used, separated by -watch flags.

    The -watch B +N flag takes a hex base address and a hex length N to define a region to watch. This region is watched at the tracelevel of the previous tracelevel flag. Typically you might set tracelevel 9, then define some watch regions, then set tracelevel back to zero so that no tracing is done outside the watched regions.

    The -name vv flag defines the name to be used as root segment in the name of output files. This is useful to distinguish output for sequences of experiments run by the same makefile in a common folder.

    The -dram-system-ini filename.ini sets up the DRAM simulator parameters. Just use the standard file provided always.

    The -dram-device datasheet.ini sets up the DRAM type in use. Each type of DRAM has different energy and performance details. For the Parallella card, please use XXXX? TBD.

    The -wait-for-debugger causes the simulation not to start until a GDB session is remotely attached.

    The -no-caches flag is implemented in abench1.h to not instantiate the caches and to connect the ARM cores directly to the DRAM. A variant is perhaps needed to skip either L1 or L2 but not both. With the -no-caches flag the caches are removed from the design and their static power disappears. The processor will run at the speed allowable by the main store, which will be slow if this is DRAM without caches.

    The -no-harvard flag is implemented in abench1.h to only instantiate a single L1 cache per core, not split I and D caches. Variants to easily control the cache size and associativity could be easily implemented.

    There are more - please read the src code arm7smp.cpp

    Backdoors

    Some backdoors are available, for access to simulator argv and argc and getenv.

    Some SWI instructions, using codes not used for linux, are hardwired as backdoors for writing a character without going via a UART or other output model, exiting the simulator and getting core instruction counts without using a platform-specific PMU set up.

     
    // ARM trigger instruction trace - print next 100 instructions
       asm volatile (" swi #203":: );                             
    

    Prazor Simulator Structure

    In basic terms, the main file you are using is in the arm folder called abench1.h - this defines the system topology, like the diagram above, and is the main file for editing when you adjust the simulated architecture.

    You need to get to the point where you have your own copy of this file where you have made modifications. You can either make links to the existing versions of all the other files in the simulator or have your own copies of them too.

    Compiling your own programs

    For bare-metal operation you will need to follow the structure of the images/dfsin Makefile and Makefile.inc.arm.

    The ARM ISS does not currently (Feb 2015) support Thumb Mode correctly. Update Feb 2017: Thumb2 mode is now implemented. If you see 16 bit instructions in your disassembly then the compiler has used Thumb mode and you need to pass appropriate options to GCC such as -march=armv6 -marm.

    For running on more than one core follow the design pattern of djgradix which uses the djgthreads implementation of Posix pthreads on bare-metal. (NB: Today - 19th Feb - we have not tried that for ARM recently - will double check all is ok in the next few days). Note that this is non-preemptive implementation that does not time share threads over or within a core. The original program gets core 0 and the new threads get exclusive use of further cores until they exit. Therefore, for Zynq with only two cores please only start one new thread.

    As a C library, please use prlibc and not uClibc unless you want to invest time getting uClibc working bare-metal on ARM (it worked fine for OpenRISC last few years).

    Try to structure your code so that it links together a .o file that can also be run on the real parallella card. That way you are sure you are really running the same binary on the real system as on the simulator. You will need to link this .o file in a different way (using system call stubs etc) for running on linux compared with bare-metal. It is also a good idea to structure your makefiles such that you can run the program natively on your linux workstation. Finally, think about iteration counts. For stopwatch (or time ./a.out) style performance measurement you will perhaps need to run the outer loop of the simulation 1000 times more than when running on the SystemC model.

    Reusing the code in the images/energyshim.c file enables you to wrap up your benchmarks main entry point (bm_main) in a reusable manner. This shim can report energy and time use on both the real systems and the simulators if you get it all set up right.

    Overall, structure your code so that you can batch run it in lots of different configurations from one Makefile (e.g. with different numbers of cores or different cache configurations or clock frequencies) - this will generate data suitable for plotting in your report.

    Using GDB to debug a program

    The gdb debugger can be connected to a running Prazor simulation for single stepping, inspecting and changing memory and setting break and watchpoints. This uses the RSP protocol over a TCP remote connection.

    You should start the simulator with the -wait-debugger option. The simulator will print which port it is listening on and this is commonly localhost:9600.

    Load your binary ELF file into the debugger using the gdb file command before remotely connecting.

      $ /usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb
      GNU gdb (GDB) 7.8
      (gdb) target remote :9600
      Remote debugging using :9600
      0x00000000 in ?? ()
      (gdb) file dfsin % The path to your binary ELF image
      Reading symbols from dfsin...done.
      (gdb) cont
      Control-C
      (gdb) where
      [Remote target] #1 stopped.
      (gdb) where
      #0  puthex64 (leadingz=leadingz@entry=1, d0=125116699948, f=)
          at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:1052
      #1  0x00041ce8 in locked_printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n", 
          format@entry=0x44fc0  ",2", poi=..., poi@entry=...) at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:274
      #2  0x00041f68 in printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n")
          at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:342
      #3  0x000418b4 in bm_main (verbosef=verbosef@entry=7) at dfsin.c:177
      #4  0x00043af0 in main (argc=7, argv=0xfffe0004) at ../energyshim.c:200
      #0  mul64To128 (a=12105675798371893248, b=7613535337020653568, z0Ptr=0x41134 , z1Ptr=0x3bff50) at softfloat-macros:186
      #1  0x00173a24 in ?? ()
      (gdb) info registers
      r0             0x0      0
      r1             0x100    256
      r2             0xe0001000       3758100480
      r3             0x9      9
      r4             0x2189552c       562648364
      r5             0x1d     29
      r6             0xa      10
      r7             0x452b0  283312
      r8             0xe801c  950300
      r9             0x1      1
      r10            0x0      0
      r11            0x1000000        16777216
      r12            0x45198  283032
      sp             0x3bff20 0x3bff20
      lr             0x43954  276820
      pc             0x430b4  0x430b4 
      cpsr           0x8000005f       2147483743
      (gdb)  x/4x 0x44dc4
      0x44dc4 <__udivdi3+208>:        0xe0a55005      0xe2533001      0x1afffff8      0xe0948000
      (gdb)  x/4i 0x44dc4
      => 0x44dc4 <__udivdi3+208>:     adc     r5, r5, r5
         0x44dc8 <__udivdi3+212>:     subs    r3, r3, #1
         0x44dcc <__udivdi3+216>:     bne     0x44db4 <__udivdi3+192>
         0x44dd0 <__udivdi3+220>:     adds    r8, r4, r0
      (gdb) print num_keys  -- display a global variable
      $1 = 1000
      (gdb) up
      #4  0x00041620 in estimateDiv128To64 (a1=0, b=, a0=4815960295168657408) at softfloat-macros:217
      217       z = (b0 << 32 <= a0) ? LIT64 (0xFFFFFFFF00000000) : (a0 / b0) << 32;
      (gdb) print z -- try to print a local var (but it was in a register)
      $2 = 
      (gdb) print a0  -- print a local var that still exists
      $3 = 4815960295168657408
      (gdb) info threads
        Id   Target Id         Frame   -- We only have one CPU core (sim commands line was -cores 1) 
       * 1    Remote target     0x00041620 in estimateDiv128To64 (a1=0, b=, a0=4815960295168657408) at softfloat-macros:217
    
    

    Start gdb and give the command "target remote localhost:9600" You can leave out the machine name or the localhost name if you run the simulator and debugger on the same machine.

    Ideally you need to use an ARM version of gdb to connect. Otherwise you will see registers with x86 names being displayed.

    A copy of gdb for ARM gdb is installed at the Computer Laboratory here:

    /usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb
    . There may be more-recent versions around.

    Switching between cores: the threads commands of gdb have been adapted/abused to enable you to connect to different cores with a multiprocessor. Further details of how to switch between cores and the the spEEDO energy API will be added here ...

    Build gdb for ARM with this arg to configure
    ./configure --target=arm-linux-gnu
    
    Useful gdb commands:
      target remote localhost:9600
      info registers                -- show register contents
      x/16i $pc                     -- display next 16 instructions disassembled
      x/32x $sp                      -- display memory from sp upwards for 32 words
      file /home/djg11/d320/prazor/trunk/chstone/dfsin/dfsin -- load the symbol table from an ELF binary for symbolic debugging
    -- If it says no debugging symbols then it is stripped binary
      load                          -- download a binary (or the current binary) - better to load from the command line -image arg?
      cont                          -- continue execution
      break 0x44cca                 -- set a break point
      where                         -- give stack backtrace trace
      up                            -- change current stack frame by moving up one stack frame (towards caller) 
      down                          -- the reverse, move down one stack frame
      print num_keys                -- display a global variable or local var in current stack frame
      info threads -- show threads (may show hardware cores instead using Prazor)
      thread 2 -- switch thread/core
    
    
      set $pc = 0x485               -- perform a jump (not working prior to 18th feb 2015).
      stepi                         -- run one instruction only (not working 18th feb 2015 but being fixed).
    
    // useful additional commands for debugging the debugger:
     set debug remote 1
     set debug target 1
    
    


    Building Code that can run on both the simulator and the real cards

    The gcc C compiler defaults to generating Thumb-2 machine code and hardware floating point. The Thumb modes and hardware floating point on the simulator do not currently work (are just being debugged and should work in the near future), so meanwhile, to run the same binary code on both systems requires coercing gcc to avoid to only use the old ARM32 mode. The relevant flags are:

       -marm -mfloat-abi=soft
    

    With these flags, code compiled on the workstatstion or on the real card is interchangeable and .o files can freely be copied backwards and forwards using scp.

    However, note the installed libraries, libc and libgcc, on the Parallella cards also use Thumb mode so, for detailed performance comparison (and to avoid linker errors about VPF) you should avoid using these and instead use your own compiled versions of these too (or the ones on the links in the detailed documentation at the bottom of this section).

    The difference between application binaries targeted running bare metal on the simulator and running on linux on the real card are mainly to do with console I/O. The I/O paths are very different and so performance comparions should not be made. Also, they are incompatible and you need to swap the system calls used on the real card with direct calls to the UART device driver as used on the simulator.

    The best way to redirect the I/O is to link the same .o application and library files with some slightly different I/O shims, as in the barelift shim example below. Essentially you want to replace uart64_driver.o with bareliftshim.o. You should link using the ld program (not using gcc as a linker) since this will give you complete control over which kickoff code and libraries are included. Also, always check what you have made using objdump -d as a disassembler.

    For example, to run the dfsin standard test one would link as follows using the same binaries as ran on the simulator except for the barelift pair.

            LIBGCC=/home/linaro/djg11/libgcc-nothumb.a
            ld -o a.out bareliftcrt.o bareliftshim.o dfsin.o prlibc.o $(LIBGCC)
            objdump -d ./a.out > dis
            # If you see any 16 bit (4 hex digit) opcodes in the disassembly they you are using thumb code.
            ./a.out
    

    Once Thumb modes are working we can compare energy and performance with and without them. (It is working now - thanks Milos - April 2015).

    Detailed resources: bareliftshim-files.zip ZIP ARCHIVE.

    Archive:  bareliftshim-files.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
       146894  2015-02-24 10:52   home/linaro/libgcc.a -- This libgcc is Thumb and wont currently work on simulator.
         1250  2015-02-25 08:44   bareliftshim.c
         4860  2015-02-25 08:49   bareliftshim.o
         1406  2015-02-25 08:42   bareliftcrt.S
          984  2015-02-25 08:49   bareliftcrt.o
          631  2015-02-25 08:53   Makefile
         4777  2015-02-25 08:53   BARELIFT-README.txt
        23360  2015-02-25 08:49   dfsin.o
        48168  2015-02-24 18:45   prlibc.o
    ---------                     -------
       232330                     9 files
    

    Running S/W on a Physical Parallella Card

    This photo shows a parallella card with supply monitor and 1Volt regulator. The 1Volt core supply regulator on the parallella card was disabled and the external 1Volt supply run in. This core supply feeds the ARMs, their caches, onboard hardened controllers and the FPGA logic. The monitor tracks two currents: one for the 1Volt feed and the other for the 5Volt supply to the rest of the card which includes the Zynq pad ring.

    There are several cards available, number one has the energy monitors. Number 3 is in SW02 as of 10thFeb2017. There's also a PYNQ card and a ZedBoard available.

    Name --------------------- Kind Location Chip Notes ------------------------------------------
    parcard-djg1.sm Parallella FN12 xc7z010clg400-1 Has power probe via bognor
    parcard-djg2.sm Parallella FN12 xc7z010clg400-1
    parcard-djg3.sm Parallella FW02 xc7z010clg400-1 Has one PIO LED soldered on
    Zedra10.sm Zedboard Aaron has it xc7z020clg484-1
    pynq-djg1.sm Pynq FW02 xc7z020clg400-1

    The Ethernet RJ45 has a green LED which indicates link operational and an orange LED which indicates network traffic by going off around packets. Clarification: for 10Mbps the orange LED is on and flashes off for traffic indication, for 100 Mbps the green LED is on and flashes off for traffic indication and for 1 1Gbps both leds are on with with the orange flashing off to indicate traffic. No that's not right, even on a 10Mbps link the green led comes on solidly once linux is booted.

    See if you can ping a card: it should reply after about 10 seconds from reset or power on.

    Some cards had crashing problems but are now fixed. They will crash if connected to a 10Mbps Ethernet cable or the UART RX input is left floating. Unfortunately the LEDs on the Ethernet socket flash in response to arriving traffic even if it has not booted or has crashed, so a crash is not visually apparent, but pings stop working.

    The Parallella card UART operates at 115200 baud during boot. Linux sets it back to 9600 when booted but a devicetree kernel args can alter that to serve as a root console at the 115200 rate if needed.

      $ stty < /dev/ttyUSB0 115200
    
    ping parcard-djg2.sm
    PING parcard-djg2.sm.cl.cam.ac.uk (128.232.60.55) 56(84) bytes of data.
    64 bytes from parcard-djg1.sm.cl.cam.ac.uk (128.232.60.55): icmp_req=4 ttl=63 time=0.581 ms
    64 bytes from parcard-djg1.sm.cl.cam.ac.uk (128.232.60.55): icmp_req=6 ttl=63 time=0.586 ms
    

    See if you can log on

      $ ssh linaro@parcard-djg3.sm
    or
      $ ssh hx242@parcard-djg3.sm
    or
      $ ssh mp727@parcard-djg3.sm
    

    You should create an account of your own on the parcard so we dont all share the default user name linaro. Certainly keep your own files under /home/linaro/crsid/... if not /home/crsid where crsid is your University of Cambridge id.

    If you need to attach via USB UART use something like 'cu' which can be obtained with "yum install uucp".

    chmod a+rwx  /dev/ttyACM0 -- otherwise you get a misleading "line in use" error
    cu -s 115200 -l /dev/ttyACM0
    

    Copy personal data on and off using scp

      $ scp -r myfolder linaro@parcard-djg2.sm.cl.cam.ac.uk:/home/crsid/...
    

    Read Energy Figures From parcard-djg1.sm

    Parcard 1 has the energy monitors fixed up.

    There is a real-time graphical plot from the energy monitor that generates images like this boot-sequence example, which has been manually annotated. Note that the core is plotted on a more sensitive ordinate scale.
    energy use plots for linux shutdown and re-boot on Zynq/Parallela

    The energy monitor for parcard1 is generally connected to the machine bognor.sm.cl.cam.ac.uk (128.232.60.58) and normally runs on port 2002. Manual access to the energy readings can be demonstrated using telnet as follows, but energyshim should fetch the readings automatically using the same protocol:

    telnet bognor.sm 2002
    Escape character is '^]'.
    cmd
    ENERGY=86168!5177!
    cmd
    ENERGY=99158!5954!
    

    The protocol is embodied in this little fragment : currentprobe-client.zip energyshim client code will redirect to bognor if you set the environment variable

    export CURRENTPROBE=128.232.60.58
    

    The client code may be part of your Prazor repo under vhls/images/powertesters/currentprobe-client{.c,.h}. Energyshim can be compiled to include this or else read via the sPEEDO interface or return energy nulls. sPEEDO is an API supported in Prazor when POWER3 is enabled so that an application can find out how much energy it has used.

    The numbers returned to the command 'cmd' are energies in mJ. They are ever-increasing running totals. The first number returned is the energy of the whole Parallella card (from summing the 5V and 1V feeds) while the second is the energy used by the FPGA core logic only which excludes the DRAM and other components on the PCB.

    You should check whether other users are running programs (using top etc) before starting your own - otherwise performance results will be confused. Also check that the clock is at the correct speed (devmem2 0xF8000120 w should give 0x1F000200) and caches are on (TBD) ... or else issue a reboot as root to be sure.

    Note, Computer Laboratory networks may not enable telnet (or otherwise open sockets) to bognor.sm from some parts of the outside world, but parcard-djg1.sm does have access, which is what we need.

    Adjusting Clock Frequency on a Real Card

    You can write to the Zynq clock frequency register to slow it down and compare your results at different clock rates with the Prazor simulation at correspondingly modelled different rates.

    The program devmem2 (www.lartmaker.nl/lartware/port/devmem2.c) lets you read and write the Zynq control registers. We can adjust the clock frequency divisor in ARM_CLK_CTRL at 0x1F00_0120 (Zynq manual page 1583). Bits 13:8 are the divisor, which must not be 0, 1 or 3. The standard value of 2 gives 666 MHz I think, whereas programs get noticeably slower if you bump this up to, say E, as follows:

     $ devmem2 0xF8000120 w                                  // View current value
         Value at address 0xF8000120 (0xb6ff3120): 0x1F000200
     $ devmem2 0xF8000120 w 0x1F000E00                       // Make it slow down a lot
         Value at address 0xF8000120 (0xb6ff3120): 0x1F000200
         Written 0x1F000E00; readback 0x1F000E00
     $ devmem2 0xF8000120 w 0x1F000200                       // Set back to full speed
    

    Turning off a core on the real card

    The following sequence turns off core 1 leaving only 0.

    root@linaro-nano:/sys/devices/system/cpu# ls online 
    online
    root@linaro-nano:/sys/devices/system/cpu# cat online 
    0-1
    root@linaro-nano:/sys/devices/system/cpu# ls cpu1/online 
    cpu1/online
    root@linaro-nano:/sys/devices/system/cpu# cat cpu1/online 
    1
    root@linaro-nano:/sys/devices/system/cpu# echo 0 > cpu1/online 
    root@linaro-nano:/sys/devices/system/cpu# cat online 
    0
    

    An alternate approach is to set maxcpus=1 on the kernel command line and reboot.

    Reducing O/S crosstalk on the real card

    To get consistent results from profile experiments, it is best to turn off as much of linux as possible to get bare-metal like results.

    One way to do this is boot single user mode and log in from the UART serial port.

    Another is to turn off interrupts from user space during the critical part of your program. Most system resources, including remote shells, seem to recover fine with interrupts having been off for a few minutes, but sometimes the system crashes.

    Interrupts can be turned on and off by updating the GIC enable register after mapping it to user space:

    void *page_f8_open(off_t target)
    {
        static int fd = -1000;
        static unsigned int page = 0;
        static void *map_base = 0;
        unsigned long read_result;
    
    
        if (fd < 0)
          {
            fd = open("/dev/mem", O_RDWR | O_SYNC);
            if (fd < 0)
              {
                printf("lowlevel_accessor failed %s:%i Am I running with root privelege?\n", __FILE__, __LINE__);
                exit(1);
              }
    
            LL_TRC(printf("/dev/mem opened.\n"); fflush(stdout));
          }
    
        if (map_base)  // check page is still the correct one                                                                                                                                                       
          {
            unsigned int page_primed = target & ~MAP_MASK;
            if (page != page_primed)
              {
                printf("lowlevel_accessor failed %s:%i new page requested : %x cf %x\n", __FILE__, __LINE__, page_primed, page);
                exit(1);
              }
          }
        else
          {     /* Map one page */
            page = target & ~MAP_MASK;
            map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, page);
            if (map_base == (void *) -1)
              {
                printf("lowlevel_accessor failed %s:%i failed to map page\n", __FILE__, __LINE__);
                exit(1);
              }
          }
    
        LL_TRC(printf("Memory mapped at address %p.\n", map_base); fflush(stdout));
        return map_base;
    }
    
    unsigned int gic_ints_enable(int enablef)
    {
      off_t target = 0xF8F01000; // GIC master interrupt control register                                                                                                                                           
      void *map_base = page_f8_open(target);
      void *virt_addr = map_base + (target & MAP_MASK);
      ((unsigned long *) virt_addr)[0] = (enablef) ? 1:0;
      return enablef;
    }
    

    Turning on/off Caches on a Real Card and the Prazor Virtual Platform

    ... TBD ... L2 can perhaps be turned on/off by writing to L2_cache reg1_control at 0xF8F02100

    Running linux turns on the caches. On the bare-metal runs on real cards or the PRAZOR simulator, please insert something like the following code to enable the L1 and L2 caches. Without this you will see only zeros in the cache hit statistics since caches power up disabled.

    This code is found in the energyshim.c wrapper:

      asm("mov r0,#0x1000"); // Turn on L1 Cache (see Zynq TRM for further details.)
      asm("orr r0,r0,#4");
      asm("mcr  p15, 0, r0, c1, c0, 0"); //  (r0 = 0x1004)
      // You might possibly also need
      ((int *)0xF8F02100)[0] = 1; // Zynq: turn on L2 cache 
    
    This will cause the following to be printed from Prazor when appropriate tracing is enabled:
    Cache the_top.coreunit_0.l1_d_cache_0 is ENABLED now
    Cache the_top.l2_cache_and_controller is ENABLED now
    Cache the_top.l2_cache_and_controller is ENABLED now
    Cache the_top.coreunit_0.l1_i_cache_0 is ENABLED now
    

    In Prazor, adjusting abench.h or the command line flags will alter the physical size of the caches and hence their static energy consumption and hit ratios. The hit ratio of a cache affects the dynamic energy use of its backside components.

    For the Zynq model, the abench.h file is replaced with the file src/platform/arm/zynq/parallella/zynq.cpp.

    UnixBench Performance on Physical Parallella or Zynq at Different Clock Frequencies

      $ wget http://byte-unixbench.googlecode.com/files/UnixBench5.1.3.tgz
      $ tar -xf UnixBench5.1.3.tgz
      $ ./UnixBench/Run
    

    Results at full and slow speed:

    Summary of results, as reported: No difference! But actual execution real time was greatly different, taking 56 minutes of real time at 1/3rd clock speed. The timers are giving dilated answers on lower clock speeds so execution time is reported the same. This is graphically demonstrated by running 'xclock -update 1'

    parcard-unixbench-results-fullspeed.txt:System Benchmarks Index Score                                          75.5
    parcard-unixbench-results-fullspeed.txt:System Benchmarks Index Score                                         106.2
    parcard-unixbench-results-thirdspeed.txt:System Benchmarks Index Score                                          77.7
    parcard-unixbench-results-thirdspeed.txt:System Benchmarks Index Score                                         111.4
    

    Running Linux on the Prazor Parallella/Zynq Simulator

    Video Screen Capture of Interactive Linux Shell Session

    Video screen capture : interactive linux session on the SystemC Zynq model: MP4 Video.

    Procedure to run your own kernel on the model:

    Get a copy of linux kernel sources

    The Zynq linux source files can be obtained from git as follows:

    How to compile the ADI based linux kernel? (uImage)
    git clone https://github.com/parallella/parallella-linux
    cd parallella-linux bash
    export ARCH=arm
    export CROSS_COMPILE=arm-linux-gnu-
    export PATH=:$PATH
    make ARCH=arm parallella_defconfig
    make ARCH=arm LOADADDR=0x8000 uImage
    

    After this has completed, you will find a file called vmlinux which is the ELF binary for the kernel that can be loaded into the simulator. This section contains a Very Drafty Description of this process.

    You will need:

    Reference copies of some of these are stored in this folder

    $PRAZOR/vhls/boards/parallella/linux
    

    Command Line

      /home/djg11/d320/prazor-virtual-platform/vhls/src/arm/vhls-arm7smp -kernel 
      /home/djg11/parallella-sw/complete-from-git/parallella-linux/vmlinux
      -devicetree prazor-linux.dtb 
      -boot loimage 
      -vdd disk3.img 
      -dram-system-ini /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/system.ini.example  
      -dram-device /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini  
      -cores 2
    

    loimage booter

    This shim replaces the standard linux grub or whatever boot loader. It runs from the first ARM core reset vector of 0. It jumps to the linux kenel after having set up r2 to point at the device tree blob.

    Simulated output: CONSOLE LOG.

    Mount Disk Image on Host Workstation (not simultaneously while mounted as a SystemC model?)

    Disks have a root partition that starts at 2048th block. 
      
    fdisk -l ./disk1.img gives:  
    
    Disk ./disk1.img: 16 MB, 16252928 bytes4 heads, 31 sectors/track, 256 cylinders, total 31744 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x658d7048
    
          Device Boot      Start         End      Blocks   Id  System
    ./disk1.img1            2048       31743       14848   83  Linux
    

    To mount it on the workstation you need to set loopback to start at byte block size*block start which in this case is 2048*512=1048576. So you would need to do:

    sudo losetup -o1048576 /dev/loop0 disk1.img
    sudo mount -t ext4 /dev/loop0 /mnt
    

    FPGA booting on the Zynq Chips or Parallella Cards

    We assume that there is no parallella.bit.bin file on the SD card. If so, it would be loaded into the FPGA at boot time. Most likely this would be an X-windows framestore for a local console. But assuming there is no such FPGA boot, command line access via the UART or ssh is needed.

    The basic procedure is to copy the bitstream to the card with scp from the place where it was compiled. Then cat it into /dev/xdevcfg as root.

    The FPGA is not loaded
    
    djg11@parcard-djg1:~$ cat /sys/devices/amba.0/f8007000.ps7-dev-cfg/prog_done 
    0
    
    Or it might be:   cat /sys/devices/soc0/amba/f8007000.devcfg/prog_done
    
    We boot the FPGA like this
    
    root@parcard-djg1:~# cat parallella.bit.bin > /dev/xdevcfg 
    root@parcard-djg1:~# cat /sys/devices/amba.0/f8007000.ps7-dev-cfg/prog_done
    1
    
    
    The booted file looks like this:
    
    root@parcard-djg1:~# od -x parallella.bit.bin | head -10
    0000000 ffff ffff ffff ffff ffff ffff ffff ffff
    *
    0000060 00bb 0000 0044 1122 ffff ffff ffff ffff
    0000100 5566 aa99 0000 2000 2001 3002 0000 0000
    0000120 0001 3002 0000 0000 8001 3000 0000 0000
    0000140 0000 2000 8001 3000 0007 0000 0000 2000
    0000160 0000 2000 6001 3002 0000 0000 2001 3001
    0000200 3fe5 0200 c001 3001 0000 0000 8001 3001
    0000220 2093 0372 8001 3000 0009 0000 0000 2000
    0000240 c001 3000 0401 0000 a001 3000 0501 0000
    root@parcard-djg1:~# ls -l parallella.bit.bin 
    -rw-r--r-- 1 root root 2083760 Oct  1  2015 parallella.bit.bin
    

    Modern releases of the xdevcfg device driver will cope with either endedness files and strip the leading meta-info from a .bit file as generated by vivado.

    Mouseless FPGA Build - Invoking Vivado by tcl Script (ksubs2 Makefile)

    You can make a bit stream by running Vivado for the appropriate device, such as XC7010clg400-1 which has 17600 LUTs, 60 BRAMs and 80 DSPs.

    Building your first design and getting programmed I/O from the ARM to the FPGA logic on the Zynq platfom all to work is quite a learning curve with the Vivado tools. As a shortcut, you can copy the rough framework in the ksubs2 folder which illustrates a simple approach. You can test this from user space on the embedded linux using devmem2 and you can use the code in devmem2.c as a basis for you embedded code: the important lines are an open of /dev/mem and a mmap call.

    The ksubs2 framework is intended for supporting the output from Kiwi HLS. It is used to quickly generate an FPGA design using just a Makefile. Further notes are here: Kiwi Scientific Acceleration: Zynq Substrate Programmed I/O and DMA Basic Demo.

    It's best to include a serial number register in your design so that when you make a programmed I/O read from it you have a sanity check for which design is loaded. For instance, at the start of your programmed I/O register file decode, put something like

     assign maxi_rdata =
         (addr == 0) ? 24'h552904:
         (addr == 1) ? ...
    

    Then, when logged on to the Zynq linux (either on Prazor or the real card) do

    root@parcard-djg1:~# gcc -o devmem2 devmem2.c 
    root@parcard-djg1:~# ./devmem2 0x43c00000 w
    Value at address 0x43C00000 (0xb6fc5000): 0x552904
    

    and check you get back your distinctive number, in this case 552904.

    Energy Modelling: Modelled versus Measured Comparison and Results

    There are various sources of discrepancy between the Prazor energy results and the measurements on parcard1.

    Firstly, the implementation of the sPEEDO interface on Prazor returns the core's energy account and the whole system energy account. And on the physical probe attached to parcard1 it returns the whole system energy and the energy used by the Zynq core. So the figures reported would not directly agree even if both sides were completely accurate. Note: The Zynq core contains both processors, the L2 cache + GIC + SCU and the programmable FPGA logic.

    Secondly, the Prazor system has features missing from its model, notably the DRAM driving pads and the Ethernet PHY energy.

    Ethernet Power

    The Ethernet PHY on the real card takes a lot of energy that is not modelled (currently) in Prazor. This has four modes of operation that take different amounts of power. If you unplug the 100 Mbps Ethernet cable and log in via the UART instead the power saving is about 160 mW. The actual measurements of the 5 volt supply current were:

        Ethernet unplugged   385mA
        10  Mb/s             410mA
        100 Mb/s             425mA
        1 GB/sec             470mA
    

    The power consumed by the PHY can be computed by multiplying by 0.001 to convert to amps, then multiply by 5 to convert to Watts and finally take of 20 percent to account for the SMPSU efficiency inconverty from 5 to 1.8 volts.

    Another 40mW may be accounted by the LED indicators which will go off when the Ethernet is unplugged.

    The Ethernet MAC hard macro is quoted in the Xilinx datasheet as taking 15 mW.

    DRAM Power

    The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts.

    During normal quiescent operation (i.e. no L2 cache misses and infrequent refresh) it should take about 32 mW. Ths should rise to 613 mW under very heavy traffic with no locality of reference (i.e. one burst read or write per row activation) and be perhaps about 400 mW under heavy load with well-organised data layout (many bursts per row activation).

    The measured power use of the real DRAM was a little different!

    With reset held down it takes about 7mW, which is as per the data sheet. But for all other traffic loads the supply current varies between 424 and 464 mA (mutiply by 1.3 for mW). This is odd. Writing 0x83 to ddrc_ctrl at 0xf800_6000, which supposedly makes the DRAM more efficient reduces energy use by only a percent or so.

    Presumably the Zynq chip is making many more operations on the DRAM than expected and hence DRAM power measured does not correlate with Prazor prediction. This could be explained, perhaps, by linux having set the refresh period to a stupidly small value or EEC scrubbing of L2 operating far too fast.

    In addition, the DRAM driving I/O pads on the Zynq I/O ring take quite a lot: perhaps 100 mW: see data sheet.

    Further analysis is needed ...

    Power use for HDMI, FLASH and Misc Other Devices

    The Parallella card uses an ADV7513BSWZ chip to drive its monitor.

    This takes 256 mW from its 1v8 input and 1mW from its 3V input, but when powered down takes just 300 uA or so. It resets to power down mode. It is not modelled in Prazor and it is not used in any of our real-card experiments, so is irrelevant.

    The USB driver chips, FLASH memory and Epiphany chips are also all not modelled in Prazor and not used in our experiments. They should take negligable power in standby mode, so can be ignored. For example, the 3v3 serial NOR Quad Flash U7 N25Q128A13EF840E has standby current of less than 14 uA, whereas operating takes 15 mA for 108 MHz and 6mA at 54 MHz. It takes 20 mA during write and erase internal operations.

    FPGA Programmable Logic Power

    When not loaded with any design the PL on the 7010 takes about 22 mW. Larger chips would take more.

    The FPGA takes an extra 40 mW or so of power on system powerup in some sort of self-clear mode.

    CPU Power

    The dynamic L1+CPU energy when running flat out is about 20 mW more than when idle. Was this per core? Need to check again. We note a linear decrease in energy use as the programmable CPU clock frequency is scaled down (666 MHz down to 90 MHz).

    Further verification of the L2 cache and tightly-coupled SRAM is ongoing...


    Course Home.       http://www.hitwebcounter.com/

    EOF.       Prazor Home Page.