Prazor/VHLS Temporary User Manual

Prazor Virtual Platform Home Page.

The Prazor/VHLS Virtual Platform is a simulator is written using SystemC TLM 2.0 sockets and can model a number of CPU architectures (x86_64, ARM, MIPS, OpenRISC).

Very high level simulation -> VHLS. We aim to render complete energy estimates from real workloads on complete datacentres with accuracy of +/- (sqroot 10) / 2.

Note: this project was released in the additional material annex for the textbook, "Modern System-on-Chip Design on Arm" by D J Greaves and the associated stable version of PRAZOR/VHLS is available for download from modern-soc-design-addmat.

Contents

  • References

  • Introduction

  • Zynq Platform Coverage

  • General Download

  • Hello World Tiny Demo

  • Command Line Flags

  • Prazor Simulator Backdoors

  • Prazor Simulator Structure

  • Loose Timing

  • Compiling Application Programs Bare Metal

  • Using GDB to debug a program

  • Building Code that can run on both the simulator and the real cards

  • Running S/W on a Physical Zynq Card

  • Running Linux on the Prazor Parallella/Zynq Simulator

  • Energy Budgets and Modelling: Modelled versus Measured Comparison and Results

    References

  • PDF: Product-Brief

  • PDF: Zynq-7000-Technical Reference Manual

  • PDF: instruction-set-reference-card

  • PDF: Cortex-A9-MPCore-TRM

  • PDF: ARMv7-M-manual

  • PDF: ARMv7-A-R-manual

  • PDF: Interrupt Controller Specification

  • PDF: ARM1176JZ-S-TRM

    Introduction

    The Prazor Virtual Platform is a simulator is written using SystemC TLM 2.0 sockets and can model a number of CPU architectures (x86_64, ARM, MIPS, OpenRISC).

    Currently we are using it to model Xilinx Zynq platform such as the Parallella Card or ZedBoard. The model is binary compatible with the real hardware meaning it can run the same linux kernel and use the same SD card images. We might next make a 4 core model that will be the basis of the new Raspberry Pi 2 or we might make an Allwinner A20 (cubieboard/sunxi) model.

    The platform can be compiled with and without POWER3 and Speedo (see below for what this means.) It can also be a static platform compile or a DLL platform compile. Dynamic compiles were supported for architectural variations between modelled platforms, but are also useful for loading alternative reconfigurable FPGA designs into the model.

    tlm Zynq and Parallella model

    Zynq Architecure Coverage

    The Zynq model within Prazor covers the ARM cores, L1 and L2 caches, GIC, SCU, DRAM, SD Card and all peripherals and timers needed to boot linux with UART input and output. The Ethernet is currently missing, as is the Epiphany chip, Neon, USB, the FPGA and TCM. Floating point is being added. Jazelle cannot be added since it is secret. All of these components are modelled as TLM 2.0 blocking SystemC classes with loose timing and quantum keeper.

    Prazor does not directly model the FPGA fabric (programmable logic) of the Zynq. Instead, one route is to use HLS to generate the FPGA design and to use a SystemC output from the HLS tool for the ESL model (ie. link it into the Prazor binary).

    Computer Laboratory Local Use

    Static snapshots, called pvp_nn, of the simulator code are also installed on the Computer Lab file server and can copied or linked to

    /usr/groups/han/clteach/btlm/current
    .

    If updates are needed, revised/updated copies will be maintained at

    /usr/groups/han/clteach/btlm/pvp_xx

    A first step is to run the pre-built simulator with a pre-compiled binary image and then look at the structure of what you have run. Later steps will be to modify both halves and confirm proficiency with this. This is mostly a matter of getting your environment variables set up correctly.

    Making a simple first run of the pre-built vhls binary.

    The virtual platform can be run from the command line.

    At the Computer Laboratory you will need the following set ups:

    export CLTEACH=/usr/groups/han/clteach
    export TARCH=ARM32
    export PRAZOR=$CLTEACH/btlm/current
    

    The pre-built simulator is here:

    export VHLS=/usr/groups/han/clteach/btlm/current/vhls/src/vhls

    The pre-built hello-world program as an ELF binary for ARM is here:

    /usr/groups/han/clteach/btlm/current/vhls/images/hellow-world/hello

    It is easiest to run under make, but a manual command line run should also work. You'll need something like

       $VHLS \
           -dram-system-ini /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/system.ini.example \
           -dram-device /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini \
           -cores 1 -tracelevel 0 -global-qk-ns 1 -no-caches -image ./hello -name vv \
           -- red yellow green blue well done milos 
    

    The command line is divided into two sections using the -- separator. Three sections with two of these separators are sometimes used. The part before the first separator is processed by the Prazor simulator. The part afterwards is passed in to the running program as its argv command line. Generally, this program consists of a wrapper (energyshim.c) that splits the command line again at the second -- separator with the final part being used by the application proper and the earlier part controlling logging instrumentation.

    You will get a UART output spool file written to your working directory and also an energy use report from the TLM POWER3 library if included in the platform version.

    Note, you will see very poor performance. This is because the -no-caches flag has removed the caches from the hardware, so raw DRAM access is being used. Just removing this flag is insufficient to get cached operation, you also need to include some coprocessor writes to turn them on CACHE TURN ON.

    If you get an access denied error this may be because the simulator is trying to write its output file to the current folder for which you do not have write access. You need to run it in a folder where you have write access.

    Further abbreviations for ease of use

    If you wish to continue running from command line, to abbreviate the lengthy command line it is suggested you use something like the following macros.

    export CLTEACH=/usr/groups/han/clteach
    export TARCH=ARM32
    export PRAZOR=$CLTEACH/btlm/current
    export SIMBIN=/usr/groups/han/clteach/btlm/current/vhls/src/vhls
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYSTEMC/lib-linux64
    export DRAM1="-dram-system-ini /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/system.ini.example"
    export DRAM2="-dram-device /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini"
    export PRARM="$SIMBIN $DRAM1 $DRAM2"
    

    so now you can run something short like

        $PRARM -image hello
      

    ... or else write a little shell script of your own. There is no prazor GUI.

    If it complains it cannot find the TLM_POWER3 library or the SystemC library you need one or both of the following:

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/groups/han/clteach/tlm-power3/lib-linux64
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYSTEMC/lib-linux64
    

    Structure of the Hello World Tiny Demo

    The 'hello-world' program A Minimalist Bare-Metal Running Program for Prazor - One Thread, No libc, No device drivers is illustrated HERE. This is about the most simple program you can run.

    Take a look at the makefile and the disassembly of this test program. We are running bare-metal and you will see the following components:

    Building Your Own Copy of Prazor

    One purpose of building your own copy of the simulator is to explore the performance of a program on different architectures, such as with different cache sizes.

    Download or Copy Source Files

    Before open public release, please :

    1. Get an xparch user ID and password from DJG or MP
    
    2. Login via the web interface (password required) to    http://phabricator.xparch.com
       2a and change your password.
       2b Install your public key on phabricator
    
    3. Use ssh-agent to put your private key in scope and then
    
       git clone ssh://vcs@phabricator.xparch.com:22/diffusion/P/vhls.git
    
    
    4. cd to vhls and follow the README.md
    

    Owing to the nature of git, it is possible for local users to clone their own git repo from the checkout in /usr/groups/han/clteach/btlm/current using the appropriate git command (someone please let me know what it is) or just doing a simple cp -ra of that folder so that you get the hidden hit repo management files.

    Manual Linking Makefile

    The Prazor Vitual platform is normally built with automake and the platform of interest is then loaded as a DLL (.so) file. But this can be very tricky to use if you have not used it before and so a simple manual Makefile is given here that you can adapt if you prefer. This makefile also helpfully gives you a stong idea of the overall structure and what automake is achieving.

    #
    # VHLS Prazor hand-crafed Makefile for SystemC TLM Zynq-style models.
    #
    # People often have problems with automake, so a simple manual Makefile is all we really need.
    # This version assumes the C++ src files have already been compiled to .o form
    # This version without TLM POWER3 and Speedo
    #
    
    VHLSDIR=/usr/groups/han/clteach/btlm/pvp10/vhls
    
    CPP=g++
    
    CPPFLAGS=-O2 -std=c++0x \
       -DTARCH=ARM32 \
       -DVHLS_STATIC=1 \
       -DSC_CPLUSPLUS=199701L -DSC_DISABLE_API_VERSION_CHECK=1 \
       -I/usr/groups/han/clteach/systemc/systemc-current/include \
       -I/usr/groups/han/clteach/systemc/systemc-current/include/tlm_core/tlm_2 \
       -I/usr/groups/han/clteach/boost/boost_1_48_0  \
       -I$(VHLSDIR)/src
    
    LDFLAGS = -L$(SYSTEMC)/lib-x86_64 -lsystemc -lpthread
    
    # Simulator infrastructure
    PRAZOR_SRC= \
    	$(VHLSDIR)/src/tenos/argv_backdoor.o \
    	$(VHLSDIR)/src/tenos/cpu_busaccess.o \
    	$(VHLSDIR)/src/tenos/generic_branch_predictor.o \
    	$(VHLSDIR)/src/tenos/io_backdoor.o \
    	$(VHLSDIR)/src/tenos/MpHash.o \
    	$(VHLSDIR)/src/tenos/tenos.o \
    
    # The platform of interest
    PARALLELLA= \
    	$(VHLSDIR)/src/platform/arm/zynq/parallella/zynq.o 
    
    # Generic ARM IP blocks
    ARM_A9_ZYNQ = \
    	$(VHLSDIR)/src/arm/arm_abt.o \
    	$(VHLSDIR)/src/arm/arm_ccache.o \
    	$(VHLSDIR)/src/arm/armcore_tlm.o \
    	$(VHLSDIR)/src/arm/arm_cortex_a9.o \
    	$(VHLSDIR)/src/arm/arm_cp14.o \
    	$(VHLSDIR)/src/arm/arm_cp15.o \
    	$(VHLSDIR)/src/arm/armdis.o \
    	$(VHLSDIR)/src/arm/armisa.o \
    	$(VHLSDIR)/src/arm/arm_L2Cpl310.o \
    	$(VHLSDIR)/src/arm/arm_mmu.o \
    	$(VHLSDIR)/src/arm/arm_scu.o \
    	$(VHLSDIR)/src/arm/armthumb.o \
    	$(VHLSDIR)/src/arm/arm_timers.o \
    	$(VHLSDIR)/src/arm/gic_arm_tlm.o \
    	$(VHLSDIR)/src/arm/sclr_arm_tlm.o 
    
    # Static memories and caches
    MEMORYSRC= \
    	$(VHLSDIR)/src/memories/base_mmu_tlm.o \
    	$(VHLSDIR)/src/memories/ccache.o \
    	$(VHLSDIR)/src/memories/dram64_cbg.o \
    	$(VHLSDIR)/src/memories/generic_tlm_mem.o \
    	$(VHLSDIR)/src/memories/memloaders.o \
    	$(VHLSDIR)/src/memories/scu.o \
    	$(VHLSDIR)/src/memories/secondary_cache_with_directory.o \
    	$(VHLSDIR)/src/memories/sram64_cbg.o 
    
    # DRAM simulator
    DRAM2SIMSRC= \
    	$(VHLSDIR)/src/dramsim2/dramsim_sc_wrapper.o \
    	$(VHLSDIR)/src/dramsim2/dist/Bank.o \
    	$(VHLSDIR)/src/dramsim2/dist/BankState.o \
    	$(VHLSDIR)/src/dramsim2/dist/BusPacket.o \
    	$(VHLSDIR)/src/dramsim2/dist/CommandQueue.o \
    	$(VHLSDIR)/src/dramsim2/dist/IniReader.o \
    	$(VHLSDIR)/src/dramsim2/dist/MemoryController.o \
    	$(VHLSDIR)/src/dramsim2/dist/MemorySystem.o \
    	$(VHLSDIR)/src/dramsim2/dist/Rank.o \
    	$(VHLSDIR)/src/dramsim2/dist/SimulatorObject.o \
    	$(VHLSDIR)/src/dramsim2/dist/TraceBasedSim.o \
    	$(VHLSDIR)/src/dramsim2/dist/Transaction.o 
    
    # GDB stub via RSP Protocol
    GDBSRC= \
    	$(VHLSDIR)/src/gdbrsp/gdbrsp.o \
    	$(VHLSDIR)/src/gdbrsp/GdbServerSC.o \
    	$(VHLSDIR)/src/gdbrsp/RspConnection.o \
    	$(VHLSDIR)/src/gdbrsp/RspPacket.o \
    	$(VHLSDIR)/src/gdbrsp/Utils.o \
    	$(VHLSDIR)/src/gdbrsp/vhls_soc_debug.o
    
    # Bus Components
    BUSSRC= \
    	$(VHLSDIR)/src/bus/busmux64.o
    
    # Input and Output Devices
    IO= \
    	$(VHLSDIR)/src/io/sdio_cbg.o \
    	$(VHLSDIR)/src/io/uart64_cbg.o
    
    
    all:
    	$(CPP) -o vhls $(CPPFLAGS) $(LDFLAGS) $(PRAZOR_SRC) $(IO)  $(BUSSRC) \
                  $(MEMORYSRC) $(GDBSRC) $(DRAM2SIMSRC) \
                  $(ARM_A9_ZYNQ) $(PARALLELLA) $(VHLSDIR)/src/vhls.cpp \
                  $(LDFLAGS)
    	echo "The excutable file vhls should now exist in the current directory"
    	ls -l
    # eof
    

    You will find a copy of this Makefile in

    /usr/groups/han/clteach/btlm/pvp10/ManualMake
    or similar.

    Using Automake to Compile Prazor

    You will first need a copy of the source files

    Set ups

    You only need to set BOOST, LDFLAGS, TLM_POWER3 and CXXFLAGS for compiling the simulator. For running it you just need the PRAZOR and TARCH settings as mentioned above.

    The following setup is generally what is needed:

    export CLTEACH=/usr/groups/han/clteach
    export BOOST=$CLTEACH/boost/boost_1_48_0
    export BOOST_ROOT=$BOOST
    export SYSTEMC=$CLTEACH/systemc/systemc-current
    export TLM_POWER3=$CLTEACH/tlm-power3
    export LDFLAGS="-L$SYSTEMC/lib-linux64 -L/usr/local/lib -L$TLM_POWER3/src/.libs"
    export CXXFLAGS="-I$SYSTEMC/include/ -I$BOOST_ROOT -I$SYSTEMC/include/tlm_core/tlm_2 -I$TLM_POWER3/include -g -O2"
    export TARCH=ARM32
    export PRAZOR=$CLTEACH/btlm/current
    
    The automake settings precisely needed should be on the toolinfo page and the key fact that was holding us up at the start of term is that you need to add --host=x86_64-pc-linux-gnu to the configure command line

    Plugging your own IP Block Models into the Virtual Platform

    As said, one purpose of building your own copy of the simulator is to explore the performance of a program on different architectures, such as with different cache sizes. You also need to recompile the simulator to include your own IP blocks.

    New blocks should be instantiated inside the platform configuration C++ file, such as

              /usr/groups/han/clteach/btlm/current/vhls/src/platform/arm/zynq/parallella/zynq{.cpp,.h}
    

    You need to get to the point where you have your own copy of this file where you have made modifications. You can either make links to the existing versions of all the other files in the simulator or have your own copies of them too.

    To instantiate a simple peripheral for programmed I/O access, first look at the way an existing I/O device is wired in, such as the UART. You can see it is connected to the I/O bus with the following line

                 BUSMUX64_BIND(busmux0, UARTS[uu]->port0, start, UART_SPACING)
    
    and your own device can be connected to busmux0 with an additional, similar such call. The busmux may automatically allocate a programmed I/O base address spaced by some factor given to its own constructor or else you can pass in an explicit base address. Note, busmuxes can be in a tree structure and all such along the route to an IP block must forward the transaction appropriately.

    Using a base address of 0x43c00000 will mean the self-same code can run on the virtual ARM as the real ARM.

    There may be code relating to ETHERNET_CRC in the platform file: if so, this is a placeholder that is otherwise unused and you can remove it or else adapt it as a basis of your own.

    Note that Prazor uses an extended generic payload, not the default one provided in SystemC. Hence, your device will need to instantiate a simple_target_socket where an extra type argument is passed to the socket constructor to that defines the payload, such as

        tlm_utils::simple_target_socket port0;
    
    Note also that Prazor uses 64-bit generic payloads even when modelling 32-bit systems like the ARM7.

    DRAM Data Sheets.

    The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts. 8Gbit = 1 Gbyte = 2^30 bytes = 2^3 banks * 2^15 ras * 2^10 cas * 2^2 lanes. At 175 mA ICCD1 current it dissipates 236mW when active. See bottom for actual measurements.

    A suitable datasheet file for DRAMsim2 may be present here: DRAMSim2 on github and if not I provide a temporary one: DDR3_micron_32M_8B_x32_parallella.ini. Application note: Micron TN-41-01: Calculating Memory System Power for DDR3.

    DRAM bank interleave is controlled in hardware by the following registers that are set by linux as

    DRAM_addr_map_bank 0xF800603C  0x777
    DRAM_addr_map_col  0xF8006040  0xFFF00000
    DRAM_addr_map_row  0xF8006044  0x0F666666
    
    Which, in detail, gives the following mapping for the bottom 1 MByte (A[31:30]==2'b00):
       ra15 = 0
       ra[11:2] = 6+11 = A[28:17]
       ra1  = 6+10  = A[16]
       ra0  = 6+9   = A[15]
       ba[2]  = 7+7   = A[14]             //   777 sets this
       ba[1]  = 7+6   = A[13]
       ba[0]  = 7+5   = A[12]
       ca[13:10] = 0
       ca[9]  = 0+11  = A[11]
       ca[8]  = 0+10  = A[10]
       ca[5]  = 0+7   = A[7]
       ca[4]  = 0+6   = A[6]
       ca[3]  = 0+5   = A[5]
       ca[2:0] =   A[4:2]  // Burst/line Offset
       lane[1:0] = A[1:0]  // Byte Lane Offset
    
    So this is standard-enough layout: A[29:0] = { row, bank, col, lane }.

    In terms of DRAMSIM2 from U-Maryland, it is Scheme6, chan:row:bank:rank:col, with the chan and rank fields being null.

    Prazor Command Line Flags

    The -cores n flag specifies how many CPU cores are created.

    The -self-starting-cores n flag specifies how many CPU cores are in run mode at reset time. For Zynq linux this option should not be used, thus leaving the setting at its default of 1 because the linux kernel uses a store to the system controller block from the first CPU to enable the subsequent core(s).

    The -image flag specifies an ELF binary file to be loaded into the DRAM model.

    The -tracelevel n flag turns on a given level of tracing. n is in the range 0 to 9 with 0 being all tracing off. Multiple tracelevel flags can be used, separated by -watch flags.

    The -watch B +N flag takes a hex base address and a hex length N to define a region to watch. This region is watched at the tracelevel of the previous tracelevel flag. Typically you might set tracelevel 9, then define some watch regions, then set tracelevel back to zero so that no tracing is done outside the watched regions.

    The -name vv flag defines the name to be used as root segment in the name of output files. This is useful to distinguish output for sequences of experiments run by the same makefile in a common folder.

    The -dram-system-ini filename.ini sets up the DRAM simulator parameters. Just use the standard file provided always.

    The -dram-device datasheet.ini sets up the DRAM type in use. Each type of DRAM has different energy and performance details. For the Parallella card, please use XXXX? TBD.

    The -wait-for-debugger causes the simulation not to start until a GDB session is remotely attached.

    The -no-caches flag is implemented in abench1.h to not instantiate the caches and to connect the ARM cores directly to the DRAM. A variant is perhaps needed to skip either L1 or L2 but not both. With the -no-caches flag the caches are removed from the design and their static power disappears. The processor will run at the speed allowable by the main store, which will be slow if this is DRAM without caches.

    The -no-harvard flag is implemented in abench1.h to only instantiate a single L1 cache per core, not split I and D caches. Variants to easily control the cache size and associativity could be easily implemented.

    There are further options - please read the src code vhls.cpp where they are mainly parsed.

    Prazor Simulator Backdoors

    A 'backdoor' is an artefact not present on the real hardware that can be used on the simulator for access to a simulator feature, such as logging or exiting the simulator.

    Some backdoors are available, for access to simulator argv and argc and getenv.

    Some SWI instructions, using codes not used for linux, are hardwired as backdoors for writing a character without going via a UART or other output model, exiting the simulator and getting core instruction counts without using a platform-specific PMU set up.

     
    // ARM trigger instruction trace - print next 100 instructions
       asm volatile (" swi #203":: );                             
    

    Prazor Simulator Structure

    For the P35 course, the file that defines the architecure you are modelling is

    $PRAZOR/src/platform/arm/zynq/parallella/zync{.h,.cpp}    

    this defines the system topology, like the diagram above. This is the main file for editing when you adjust the simulated architecture. It can be loaded as a DLL or statically linked.

    You should take a good look at that file since you must make minor edits to it when you add your own IP blocks to the system.


    Prazor Loose Timing

    Prazor uses Loose Timing. The LT quatum is set on the command line. If it is set to a value close to a bus-cycle period, then transaction order is guaranteed and it degenerates to an approximately-timed model.

    Q. When I use sc_delay_ += std_delay, I get:

        Assertion `unused_delay == SC_ZERO_TIME' failed.
    

    A. The prazor model does not normally use the textbook sc_delay value. That is why it is named sc_delay_ throughout, to indicate that the standard field should not be directly used. Instead Prazor is normally compiled to use prazor_gp_t which has its own delay field called ltd of type lt_delay. The advantage of this type is it allows forks and joins in the loosely-timed trajectory, with a MAX function being applied correctly at the joins. The coding style used throughout is to use the AUGMENT_LT_DELAY macro in place of the manual addition you have used. The macro augments the correct delay variable according to the coding style used.


    Compiling Application Programs Bare Metal

    For simplicity and for clear results, rather than booting an operating system on to the simulator as well as a application of interest, it is often better to run the application 'bare metal'. For bare-metal operation you statically link a few tiny device drivers or backdoor drivers with the application and just load the combined binary into the simulator.

    The 'hello-world' program A Minimalist Bare-Metal Running Program for Prazor - One Thread, No libc, No device drivers is illustrated HERE. This is the most simple program you can run.

    But for more-complex bare-metal operation you will need to follow the structure of the images/dfsin Makefile and Makefile.inc.arm.

    Thumb2 mode is now implemented. If you see 16 bit instructions in your disassembly then the compiler has used Thumb mode. If you wisth to avoid this mode, then pass appropriate options to GCC such as -march=armv6 -marm.

    For running on more than one core follow the design pattern of djgradix which uses the djgthreads implementation of Posix pthreads on bare-metal. (NB: Today - 19th Feb - we have not tried that for ARM recently - will double check all is ok in the next few days). Note that this is non-preemptive implementation that does not time share threads over or within a core. The original program gets core 0 and the new threads get exclusive use of further cores until they exit. Therefore, for Zynq with only two cores please only start one new thread.

    As a C library, please use prlibc and not uClibc unless you want to invest time getting uClibc working bare-metal on ARM (it worked fine for OpenRISC last few years).

    Caches

    Running linux turns on the caches. On the bare-metal runs on real cards or the PRAZOR simulator, please insert something like the following code to enable the L1 and L2 caches. Without this you will see only zeros in the cache hit statistics since caches power up disabled.

    This code is found in the energyshim.c wrapper or you may paste it from here:

      asm("mov r0,#0x1000"); // Turn on L1 Cache (see Zynq TRM for further details.)
      asm("orr r0,r0,#4");
      asm("mcr  p15, 0, r0, c1, c0, 0"); //  (r0 = 0x1004)
      // You might possibly also need
      ((volatile int *)0xF8F02100)[0] = 1; // Zynq: turn on L2 cache 
    
    This will cause the following to be printed from Prazor when appropriate tracing is enabled:
    Cache the_top.coreunit_0.l1_d_cache_0 is ENABLED now
    Cache the_top.l2_cache_and_controller is ENABLED now
    Cache the_top.l2_cache_and_controller is ENABLED now
    Cache the_top.coreunit_0.l1_i_cache_0 is ENABLED now
    

    Code Structure

    Try to structure your code so that it links together a .o file that can also be run on the real parallella card. That way you are sure you are really running the same binary on the real system as on the simulator. You will need to link this .o file in a different way (using system call stubs etc) for running on linux compared with bare-metal. It is also a good idea to structure your makefiles such that you can run the program natively on your linux workstation. Finally, think about iteration counts. For stopwatch (or time ./a.out) style performance measurement you will perhaps need to run the outer loop of the simulation 1000 times more than when running on the SystemC model.

    Reusing the code in the images/energyshim.c file enables you to wrap up your benchmarks main entry point (bm_main) in a reusable manner. This shim can report energy and time use on both the real systems and the simulators if you get it all set up right.

    Overall, structure your code so that you can batch run it in lots of different configurations from one Makefile (e.g. with different numbers of cores or different cache configurations or clock frequencies) - this will generate data suitable for plotting in your report.

    Using GDB to debug a program

    The gdb debugger can be connected to a running Prazor simulation for single stepping, inspecting and changing memory and setting break and watchpoints. This uses the RSP protocol over a TCP remote connection.

    You should start the simulator with the -wait-debugger option. The simulator will print which port it is listening on and this is commonly localhost:9600.

    Load your binary ELF file into the debugger using the gdb file command before remotely connecting.

      $ /usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb
      GNU gdb (GDB) 7.8
      (gdb) target remote :9600
      Remote debugging using :9600
      0x00000000 in ?? ()
      (gdb) file dfsin % The path to your binary ELF image
      Reading symbols from dfsin...done.
      (gdb) cont
      Control-C
      (gdb) where
      [Remote target] #1 stopped.
      (gdb) where
      #0  puthex64 (leadingz=leadingz@entry=1, d0=125116699948, f=)
          at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:1052
      #1  0x00041ce8 in locked_printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n", 
          format@entry=0x44fc0  ",2", poi=..., poi@entry=...) at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:274
      #2  0x00041f68 in printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n")
          at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:342
      #3  0x000418b4 in bm_main (verbosef=verbosef@entry=7) at dfsin.c:177
      #4  0x00043af0 in main (argc=7, argv=0xfffe0004) at ../energyshim.c:200
      #0  mul64To128 (a=12105675798371893248, b=7613535337020653568, z0Ptr=0x41134 , z1Ptr=0x3bff50) at softfloat-macros:186
      #1  0x00173a24 in ?? ()
      (gdb) info registers
      r0             0x0      0
      r1             0x100    256
      r2             0xe0001000       3758100480
      r3             0x9      9
      r4             0x2189552c       562648364
      r5             0x1d     29
      r6             0xa      10
      r7             0x452b0  283312
      r8             0xe801c  950300
      r9             0x1      1
      r10            0x0      0
      r11            0x1000000        16777216
      r12            0x45198  283032
      sp             0x3bff20 0x3bff20
      lr             0x43954  276820
      pc             0x430b4  0x430b4 
      cpsr           0x8000005f       2147483743
      (gdb)  x/4x 0x44dc4
      0x44dc4 <__udivdi3+208>:        0xe0a55005      0xe2533001      0x1afffff8      0xe0948000
      (gdb)  x/4i 0x44dc4
      => 0x44dc4 <__udivdi3+208>:     adc     r5, r5, r5
         0x44dc8 <__udivdi3+212>:     subs    r3, r3, #1
         0x44dcc <__udivdi3+216>:     bne     0x44db4 <__udivdi3+192>
         0x44dd0 <__udivdi3+220>:     adds    r8, r4, r0
      (gdb) print num_keys  -- display a global variable
      $1 = 1000
      (gdb) up
      #4  0x00041620 in estimateDiv128To64 (a1=0, b=, a0=4815960295168657408) at softfloat-macros:217
      217       z = (b0 << 32 <= a0) ? LIT64 (0xFFFFFFFF00000000) : (a0 / b0) << 32;
      (gdb) print z -- try to print a local var (but it was in a register)
      $2 = 
      (gdb) print a0  -- print a local var that still exists
      $3 = 4815960295168657408
      (gdb) info threads
        Id   Target Id         Frame   -- We only have one CPU core (sim commands line was -cores 1) 
       * 1    Remote target     0x00041620 in estimateDiv128To64 (a1=0, b=, a0=4815960295168657408) at softfloat-macros:217
    
    

    Start gdb and give the command "target remote localhost:9600" You can leave out the machine name or the localhost name if you run the simulator and debugger on the same machine.

    Ideally you need to use an ARM version of gdb to connect. Otherwise you will see registers with x86 names being displayed.

    A copy of gdb for ARM gdb is installed at the Computer Laboratory here:

    /usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb
    . There may be more-recent versions around.

    Switching between cores: the threads commands of gdb have been adapted/abused to enable you to connect to different cores with a multiprocessor. Further details of how to switch between cores and the the spEEDO energy API will be added here ...

    Build gdb for ARM with this arg to configure
    ./configure --target=arm-linux-gnu
    
    Useful gdb commands:
      target remote localhost:9600
      info registers                -- show register contents
      x/16i $pc                     -- display next 16 instructions disassembled
      x/32x $sp                      -- display memory from sp upwards for 32 words
      file /home/djg11/d320/prazor/trunk/chstone/dfsin/dfsin -- load the symbol table from an ELF binary for symbolic debugging
    -- If it says no debugging symbols then it is stripped binary
      load                          -- download a binary (or the current binary) - better to load from the command line -image arg?
      cont                          -- continue execution
      break 0x44cca                 -- set a break point
      where                         -- give stack backtrace trace
      up                            -- change current stack frame by moving up one stack frame (towards caller) 
      down                          -- the reverse, move down one stack frame
      print num_keys                -- display a global variable or local var in current stack frame
      info threads -- show threads (may show hardware cores instead using Prazor)
      thread 2 -- switch thread/core
    
    
      set $pc = 0x485               -- perform a jump (not working prior to 18th feb 2015).
      stepi                         -- run one instruction only (not working 18th feb 2015 but being fixed).
    
    // useful additional commands for debugging the debugger:
     set debug remote 1
     set debug target 1
    
    


    Building Code that can run on both the simulator and the real cards

    The gcc C compiler defaults to generating Thumb-2 machine code and hardware floating point. The Thumb modes and hardware floating point on the simulator do not currently work (are just being debugged and should work in the near future), so meanwhile, to run the same binary code on both systems requires coercing gcc to avoid to only use the old ARM32 mode. The relevant flags are:

       -marm -mfloat-abi=soft
    

    With these flags, code compiled on the workstatstion or on the real card is interchangeable and .o files can freely be copied backwards and forwards using scp.

    However, note the installed libraries, libc and libgcc, on the Parallella cards also use Thumb mode so, for detailed performance comparison (and to avoid linker errors about VPF) you should avoid using these and instead use your own compiled versions of these too (or the ones on the links in the detailed documentation at the bottom of this section).

    The difference between application binaries targeted running bare metal on the simulator and running on linux on the real card are mainly to do with console I/O. The I/O paths are very different and so performance comparions should not be made. Also, they are incompatible and you need to swap the system calls used on the real card with direct calls to the UART device driver as used on the simulator.

    The best way to redirect the I/O is to link the same .o application and library files with some slightly different I/O shims, as in the barelift shim example below. Essentially you want to replace uart64_driver.o with bareliftshim.o. You should link using the ld program (not using gcc as a linker) since this will give you complete control over which kickoff code and libraries are included. Also, always check what you have made using objdump -d as a disassembler.

    For example, to run the dfsin standard test one would link as follows using the same binaries as ran on the simulator except for the barelift pair.

            LIBGCC=/home/linaro/djg11/libgcc-nothumb.a
            ld -o a.out bareliftcrt.o bareliftshim.o dfsin.o prlibc.o $(LIBGCC)
            objdump -d ./a.out > dis
            # If you see any 16 bit (4 hex digit) opcodes in the disassembly they you are using thumb code.
            ./a.out
    

    Once Thumb modes are working we can compare energy and performance with and without them. (It is working now - thanks Milos - April 2015).

    Detailed resources: bareliftshim-files.zip ZIP ARCHIVE.

    Archive:  bareliftshim-files.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
       146894  2015-02-24 10:52   home/linaro/libgcc.a -- This libgcc is Thumb and wont currently work on simulator.
         1250  2015-02-25 08:44   bareliftshim.c
         4860  2015-02-25 08:49   bareliftshim.o
         1406  2015-02-25 08:42   bareliftcrt.S
          984  2015-02-25 08:49   bareliftcrt.o
          631  2015-02-25 08:53   Makefile
         4777  2015-02-25 08:53   BARELIFT-README.txt
        23360  2015-02-25 08:49   dfsin.o
        48168  2015-02-24 18:45   prlibc.o
    ---------                     -------
       232330                     9 files
    

    Running S/W on a Physical Zynq Card

    The physical board information is now this separate page Own Page.


    Running Linux on the Prazor Parallella/Zynq Simulator

    Video Screen Capture of Interactive Linux Shell Session

    Video screen capture : interactive linux session on the SystemC Zynq model: MP4 Video.

    Procedure to compile and run your own Linux kernel on Prazor:

    Get a copy of linux kernel sources

    The Zynq linux source files can be obtained from git as follows:

    How to compile the ADI based linux kernel? (uImage)
    git clone https://github.com/parallella/parallella-linux
    cd parallella-linux bash
    export ARCH=arm
    export CROSS_COMPILE=arm-linux-gnu-
    export PATH=:$PATH
    make ARCH=arm parallella_defconfig
    make ARCH=arm LOADADDR=0x8000 uImage
    

    After this has completed, you will find a file called vmlinux which is the ELF binary for the kernel that can be loaded into the simulator. This section contains a Very Drafty Description of this process.

    You will need:

    Reference copies of some of these are stored in this folder

    $PRAZOR/vhls/boards/parallella/linux
    

    Command Line

      /home/djg11/d320/prazor-virtual-platform/vhls/src/arm/vhls-arm7smp -kernel 
      /home/djg11/parallella-sw/complete-from-git/parallella-linux/vmlinux
      -devicetree prazor-linux.dtb 
      -boot loimage 
      -vdd disk3.img 
      -dram-system-ini /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/system.ini.example  
      -dram-device /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini  
      -cores 2
    

    loimage booter

    This shim replaces the standard linux grub or whatever boot loader. It runs from the first ARM core reset vector of 0. It jumps to the linux kenel after having set up r2 to point at the device tree blob.

    Simulated output: CONSOLE LOG.

    Mount Disk Image on Host Workstation (not simultaneously while mounted as a SystemC model?)

    Disks have a root partition that starts at 2048th block. 
      
    fdisk -l ./disk1.img gives:  
    
    Disk ./disk1.img: 16 MB, 16252928 bytes4 heads, 31 sectors/track, 256 cylinders, total 31744 sectors
    Units = sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disk identifier: 0x658d7048
    
          Device Boot      Start         End      Blocks   Id  System
    ./disk1.img1            2048       31743       14848   83  Linux
    

    To mount it on the workstation you need to set loopback to start at byte block size*block start which in this case is 2048*512=1048576. So you would need to do:

    sudo losetup -o1048576 /dev/loop0 disk1.img
    sudo mount -t ext4 /dev/loop0 /mnt
    

    Energy Budgets and Modelling: Modelled versus Measured Comparison and Results

    There are various sources of discrepancy between the Prazor energy results and the measurements on parcard1.

    Firstly, the implementation of the sPEEDO interface on Prazor returns the core's energy account and the whole system energy account. And on the physical probe attached to parcard1 it returns the whole system energy and the energy used by the Zynq core. So the figures reported would not directly agree even if both sides were completely accurate. Note: The Zynq core contains both processors, the L2 cache + GIC + SCU and the programmable FPGA logic.

    Secondly, the Prazor system has features missing from its model, notably the DRAM driving pads and the Ethernet PHY energy.

    Ethernet Power

    The Ethernet PHY on the real card takes a lot of energy that is not modelled (currently) in Prazor. This has four modes of operation that take different amounts of power. If you unplug the 100 Mbps Ethernet cable and log in via the UART instead the power saving is about 160 mW. The actual measurements of the 5 volt supply current were:

        Ethernet unplugged   385mA
        10  Mb/s             410mA
        100 Mb/s             425mA
        1 GB/sec             470mA
    

    The power consumed by the PHY can be computed by multiplying by 0.001 to convert to amps, then multiply by 5 to convert to Watts and finally take of 20 percent to account for the SMPSU efficiency inconverty from 5 to 1.8 volts.

    Another 40mW may be accounted by the LED indicators which will go off when the Ethernet is unplugged.

    The Ethernet MAC hard macro is quoted in the Xilinx datasheet as taking 15 mW.

    DRAM Power

    The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts.

    During normal quiescent operation (i.e. no L2 cache misses and infrequent refresh) it should take about 32 mW. Ths should rise to 613 mW under very heavy traffic with no locality of reference (i.e. one burst read or write per row activation) and be perhaps about 400 mW under heavy load with well-organised data layout (many bursts per row activation).

    The measured power use of the real DRAM was a little different!

    With reset held down it takes about 7mW, which is as per the data sheet. But for all other traffic loads the supply current varies between 424 and 464 mA (mutiply by 1.3 for mW). This is odd. Writing 0x83 to ddrc_ctrl at 0xf800_6000, which supposedly makes the DRAM more efficient reduces energy use by only a percent or so.

    Presumably the Zynq chip is making many more operations on the DRAM than expected and hence DRAM power measured does not correlate with Prazor prediction. This could be explained, perhaps, by linux having set the refresh period to a stupidly small value (but inspection of reg_ddrc_t_rfc_nom_x at 0xF8006004 reveals 0x61 which is a sensible value giing 5 us or so interval), anti-rowhammer cycles, or EEC scrubbing of L2 operating far too fast.

    In addition, the DRAM driving I/O pads on the Zynq I/O ring take quite a lot: perhaps 100 mW: see data sheet.

    Further analysis is needed ...

    Power use for HDMI, FLASH and Misc Other Devices

    The Parallella card uses an ADV7513BSWZ chip to drive its monitor.

    This takes 256 mW from its 1v8 input and 1mW from its 3V input, but when powered down takes just 300 uA or so. It resets to power down mode. It is not modelled in Prazor and it is not used in any of our real-card experiments, so is irrelevant.

    The USB driver chips, FLASH memory and Epiphany chips are also all not modelled in Prazor and not used in our experiments. They should take negligable power in standby mode, so can be ignored. For example, the 3v3 serial NOR Quad Flash U7 N25Q128A13EF840E has standby current of less than 14 uA, whereas operating takes 15 mA for 108 MHz and 6mA at 54 MHz. It takes 20 mA during write and erase internal operations.

    FPGA Programmable Logic Power

    When not loaded with any design the PL on the 7010 takes about 22 mW. Larger chips would take more.

    The FPGA takes an extra 40 mW or so of power on system powerup in some sort of self-clear mode.

    CPU Power

    The dynamic L1+CPU energy when running flat out is about 20 mW more than when idle. Was this per core? Need to check again. We note a linear decrease in energy use as the programmable CPU clock frequency is scaled down (666 MHz down to 90 MHz).

    Further verification of the L2 cache and tightly-coupled SRAM is ongoing...


    Course Home.       http://www.hitwebcounter.com/

    EOF.       Prazor Home Page.