Prazor Virtual Platform Home Page.
The Prazor/VHLS Virtual Platform is a simulator is written using SystemC TLM 2.0 sockets and can model a number of CPU architectures (x86_64, ARM, MIPS, OpenRISC).
Very high level simulation -> VHLS. We aim to render complete energy estimates from real workloads on complete datacentres with accuracy of +/- (sqroot 10) / 2.
Note: this project was released in the additional material annex for the textbook, "Modern System-on-Chip Design on Arm" by D J Greaves and the associated stable version of PRAZOR/VHLS is available for download from modern-soc-design-addmat.
The Prazor Virtual Platform is a simulator is written using SystemC TLM 2.0 sockets and can model a number of CPU architectures (x86_64, ARM, MIPS, OpenRISC).
Currently we are using it to model Xilinx Zynq platform such as the Parallella Card or ZedBoard. The model is binary compatible with the real hardware meaning it can run the same linux kernel and use the same SD card images. We might next make a 4 core model that will be the basis of the new Raspberry Pi 2 or we might make an Allwinner A20 (cubieboard/sunxi) model.
The platform can be compiled with and without POWER3 and Speedo (see below for what this means.) It can also be a static platform compile or a DLL platform compile. Dynamic compiles were supported for architectural variations between modelled platforms, but are also useful for loading alternative reconfigurable FPGA designs into the model.
The Zynq model within Prazor covers the ARM cores, L1 and L2 caches, GIC, SCU, DRAM, SD Card and all peripherals and timers needed to boot linux with UART input and output. The Ethernet is currently missing, as is the Epiphany chip, Neon, USB, the FPGA and TCM. Floating point is being added. Jazelle cannot be added since it is secret. All of these components are modelled as TLM 2.0 blocking SystemC classes with loose timing and quantum keeper.
Prazor does not directly model the FPGA fabric (programmable logic) of the Zynq. Instead, one route is to use HLS to generate the FPGA design and to use a SystemC output from the HLS tool for the ESL model (ie. link it into the Prazor binary).
Static snapshots, called pvp_nn, of the simulator code are also installed on the Computer Lab file server and can copied or linked to
/usr/groups/han/clteach/btlm/current.
If updates are needed, revised/updated copies will be maintained at
/usr/groups/han/clteach/btlm/pvp_xx
A first step is to run the pre-built simulator with a pre-compiled binary image and then look at the structure of what you have run. Later steps will be to modify both halves and confirm proficiency with this. This is mostly a matter of getting your environment variables set up correctly.
The virtual platform can be run from the command line.
At the Computer Laboratory you will need the following set ups:
export CLTEACH=/usr/groups/han/clteach export TARCH=ARM32 export PRAZOR=$CLTEACH/btlm/current
The pre-built simulator is here:
export VHLS=/usr/groups/han/clteach/btlm/current/vhls/src/vhls
The pre-built hello-world program as an ELF binary for ARM is here:
/usr/groups/han/clteach/btlm/current/vhls/images/hellow-world/hello
It is easiest to run under make, but a manual command line run should also work. You'll need something like
$VHLS \ -dram-system-ini /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/system.ini.example \ -dram-device /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini \ -cores 1 -tracelevel 0 -global-qk-ns 1 -no-caches -image ./hello -name vv \ -- red yellow green blue well done milos
The command line is divided into two sections using the -- separator. Three sections with two of these separators are sometimes used. The part before the first separator is processed by the Prazor simulator. The part afterwards is passed in to the running program as its argv command line. Generally, this program consists of a wrapper (energyshim.c) that splits the command line again at the second -- separator with the final part being used by the application proper and the earlier part controlling logging instrumentation.
You will get a UART output spool file written to your working directory and also an energy use report from the TLM POWER3 library if included in the platform version.
Note, you will see very poor performance. This is because the -no-caches flag has removed the caches from the hardware, so raw DRAM access is being used. Just removing this flag is insufficient to get cached operation, you also need to include some coprocessor writes to turn them on CACHE TURN ON.
If you get an access denied error this may be because the simulator is trying to write its output file to the current folder for which you do not have write access. You need to run it in a folder where you have write access.
If you wish to continue running from command line, to abbreviate the lengthy command line it is suggested you use something like the following macros.
export CLTEACH=/usr/groups/han/clteach export TARCH=ARM32 export PRAZOR=$CLTEACH/btlm/current export SIMBIN=/usr/groups/han/clteach/btlm/current/vhls/src/vhls export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYSTEMC/lib-linux64 export DRAM1="-dram-system-ini /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/system.ini.example" export DRAM2="-dram-device /usr/groups/han/clteach/btlm/current/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini" export PRARM="$SIMBIN $DRAM1 $DRAM2"
so now you can run something short like
$PRARM -image hello
... or else write a little shell script of your own. There is no prazor GUI.
If it complains it cannot find the TLM_POWER3 library or the SystemC library you need one or both of the following:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/groups/han/clteach/tlm-power3/lib-linux64 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYSTEMC/lib-linux64
The 'hello-world' program A Minimalist Bare-Metal Running Program for Prazor - One Thread, No libc, No device drivers is illustrated HERE. This is about the most simple program you can run.
Take a look at the makefile and the disassembly of this test program. We are running bare-metal and you will see the following components:
One purpose of building your own copy of the simulator is to explore the performance of a program on different architectures, such as with different cache sizes.
Before open public release, please :
1. Get an xparch user ID and password from DJG or MP 2. Login via the web interface (password required) to http://phabricator.xparch.com 2a and change your password. 2b Install your public key on phabricator 3. Use ssh-agent to put your private key in scope and then git clone ssh://vcs@phabricator.xparch.com:22/diffusion/P/vhls.git 4. cd to vhls and follow the README.md
Owing to the nature of git, it is possible for local users to clone their own git repo from the checkout in /usr/groups/han/clteach/btlm/current using the appropriate git command (someone please let me know what it is) or just doing a simple cp -ra of that folder so that you get the hidden hit repo management files.
The Prazor Vitual platform is normally built with automake and the platform of interest is then loaded as a DLL (.so) file. But this can be very tricky to use if you have not used it before and so a simple manual Makefile is given here that you can adapt if you prefer. This makefile also helpfully gives you a stong idea of the overall structure and what automake is achieving.
# # VHLS Prazor hand-crafed Makefile for SystemC TLM Zynq-style models. # # People often have problems with automake, so a simple manual Makefile is all we really need. # This version assumes the C++ src files have already been compiled to .o form # This version without TLM POWER3 and Speedo # VHLSDIR=/usr/groups/han/clteach/btlm/pvp10/vhls CPP=g++ CPPFLAGS=-O2 -std=c++0x \ -DTARCH=ARM32 \ -DVHLS_STATIC=1 \ -DSC_CPLUSPLUS=199701L -DSC_DISABLE_API_VERSION_CHECK=1 \ -I/usr/groups/han/clteach/systemc/systemc-current/include \ -I/usr/groups/han/clteach/systemc/systemc-current/include/tlm_core/tlm_2 \ -I/usr/groups/han/clteach/boost/boost_1_48_0 \ -I$(VHLSDIR)/src LDFLAGS = -L$(SYSTEMC)/lib-x86_64 -lsystemc -lpthread # Simulator infrastructure PRAZOR_SRC= \ $(VHLSDIR)/src/tenos/argv_backdoor.o \ $(VHLSDIR)/src/tenos/cpu_busaccess.o \ $(VHLSDIR)/src/tenos/generic_branch_predictor.o \ $(VHLSDIR)/src/tenos/io_backdoor.o \ $(VHLSDIR)/src/tenos/MpHash.o \ $(VHLSDIR)/src/tenos/tenos.o \ # The platform of interest PARALLELLA= \ $(VHLSDIR)/src/platform/arm/zynq/parallella/zynq.o # Generic ARM IP blocks ARM_A9_ZYNQ = \ $(VHLSDIR)/src/arm/arm_abt.o \ $(VHLSDIR)/src/arm/arm_ccache.o \ $(VHLSDIR)/src/arm/armcore_tlm.o \ $(VHLSDIR)/src/arm/arm_cortex_a9.o \ $(VHLSDIR)/src/arm/arm_cp14.o \ $(VHLSDIR)/src/arm/arm_cp15.o \ $(VHLSDIR)/src/arm/armdis.o \ $(VHLSDIR)/src/arm/armisa.o \ $(VHLSDIR)/src/arm/arm_L2Cpl310.o \ $(VHLSDIR)/src/arm/arm_mmu.o \ $(VHLSDIR)/src/arm/arm_scu.o \ $(VHLSDIR)/src/arm/armthumb.o \ $(VHLSDIR)/src/arm/arm_timers.o \ $(VHLSDIR)/src/arm/gic_arm_tlm.o \ $(VHLSDIR)/src/arm/sclr_arm_tlm.o # Static memories and caches MEMORYSRC= \ $(VHLSDIR)/src/memories/base_mmu_tlm.o \ $(VHLSDIR)/src/memories/ccache.o \ $(VHLSDIR)/src/memories/dram64_cbg.o \ $(VHLSDIR)/src/memories/generic_tlm_mem.o \ $(VHLSDIR)/src/memories/memloaders.o \ $(VHLSDIR)/src/memories/scu.o \ $(VHLSDIR)/src/memories/secondary_cache_with_directory.o \ $(VHLSDIR)/src/memories/sram64_cbg.o # DRAM simulator DRAM2SIMSRC= \ $(VHLSDIR)/src/dramsim2/dramsim_sc_wrapper.o \ $(VHLSDIR)/src/dramsim2/dist/Bank.o \ $(VHLSDIR)/src/dramsim2/dist/BankState.o \ $(VHLSDIR)/src/dramsim2/dist/BusPacket.o \ $(VHLSDIR)/src/dramsim2/dist/CommandQueue.o \ $(VHLSDIR)/src/dramsim2/dist/IniReader.o \ $(VHLSDIR)/src/dramsim2/dist/MemoryController.o \ $(VHLSDIR)/src/dramsim2/dist/MemorySystem.o \ $(VHLSDIR)/src/dramsim2/dist/Rank.o \ $(VHLSDIR)/src/dramsim2/dist/SimulatorObject.o \ $(VHLSDIR)/src/dramsim2/dist/TraceBasedSim.o \ $(VHLSDIR)/src/dramsim2/dist/Transaction.o # GDB stub via RSP Protocol GDBSRC= \ $(VHLSDIR)/src/gdbrsp/gdbrsp.o \ $(VHLSDIR)/src/gdbrsp/GdbServerSC.o \ $(VHLSDIR)/src/gdbrsp/RspConnection.o \ $(VHLSDIR)/src/gdbrsp/RspPacket.o \ $(VHLSDIR)/src/gdbrsp/Utils.o \ $(VHLSDIR)/src/gdbrsp/vhls_soc_debug.o # Bus Components BUSSRC= \ $(VHLSDIR)/src/bus/busmux64.o # Input and Output Devices IO= \ $(VHLSDIR)/src/io/sdio_cbg.o \ $(VHLSDIR)/src/io/uart64_cbg.o all: $(CPP) -o vhls $(CPPFLAGS) $(LDFLAGS) $(PRAZOR_SRC) $(IO) $(BUSSRC) \ $(MEMORYSRC) $(GDBSRC) $(DRAM2SIMSRC) \ $(ARM_A9_ZYNQ) $(PARALLELLA) $(VHLSDIR)/src/vhls.cpp \ $(LDFLAGS) echo "The excutable file vhls should now exist in the current directory" ls -l # eof
You will find a copy of this Makefile in
/usr/groups/han/clteach/btlm/pvp10/ManualMakeor similar.
You will first need a copy of the source files
You only need to set BOOST, LDFLAGS, TLM_POWER3 and CXXFLAGS for compiling the simulator. For running it you just need the PRAZOR and TARCH settings as mentioned above.
The following setup is generally what is needed:
export CLTEACH=/usr/groups/han/clteach export BOOST=$CLTEACH/boost/boost_1_48_0 export BOOST_ROOT=$BOOST export SYSTEMC=$CLTEACH/systemc/systemc-current export TLM_POWER3=$CLTEACH/tlm-power3 export LDFLAGS="-L$SYSTEMC/lib-linux64 -L/usr/local/lib -L$TLM_POWER3/src/.libs" export CXXFLAGS="-I$SYSTEMC/include/ -I$BOOST_ROOT -I$SYSTEMC/include/tlm_core/tlm_2 -I$TLM_POWER3/include -g -O2" export TARCH=ARM32 export PRAZOR=$CLTEACH/btlm/currentThe automake settings precisely needed should be on the toolinfo page and the key fact that was holding us up at the start of term is that you need to add --host=x86_64-pc-linux-gnu to the configure command line
As said, one purpose of building your own copy of the simulator is to explore the performance of a program on different architectures, such as with different cache sizes. You also need to recompile the simulator to include your own IP blocks.
New blocks should be instantiated inside the platform configuration C++ file, such as
/usr/groups/han/clteach/btlm/current/vhls/src/platform/arm/zynq/parallella/zynq{.cpp,.h}
You need to get to the point where you have your own copy of this file where you have made modifications. You can either make links to the existing versions of all the other files in the simulator or have your own copies of them too.
To instantiate a simple peripheral for programmed I/O access, first look at the way an existing I/O device is wired in, such as the UART. You can see it is connected to the I/O bus with the following line
BUSMUX64_BIND(busmux0, UARTS[uu]->port0, start, UART_SPACING)and your own device can be connected to busmux0 with an additional, similar such call. The busmux may automatically allocate a programmed I/O base address spaced by some factor given to its own constructor or else you can pass in an explicit base address. Note, busmuxes can be in a tree structure and all such along the route to an IP block must forward the transaction appropriately.
Using a base address of 0x43c00000 will mean the self-same code can run on the virtual ARM as the real ARM.
There may be code relating to ETHERNET_CRC in the platform file: if so, this is a placeholder that is otherwise unused and you can remove it or else adapt it as a basis of your own.
Note that Prazor uses an extended generic payload, not the default one provided in SystemC. Hence, your device will need to instantiate a simple_target_socket where an extra type argument is passed to the socket constructor to that defines the payload, such as
tlm_utils::simple_target_socketNote also that Prazor uses 64-bit generic payloads even when modelling 32-bit systems like the ARM7.port0;
The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts. 8Gbit = 1 Gbyte = 2^30 bytes = 2^3 banks * 2^15 ras * 2^10 cas * 2^2 lanes. At 175 mA ICCD1 current it dissipates 236mW when active. See bottom for actual measurements.
A suitable datasheet file for DRAMsim2 may be present here: DRAMSim2 on github and if not I provide a temporary one: DDR3_micron_32M_8B_x32_parallella.ini. Application note: Micron TN-41-01: Calculating Memory System Power for DDR3.
DRAM bank interleave is controlled in hardware by the following registers that are set by linux as
DRAM_addr_map_bank 0xF800603C 0x777 DRAM_addr_map_col 0xF8006040 0xFFF00000 DRAM_addr_map_row 0xF8006044 0x0F666666Which, in detail, gives the following mapping for the bottom 1 MByte (A[31:30]==2'b00):
ra15 = 0 ra[11:2] = 6+11 = A[28:17] ra1 = 6+10 = A[16] ra0 = 6+9 = A[15] ba[2] = 7+7 = A[14] // 777 sets this ba[1] = 7+6 = A[13] ba[0] = 7+5 = A[12] ca[13:10] = 0 ca[9] = 0+11 = A[11] ca[8] = 0+10 = A[10] ca[5] = 0+7 = A[7] ca[4] = 0+6 = A[6] ca[3] = 0+5 = A[5] ca[2:0] = A[4:2] // Burst/line Offset lane[1:0] = A[1:0] // Byte Lane OffsetSo this is standard-enough layout: A[29:0] = { row, bank, col, lane }.
In terms of DRAMSIM2 from U-Maryland, it is Scheme6, chan:row:bank:rank:col, with the chan and rank fields being null.
The -cores n flag specifies how many CPU cores are created.
The -self-starting-cores n flag specifies how many CPU cores are in run mode at reset time. For Zynq linux this option should not be used, thus leaving the setting at its default of 1 because the linux kernel uses a store to the system controller block from the first CPU to enable the subsequent core(s).
The -image flag specifies an ELF binary file to be loaded into the DRAM model.
The -tracelevel n flag turns on a given level of tracing. n is in the range 0 to 9 with 0 being all tracing off. Multiple tracelevel flags can be used, separated by -watch flags.
The -watch B +N flag takes a hex base address and a hex length N to define a region to watch. This region is watched at the tracelevel of the previous tracelevel flag. Typically you might set tracelevel 9, then define some watch regions, then set tracelevel back to zero so that no tracing is done outside the watched regions.
The -name vv flag defines the name to be used as root segment in the name of output files. This is useful to distinguish output for sequences of experiments run by the same makefile in a common folder.
The -dram-system-ini filename.ini sets up the DRAM simulator parameters. Just use the standard file provided always.
The -dram-device datasheet.ini sets up the DRAM type in use. Each type of DRAM has different energy and performance details. For the Parallella card, please use XXXX? TBD.
The -wait-for-debugger causes the simulation not to start until a GDB session is remotely attached.
The -no-caches flag is implemented in abench1.h to not instantiate the caches and to connect the ARM cores directly to the DRAM. A variant is perhaps needed to skip either L1 or L2 but not both. With the -no-caches flag the caches are removed from the design and their static power disappears. The processor will run at the speed allowable by the main store, which will be slow if this is DRAM without caches.
The -no-harvard flag is implemented in abench1.h to only instantiate a single L1 cache per core, not split I and D caches. Variants to easily control the cache size and associativity could be easily implemented.
There are further options - please read the src code vhls.cpp where they are mainly parsed.
A 'backdoor' is an artefact not present on the real hardware that can be used
on the simulator for access to a simulator feature, such as logging or exiting the simulator.
Some backdoors are available, for access to simulator argv and argc and getenv.
Some SWI instructions, using codes not used for linux, are hardwired as backdoors for writing a character without going via a UART or other output model, exiting the
simulator and getting core instruction counts without using a platform-specific PMU set up.
Prazor Simulator Backdoors
// ARM trigger instruction trace - print next 100 instructions
asm volatile (" swi #203":: );
For the P35 course, the file that defines the architecure you are modelling is
$PRAZOR/src/platform/arm/zynq/parallella/zync{.h,.cpp}this defines the system topology, like the diagram above. This is the main file for editing when you adjust the simulated architecture. It can be loaded as a DLL or statically linked.
You should take a good look at that file since you must make minor edits to it when you add your own IP blocks to the system.
Prazor uses Loose Timing. The LT quatum is set on the command line. If it is set to a value close to a bus-cycle period, then transaction order is guaranteed and it degenerates to an approximately-timed model.
Q. When I use sc_delay_ += std_delay, I get:
Assertion `unused_delay == SC_ZERO_TIME' failed.
A. The prazor model does not normally use the textbook sc_delay value. That is why it is named sc_delay_ throughout, to indicate that the standard field should not be directly used. Instead Prazor is normally compiled to use prazor_gp_t which has its own delay field called ltd of type lt_delay. The advantage of this type is it allows forks and joins in the loosely-timed trajectory, with a MAX function being applied correctly at the joins. The coding style used throughout is to use the AUGMENT_LT_DELAY macro in place of the manual addition you have used. The macro augments the correct delay variable according to the coding style used.
For simplicity and for clear results, rather than booting an operating system on to the simulator as well as a application of interest, it is often better to run the application 'bare metal'. For bare-metal operation you statically link a few tiny device drivers or backdoor drivers with the application and just load the combined binary into the simulator.
The 'hello-world' program A Minimalist Bare-Metal Running Program for Prazor - One Thread, No libc, No device drivers is illustrated HERE. This is the most simple program you can run.
But for more-complex bare-metal operation you will need to follow the structure of the images/dfsin Makefile and Makefile.inc.arm.
Thumb2 mode is now implemented. If you see 16 bit instructions in your disassembly then the compiler has used Thumb mode. If you wisth to avoid this mode, then pass appropriate options to GCC such as -march=armv6 -marm.
For running on more than one core follow the design pattern of djgradix which uses the djgthreads implementation of Posix pthreads on bare-metal. (NB: Today - 19th Feb - we have not tried that for ARM recently - will double check all is ok in the next few days). Note that this is non-preemptive implementation that does not time share threads over or within a core. The original program gets core 0 and the new threads get exclusive use of further cores until they exit. Therefore, for Zynq with only two cores please only start one new thread.
As a C library, please use prlibc and not uClibc unless you want to invest time getting uClibc working bare-metal on ARM (it worked fine for OpenRISC last few years).
Caches
Running linux turns on the caches. On the bare-metal runs on real cards or the PRAZOR simulator, please insert something like the following code to enable the L1 and L2 caches. Without this you will see only zeros in the cache hit statistics since caches power up disabled.
This code is found in the energyshim.c wrapper or you may paste it from here:
asm("mov r0,#0x1000"); // Turn on L1 Cache (see Zynq TRM for further details.) asm("orr r0,r0,#4"); asm("mcr p15, 0, r0, c1, c0, 0"); // (r0 = 0x1004) // You might possibly also need ((volatile int *)0xF8F02100)[0] = 1; // Zynq: turn on L2 cache This will cause the following to be printed from Prazor when appropriate tracing is enabled: Cache the_top.coreunit_0.l1_d_cache_0 is ENABLED now Cache the_top.l2_cache_and_controller is ENABLED now Cache the_top.l2_cache_and_controller is ENABLED now Cache the_top.coreunit_0.l1_i_cache_0 is ENABLED now
Try to structure your code so that it links together a .o file that can also be run on the real parallella card. That way you are sure you are really running the same binary on the real system as on the simulator. You will need to link this .o file in a different way (using system call stubs etc) for running on linux compared with bare-metal. It is also a good idea to structure your makefiles such that you can run the program natively on your linux workstation. Finally, think about iteration counts. For stopwatch (or time ./a.out) style performance measurement you will perhaps need to run the outer loop of the simulation 1000 times more than when running on the SystemC model.
Reusing the code in the images/energyshim.c file enables you to wrap up your benchmarks main entry point (bm_main) in a reusable manner. This shim can report energy and time use on both the real systems and the simulators if you get it all set up right.
Overall, structure your code so that you can batch run it in lots of different configurations from one Makefile (e.g. with different numbers of cores or different cache configurations or clock frequencies) - this will generate data suitable for plotting in your report.
The gdb debugger can be connected to a running Prazor simulation for single stepping, inspecting and changing memory and setting break and watchpoints. This uses the RSP protocol over a TCP remote connection.
You should start the simulator with the -wait-debugger option. The simulator will print which port it is listening on and this is commonly localhost:9600.
Load your binary ELF file into the debugger using the gdb file command before remotely connecting.
$ /usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb GNU gdb (GDB) 7.8 (gdb) target remote :9600 Remote debugging using :9600 0x00000000 in ?? () (gdb) file dfsin % The path to your binary ELF image Reading symbols from dfsin...done. (gdb) cont Control-C (gdb) where [Remote target] #1 stopped. (gdb) where #0 puthex64 (leadingz=leadingz@entry=1, d0=125116699948, f=) at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:1052 #1 0x00041ce8 in locked_printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n", format@entry=0x44fc0 ",2", poi=..., poi@entry=...) at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:274 #2 0x00041f68 in printf (format=0x4516c "input=%016llx expected=%016llx output=%016llx ok=%i\n") at /home/djg11/d320/prazor-virtual-platform/vhls/src/crt/prlibc/prlibc.c:342 #3 0x000418b4 in bm_main (verbosef=verbosef@entry=7) at dfsin.c:177 #4 0x00043af0 in main (argc=7, argv=0xfffe0004) at ../energyshim.c:200 #0 mul64To128 (a=12105675798371893248, b=7613535337020653568, z0Ptr=0x41134 , z1Ptr=0x3bff50) at softfloat-macros:186 #1 0x00173a24 in ?? () (gdb) info registers r0 0x0 0 r1 0x100 256 r2 0xe0001000 3758100480 r3 0x9 9 r4 0x2189552c 562648364 r5 0x1d 29 r6 0xa 10 r7 0x452b0 283312 r8 0xe801c 950300 r9 0x1 1 r10 0x0 0 r11 0x1000000 16777216 r12 0x45198 283032 sp 0x3bff20 0x3bff20 lr 0x43954 276820 pc 0x430b4 0x430b4 cpsr 0x8000005f 2147483743 (gdb) x/4x 0x44dc4 0x44dc4 <__udivdi3+208>: 0xe0a55005 0xe2533001 0x1afffff8 0xe0948000 (gdb) x/4i 0x44dc4 => 0x44dc4 <__udivdi3+208>: adc r5, r5, r5 0x44dc8 <__udivdi3+212>: subs r3, r3, #1 0x44dcc <__udivdi3+216>: bne 0x44db4 <__udivdi3+192> 0x44dd0 <__udivdi3+220>: adds r8, r4, r0 (gdb) print num_keys -- display a global variable $1 = 1000 (gdb) up #4 0x00041620 in estimateDiv128To64 (a1=0, b= , a0=4815960295168657408) at softfloat-macros:217 217 z = (b0 << 32 <= a0) ? LIT64 (0xFFFFFFFF00000000) : (a0 / b0) << 32; (gdb) print z -- try to print a local var (but it was in a register) $2 = (gdb) print a0 -- print a local var that still exists $3 = 4815960295168657408 (gdb) info threads Id Target Id Frame -- We only have one CPU core (sim commands line was -cores 1) * 1 Remote target 0x00041620 in estimateDiv128To64 (a1=0, b= , a0=4815960295168657408) at softfloat-macros:217
Start gdb and give the command "target remote localhost:9600" You can leave out the machine name or the localhost name if you run the simulator and debugger on the same machine.
Ideally you need to use an ARM version of gdb to connect. Otherwise you will see registers with x86 names being displayed.
A copy of gdb for ARM gdb is installed at the Computer Laboratory here:
/usr/groups/han/clteach/arm-gdb/gdb-7.8/gdb. There may be more-recent versions around.
Switching between cores: the threads commands of gdb have been adapted/abused to enable you to connect to different cores with a multiprocessor. Further details of how to switch between cores and the the spEEDO energy API will be added here ...
Build gdb for ARM with this arg to configure ./configure --target=arm-linux-gnu Useful gdb commands: target remote localhost:9600 info registers -- show register contents x/16i $pc -- display next 16 instructions disassembled x/32x $sp -- display memory from sp upwards for 32 words file /home/djg11/d320/prazor/trunk/chstone/dfsin/dfsin -- load the symbol table from an ELF binary for symbolic debugging -- If it says no debugging symbols then it is stripped binary load -- download a binary (or the current binary) - better to load from the command line -image arg? cont -- continue execution break 0x44cca -- set a break point where -- give stack backtrace trace up -- change current stack frame by moving up one stack frame (towards caller) down -- the reverse, move down one stack frame print num_keys -- display a global variable or local var in current stack frame info threads -- show threads (may show hardware cores instead using Prazor) thread 2 -- switch thread/core set $pc = 0x485 -- perform a jump (not working prior to 18th feb 2015). stepi -- run one instruction only (not working 18th feb 2015 but being fixed). // useful additional commands for debugging the debugger: set debug remote 1 set debug target 1
The gcc C compiler defaults to generating Thumb-2 machine code and hardware floating point. The Thumb modes and hardware floating point on the simulator do not currently work (are just being debugged and should work in the near future), so meanwhile, to run the same binary code on both systems requires coercing gcc to avoid to only use the old ARM32 mode. The relevant flags are:
-marm -mfloat-abi=soft
With these flags, code compiled on the workstatstion or on the real card is interchangeable and .o files can freely be copied backwards and forwards using scp.
However, note the installed libraries, libc and libgcc, on the Parallella cards also use Thumb mode so, for detailed performance comparison (and to avoid linker errors about VPF) you should avoid using these and instead use your own compiled versions of these too (or the ones on the links in the detailed documentation at the bottom of this section).
The difference between application binaries targeted running bare metal on the simulator and running on linux on the real card are mainly to do with console I/O. The I/O paths are very different and so performance comparions should not be made. Also, they are incompatible and you need to swap the system calls used on the real card with direct calls to the UART device driver as used on the simulator.
The best way to redirect the I/O is to link the same .o application and library files with some slightly different I/O shims, as in the barelift shim example below. Essentially you want to replace uart64_driver.o with bareliftshim.o. You should link using the ld program (not using gcc as a linker) since this will give you complete control over which kickoff code and libraries are included. Also, always check what you have made using objdump -d as a disassembler.
For example, to run the dfsin standard test one would link as follows using the same binaries as ran on the simulator except for the barelift pair.
LIBGCC=/home/linaro/djg11/libgcc-nothumb.a ld -o a.out bareliftcrt.o bareliftshim.o dfsin.o prlibc.o $(LIBGCC) objdump -d ./a.out > dis # If you see any 16 bit (4 hex digit) opcodes in the disassembly they you are using thumb code. ./a.out
Once Thumb modes are working we can compare energy and performance with and without them. (It is working now - thanks Milos - April 2015).
Detailed resources: bareliftshim-files.zip ZIP ARCHIVE.
Archive: bareliftshim-files.zip Length Date Time Name --------- ---------- ----- ---- 146894 2015-02-24 10:52 home/linaro/libgcc.a -- This libgcc is Thumb and wont currently work on simulator. 1250 2015-02-25 08:44 bareliftshim.c 4860 2015-02-25 08:49 bareliftshim.o 1406 2015-02-25 08:42 bareliftcrt.S 984 2015-02-25 08:49 bareliftcrt.o 631 2015-02-25 08:53 Makefile 4777 2015-02-25 08:53 BARELIFT-README.txt 23360 2015-02-25 08:49 dfsin.o 48168 2015-02-24 18:45 prlibc.o --------- ------- 232330 9 files
The physical board information is now this separate page Own Page.
Video screen capture : interactive linux session on the SystemC Zynq model: MP4 Video.
The Zynq linux source files can be obtained from git as follows:
How to compile the ADI based linux kernel? (uImage) git clone https://github.com/parallella/parallella-linux cd parallella-linux bash export ARCH=arm export CROSS_COMPILE=arm-linux-gnu- export PATH=:$PATH make ARCH=arm parallella_defconfig make ARCH=arm LOADADDR=0x8000 uImage
After this has completed, you will find a file called vmlinux which is the ELF binary for the kernel that can be loaded into the simulator. This section contains a Very Drafty Description of this process.
You will need:
Reference copies of some of these are stored in this folder
$PRAZOR/vhls/boards/parallella/linux
/home/djg11/d320/prazor-virtual-platform/vhls/src/arm/vhls-arm7smp -kernel /home/djg11/parallella-sw/complete-from-git/parallella-linux/vmlinux -devicetree prazor-linux.dtb -boot loimage -vdd disk3.img -dram-system-ini /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/system.ini.example -dram-device /home/djg11/d320/prazor-virtual-platform/vhls/src/dramsim2/dist/ini/DDR3_micron_8M_8B_x16_sg15.ini -cores 2
This shim replaces the standard linux grub or whatever boot loader. It runs from the first ARM core reset vector of 0. It jumps to the linux kenel after having set up r2 to point at the device tree blob.
Simulated output: CONSOLE LOG.
Disks have a root partition that starts at 2048th block. fdisk -l ./disk1.img gives: Disk ./disk1.img: 16 MB, 16252928 bytes4 heads, 31 sectors/track, 256 cylinders, total 31744 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x658d7048 Device Boot Start End Blocks Id System ./disk1.img1 2048 31743 14848 83 Linux
To mount it on the workstation you need to set loopback to start at byte block size*block start which in this case is 2048*512=1048576. So you would need to do:
sudo losetup -o1048576 /dev/loop0 disk1.img sudo mount -t ext4 /dev/loop0 /mnt
There are various sources of discrepancy between the Prazor energy results and the measurements on parcard1.
Firstly, the implementation of the sPEEDO interface on Prazor returns the core's energy account and the whole system energy account. And on the physical probe attached to parcard1 it returns the whole system energy and the energy used by the Zynq core. So the figures reported would not directly agree even if both sides were completely accurate. Note: The Zynq core contains both processors, the L2 cache + GIC + SCU and the programmable FPGA logic.
Secondly, the Prazor system has features missing from its model, notably the DRAM driving pads and the Ethernet PHY energy.
The Ethernet PHY on the real card takes a lot of energy that is not modelled (currently) in Prazor. This has four modes of operation that take different amounts of power. If you unplug the 100 Mbps Ethernet cable and log in via the UART instead the power saving is about 160 mW. The actual measurements of the 5 volt supply current were:
Ethernet unplugged 385mA 10 Mb/s 410mA 100 Mb/s 425mA 1 GB/sec 470mA
The power consumed by the PHY can be computed by multiplying by 0.001 to convert to amps, then multiply by 5 to convert to Watts and finally take of 20 percent to account for the SMPSU efficiency inconverty from 5 to 1.8 volts.
Another 40mW may be accounted by the LED indicators which will go off when the Ethernet is unplugged.
The Ethernet MAC hard macro is quoted in the Xilinx datasheet as taking 15 mW.
The parallella card uses twin-die MT41K256M32SLD-125 DRAM 256Mx32 PDF DATASHEET . Speed grade 125 has main RCD-RP-CL parameters of 11-11-11 at 1600 MT/s. Self refresh adjusts its rate according to die temperature. It runs from 1.35 volts.
During normal quiescent operation (i.e. no L2 cache misses and infrequent refresh) it should take about 32 mW. Ths should rise to 613 mW under very heavy traffic with no locality of reference (i.e. one burst read or write per row activation) and be perhaps about 400 mW under heavy load with well-organised data layout (many bursts per row activation).
The measured power use of the real DRAM was a little different!
With reset held down it takes about 7mW, which is as per the data sheet. But for all other traffic loads the supply current varies between 424 and 464 mA (mutiply by 1.3 for mW). This is odd. Writing 0x83 to ddrc_ctrl at 0xf800_6000, which supposedly makes the DRAM more efficient reduces energy use by only a percent or so.
Presumably the Zynq chip is making many more operations on the DRAM than expected and hence DRAM power measured does not correlate with Prazor prediction. This could be explained, perhaps, by linux having set the refresh period to a stupidly small value (but inspection of reg_ddrc_t_rfc_nom_x at 0xF8006004 reveals 0x61 which is a sensible value giing 5 us or so interval), anti-rowhammer cycles, or EEC scrubbing of L2 operating far too fast.
In addition, the DRAM driving I/O pads on the Zynq I/O ring take quite a lot: perhaps 100 mW: see data sheet.
Further analysis is needed ...
The Parallella card uses an ADV7513BSWZ chip to drive its monitor.
This takes 256 mW from its 1v8 input and 1mW from its 3V input, but when powered down takes just 300 uA or so. It resets to power down mode. It is not modelled in Prazor and it is not used in any of our real-card experiments, so is irrelevant.
The USB driver chips, FLASH memory and Epiphany chips are also all not modelled in Prazor and not used in our experiments. They should take negligable power in standby mode, so can be ignored. For example, the 3v3 serial NOR Quad Flash U7 N25Q128A13EF840E has standby current of less than 14 uA, whereas operating takes 15 mA for 108 MHz and 6mA at 54 MHz. It takes 20 mA during write and erase internal operations.
When not loaded with any design the PL on the 7010 takes about 22 mW. Larger chips would take more.
The FPGA takes an extra 40 mW or so of power on system powerup in some sort of self-clear mode.
The dynamic L1+CPU energy when running flat out is about 20 mW more than when idle. Was this per core? Need to check again. We note a linear decrease in energy use as the programmable CPU clock frequency is scaled down (666 MHz down to 90 MHz).
Further verification of the L2 cache and tightly-coupled SRAM is ongoing...
EOF. Prazor Home Page.