Ex4: 2010/11: ACS P35 SoC D/M Week 7 Exercise. 10 Credit Marks.


Mini-Project part I:   

This exercise again involves the OR1K non-blocking TLM model. (Part II of the mini-project development will be part of Ex5):   

Select one of the tasks you looked at for Ex3 or choose another IP
block of your own design or from list in Ex3.  Recalling that the highest-level
ESL model enables firmware to be compiled directly with the behavioural model
of the device (grail slide), ensure that the main functions of your IP block are directly
callable as C++ methods (as well as being callable as TLM targets).


By extending main.cpp from the btlm-ref-design or otherwise, create a
System-On-Chip containing one (or more) OR1K processors and two or more
of your chosen IP block plus any other IP blocks needed.

Optional part: Implement two modelling styles for your on chip
network: one where queuing contention is directly modelled (i.e. one
transaction has to wait for another) and the other where contention is
measured and an estimate is added to time delays.  Use a #define to
switch between the modelling styles.

Write and run a test application program on the OR1K core, along with
any device drivers, that generates traffic for your IP blocks.  Your device driver
code should be modifiable with a #define so that it may either be compiled 
to OR1K machine code to run on the ISS, or so that it may be compiled along with
the behavioural model of your IP block as a single program.

Under SystemC, adjust the TLM quantum keeper setting to observe two
different interleavings of transactions.  Measure the simulation
performance (transactions per second simulated) for both models of
on-chip network (if you have them), at different setting of the TLM
quantum and for the direct-calling ESL model.  Note any sequential
inconsistencies and say what sort of hazard they present (RaW, WaW,
WaR etc.)

With or without using SystemC, run your test application in the directly-calling mode. Compare the transaction
order and simulation performance.

Submit a report containing listings and printouts and a short report that answers
the following questions:
 
Questions: 

Q1. For the target system: tabulate realistic assumptions for system
parameters such as memory size, bus width, clock frequency, bus
throughput.

Q2. Modelling accuracy: List measuresments of your system: such as
memory bandwidth used for instruction fetch, data load/store, bus
utilisation, contention delays and so on, commenting on how their
value or accuracy might vary as the style of modelling is adjusted.

Q3. What is the simulation performance?  I.e. how many clock cycles,
instructions or transactions per second are simulated on the
workstation.

Q4. Given that you have multiple SystemC threads and/or methods in the
simulation, is simulation performance maximised by minimising the
number of times they block or context switch?  Discuss any useful
mechanisms that might apply be applied to your system to maximise
simulation performance while maintaining sequential consistency.

Q5. Optional: Estimate the power consumption and silicon area for this system assuming it was 45 nm VLSI.

Collaborative work is allowed provided it is made very clear who did what.

------------------------------------------------------------------

Further Notes

Q. Do you mean that methods in the IP block (SystemC model) should be
callable from the C source code (software running on the OR1K)? If so
how do you achieve this? Could we get some examples?

A. Using C Preprocessor to Adapt Firmware shows a typical modifcation
to the device driver
  http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhp75d177e02.html

The one-channel DMA controller example
 http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhp003e249cf.html
has a slave-write function.  This is what would typically be called by
the alternative setting of the preprocessor flags.  A variant of this,
for a more-complex device, would be that the slave_write function
itself dispatches its calls to a variety of different routines
according to which register is writtent to, and the device driver code
would have separate stubs for each different write operation.  In
general, it is not just register writes that can be 'inlined' in this
way, it is invocations of major components inside the addressed
device, such as a rectangle shading function in a graphics controller,
or a packet decoding operation in an encryption co-processor.


Q. What is the quantum keeper?

A. The QK is illustrated on  'Loose Timing and Temporal Decoupling'
http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhpc4b5acced.html


In the OR1200.cpp and orsim_sc.cpp implementations of a TLM OR1K we
see the folowing
explicit code for a simple QK
    if (maindelay > local_quantum)
       {
         wait(maindelay);
         maindelay = period;
       }
and a slightly more complex implementation of this is provided by the
TLM library quantum keeper
which is invoked with
           if (m_qk.need_sync()) m_qk.sync();


Q. Can you explain the optional part more clearly? By on-chip network do
you mean the instantiations of busmux? To model queuing contention do
you connect all IP blocks to the same initiator socket and only allow
one transaction through at a time? How do you measure contention?


A. There are various ways to model queing contention, as discussed in
 http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhpe214fa1ca.html
and
 http://www.cl.cam.ac.uk/teaching/1011/P35/obj4.1/zhpbb8fdd908.html
Contention can be estimated from the measured utilisation, using a mechanism
such as the one on the slide, where the number of transactions in a recent
interval is measured.  

Q. Can you give an example of how sequential consistency might be lost?

A. Sequential consistency will be lost when two OR1K cores are using a
region of shared memory at the same time and their QK setting is more
than a few instructions.

Q. What is the purpose of creating two or more instances of the same IP block?

A. Using the same program on two cores, each addressing a different
instance of the IP block is perhaps the most simple way to demonstrate
contention on shared resources, such as a bus or cache.  It is also
another source of sequential inconsistency (even with one core if the
internals of the IP block use their own thread for losely-timed
operation) since we might see I/O operations performed in a different
order.  Having them be the same type of IP block makes differences in
I/O order more apparant, since we might generally not be very
interested in I/O ordering on different types of IP block: e.g. do we
care whether a pixel is plotted before or after an Ethernet packet is
sent ?

Q. How can you run the test application without SystemC?

A. The first item on the 'Levels of Modelling' slide
  http://www.cl.cam.ac.uk/teaching/1011/P35/obj1.1/zhpbefbf2f19.html
describes running a program that gives the same 'output'.  This may be the
same C program just compiled for a different platform, but with
different libraries
linked with it, or it may just parts of the C program, such as the
algorithmic core.
Examples of the former are the musicbox OR1K application and all of
the splash2 (/usr/groups/han/clteach/orpsoc-old/splash2) benchmark
suite can be run on the OR1K or else compiled
natively for running on the workstation.  They give the same 'output'
in both cases.


Q. It would be really useful to get an example of how the firmware can be
compiled directly with the behavioural model. After spending a day
working on this I still haven't managed to do it. It would appear that
this is considerably more complex than the slide you linked to would
suggest.

A. I have placed a basic example here (also under CVS):  /usr/groups/han/clteach/orpsoc/ethercrc


Q. Say we take the example of reading a buffered ethernet packet from
an ethernet device. Should the callable method in the IP block return
the entire packet or, say, 32 bits at a time (as would happen using
actual firmware)? The first approach seems to be more high level but
on the IP block side would actually require implementing most of the
logic twice which seems to defeat the purpose....

A. There will be several potential interface points but all of them
will maintain the same interface between the device driver and its
test application.  These may not be separate components in a small
get-started implementation, but the idea should be present.  The
behavioural model core code should also be the same.

What does vary, as you mention, is the amount of the interface between
the device and the bottom of the device driver that is fully modelled.
The number of bus operations may be accurate or abstracted.

With good structure to your code, I don't see that very much needs
implementing twice.  It's more a matter of missing out the
intermediate details in the higher-level model.

Q. I do not see any sources of contention.

A. Within a single core there can be sources of contention:
instruction fetch and load/store can both demand the bus at the same
time, but not with the setup in the reference model as originally
provided: it uses just one thread to fill its I-cache, run its
internals and perform loads and stores.  However, it will suffer
contention if other bus masters are present (such as devices that
perform DMA and other active cores).

Without modifying any core models, if you set the number of cores on
the command line of the reference model to 2, you will certainly
have contention.  The mixbug example shows how to set the stack
pointer and how to branch conditionally depending on core number.


Q. I do not see any sequential inconsistencies.

A. In a simple system with only one core running you may not notice
any such, even with large LT quantums: there's no overtaking point
where something can get out of order.  However, if you have a cache
with slow write back, or a DMA controller, or two cores, or a NoC with
significant buffering, they will soon start to crop up.  You might
even think about where to detect them or add fences to enforce sequence.


Q. What figures should I use if I do any power modelling ?

A.  For SRAM see  http://www.cs.virginia.edu/~skadron/Papers/islped05_li.pdf and create your own model that
    gives a figure in pJ per read operation assuming one of the optimal internal implementations (you can assume
    the bits are in a simple square array... answers between 1E-8 to 1E-7 Joules per read are roughly correct..)
    For onchip interconnect use 0.3 pJ per transition per millimeter at 1.0 Volt supply rail and scale with V^2.
    For instruction execution, assume 

Q.  Can I implement a device and device driver that implements dynamic frequency and voltage scaling ?

A.  Yes, that would be a fine thing to do - it would integrate many parts of the lectured material.

Q. I keep getting runtime errors like "Segmentation Fault" or "Illegal
Instruction" which are very difficult to debug.

A. To debug segmentation faults and so on it is easiest to run the
program under gdb.  Make sure that each C or C++ compilation has the
-g option in it so that debugging information is present in the files.
Give the executable file of the systemC simulation on the gdb command
line and then, inside gdb, use 'run args' where args is all the other
args that you normally put after the executable name.  If it stops, use
the 'where' gdb command and if it does not, do control-C and they use
the 'where' and 'cont' commands to follow progress.


Q. Would you expect the following to work as a mutual exclusion lock on the OR1K ?
     static volatile int lock = 0;


     while(lock == 1) continue;
     lock = 1;
     critical_section();
     lock = 0;
I'm getting strange behaviour with this. Are there any atomic
instructions you can use for the OR1K to implement a proper mutex?

A. The load-locked store-conditional cycles on the memory system will
provide mutexes that are safe when multiple cores contend. Using
this atomic instruction sequence is demonstrated in solosbits.c:
    // Actually an atomic exchange instruction, despite its name.
    int _test_and_set(volatile int *addr, int wdata)
    {
     // The appropraite prefix should go before the load, store sequence.
     //    15 00 00 05     l.nop 0x5
     //    d4 03 20 00     l.sw 0x0(r3),r4
     //    85 63 00 00     l.lwz r11,0x0(r3)


     // Wierdly, the test and set is coded as set then test in the assembler.
     // The order of transactions is permuted in the hardware (and if it
     // were not for some reason we would always fail in mutex aquire since we
     // would see ourself).

     asm("l.nop %0": :"K" (NOP_ATOMIC));       // Clobbers RAM  Need to add.
     *addr = wdata;
     int r = *addr;
     return r;
    }


The putchar function for the UART without interrupts will be
thread-safe in the sense that nothing will crash when called from
multiple threads without locks but characters from different threads
can get well mixed up.

I recall there is a version of printf there somewhere that takes out a
lock for the duration of each output line, thereby making uart/console
output from different cores mix on a line-by-line basis, which is what
we normally expect.