Ex4: 2010/11: ACS P35 SoC D/M Week 7 Exercise. 10 Credit Marks. Mini-Project part I: This exercise again involves the OR1K non-blocking TLM model. (Part II of the mini-project development will be part of Ex5): Select one of the tasks you looked at for Ex3 or choose another IP block of your own design or from list in Ex3. Recalling that the highest-level ESL model enables firmware to be compiled directly with the behavioural model of the device (grail slide), ensure that the main functions of your IP block are directly callable as C++ methods (as well as being callable as TLM targets). By extending main.cpp from the btlm-ref-design or otherwise, create a System-On-Chip containing one (or more) OR1K processors and two or more of your chosen IP block plus any other IP blocks needed. Optional part: Implement two modelling styles for your on chip network: one where queuing contention is directly modelled (i.e. one transaction has to wait for another) and the other where contention is measured and an estimate is added to time delays. Use a #define to switch between the modelling styles. Write and run a test application program on the OR1K core, along with any device drivers, that generates traffic for your IP blocks. Your device driver code should be modifiable with a #define so that it may either be compiled to OR1K machine code to run on the ISS, or so that it may be compiled along with the behavioural model of your IP block as a single program. Under SystemC, adjust the TLM quantum keeper setting to observe two different interleavings of transactions. Measure the simulation performance (transactions per second simulated) for both models of on-chip network (if you have them), at different setting of the TLM quantum and for the direct-calling ESL model. Note any sequential inconsistencies and say what sort of hazard they present (RaW, WaW, WaR etc.) With or without using SystemC, run your test application in the directly-calling mode. Compare the transaction order and simulation performance. Submit a report containing listings and printouts and a short report that answers the following questions: Questions: Q1. For the target system: tabulate realistic assumptions for system parameters such as memory size, bus width, clock frequency, bus throughput. Q2. Modelling accuracy: List measuresments of your system: such as memory bandwidth used for instruction fetch, data load/store, bus utilisation, contention delays and so on, commenting on how their value or accuracy might vary as the style of modelling is adjusted. Q3. What is the simulation performance? I.e. how many clock cycles, instructions or transactions per second are simulated on the workstation. Q4. Given that you have multiple SystemC threads and/or methods in the simulation, is simulation performance maximised by minimising the number of times they block or context switch? Discuss any useful mechanisms that might apply be applied to your system to maximise simulation performance while maintaining sequential consistency. Q5. Optional: Estimate the power consumption and silicon area for this system assuming it was 45 nm VLSI. Collaborative work is allowed provided it is made very clear who did what. ------------------------------------------------------------------ Further Notes Q. Do you mean that methods in the IP block (SystemC model) should be callable from the C source code (software running on the OR1K)? If so how do you achieve this? Could we get some examples? A. Using C Preprocessor to Adapt Firmware shows a typical modifcation to the device driver http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhp75d177e02.html The one-channel DMA controller example http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhp003e249cf.html has a slave-write function. This is what would typically be called by the alternative setting of the preprocessor flags. A variant of this, for a more-complex device, would be that the slave_write function itself dispatches its calls to a variety of different routines according to which register is writtent to, and the device driver code would have separate stubs for each different write operation. In general, it is not just register writes that can be 'inlined' in this way, it is invocations of major components inside the addressed device, such as a rectangle shading function in a graphics controller, or a packet decoding operation in an encryption co-processor. Q. What is the quantum keeper? A. The QK is illustrated on 'Loose Timing and Temporal Decoupling' http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhpc4b5acced.html In the OR1200.cpp and orsim_sc.cpp implementations of a TLM OR1K we see the folowing explicit code for a simple QK if (maindelay > local_quantum) { wait(maindelay); maindelay = period; } and a slightly more complex implementation of this is provided by the TLM library quantum keeper which is invoked with if (m_qk.need_sync()) m_qk.sync(); Q. Can you explain the optional part more clearly? By on-chip network do you mean the instantiations of busmux? To model queuing contention do you connect all IP blocks to the same initiator socket and only allow one transaction through at a time? How do you measure contention? A. There are various ways to model queing contention, as discussed in http://www.cl.cam.ac.uk/teaching/1011/P35/obj2.1/zhpe214fa1ca.html and http://www.cl.cam.ac.uk/teaching/1011/P35/obj4.1/zhpbb8fdd908.html Contention can be estimated from the measured utilisation, using a mechanism such as the one on the slide, where the number of transactions in a recent interval is measured. Q. Can you give an example of how sequential consistency might be lost? A. Sequential consistency will be lost when two OR1K cores are using a region of shared memory at the same time and their QK setting is more than a few instructions. Q. What is the purpose of creating two or more instances of the same IP block? A. Using the same program on two cores, each addressing a different instance of the IP block is perhaps the most simple way to demonstrate contention on shared resources, such as a bus or cache. It is also another source of sequential inconsistency (even with one core if the internals of the IP block use their own thread for losely-timed operation) since we might see I/O operations performed in a different order. Having them be the same type of IP block makes differences in I/O order more apparant, since we might generally not be very interested in I/O ordering on different types of IP block: e.g. do we care whether a pixel is plotted before or after an Ethernet packet is sent ? Q. How can you run the test application without SystemC? A. The first item on the 'Levels of Modelling' slide http://www.cl.cam.ac.uk/teaching/1011/P35/obj1.1/zhpbefbf2f19.html describes running a program that gives the same 'output'. This may be the same C program just compiled for a different platform, but with different libraries linked with it, or it may just parts of the C program, such as the algorithmic core. Examples of the former are the musicbox OR1K application and all of the splash2 (/usr/groups/han/clteach/orpsoc-old/splash2) benchmark suite can be run on the OR1K or else compiled natively for running on the workstation. They give the same 'output' in both cases. Q. It would be really useful to get an example of how the firmware can be compiled directly with the behavioural model. After spending a day working on this I still haven't managed to do it. It would appear that this is considerably more complex than the slide you linked to would suggest. A. I have placed a basic example here (also under CVS): /usr/groups/han/clteach/orpsoc/ethercrc Q. Say we take the example of reading a buffered ethernet packet from an ethernet device. Should the callable method in the IP block return the entire packet or, say, 32 bits at a time (as would happen using actual firmware)? The first approach seems to be more high level but on the IP block side would actually require implementing most of the logic twice which seems to defeat the purpose.... A. There will be several potential interface points but all of them will maintain the same interface between the device driver and its test application. These may not be separate components in a small get-started implementation, but the idea should be present. The behavioural model core code should also be the same. What does vary, as you mention, is the amount of the interface between the device and the bottom of the device driver that is fully modelled. The number of bus operations may be accurate or abstracted. With good structure to your code, I don't see that very much needs implementing twice. It's more a matter of missing out the intermediate details in the higher-level model. Q. I do not see any sources of contention. A. Within a single core there can be sources of contention: instruction fetch and load/store can both demand the bus at the same time, but not with the setup in the reference model as originally provided: it uses just one thread to fill its I-cache, run its internals and perform loads and stores. However, it will suffer contention if other bus masters are present (such as devices that perform DMA and other active cores). Without modifying any core models, if you set the number of cores on the command line of the reference model to 2, you will certainly have contention. The mixbug example shows how to set the stack pointer and how to branch conditionally depending on core number. Q. I do not see any sequential inconsistencies. A. In a simple system with only one core running you may not notice any such, even with large LT quantums: there's no overtaking point where something can get out of order. However, if you have a cache with slow write back, or a DMA controller, or two cores, or a NoC with significant buffering, they will soon start to crop up. You might even think about where to detect them or add fences to enforce sequence. Q. What figures should I use if I do any power modelling ? A. For SRAM see http://www.cs.virginia.edu/~skadron/Papers/islped05_li.pdf and create your own model that gives a figure in pJ per read operation assuming one of the optimal internal implementations (you can assume the bits are in a simple square array... answers between 1E-8 to 1E-7 Joules per read are roughly correct..) For onchip interconnect use 0.3 pJ per transition per millimeter at 1.0 Volt supply rail and scale with V^2. For instruction execution, assume Q. Can I implement a device and device driver that implements dynamic frequency and voltage scaling ? A. Yes, that would be a fine thing to do - it would integrate many parts of the lectured material. Q. I keep getting runtime errors like "Segmentation Fault" or "Illegal Instruction" which are very difficult to debug. A. To debug segmentation faults and so on it is easiest to run the program under gdb. Make sure that each C or C++ compilation has the -g option in it so that debugging information is present in the files. Give the executable file of the systemC simulation on the gdb command line and then, inside gdb, use 'run args' where args is all the other args that you normally put after the executable name. If it stops, use the 'where' gdb command and if it does not, do control-C and they use the 'where' and 'cont' commands to follow progress. Q. Would you expect the following to work as a mutual exclusion lock on the OR1K ? static volatile int lock = 0; while(lock == 1) continue; lock = 1; critical_section(); lock = 0; I'm getting strange behaviour with this. Are there any atomic instructions you can use for the OR1K to implement a proper mutex? A. The load-locked store-conditional cycles on the memory system will provide mutexes that are safe when multiple cores contend. Using this atomic instruction sequence is demonstrated in solosbits.c: // Actually an atomic exchange instruction, despite its name. int _test_and_set(volatile int *addr, int wdata) { // The appropraite prefix should go before the load, store sequence. // 15 00 00 05 l.nop 0x5 // d4 03 20 00 l.sw 0x0(r3),r4 // 85 63 00 00 l.lwz r11,0x0(r3) // Wierdly, the test and set is coded as set then test in the assembler. // The order of transactions is permuted in the hardware (and if it // were not for some reason we would always fail in mutex aquire since we // would see ourself). asm("l.nop %0": :"K" (NOP_ATOMIC)); // Clobbers RAM Need to add. *addr = wdata; int r = *addr; return r; } The putchar function for the UART without interrupts will be thread-safe in the sense that nothing will crash when called from multiple threads without locks but characters from different threads can get well mixed up. I recall there is a version of printf there somewhere that takes out a lock for the duration of each output line, thereby making uart/console output from different cores mix on a line-by-line basis, which is what we normally expect.