SoC D/M Proficiency Tick 5+6: 5+5 credits. Tick 5 and 6 concern implementing and using an inter-core communications (ICC) mechanisms between a number of ORP cores on a single SoC. There are three styles of IPC to consider and full credit will be awarded for implementing and using one of them. Tick 5 is the generation of a design specification and tick 6 is simply showing that it worked. There is a lot of freedom over what you actually do. For these two ticks, you may wish to work in a tiny team of two (or at most three people) where one does the hardware and the other does the software. However, please provide hardcopy of all source code created by the tiny team with your own submission. There must be a completely clear separation of roles between team members (for assessment purposes) and each file should have just one, named author. The preferred starting point for this work is the blocking TLM reference implemention of an N-core ORP platform that provides basic shared memory but has no cache subsystem. The getting started material is: Blocking TLM Hardware: /usr/groups/han/clteach/orpsoc/btlm-ref-design Hello World Software: /usr/groups/han/clteach/orpsoc/sw/hello-world Monitor Software: /usr/groups/han/clteach/orpsoc/sw/mixbug-or1k Work-stealing software /usr/groups/han/clteach/orpsoc/workstealer Music/Audio material: /usr/groups/han/clteach/orpsoc/musicbox You may alternatively use the blocking model or the RTL-style cycle-accurate model from the neighbouring directory, or start from scratch yourself. The provided TLM model(s) uses the industry-standard OSCI TLM 2.0 convenience sockets, but you may prefer to use the 'toy' TLM style from the earlier taught material. The three styles of ICC implementation are (choose one): ICC Style 1: Shared memory with cache, ICC Style 2: Message-passing hardware, ICC Style 3: Hardware transactional memory. The three applications are (choose one): App 1: Trivial demo, App 2: Work stealing demo, App 3: Music synthesiser (optionally using work stealing). Further details: ICC Style 1: Shared memory with cache. For style 1, please implement a SystemC module that is a cache memory that connects between a processor core and its main memory. Implement an associative, set-associative or directly mapped cache using a memory component instantiated inside your module. In slidepack 4.x (link to be added) there is some TLM pseudocode for a cache memory that you may copy. Note, that the 'hard' part of this exercise is the cache consistency mechanism: you will need to implement some sort of MESI algorithm as well as some snooping/invalidate mechanism. (Note that the OR1200, as provided, has its own caches in the pre-verilated RTL. These are turned off by default and you should leave them off. To implement style 1 you must write your own cache that sits outside the provided core.) ICC Style 2: Message-passing hardware For style 2, please modify the OR1200 module so that it has one or more instance of some sort of channel that can be directly or indirectly wired to other such instances of your modified ORP module. Provide an API that enables 'flits' of 64 bytes to be sent between the cores. You may wish to use interrupts, but polling is acceptable. These channels should be implemented using the same level of abstraction as the surrounding design (e.g. TLM 2.0 sockets). If desired, you can augment the ORP instruction set to handle the channels (one can either modify the ORP RTL code and re-run verilator on it, but it is eaier to instead follow the pattern already in use where certain NOPs have been given special functions using specific C code in the OR1200 module). ICC Style 3: Hardware transactional memory For style 3, proceed as per style 1, building a cache that connects between the processor and its memory, but do not build a cache consistency mechanism. Instead, implement a software API that the processor can use to control write back from the local store to the main memory. Your implementation should preferably implement a mechanism that avoids commiting transactions based on dirty reads [2]. App 1: Trivial demo To implement a trivial demo, please send a 64 byte datum between software running on one core to that on another and back again with the bits flipped or some other marking operation. The cores should communicate using your implemented ICC protocol. App 2: Work stealing demo Work stealing [3] is a load balancing technique where each core keeps a FIFO of pending work items. Take a look at the provided work stealing code on the link above. This is designed to use ICC x and will have to be adapted if you are using x'. TO BE COMPLETED. App 3: Music synthesiser (optionally using work stealing) Take a copy of the Music/Audio material that is provided (or use your own) and partition or extend it so that parts of the work are done on at least two different processor cores. The cores should communicate using your implemented ICC protocol. The 'mixwav' material should be adapted to be a SystemC model of the audio output DAC that writes a .wav audio file. As an extension, the cores should balance their relative loads automatically, perhaps using the work stealing algorithm. Questions for Tick 5: 1. Did you work in a tiny team? If so, who else was involved and what did each person do. 2. Which style of ICC are you going to implement ? Describe your design. 3. Using B-TLM methodology for timing estimation, you will have to increment sc_time at one or more points. Where will you do this, by how much and what model of queueing contention is embodied (e.g. will you use the 1/(1-p) formula)? 4. Do you rely on polling for synchronisation ? If so, could improvements be made using interrupts or custom extensions to the processor itself? 5. What variations in behaviour would you see as the quantum keeping interval is increased and might these ever effect correctness? (Correctness can be interpreted to mean that the same sequence of transactions occurs as in an RTL implementation or it may be interpreted to just mean that the application code, if well written, produces the same output.) NB: You should consider workloads that have variable execution times to their components. Questions for Tick 6 (please answer three or more): 1. Does you design work for N=4? What tests have you made? Can N easily be made larger? What happens and how would you address this? 2. How does your design compare to an alternative system created using bus bridges? (Think about uniformity of the address space in terms of addressiblity and access time). Would you need to create a 'doorbell' inter-core interrupter? 3. Do you support full sequential consistency [1]: i.e. is it possible that data appears to have been sent in a different order from actual ? Does this get relaxed if you increase the 'quantum' in any of the quantum keepers? 4. Briefly commment on how you answers to the above would vary if you had selected the other forms of ICC. 5. The consistent cache approach to inter-core communication is used in most contemporary PCs, workstations and laptops and it would appear to have a less complex API than the explicit message passing and transactional memory alternatives. However, relaxed memory ordering in modern implementations is increasing the neeed for memory fence instructions, meaning that this form of ICC is becoming more explicit. Whatever form of ICC you used, could you couple your mechanism to the quantum keeper(s) so that memory fence, other explicit synchronisation instructions (e.g. test and set) or those of your own API always see correct semantics where necessary while allowing the rest of the model to run in a highly-temporally decoupled way? Credit Matrix Tick 5: Design one ICC and one app and answer all of the tick 5 questions. Tick 6: Implement a design that would qualify as tick 5 and make it work. Include hardcopy of the output (or fileserver path for .wav files) and answer three of the tick 6 questions. Mini-project II: Parts of tick 5/6 can be included in the write up for research essay II. This essay will be set in the last session. References: [1] http://en.wikipedia.org/wiki/Sequential_consistency [2] http://en.wikipedia.org/wiki/Transactional_memory [3] http://en.wikipedia.org/wiki/Cilk#Work-stealing END ----------------------------------------------------- Questions arising: > I was trying to implement a MESI protocol for the private caches of > the orp cores and I was wondering how am I going to invalidate copies > of a block from (let’s say) core A to all others. I probably need to > use a transaction but it’s not very clear to me how to do this. Will > it be a tlm_generic_payload? Should I declare it as a write > (trans.set_write)? This is a key part of the exercise and it requires creative thought... For high-level ESL modelling, we may not wish to model the details of the invalidate mechanism, so an implementation that has little to with reality could be used. If we wish to do more-detailed timing and contention modelling, then something that resembles the physical structure is needed so that the contention can be measured. But if the physical structure is not a limitation on system performance then contention in it will not affect the timing results. For up to four cores, the snoopy bus protocol is commonly used, and even this is not a bad way to do 16 cores as a two-level tree. (Only four cores are needed for the working tick.) So perhaps some sort of snoopy bus model is a good start. Most abstractly, you could make every cache install a callback in a data structure that represents a snoopy bus and for the invalidates to be implemented using an iterator that calls every callback in turn, mirroring its broadcast nature. You could use a TLM generic payload between appropriate TLM sockets for this, or you can just use simple C++ object calls of your own invention (perhaps using the C++ boost library to assist with passing pointers to object methods). To make the model less abstract you could account for the time taken by an invalidate operation. You could dynamically measure the load at a contention point and use the 1/(1-p) formula to simulate queuing or you could just block the threads and get actual queueing. The main alternative to the snoopy cache is the directory-based system and I think this could fairly-readily be coded up in TLM style. >1. I can either use ass() function or C pointer to access the memory. >Is there any restriction you expected? Just use something like *((volatile int *)0xFFEEDDCC) Using this C pointer form is easier than embedded assembler. There are no restrictions that spring to mind. >2. When I declare a local variable, I noticed that its address is >65520. I am confusing because we should only in the range of 0 to >4000? A local variable will be allocated on the stack and it's address depends on the base value loaded into r1 at the start of the run time system (crt.S) and on how much calling/nesting is currently active. One version of the crt that I provided offsets the stacks of each core by the core number shifted left by a small amount, but you would have to increase this more if the stack is heavily used and such a shift is not needed at all if the stacks are in local RAMs, as is the case in one of the examples I provided. Static variables, on the other hand are allocated successive addresses starting from the value passed in to the link editor and again you can arrange for these to be in shared or local RAM. > 3. Is there any way in software to control which part of program > runs in which CPU (like thread)? The mixbug file shows how to call a different main according to which processor core you are on: it just branches according to the core ID at the start of its 'main' code. Alternatively, you can compile separate programs for each core (provided they have different RAMs addressed at their reset vector (address 0x100)), but this tends to be more tedious. 4. Is there any printf function that I can use, since it is using "r32-elf-gcc" without standard IO library of GCC? Yes, the full gcc library is present in the lib directory and it should work fine (the bootable linux image uses it) but I am unsure about how that version of libc does console/uart input and output so it is probably easier to start with the mixbug example code that uses its own simplistic printf functions provided in /usr/groups/han/clteach/orpsoc/sw/mixbug-or1k/prlibc.c