SoC D/M Proficiency Tick 5+6: 5+5 credits. 


Tick 5 and 6 concern implementing and using an inter-core
communications (ICC) mechanisms between a number of ORP cores on a
single SoC.  There are three styles of IPC to consider and full credit
will be awarded for implementing and using one of them.  Tick 5 is the
generation of a design specification and tick 6 is simply showing
that it worked. There is a lot of freedom over what you actually do.

For these two ticks, you may wish to work in a tiny team of two (or at most
three people) where one does the hardware and the other does the
software. However, please provide hardcopy of all source code created
by the tiny team with your own submission. There must be a completely clear
separation of roles between team members (for assessment purposes) and
each file should have just one, named author.

The preferred starting point for this work is the blocking TLM
reference implemention of an N-core ORP platform that provides basic
shared memory but has no cache subsystem.

The getting started material is: 
  Blocking TLM Hardware:  /usr/groups/han/clteach/orpsoc/btlm-ref-design
  Hello World Software:   /usr/groups/han/clteach/orpsoc/sw/hello-world
  Monitor Software:       /usr/groups/han/clteach/orpsoc/sw/mixbug-or1k
  Work-stealing software  /usr/groups/han/clteach/orpsoc/workstealer
  Music/Audio material:   /usr/groups/han/clteach/orpsoc/musicbox


You may alternatively use the blocking model or the RTL-style
cycle-accurate model from the neighbouring directory, or start from
scratch yourself.  The provided TLM model(s) uses the
industry-standard OSCI TLM 2.0 convenience sockets, but you may prefer
to use the 'toy' TLM style from the earlier taught material.


The three styles of ICC implementation are (choose one):

ICC Style 1: Shared memory with cache,
ICC Style 2: Message-passing hardware,
ICC Style 3: Hardware transactional memory.

The three applications are (choose one):

App 1: Trivial demo,
App 2: Work stealing demo,
App 3: Music synthesiser (optionally using work stealing).


Further details:

ICC Style 1: Shared memory with cache.

For style 1, please implement a SystemC module that is a cache memory that connects
between a processor core and its main memory.  Implement an
associative, set-associative or directly mapped cache using a memory
component instantiated inside your module.  In slidepack 4.x (link to
be added) there is some TLM pseudocode for a cache memory that you may
copy.  Note, that the 'hard' part of this exercise is the cache
consistency mechanism: you will need to implement some sort of MESI
algorithm as well as some snooping/invalidate mechanism.


(Note that the OR1200, as provided, has its own caches in the
pre-verilated RTL.  These are turned off by default and you should
leave them off. To implement style 1 you must write
your own cache that sits outside the provided core.)


ICC Style 2: Message-passing hardware

For style 2, please modify the OR1200 module so that it has one or
more instance of some sort of channel that can be directly or
indirectly wired to other such instances of your modified ORP module.
Provide an API that enables 'flits' of 64 bytes to be sent between the
cores.  You may wish to use interrupts, but polling is acceptable.
These channels should be implemented using the same level of
abstraction as the surrounding design (e.g. TLM 2.0 sockets).  If
desired, you can augment the ORP instruction set to handle the
channels (one can either modify the ORP RTL code and re-run verilator
on it, but it is eaier to instead follow the pattern already in use
where certain NOPs have been given special functions using specific C
code in the OR1200 module).


ICC Style 3: Hardware transactional memory

For style 3, proceed as per style 1, building a cache that connects
between the processor and its memory, but do not build a cache
consistency mechanism.  Instead, implement a software API that the
processor can use to control write back from the local store to the
main memory.  Your implementation should preferably implement a
mechanism that avoids commiting transactions based on dirty reads [2].


App 1: Trivial demo

To implement a trivial demo, please send a 64 byte datum between
software running on one core to that on another and back again with
the bits flipped or some other marking operation.  The cores should
communicate using your implemented ICC protocol.

App 2: Work stealing demo

Work stealing [3] is a load balancing technique where each core keeps
a FIFO of pending work items. Take a look at the provided work
stealing code on the link above.  This is designed to use ICC x and
will have to be adapted if you are using x'.  TO BE COMPLETED.


App 3: Music synthesiser (optionally using work stealing)

Take a copy of the Music/Audio material that is provided (or use your
own) and partition or extend it so that parts of the work are done on
at least two different processor cores.  The cores should communicate
using your implemented ICC protocol. The 'mixwav' material should be
adapted to be a SystemC model of the audio output DAC that writes a
.wav audio file.  As an extension, the cores should balance their
relative loads automatically, perhaps using the work stealing
algorithm.


Questions for Tick 5:

  1. Did you work in a tiny team? If so, who else was involved and what did 
each person do.

  2. Which style of ICC are you going to implement ?  Describe your design.

  3. Using B-TLM methodology for timing estimation, you will have to increment
sc_time at one or more points.  Where will you do this, by how much and
what model of queueing contention is embodied (e.g. will you use the 1/(1-p) formula)?


  4. Do you rely on polling for synchronisation ? If so, could improvements
be made using interrupts or custom extensions to the processor itself?  

  5.  What variations in behaviour would you see as the quantum keeping interval
is increased and might these ever effect correctness? (Correctness can be
interpreted to mean that the same sequence of transactions occurs as in an RTL
implementation or it may be interpreted to just mean that the application code,
if well written, produces the same output.) NB: You should consider workloads
that have variable execution times to their components.


Questions for Tick 6 (please answer three or more):

  1. Does you design work for N=4?  What tests have you made? Can N
  easily be made larger? What happens and how would you address this?

  2. How does your design compare to an alternative system created using bus bridges?
(Think about uniformity of the address space in terms of addressiblity and
access time). Would you need to create a 'doorbell' inter-core interrupter?

  3. Do you support full sequential consistency [1]: i.e. is it possible that
data appears to have been sent in a different order from actual ?  Does this
get relaxed if you increase the 'quantum' in any of the quantum keepers?

  4. Briefly commment on how you answers to the above would vary if you had
selected the other forms of ICC.

  5. The consistent cache approach to inter-core communication is used
in most contemporary PCs, workstations and laptops and it would appear
to have a less complex API than the explicit message passing and
transactional memory alternatives.  However, relaxed memory ordering
in modern implementations is increasing the neeed for memory fence
instructions, meaning that this form of ICC is becoming more explicit.
Whatever form of ICC you used, could you couple your mechanism to the
quantum keeper(s) so that memory fence, other explicit synchronisation
instructions (e.g. test and set) or those of your own API always see
correct semantics where necessary while allowing the rest of the model
to run in a highly-temporally decoupled way?


Credit Matrix

Tick 5: Design one ICC and one app and answer all of the tick 5 questions.

Tick 6: Implement a design that would qualify as tick 5 and make it work.
Include hardcopy of the output (or fileserver path for .wav files) and
answer three of the tick 6 questions.

Mini-project II: Parts of tick 5/6 can be included in the write up
for research essay II.  This essay will be set in the last session.


References:


[1] http://en.wikipedia.org/wiki/Sequential_consistency

[2] http://en.wikipedia.org/wiki/Transactional_memory

[3] http://en.wikipedia.org/wiki/Cilk#Work-stealing

END

-----------------------------------------------------

Questions arising:

> I was trying to implement a MESI protocol for the private caches of
> the orp cores and I was wondering how am I going to invalidate copies
> of a block from (let’s say) core A to all others.  I probably need to
> use a transaction but it’s not very clear to me how to do this.  Will
> it be a tlm_generic_payload? Should I declare it as a write
> (trans.set_write)?

This is a key part of the exercise and it requires creative thought...


For high-level ESL modelling, we may not wish to model the details of
the invalidate mechanism, so an implementation that has little to with
reality could be used.  If we wish to do more-detailed timing and
contention modelling, then something that resembles the physical
structure is needed so that the contention can be measured.  But
if the physical structure is not a limitation on system performance
then contention in it will not affect the timing results.

For up to four cores, the snoopy bus protocol is commonly used, and
even this is not a bad way to do 16 cores as a two-level tree.  (Only
four cores are needed for the working tick.)


So perhaps some sort of snoopy bus model is a good start.  Most
abstractly, you could make every cache install a callback in a data
structure that represents a snoopy bus and for the invalidates to be
implemented using an iterator that calls every callback in turn,
mirroring its broadcast nature.  You could use a TLM generic payload
between appropriate TLM sockets for this, or you can just use simple
C++ object calls of your own invention (perhaps using the C++ boost
library to assist with passing pointers to object methods).  

To make the model less abstract you could account for the time taken
by an invalidate operation.  You could dynamically measure the load
at a contention point and use the 1/(1-p) formula to simulate queuing
or you could just block the threads and get actual queueing.

The main alternative to the snoopy cache is the directory-based
system and I think this could fairly-readily be coded up in TLM style.


>1. I can either use ass() function or C pointer to access the memory.
>Is there any restriction you expected?

Just use something like
   *((volatile int *)0xFFEEDDCC) 

Using this C pointer form is easier than embedded assembler.  There
are no restrictions that spring to mind.

>2. When I declare a local variable, I noticed that its address is
>65520. I am confusing because we should only in the range of 0 to
>4000?

A local variable will be allocated on the stack and it's address
depends on the base value loaded into r1 at the start of the run time
system (crt.S) and on how much calling/nesting is currently active.
One version of the crt that I provided offsets the stacks of each core
by the core number shifted left by a small amount, but you would have
to increase this more if the stack is heavily used and such a shift is not
needed at all if the stacks are in local RAMs, as is the case in one of the
examples I provided. Static variables, on the other hand are allocated
successive addresses starting from the value passed in to the link
editor and again you can arrange for these to be in shared or local RAM.

> 3. Is there any way in software to control which part of program
> runs in which CPU (like thread)?

The mixbug file shows how to call a different main according to which
processor core you are on: it just branches according to the core ID
at the start of its 'main' code.  Alternatively, you can compile
separate programs for each core (provided they have different RAMs addressed
at their reset vector (address 0x100)), but this tends to be more tedious.

4. Is there any printf function that I can use, since it is using
"r32-elf-gcc" without standard IO library of GCC?

Yes, the full gcc library is present in the lib directory and it
should work fine (the bootable linux image uses it) but I am unsure
about how that version of libc does console/uart input and output so
it is probably easier to start with the mixbug example code that
uses its own simplistic printf functions provided in
/usr/groups/han/clteach/orpsoc/sw/mixbug-or1k/prlibc.c