ACS SOC D/M P35: Ex 4 A/B 2017/18:  Mini-Project and Structured Research Essay.

The deadline for all P35 work is the first day of Easter Term.


Notes:

Please ensure you have completed earlier exercises and feel to free to
reuse text or results from earlier exercises for the Mini-Project (4a).

Collaborating is not allowed for the Research Essay and is only
allowed for any parts of the mini-project that are borrowed from the
term-time work or with express permission that will only be granted if
the nature of the collaboration will enable individual contributions
to be clearly discriminated.


Your audience is the External Examiner, Second Assessor
and readers of Design and Reuse or Electronics Times.  It is therefore
worthwhile explaining material that would perhaps be well known to others
directly involved in this module.

Please feel free to contact DJG as much as you like for assistance and
advice with Exercise 4 A/B over the Easter Vac.


---------------------------------------------------------------------
Exercise 4a: See companion sheet.

---------------------------------------------------------------------

Exercise 4b: Structured Research Essay Task

The exercise is to write an essay.  The essay must be structured
as described here for ease of marking, but it is not sufficient to
simply follow the strucure. One quarter of the credit is reserved
for overall coherency of the argument.


Please note, an essay argues a point: it is not just a set of facts
delivered in your own prose.  An essay should have a title and include
at least a pair of paragraphs at the start and end that serve as an
introduction and conclusion.  As well as these components, your essay
will consist of FIVE main sections whose headings are taken from the
list below.  Each section should consist of approximately 400 to 800
words of text and an appropriate number of diagrams.  Make sure it
is totally clear which section in your essay relates to which heading 
below.


Use the materials on the reading list, recent articles in
Design+Reuse, EDA Cafe and EE Times, the undergraduate lecture notes for System
on Chip, the more-advanced slides lectured for P35 and your own
research. You may assume the reader has just read your 4a submission.

Your essay can simply be titled 'System On Chip Design and Modelling'
but it is better to choose a title that relates to the argument you
are making.  

You may refer to Ex 4a in Ex 4b but do repeat the same arguments since
no further credit will be awarded. It is best to avoid in 4b any
section heading that directly aligns with what you did in 4a.

The available section headings are now listed.  Please choose FIVE and
place in the best order to construct your overall argument. Along with
each heading I have put some suggestions for what to discuss, but you
may vary.  

1. Architectural Exploration (AE) (and ease of design re-partition).

   Explain the need for AE.  Describe techniques where a target
   application can be 'run' (i.e. explored) on SoC models that vary in
   their level of abstraction and show how having a lower-level model
   makes architectural changes, such as changing the number of
   processors, memories or bus/network-on-chip structures more
   difficult.  Mention the relative roles of ISS and cycle-callable
   models of processor cores and cache systems.  Discuss various
   abstraction levels for modelling of processor cores, including
   cross compilation of firmware for the modelling workstation through
   to cycle-accurate models of every target instruction.  Mention
   styles for modelling caches (e.g. is the hit ratio estimated or
   measured ?).

2. ABD: Use of assertions and temporal logic in SoC modelling.
    Using  several example assertions that are potentially relevant at several 
    levels of abstraction, show how they are re-applied at each
    level or mention whether this is not always the case.

3. Network On Chip (NoC)

   Explain how and why the old idea of a bus has almost always now
   been replaced with something more like a NoC.  Why are tri-states
   not used? What must be done to contact 'the other side of the chip'
   ?  What techniques are used for flow control? What set of
   transactions is typically needed? How important is read and write
   ordering ? How does performance scale with chip size.
 

4. Statistics collection and modelling contention and queuing.
   Show how performance can be estimated or measured using SoC models
   at various levels of abstraction according to how many of the
   contended resources, such as memory ports or bus bridges are
   modelled and the style of modelling used for them (e.g. actual
   queues versus estimated queuing delays). Perhaps review Sniper,
   Gem5, Prazor/VHLS, ZSim, Qemu and other high-level virtual
   platforms in terms of the modelling approximations they use to
   efficiently exploit a multi-core workstation to model a multi-core
   SoC.


5. Virtualization of embedded software.
   Consider how firmware and high-level models of devices should be
   modified to run an application without any hardware model
   (i.e. with direct calls between h/w and s/w components) compared
   with its final form. Include an interrupt service routine.  For
   example, the ESL slide pack illustrates using direct calls between
   device drivers compared with abstract and concrete bus/NoC models.
   

6. Clock frequency and power consumption modelling.
   Discuss how well a high-level SoC model can be used to
   estimate system clock frequency (i.e. critical path) and
   power consumption (including dynamic frequency and voltage 
   scaling) compared with pre-synthesis, post-synthesis and 
   post-layout RTL models. What accuracy should be expected?

7. The role of high-level synthesis and synthesis from formal
specifications in SoC design flow.
   What is HLS? Is it mature? Can everyday software be compiled?
   How is time/space best traded off? Does a sequencer reduce
   parallelism? Is generating systolic arrays helpful? What memory
   packing decisions could be directed by profiling?
   Explore how part of your example design could have looked
   if synthesised from a higher-level form. (You may have included
   this anyway.)

8. Evaluation and automatic generation of glue logic and/or SoC bus/NoC 
(We did not discuss this topic this year but it was the main topic in 2015/16).
   Explain how the components of your example design are
   connected to each other at the various levels of abstraction
   and perhaps discuss the potential for automatic generation of 
   address maps (via IP-XACT) and automatic synthesis of the joining code or logic.
   (Perhaps refer to http://www.cl.cam.ac.uk/~djg11/pubs/joining-fdl10 or
    Google for 'network on chip traffic generator')

9. Relative role of FPGA, CGRA and ASIC in contemporary products. 
   Tape-out mask costs are ever increasing. FPGA has now eaten all of
   the low-volume ASIC market and the Structured ASIC has died.  Which
   sectors still do SoC design ? Is the CGRA going to be useful?  What
   style of re-configurable computing is best for embedded devices and
   what is best for HPC (high-performance computing) server blades?
   What impact is FPGA in the cloud going to have?  What is a good mix of
   hard and soft area on FPGA die?

10. Near-data processing, processing in memory (PIM), and other accelerator architecures.
   Wimpy versus strong cores.  Custom accelerator architecture.  Should accelerators  
   be put on the same silicon as memory? Or should they be bumps in the wires so
   that acceleration is achieved while the data is already on the move?

11. An appropriate topic of your own choosing.


END


---------------------------------------------------------------------------
4b: Further Notes arising this year (March/April 2018):  - This will be updated in response to email interactions


The original marking scheme for the P35 module, as relfected on Moodle, was
    3a 0
    3b 20
    4a 30
    4b 40

But as some people have noticed, this has been changed to
    3a 0    5
    3b 20   25
    4a 30   30
    4b 40   30
The explanation for this is that the 20 marks for exercise 3 was not sufficient to reflect
the different levels of sophistication in different implementations.  And in previous years, more
material was lectured and so there was more to examine in 4b.  

Note that you only need take note of the general area or title of each of the points above: the detailed
questions that follow are just suggestions for what to discuss under that heading.


Note that DJG is happy to review ONE draft of each of 4a and 4b, assuming sent in advance of the final
deadline by a reasonable margin, and give brief feedback regarding any noteworthy points that immediately apparent.

---------------------------------------------------------------------------

Questions Previously Arising (clarification points from email exchanges from previous years.)

In the above, by 'various levels of abstraction' we refer to ESL models spanning:
  1. Application software and device drivers with no hardware model at all,
  2. High-level TLM modelling, loosely timed, with no models of bus or 
network structures,
  3. Lower-level TLM modelling that accurately models contention points,
  4. Cycle-accurate modelling.


If you state that a given approach is more suitable for FPGA than ASIC
please explain why, since many aspects of the design flow are normally similar.

When you write a phrase like 'higher-level model' the hyphen should be present.  Gramatically this is a compound adjective.



---------------------------------------------------------------------------


Q. In question one you use the term cycle-callable. Is this the same
as cycle-accurate?

A. cycle-callable is a cycle accurate model of a subsystem implemented in
a non-blocking style where one clock cycle is executed for each
call (method invocation).


Q. Looking at the Part II course notes, does cross compiling the firmware
for the modelling workstation represent the "Functional Modelling"
level of abstraction?

A. In broad terms yes. My notes define this term to mean the output of
the simulation is correct.  It implies that the same algorithm is used
to arrive at that output as well. Cross compiling firmware should lead
to correct output but also models further aspects of the
implementation beyond those needed to just get the output correct. For
instance, the UML diagram of the class instances used by the
cross-compiled code would be the same as the actual implementation
whereas there could also be different implementations that still use the
same algorithm (e.g. minor variations in record field structure, a
different calling pattern between methods or executing on a different
number of CPU cores).

Q. Would I be correct in saying that if you cross compile the firmware
for the workstation then there is actually no model of the processor
core at all?

A. Yes, but one can still profile the code using gprof, valgrind,
oprofile or whatever to find out how many instructions it used and
method calls it made to get some idea of what the target processor
would have consumed.  Using oprofile you can get cache hits and misses
and other details.  This might be useful at the very early stages of
system development (e.g. for a new data coding scheme like low-density
parity checks or candidates for 4G mobile telephony or a replacement
for DES) to understand what class of core or number of cores are going
to be needed and to estimate the basic cost of a product based on this
technology.

Q. In question 5 you've asked for a description of an interrupt service
routine. Have you any instructions on how to write an ISR for the
or1k?

A. There is an example of an interrupt service routine in my notes and
this is the same as the SystemC UART with interrupts
(/home/djg11/example-uart/example-uart-with-interrupts/) I have not
made this work on the OR1K personally (I did not get as far as finding
the OR1K instruction to enable interrupts, which needs adding to the
crt.S).

The linux kernel compiled for OR1K uses them of course, so perhaps
look in a kernel source folder such as  linux-2.6.24/arch/or32/kernel
for real OR1K interrupt handlers.



Q. "Show how part of your example design could have looked
   if synthesised from a higher-level form. (You may have included
   this anyway."  By this do you mean, say, how would the ethernet TLM model
ook if it had actually been implemented in, say, Csharp with Kiwi attributes?

A. This question potentially covers a great deal of ground.  For
instance, you might generate a protocol or packet checker by compiling
a formal spec to include in the system RTL.

You could also cite work regarding synthesis of memory maps, bus
structures and other glue logic needed to connect parts together.

Perhaps the most obvious thing to do is to talk about compiling a
behavioural model of the subsystem into synthesisable RTL for the
target implementation, commenting on what is likely to work (or even
trying out one or two experiments on Kiwic or one of the online
C-to-gate servers).

Considering TLM, which you mention, manually-coded TLM models of
devices are very much like high-level behavioural models written
specifically for synthesis by C-to-gates flows.  So if comparing these
two forms you would mostly comment on what parts of the TLM model can
and cannot be expected to be synthesisable to RTL implementations.
There are a number of research papers on this but no accepted
standards.

Generating a TLM model from Kiwi is not something I have considered:
although KiwiC can generate SystemC output, this is RTL-style
code, not TLM code.  I guess when you say the 'ethernet TLM' you
really me the ethernet synthesisable RTL implementation that I spoke
of above under 'most obvious thing to do' ?


END.