ACS SOC D/M P35: Ex 4 A/B 2016/17:  Mini-Project and Structured Research Essay.

The deadline for all P35 work is the first day of Easter Term.


Notes:

Please ensure you have completed earlier exercises and feel to free to
reuse text or results from earlier exercises for the Mini-Project (4a).

Collaborating is not allowed for the Research Essay and is only
allowed for any parts of the mini-project that are borrowed from the
term-time work or with express permission that will only be granted if
the nature of the collaboration will enable individual contributions
to be clearly discriminated.


Your audience is the External Examiner, Second Assessor
and readers of Design and Reuse or Electronics Times.  It is therefore
worthwhile explaining material that would perhaps be well known to others
directly involved in this module.

Please feel free to contact DJG as much as you like for assistance and
advice with Exercise 4 A/B over the Easter Vac.


---------------------------------------------------------------------
Exercise 4a (accounts for 30 credit points): 


Ex4 Part A: (Mini-Project): Construct an interesting argument based on
practical work you have conducted using design tools for FPGA and
System-on-Chip embedded software and accelerators.  This will most
likely contain an evaluation of the group mini-project last term, but
you need to make perfectly clear what your own contribution to the
work is and any measurements must be your own work.  

Example arguments for 4a are: 
   1. Accelerating our project application saves energy because ...
   2. Using a virtual platform was a good idea because ...
   3. Having determined the performance looked good using a simple FPGA experiment we can 
explore rack scale and/or custom silicon performance using ...
But you may expound on any sensible and interesting result from your practical work.

Write a report in a style suitable for publication in Electronics
Times or Design and Reuse (or similar). You should aim to write at
least 2000 words but full credit is available if information is
instead conveyed in diagrams and figures.  All of the words must be
your own work but diagrams from any source may be included if credited
properly.  You argument itself does not have to be original: basing
your report on an existing D&R or ET article is acceptable.


Most importantly: think carefully about your report structure. Cite
relevant prior work or alternative solutions.  It is generally easiest to
use a provocative title that poses a question, then expand on the question
in the introduction and answer it at the end.


Feel free to ask DJG for further pointers on specific topics.

---------------------------------------------------------------------

Exercise 4b: Structured Research Essay Task: See companion sheet.

---------------------------------------------------------------------
4a: Further Notes arising this year (March/April 2017):  - This will be updated in response to email interactions

Q. In our current implementation, the device will set a "finished"
flag to 1 once it finishes the calculation, and the ARM keeps polling
to this address to see the state of this flag. This works fine when
the compiler optimization is turned off (-O0), but not when it is on
(-O2). I'm guessing the ARM is caching the value in this address after
the first query, therefore cannot see the update from the device. Does
this seems to be a reasonable guess? Is there any workaround except
turning off the cache?

A. You don't say if this is on the real or virtual platform - the
means of implementing cache bypass are slightly different between the
two. Prazor bare metal programs simply exploit Prazor's UNCACHED_ADDRESS_SPACE64 macro.
But when the MMU is enabled it takes over from the macro, as per the real hardware.
If the compiler optimisations are having the effect you report, then it is
not a matter of cacheing anyway, but of compiler code generation.
Have you used the keyword volatile as per the examples in my notes?
  int a = *((volatile int *)0xblah);
This will stop the compiler cacheing the result.  Check the assembler
code generated from the C compiler using objdump -d applied to the
object file - you should see whether the compiler has hoisted the
polling out of your spinning loop.

Q. How can I estimate gate counts, area and energy?

A. This is in my lecture notes somewhere - but I cannot find it just now ...


Estimating Area and Energy values to insert in a high-level model of a peripheral.
----------------------------------------------------------------------------------

You need to chose a target technology: ASIC and FPGA will differ in area by a factor of roughly 100.

Let us select ASIC in 22 um with 1 volt supply with 7 layers of metal wiring.

Although silicon dioxide has relative permitivity (e_r) of aboue 3.8,
meaning the speed of light will be divided by about 2, the resistance
of the tracks will degrade this quite a bit further.  So you can
assume that the effective propagation speed for logic values on the
wiring layers is roughly 0.1 times the speed of light, i.e. about 30
metres per microsecond.


The wiring capacitance for a pair of conductors, surrounded by silicon
dioxide (permitivity 4), whose width is the same as their spacing is
Pi times e_0 * e_r Farads per metre, which is roughly 100 pF/m.  This
holds for VLSI where the units of 0.1 pF/mm or 0.1 fF/um are more
useful.

A 1mm conductor will therefore have a capacitance of 0.1 pF and, with
a one volt supply, will waste 0.1 pJ in an overal charge and discharge
cycle.  Note that toggle rates are twice activity ratios, so do not
count the dynamic energy for both charge and discharge!

This figure is hardwired into the pw_tlm_payload.cpp file but there
are two minor mistakes that largely cancel.  That file wrongly
multiplies the energy by 0.5 and should be changed to not do this
since that energy is dissipated on both the charge and discharge of
the nets.  It also uses the figure of 0.3 pF/m based on the
permitivity of silicon being 12.  In reality, conductors on VLSI are
mostly surrounded by silicon dioxide and perhaps silicon nitride
(permitivity 8).  Overall the computation seems accurate enough in its
errored form.






A small logic gate, such as a 2-input NAND gate will have dimensions
of about 22x13 lambda, and so be about 0.14 square microns in area for
lambda 22e-6.  Note there are 1e6 square microns in a square
millimetre.


Broadside Arith/Logic Operator Equivalent Gate Count Rules Of Thumb for Area Estimation

W-bit register	      	       	  - Six times the number of bits in the register			
Shift by constant left or right	  - No logic
AND or OR	        	  - Number of bits in the word times 1			(W=word size here)
XOR    			          - Number of bits in the words times 1.5				     
Adder/subtractor	  	  - Sum of the number of non-constant bits in the two operands times 2.5		 
Multipy/divide by constant 	  - Number of non-zero bits in constant multiplied by word width multiplied by adder width   
Multiply two operands		  -  (flash) Longer operand times shorter operand times 2.5	      	    
Divide two operands		  -  (fixed or vari-latency long division)  Four registers plus two adders plus 100    
Static RAM			  - (small RAMs WxL < 1000 bits)   600 sqlambda/bit
Static RAM			  - (larger RAM  WxL<12800)   400 sqlambda/bit

END


Q. In the TLM model, if my custom device will takes some time to
process a task, then shall I put the task in the blocking transaction
function, and estimate the processing time of my logic and add it to
the delay variable? If yes, is there a guideline I can follow to do
this estimation? Otherwise, shall I use something like a separate
SC_THREAD?

A.  A separate SC_THREAD for the device that make a good deal of
independent processing may be a good idea, especially when transaction
order needs to be preserved.  But, for basic targets devices that
simply process slave operations in order and do not initiate any of
their own, then the target device should be able to just augment the
delay field in the transaction to model a slow response to the
transaction.  This is an appropriate coding style to use if the bus
resources between the initiator and the target would be held up for
entire transaction in reality.  Many target devices are not like that:
the individual bus transactions on them operate at normal speed -
taking just one bus clock cycle, but the initiator has to wait for
completing using polling or interrupts. So it depends on the type of
concurrency that is going on.

However, there was a bug in Prazor that was finally fixed just two
week's ago where large values being added to the TLM delay parameter
in making access to a device upset the performance modelling of
instructions being processed by the CPU core at the same time.  This
matters for out-of-order cores only and the ARM9 on the Zynq board is
pretty much in order. Big means likely to cross a loose-timed quantum
boundary.  The fix, which is a little temporary at the moment,
involved replacing the delay variable in the transaction with a pair
which is the delay and the kernel time it is relative to.  This pair
is now kept in the generic payload itself and called lt_delay.  I will
post this information into the Prazor Temporary Reference Manual and
provide a pointer to a rebuilt version shortly.  If you get a new copy
of Prazor from phabricator it uses this coding style.  The augment to
the delay should be done with the lt_delay override of the plus
operator.  See src/tenos/lt_delay.h.  Also, when I last checked, it
was also necessary to compile with TLM_POWER3 turned on and with a
hack in TLM_POWER3 as reflected in
/usr/groups/han/clteach/tlm-power3/include/pw_tlm_payload.h that
disables the definition of pw_base_protocol_types that is currently
redfined in prazor.h

#ifndef TEMP_PT_BYPASS
struct tlm_pw_base_protocol_types1
{
  typedef PW_TLM_PAYTYPE tlm_payload_type;
    typedef tlm::tlm_phase tlm_phase_type;
    };

Hope that helps!


Q. I think I misunderstand the delay thing. By "target device should
be able to just augment the delay field in the transaction", do you
mean the delay field will be incremented automatically by the TLM
module, or I need to estimate the amount of delay and add it to the
delay field? What guideline shall I follow if I need to do the
estimation?

The code for a peripheral device should augment the delay field in the
TLM transaction, just like in the Toy ESL classes.

For most simple devices, the augment is just one clock cycle, such as
4 ns.  But for a slow device which will stall the initiating core by
design, it will be much more.

The amount of increment in a high-level model will be an estimate:
perhaps just a ballpark guess of the number of clock cycles.  If you
do not have a low-level model then you can do no better.  If you have
also coded the design in RTL, you can measure the performance of the
low-level design on an RTL simulator or count the clock cycles by
hand.

With the changes to Prazor earlier this month, you must increment the
field in the payload and not the field in the transaction socket
callback.

For example, uart64_cbg at line 485 now uses the following macro to increment the delay

  AUGMENT_LT_DELAY(trans.ltd, delay, latency)

which can be adjusted in prazor.h to augment the new or old delay
variable.  But I think I will only be using the new field from now on.

BUG:
A bug in the switch to the new coding style:
Class variable were being masked by redeclaration as locals. 

@@ -487,8 +489,8 @@ void armcore_tlm::run()
        }


-      lt_delay lt_i_delay = master_runahead;
-      lt_delay lt_d_delay = master_runahead;
+      lt_i_delay = master_runahead;
+      lt_d_delay = master_runahead;
       if (reset_or_yield_countdown > 0) reset_or_yield_countdown -= 1;
       else
    {


Q. We planned to extend the work in 3b to a chained matrix
multiplication program, like the form Y=(A*X+B)*C+D, and we planed to
divided the work just as we did on 3b, i.e. I work on TLM, Aaron works
on RTL, Chris works on energy/timing statistics collection, is this
kind of collaboration allowed? If yes, can I use Chris's statistics in
my report directly if I reference that?

Extensions to the practical work of 3b sound like a good idea and can
be shared.  It is important that this is all made clear in your write
up.  The analysis and reports for Exercise 4 should be your own
individual work and where this is based on the shared work of Exercise 3
again needs to be clear.

As you recall, Exercise 3 is about getting the shared practical work
into working order and having confidence in it.  If another member of
the group has done some in-depth analysis of the performance they will
probably be using this for their own Exercise 4a write up. You need to
make your own contribution to gain credit in Exercise 4a.  You can certainly
quote someone else's work but you will not be awarded Exercise 4
credit for it.


---------------------------------------------------------------------------
4a: Further Notes arising from previous years.


Q. How about the length of 4a? The articles on design and reuse vary from very short to quite lengthy.

A. I agree they vary a lot.  For your masters degree you need to show
that you have 'mastered' a particular subject, and this does not
directly relate to some word count.  Also I've tried not to be overly
prescriptive with exactly what gets written up where, but being
underly prescriptive is also unhelpful.  And the word count will
vary greatly according to how many diagrams used to tell the story.
Overall about 5 pages for each of Ex 4a and Ex4b (total 10) should be
sufficient.


Other points you might consider are:

In the case of IP-XACT, it is not clear how many SoC designs use it, but certainly
not all.  So there are alternative approaches including some commercial products.

Chisel is a new hardware language and rather simple, although elegantly embedded in Scala
which provides considerable metaprogramming power.  What is its future?  What should
Chisel 3.0 aim to include?     

What percentage of industrial  high-level modelling uses SystemC and what else is used?

Is UML/SysML or GUI-based SoC design serious or toy ?

What would be needed to take your practical work and to industry and for it to be adopted ?
What is the complete bundle for IP block distribution: this clearly includes high-level models,
actual implementations, data sheets and machine-readable meta-information for energy and area
accounting, test-programme generation, software programming, automatic configuration ...



END