//
// SoC P35 Exercise 3b 2013/14 
// 

Exercise 3b: Preliminary Investigation of performance, power, area trade offs, etc..

// Please do the following 

  1. Investigate the execution time and energy use figures for your project work
     described in Exercise 3a and generate a spreadsheet or other table showing their variation
     as parameters change.  You should aim to have three or more parameters and 20 or more
     results in your table.  You should probably write a Makefile or shellscript that
     automates the experiments.
      
     Parameters to adjust typically include input data size, number of CPU cores, cache size and structure and 
     parameters specific to your project work. You may also include other results in your table, 
     such as actual simulation time, cache hit ratios, number of DRAM columns addressed etc..

     (Note the sha3 work is not likely to use multicores this year ?).

  2. Compare your results to an analytical model of your own creation or results in a published article.
     This should consist of one or more simple formula(e) that roughly predict the time and energy 
     figures you have measured.
     You should include an explanation of the coefficients in your formula, saying whether you think
     they are reasonable or what else we can learn from them.

  3. Optional: for further understanding, make variations to the model, such as: enable and
     disable the cache(s) or vary their allocation  policy or clock frequency, add a separate I cache or change the main
     memory from DRAM to SRAM.  Your experiments should help confirm your analytical model and enable you to conclude
     whether your selected partition was a sensible approach.  Perhaps compare with another partition.


  Full marks will be awarded for a short report (4 pages or so) that
  includes the main spreadsheet with additional plots of power
  consumption and execution speed as measured and as predicted by the
  model.


  Please give in at least a preliminary report by the published deadline
  to get feedback.  Re-submission to get a higher mark is allowed at any
  point up to the final deadline at the start of the Easter Term.

END

Marking Scheme:
			    out	of     marks
  Method                      5         0
  Spreadsheet and plots:      5         0
  Analytical formulae         5         0
  Discussion                  5         0
  Total                      20         0


---------------------------------------
Points arising this year:

When describing cache size please use units of bytes (not some strange
logarithmic parameter from the benchx.h file). Also give its way
associativity (the next parameter passed to the consistent_cache
constructor, defaulting to 8).  In fact, please give a full
description of the memory system(s) you are using.  There is no hard
disk model in use this year.  We are running bare metal with the
program and data initially loaded in to DRAM by a process that is not
part of the model.

Note that a smaller cache will use less static energy, but its hit
ratio is lower, causing a rise in DRAM energy which is generally
larger than that saved in the cache.  The Univ Maryland DRAM simulator
allocates energy to both the static and dynamic energy accounts, but
the distinction is not very important.  Only the total is important.

Note that cache misses can be classified as compulsorary, sharing or
capacity misses.  It is good to use these terms in your report.

Q. How many cycles does the OR1K use?

A. Provided there are no cache misses, the fastiss executes each
instruction in one cycle, including instructions that in reality would
take multiple cycles. There is some commented out code in the fastiss
that partially implements accurate cycle counts.  The verilated
version is cycle accurate.  

Q. What clock frequency is the OR1K ?

A. It is set up as running at 200 MHz.  An obvious thing to try is
adjusting its clock frequency.  But of course, above a certain
frequency, performance will start to be limited by cache hit access
time.  The current set up of the OR1K does not accurately adjust its
static and dynamic power model with frequency: the static power will
remain constant and the dynamic power will rise linearly.  A more
realistic approach would be to adjust these with the clock frequency
based on the supply voltage or drive strength needed in the gates.
Basically the supply voltage needs to be raised according to a 1.2 to
1.3 or so power law and the static and dynamic energy will follow the
square of that (around 1.5) but more citable figures should be found in
the literature.

Q. How can I find the hotspots in my code?

A. One way is to compile it natively for the workstation with the gnu
profile -pg option.  If you extend the/a outer loop so it runs for a
few minutes of real time you will then get pretty good indications
from gprof.


Q. How can I estimate the energy and timing performance from a high-level model of my new hardware ?

A. There are many papers on the related subject of pre-synthesis RTL
performance estimation.  In general there are a number of answers
according to the time/space tradeoff envisaged for the final
implementation.  The easiest answer to find is the one where the final
hardware will use the same structure and same loop counts as the
high-level model, but structural hazards may require its refinement.
Structural hazards occur mainly on memory ports and floating point
ALUs.  In C code you can use as many array operations as you like in a
set of expressions, but if these are mapped to hardware RAMs they may
stall a loop if more values need to be accessed then there are RAM
ports. Two address ports is the maximum number normally
available. Within a single port it is also typically possible to read
out the old data and write new data at a location in one clock cycle.
(Register files can potentially have unlimited numbers of ports as
well as dedicated input and output wiring for specific locations but
much more energy and area use.)

Work out how many RAMs you need and their sizes (length, width and
number of ports) and what each will contain, such that its ports are
not overloaded. Pust simple scalars in registers. Refactor your
high-level design to balance the load on memory ports and to avoid
structural hazards.

We now need to estimate the area, energy and clock frequency for your design.

   Firstly, you must either instrument your (refactored) model to count loops and
hamming distance in register assigns or else you can estimate these
numbers.

   Secondly you must compute a logic depth and complexity for the
right-hand-sides of your assignments. You can use one gate per bit for
logic operations, four gates per bit for arithmetic operations and
quadratic cost for multipliers.  Constant shifts are free of course.
Better still, take your inner loop and use legup or manually convert
to your favourite RTL and synthesise it for some technology you have
to hand: gate counts and depths will be roughly the same for all
technologies.  I will put some more information here on sheet 2: 
  http://www.cl.cam.ac.uk/~djg11/greaves-vlsitech-spreadsheet.ods


   Thirdly, estimate the area of your design, using the RAM areas and
register and gate counts.  The average net length will be about 0.3
times the square root of the area.  Dynamic and wiring energy is then
estimated using the same activity ratio as you estimated or measured
for the registers. For a hash function, activity is going to be
maximal, which means a bit will discharge 25 percent of the time.
Static energy is estimated from the gate+register+RAM count. Clock
frequency will be the reciprocal of the logic depth.  Remember to
only count discharge wiring energy since to count the charge (0 to 1
transitions) as well will double count it.  The TLM POWER3 library
correctly only counts one direction.

Note, to be independent of technology you can report your final time
and area results in units of FO4 delays and square lambda
respectively. 


---------------------------------------

Points arising last year:

Note that the performance of the simulator can be greatly reduced by
recompiling without PW_TLM_PAYLOAD=3 (which tracks transitions in the
generic payload), or without TLM_POWER3 being installed at all, and a
rough doubling in performance is possible using a large quantum keeper
constant. You may wish to investigate these effects and write them up.

If your measured results do not fit your analytical model do not worry
- full credit can still be achieved in this Exercise, but an
explanation would be be included if you use this work as your
mini-project.

---------------------

You can also modify the DRAM bank interleave and see what effect that
has.  Instead of having one DRAM completely after the other another
common design approach is for them to be finely interleaved at the
granularity of a cacheline.


The change can basically be made in the ten characters or so of C code
in the busmux that divides the work between the DRAMS.

But but you'll also have to implement a solution to loading the
program ELF file for such a memory structure. e.g. change memloader to
reflect your interleaving or load the program into another memory or
region of memory that is not so interleaved.