//
// SoC P35 Exercise 3a 2013/14  
// 
   

Note: Although there is zero credit for this Exercise, please do give in
your report as hardcopy to the ACS Admin office and also email a PDF
to DJ Greaves.

Please write a short report entitled 'Exercise 3a: Planning an
investigation of XXX' consisting of about 3 pages excluding
listings. It should have something sensible instead of XXX and
embody the following two aspects:

3a.1: Describe a question that you expect to investigate further using the   
OpenRISC blocking-TLM model augmented with any special hardware
models you need. The investigation should implement the same function
in various ways each of which gives the same output but with 
various amounts of silicon use, timing performance or energy cost.

3a.2: Demonstrate basic proficiency with the OpenRISC blocking-TLM model
showing output from a working system running a small test that would
reveal one (baseline or otherwise) data point for your investigation.

Note: you should use an instruction or loop count of more than 10,000 
for each data point. But do keep things very simple overall. 

Note: If you do not use any modifications to the provided OpenRISC
simulator you must still ultimately demonstrate proficiency with TLM
modelling by altering the memory structure or parameters as part of a
subsequent Exercise so please make sure you can compile and run your
own version of the or1ksmp program.

Within Exercise 3, working in pairs or groups is allowed, with, for
example, one person writing the code to run on the OpenRISC and
another extending the TLM model. In this case, an 'interface spec'
part should be agreed by all parties and included or linked to
in the reports.



END

=====================================================================

The remainder of this page consists of notes from last year that may not be relevant this year.

=====================================================================
// OPTION 1 - Hardware assist: Please do the following:

  
     1. Starting with the ETHERCRC example in 
         (/usr/groups/han/clteach/btlm-baseline/ethercrc:
        or http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/ethercrc/)

        or your PERIPHERAL_DEVICE from Exercise 2, or using  another design of your own (such as 
        a message passing interface or a Bloom filter or a multimedia or encryption
        algorithm) create an application and unit test so that it has a top-level part and one or more
        sub-parts where the sub parts are suitable for implementation in
        custom hardware (or possibly on separate CPU cores). (The ETHERCRC example is
        already partitioned to be implemented as a bus-connected I/O device but
        an alternative partition would be as a co-processor or custom instruction.)

        See also http://www.cl.cam.ac.uk/teaching/1112/SysOnChip/slides/sp7tte/zhpf2ee9a4a1.html

        For simplicity, the top-level part should contain all necessary input data sets
        for several tests of different lengths to be run (i.e. it has embedded unit tests).

        Your partition should result in two (or more) sections of C code that can easily be
        joined up again to run as a single program, perhaps under control of conditional compilation
        flags with the C pre-processor.

        Ideally you should select something where you can easily compare your ultimate results
        with an existing publication.

    2. Modify the OpenRISC or1ksmp simulator to include your I/O device as a TLM model or custom instruction
       or whatever is needed (possibly nothing if just using multiple cores) and compile the remainder 
       of your application and test for OpenRISC.

    3.  Compile your unpartitioned application so that it runs natively on a workstation.


    4.  Compile your unpartitioned application so that it runs on a single core of OpenRISC without
        any hardware assistance.   


    5.  Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


=====================================================================
// OPTION 2 - Natively-compiled versus VM: Please do the following:


Many applications are coded in Java and CLR/dotnet are interpreted bytecode.
This has a different power and performance profile compared with native code.

However, some code can be compiled from dotnet into hardware using HLS tools
to make accelerators. Or it can be JIT-compiled to native form.

There is an example dot net program here (primes computation):
  /usr/groups/han/clteach/btlm-baseline/openrisc/sw/primes-dotnet:

There is a really trivial version of the dotnet VM, ported for OpenRISC here:
It only supports static classes and integer arithmetic (at the moment).
  /usr/groups/han/clteach/btlm-baseline/openrisc/sw/tiny-clr-temporary:
It is 'temporary' since it is so cut down and does not use the full C library,libc, but a cut down one from mixbug.

For your reference, there is an RTL compilation of the primes example here:
http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/offchip.html
However, do not use this for Exercise 3. It can be later used or recompiled in
Exercise 4 or 5.

    1.  Write a very simple integer application of your own in C and C# (or continue with the primes example).

    2. Run both forms of the program on a single core of the OpenRISC or1ksmp simulator.

    3. Make one of the following changes to the system (the changes may or may not
       need the high-level application to be slightly recoded and recompiled):

       3a. Extend the example so that it uses multiple cores (there are many ways
           to do this, so briefly discuss the options first).

       3b. Add a special coprocessor or instruction to the OpenRISC that accelerates
           or reduces power consumption.  This might accelerate only the chosen
           application or it might potentially benefit all apps.

       3c. Add special support for arrays of booleans.

       3d. Consider which part of the application might be JIT'ed and manually
       modify the mix of native and interpreted code to reflect this change.

    4. Write a note comparing the stack machines of JVM CLR/dotnet with the Android
       DVM register machine and speculate which is easier to accelerate with
       custom hardware or coprocessors.

    5.  Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


=====================================================================
// OPTION 3 - "Single-core algorithms are not the best on parallel architectures?"

Investigate how well a textbook algorithm works on contemporary and future architectures.
Please do the following:


There is a school of thought that mutable data structures, although
good for single core programs, can waste a lot of energy on
multi-core. Better is to disable cache consistency entirely or to use
an immutable heap that does not cause write evictions.

Also, we might consider whether a custom instruction or SIMD structure
is useful for accelerating textbook algorithms.

(Finally we might investigate the power consumption and best algorithm
partition under GP/GPU acceleration, but this would involve importing
a GPU model into our system which would take far more work than is
required for full marks.)

Suitable algorithms for this investigation might be :

 -- Tree Quicksort (Guy Belloch, CMU) - an immutable form of quicksort

 -- Polyphase radix sort. From the SPLASH benchmarks. DJG has already ported
 this to our OpenRISC simulator:
  /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix

 -- Merge sort - better than radix sort for longer keys:
      "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort"
     by Nadathur Satish et al. in SIGMOD'10.
     Compare different memory architectures to see how memory bandwidth impacts on 
     the best sorting algorithm. Or add a SIMD instruction to the OpenRISC ISA.

 -- Bloom filter implementation:
      Bloom filter http://antognini.ch/papers/BloomFilters20080620.pdf 
      Additional boolean store instructions added to the OpenRISC ISA might help?

 -- Dijkstra's shortest path http://www.mcs.anl.gov/~itf/dbpp/text/node35.html
    Although the best for single core the all-paths algorithm parallelises better.
    A conditional compare and store instruction added to the OpenRISC ISA might help?

You should

    1. Adapt or write a C implementation of the algorithm that runs on
       a workstation natively and which can be adapted for the OpenRISC
       platform using conditional compilation preprocessor flags.

    2. If a thread library is needed then use pthreads/parmacs. Gain basic
       proficiency with the OpenRISC version of these
       /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/or1k.threads
      following the basic application in 
  /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix/radix.c

    3. Get your C implementation running on one core of the OpenRISC simulator
       without any hardware assist.

    4. By enabling further conditional compilation flags, demonstrate a run, however
       small or trivial, of your partitioned implementation that uses multiple cores
       or custom instructions or coprocessors etc..

    5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


END


-------------
Points arising:

Q. I cannot think of four pages' worth of material to write for my report.

A. Ultimately you will need several pages or so of text for your final
two Exercises, some parts of your report for Exercise 3 can be
first draft of bits of that.  You should explain what you have done
and also some of the infrastructure you are using.  Write it as though
addressed to an external examiner, not to DJG.  DJG knows all about the
OR1KSMP system but an external examiner will not.

-------------