//
// SoC P35 Exercise 3 2012/13  
// 
-------------
Points arising:

Q. I cannot think of four pages' worth of material to write for my report.

A. Ultimately you will need several pages or so of text for your final
two Exercises, some parts of your report for Exercise 3 can be
first draft of bits of that.  You should explain what you have done
and also some of the infrastructure you are using.  Write it as though
addressed to an external examiner, not to DJG.  DJG knows all about the
OR1KSMP system but an external examiner will not.

-------------

Demonstrate proficiency with the OpenRISC blocking-TLM model
by preparing an experiment/investigation that can potentially be taken further
in the next Exercise. In the current Exercise you need to just
get one form of the experiment basically working. 

Your experiment must involve parallelism, either in the form of
multiple OpenRISC cores co-operating or through adding some
custom hardware to the system that operates in parallel, such
as a co-processor, custom instruction or special peripheral.

(Even if you do not add custom extensions to the OpenRISC simulator
you must still ultimately demonstrate proficiency with TLM modelling by
altering the memory structure or parameters as part of the next Exercise.)

Within Exercise 3, working in pairs is allowed, with one person
writing the code to run on the OpenRISC and the extending the TLM
model. In this case, the 'interface spec' part below should be
done first and each person should refer to this document as the basis
for their work.

A number of options are given below for the experiment/investigation. Choose one.

Full marks will be awarded for a short report (about 4 pages including
the interface specification but excluding listings) that show a working
system running a small test (10,000 instruction count or so).

Please give in at least a preliminary report by the published deadline
to get feedback.  Re-submission to get a higher mark is allowed at any
point up to the final deadline at the start of the Easter Term.


=====================================================================
// OPTION 1 - Hardware assist: Please do the following:

  
     1. Starting with the ETHERCRC example in 
         (/usr/groups/han/clteach/btlm-baseline/ethercrc:
        or http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/ethercrc/)

        or your PERIPHERAL_DEVICE from Exercise 2, or using  another design of your own (such as 
        a message passing interface or a Bloom filter or a multimedia or encryption
        algorithm) create an application and unit test so that it has a top-level part and one or more
        sub-parts where the sub parts are suitable for implementation in
        custom hardware (or possibly on separate CPU cores). (The ETHERCRC example is
        already partitioned to be implemented as a bus-connected I/O device but
        an alternative partition would be as a co-processor or custom instruction.)

        See also http://www.cl.cam.ac.uk/teaching/1112/SysOnChip/slides/sp7tte/zhpf2ee9a4a1.html

        For simplicity, the top-level part should contain all necessary input data sets
        for several tests of different lengths to be run (i.e. it has embedded unit tests).

        Your partition should result in two (or more) sections of C code that can easily be
        joined up again to run as a single program, perhaps under control of conditional compilation
        flags with the C pre-processor.

        Ideally you should select something where you can easily compare your ultimate results
        with an existing publication.

    2. Modify the OpenRISC or1ksmp simulator to include your I/O device as a TLM model or custom instruction
       or whatever is needed (possibly nothing if just using multiple cores) and compile the remainder 
       of your application and test for OpenRISC.

    3.  Compile your unpartitioned application so that it runs natively on a workstation.


    4.  Compile your unpartitioned application so that it runs on a single core of OpenRISC without
        any hardware assistance.   


    5.  Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


=====================================================================
// OPTION 2 - Natively-compiled versus VM: Please do the following:


Many applications are coded in Java and CLR/dotnet are interpreted bytecode.
This has a different power and performance profile compared with native code.

However, some code can be compiled from dotnet into hardware using HLS tools
to make accelerators. Or it can be JIT-compiled to native form.

There is an example dot net program here (primes computation):
  /usr/groups/han/clteach/btlm-baseline/openrisc/sw/primes-dotnet:

There is a really trivial version of the dotnet VM, ported for OpenRISC here:
It only supports static classes and integer arithmetic (at the moment).
  /usr/groups/han/clteach/btlm-baseline/openrisc/sw/tiny-clr-temporary:
It is 'temporary' since it is so cut down and does not use the full C library,libc, but a cut down one from mixbug.

For your reference, there is an RTL compilation of the primes example here:
http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/offchip.html
However, do not use this for Exercise 3. It can be later used or recompiled in
Exercise 4 or 5.

    1.  Write a very simple integer application of your own in C and C# (or continue with the primes example).

    2. Run both forms of the program on a single core of the OpenRISC or1ksmp simulator.

    3. Make one of the following changes to the system (the changes may or may not
       need the high-level application to be slightly recoded and recompiled):

       3a. Extend the example so that it uses multiple cores (there are many ways
           to do this, so briefly discuss the options first).

       3b. Add a special coprocessor or instruction to the OpenRISC that accelerates
           or reduces power consumption.  This might accelerate only the chosen
           application or it might potentially benefit all apps.

       3c. Add special support for arrays of booleans.

       3d. Consider which part of the application might be JIT'ed and manually
       modify the mix of native and interpreted code to reflect this change.

    4. Write a note comparing the stack machines of JVM CLR/dotnet with the Android
       DVM register machine and speculate which is easier to accelerate with
       custom hardware or coprocessors.

    5.  Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


=====================================================================
// OPTION 3 - "Single-core algorithms are not the best on parallel architectures?"

Investigate how well a textbook algorithm works on contemporary and future architectures.
Please do the following:


There is a school of thought that mutable data structures, although
good for single core programs, can waste a lot of energy on
multi-core. Better is to disable cache consistency entirely or to use
an immutable heap that does not cause write evictions.

Also, we might consider whether a custom instruction or SIMD structure
is useful for accelerating textbook algorithms.

(Finally we might investigate the power consumption and best algorithm
partition under GP/GPU acceleration, but this would involve importing
a GPU model into our system which would take far more work than is
required for full marks.)

Suitable algorithms for this investigation might be :

 -- Tree Quicksort (Guy Belloch, CMU) - an immutable form of quicksort

 -- Polyphase radix sort. From the SPLASH benchmarks. DJG has already ported
 this to our OpenRISC simulator:
  /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix

 -- Merge sort - better than radix sort for longer keys:
      "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort"
     by Nadathur Satish et al. in SIGMOD'10.
     Compare different memory architectures to see how memory bandwidth impacts on 
     the best sorting algorithm. Or add a SIMD instruction to the OpenRISC ISA.

 -- Bloom filter implementation:
      Bloom filter http://antognini.ch/papers/BloomFilters20080620.pdf 
      Additional boolean store instructions added to the OpenRISC ISA might help?

 -- Dijkstra's shortest path http://www.mcs.anl.gov/~itf/dbpp/text/node35.html
    Although the best for single core the all-paths algorithm parallelises better.
    A conditional compare and store instruction added to the OpenRISC ISA might help?

You should

    1. Adapt or write a C implementation of the algorithm that runs on
       a workstation natively and which can be adapted for the OpenRISC
       platform using conditional compilation preprocessor flags.

    2. If a thread library is needed then use pthreads/parmacs. Gain basic
       proficiency with the OpenRISC version of these
       /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/or1k.threads
      following the basic application in 
  /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix/radix.c

    3. Get your C implementation running on one core of the OpenRISC simulator
       without any hardware assist.

    4. By enabling further conditional compilation flags, demonstrate a run, however
       small or trivial, of your partitioned implementation that uses multiple cores
       or custom instructions or coprocessors etc..

    5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts.


END