// // SoC P35 Exercise 3 2012/13 // ------------- Points arising: Q. I cannot think of four pages' worth of material to write for my report. A. Ultimately you will need several pages or so of text for your final two Exercises, some parts of your report for Exercise 3 can be first draft of bits of that. You should explain what you have done and also some of the infrastructure you are using. Write it as though addressed to an external examiner, not to DJG. DJG knows all about the OR1KSMP system but an external examiner will not. ------------- Demonstrate proficiency with the OpenRISC blocking-TLM model by preparing an experiment/investigation that can potentially be taken further in the next Exercise. In the current Exercise you need to just get one form of the experiment basically working. Your experiment must involve parallelism, either in the form of multiple OpenRISC cores co-operating or through adding some custom hardware to the system that operates in parallel, such as a co-processor, custom instruction or special peripheral. (Even if you do not add custom extensions to the OpenRISC simulator you must still ultimately demonstrate proficiency with TLM modelling by altering the memory structure or parameters as part of the next Exercise.) Within Exercise 3, working in pairs is allowed, with one person writing the code to run on the OpenRISC and the extending the TLM model. In this case, the 'interface spec' part below should be done first and each person should refer to this document as the basis for their work. A number of options are given below for the experiment/investigation. Choose one. Full marks will be awarded for a short report (about 4 pages including the interface specification but excluding listings) that show a working system running a small test (10,000 instruction count or so). Please give in at least a preliminary report by the published deadline to get feedback. Re-submission to get a higher mark is allowed at any point up to the final deadline at the start of the Easter Term. ===================================================================== // OPTION 1 - Hardware assist: Please do the following: 1. Starting with the ETHERCRC example in (/usr/groups/han/clteach/btlm-baseline/ethercrc: or http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/ethercrc/) or your PERIPHERAL_DEVICE from Exercise 2, or using another design of your own (such as a message passing interface or a Bloom filter or a multimedia or encryption algorithm) create an application and unit test so that it has a top-level part and one or more sub-parts where the sub parts are suitable for implementation in custom hardware (or possibly on separate CPU cores). (The ETHERCRC example is already partitioned to be implemented as a bus-connected I/O device but an alternative partition would be as a co-processor or custom instruction.) See also http://www.cl.cam.ac.uk/teaching/1112/SysOnChip/slides/sp7tte/zhpf2ee9a4a1.html For simplicity, the top-level part should contain all necessary input data sets for several tests of different lengths to be run (i.e. it has embedded unit tests). Your partition should result in two (or more) sections of C code that can easily be joined up again to run as a single program, perhaps under control of conditional compilation flags with the C pre-processor. Ideally you should select something where you can easily compare your ultimate results with an existing publication. 2. Modify the OpenRISC or1ksmp simulator to include your I/O device as a TLM model or custom instruction or whatever is needed (possibly nothing if just using multiple cores) and compile the remainder of your application and test for OpenRISC. 3. Compile your unpartitioned application so that it runs natively on a workstation. 4. Compile your unpartitioned application so that it runs on a single core of OpenRISC without any hardware assistance. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. ===================================================================== // OPTION 2 - Natively-compiled versus VM: Please do the following: Many applications are coded in Java and CLR/dotnet are interpreted bytecode. This has a different power and performance profile compared with native code. However, some code can be compiled from dotnet into hardware using HLS tools to make accelerators. Or it can be JIT-compiled to native form. There is an example dot net program here (primes computation): /usr/groups/han/clteach/btlm-baseline/openrisc/sw/primes-dotnet: There is a really trivial version of the dotnet VM, ported for OpenRISC here: It only supports static classes and integer arithmetic (at the moment). /usr/groups/han/clteach/btlm-baseline/openrisc/sw/tiny-clr-temporary: It is 'temporary' since it is so cut down and does not use the full C library,libc, but a cut down one from mixbug. For your reference, there is an RTL compilation of the primes example here: http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/offchip.html However, do not use this for Exercise 3. It can be later used or recompiled in Exercise 4 or 5. 1. Write a very simple integer application of your own in C and C# (or continue with the primes example). 2. Run both forms of the program on a single core of the OpenRISC or1ksmp simulator. 3. Make one of the following changes to the system (the changes may or may not need the high-level application to be slightly recoded and recompiled): 3a. Extend the example so that it uses multiple cores (there are many ways to do this, so briefly discuss the options first). 3b. Add a special coprocessor or instruction to the OpenRISC that accelerates or reduces power consumption. This might accelerate only the chosen application or it might potentially benefit all apps. 3c. Add special support for arrays of booleans. 3d. Consider which part of the application might be JIT'ed and manually modify the mix of native and interpreted code to reflect this change. 4. Write a note comparing the stack machines of JVM CLR/dotnet with the Android DVM register machine and speculate which is easier to accelerate with custom hardware or coprocessors. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. ===================================================================== // OPTION 3 - "Single-core algorithms are not the best on parallel architectures?" Investigate how well a textbook algorithm works on contemporary and future architectures. Please do the following: There is a school of thought that mutable data structures, although good for single core programs, can waste a lot of energy on multi-core. Better is to disable cache consistency entirely or to use an immutable heap that does not cause write evictions. Also, we might consider whether a custom instruction or SIMD structure is useful for accelerating textbook algorithms. (Finally we might investigate the power consumption and best algorithm partition under GP/GPU acceleration, but this would involve importing a GPU model into our system which would take far more work than is required for full marks.) Suitable algorithms for this investigation might be : -- Tree Quicksort (Guy Belloch, CMU) - an immutable form of quicksort -- Polyphase radix sort. From the SPLASH benchmarks. DJG has already ported this to our OpenRISC simulator: /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix -- Merge sort - better than radix sort for longer keys: "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort" by Nadathur Satish et al. in SIGMOD'10. Compare different memory architectures to see how memory bandwidth impacts on the best sorting algorithm. Or add a SIMD instruction to the OpenRISC ISA. -- Bloom filter implementation: Bloom filter http://antognini.ch/papers/BloomFilters20080620.pdf Additional boolean store instructions added to the OpenRISC ISA might help? -- Dijkstra's shortest path http://www.mcs.anl.gov/~itf/dbpp/text/node35.html Although the best for single core the all-paths algorithm parallelises better. A conditional compare and store instruction added to the OpenRISC ISA might help? You should 1. Adapt or write a C implementation of the algorithm that runs on a workstation natively and which can be adapted for the OpenRISC platform using conditional compilation preprocessor flags. 2. If a thread library is needed then use pthreads/parmacs. Gain basic proficiency with the OpenRISC version of these /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/or1k.threads following the basic application in /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix/radix.c 3. Get your C implementation running on one core of the OpenRISC simulator without any hardware assist. 4. By enabling further conditional compilation flags, demonstrate a run, however small or trivial, of your partitioned implementation that uses multiple cores or custom instructions or coprocessors etc.. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. END