// // SoC P35 Exercise 3a 2013/14 // Note: Although there is zero credit for this Exercise, please do give in your report as hardcopy to the ACS Admin office and also email a PDF to DJ Greaves. Please write a short report entitled 'Exercise 3a: Planning an investigation of XXX' consisting of about 3 pages excluding listings. It should have something sensible instead of XXX and embody the following two aspects: 3a.1: Describe a question that you expect to investigate further using the OpenRISC blocking-TLM model augmented with any special hardware models you need. The investigation should implement the same function in various ways each of which gives the same output but with various amounts of silicon use, timing performance or energy cost. 3a.2: Demonstrate basic proficiency with the OpenRISC blocking-TLM model showing output from a working system running a small test that would reveal one (baseline or otherwise) data point for your investigation. Note: you should use an instruction or loop count of more than 10,000 for each data point. But do keep things very simple overall. Note: If you do not use any modifications to the provided OpenRISC simulator you must still ultimately demonstrate proficiency with TLM modelling by altering the memory structure or parameters as part of a subsequent Exercise so please make sure you can compile and run your own version of the or1ksmp program. Within Exercise 3, working in pairs or groups is allowed, with, for example, one person writing the code to run on the OpenRISC and another extending the TLM model. In this case, an 'interface spec' part should be agreed by all parties and included or linked to in the reports. END ===================================================================== The remainder of this page consists of notes from last year that may not be relevant this year. ===================================================================== // OPTION 1 - Hardware assist: Please do the following: 1. Starting with the ETHERCRC example in (/usr/groups/han/clteach/btlm-baseline/ethercrc: or http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/ethercrc/) or your PERIPHERAL_DEVICE from Exercise 2, or using another design of your own (such as a message passing interface or a Bloom filter or a multimedia or encryption algorithm) create an application and unit test so that it has a top-level part and one or more sub-parts where the sub parts are suitable for implementation in custom hardware (or possibly on separate CPU cores). (The ETHERCRC example is already partitioned to be implemented as a bus-connected I/O device but an alternative partition would be as a co-processor or custom instruction.) See also http://www.cl.cam.ac.uk/teaching/1112/SysOnChip/slides/sp7tte/zhpf2ee9a4a1.html For simplicity, the top-level part should contain all necessary input data sets for several tests of different lengths to be run (i.e. it has embedded unit tests). Your partition should result in two (or more) sections of C code that can easily be joined up again to run as a single program, perhaps under control of conditional compilation flags with the C pre-processor. Ideally you should select something where you can easily compare your ultimate results with an existing publication. 2. Modify the OpenRISC or1ksmp simulator to include your I/O device as a TLM model or custom instruction or whatever is needed (possibly nothing if just using multiple cores) and compile the remainder of your application and test for OpenRISC. 3. Compile your unpartitioned application so that it runs natively on a workstation. 4. Compile your unpartitioned application so that it runs on a single core of OpenRISC without any hardware assistance. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. ===================================================================== // OPTION 2 - Natively-compiled versus VM: Please do the following: Many applications are coded in Java and CLR/dotnet are interpreted bytecode. This has a different power and performance profile compared with native code. However, some code can be compiled from dotnet into hardware using HLS tools to make accelerators. Or it can be JIT-compiled to native form. There is an example dot net program here (primes computation): /usr/groups/han/clteach/btlm-baseline/openrisc/sw/primes-dotnet: There is a really trivial version of the dotnet VM, ported for OpenRISC here: It only supports static classes and integer arithmetic (at the moment). /usr/groups/han/clteach/btlm-baseline/openrisc/sw/tiny-clr-temporary: It is 'temporary' since it is so cut down and does not use the full C library,libc, but a cut down one from mixbug. For your reference, there is an RTL compilation of the primes example here: http://www.cl.cam.ac.uk/research/srg/han/hprls/orangepath/timestable-demo/offchip.html However, do not use this for Exercise 3. It can be later used or recompiled in Exercise 4 or 5. 1. Write a very simple integer application of your own in C and C# (or continue with the primes example). 2. Run both forms of the program on a single core of the OpenRISC or1ksmp simulator. 3. Make one of the following changes to the system (the changes may or may not need the high-level application to be slightly recoded and recompiled): 3a. Extend the example so that it uses multiple cores (there are many ways to do this, so briefly discuss the options first). 3b. Add a special coprocessor or instruction to the OpenRISC that accelerates or reduces power consumption. This might accelerate only the chosen application or it might potentially benefit all apps. 3c. Add special support for arrays of booleans. 3d. Consider which part of the application might be JIT'ed and manually modify the mix of native and interpreted code to reflect this change. 4. Write a note comparing the stack machines of JVM CLR/dotnet with the Android DVM register machine and speculate which is easier to accelerate with custom hardware or coprocessors. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. ===================================================================== // OPTION 3 - "Single-core algorithms are not the best on parallel architectures?" Investigate how well a textbook algorithm works on contemporary and future architectures. Please do the following: There is a school of thought that mutable data structures, although good for single core programs, can waste a lot of energy on multi-core. Better is to disable cache consistency entirely or to use an immutable heap that does not cause write evictions. Also, we might consider whether a custom instruction or SIMD structure is useful for accelerating textbook algorithms. (Finally we might investigate the power consumption and best algorithm partition under GP/GPU acceleration, but this would involve importing a GPU model into our system which would take far more work than is required for full marks.) Suitable algorithms for this investigation might be : -- Tree Quicksort (Guy Belloch, CMU) - an immutable form of quicksort -- Polyphase radix sort. From the SPLASH benchmarks. DJG has already ported this to our OpenRISC simulator: /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix -- Merge sort - better than radix sort for longer keys: "Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort" by Nadathur Satish et al. in SIGMOD'10. Compare different memory architectures to see how memory bandwidth impacts on the best sorting algorithm. Or add a SIMD instruction to the OpenRISC ISA. -- Bloom filter implementation: Bloom filter http://antognini.ch/papers/BloomFilters20080620.pdf Additional boolean store instructions added to the OpenRISC ISA might help? -- Dijkstra's shortest path http://www.mcs.anl.gov/~itf/dbpp/text/node35.html Although the best for single core the all-paths algorithm parallelises better. A conditional compare and store instruction added to the OpenRISC ISA might help? You should 1. Adapt or write a C implementation of the algorithm that runs on a workstation natively and which can be adapted for the OpenRISC platform using conditional compilation preprocessor flags. 2. If a thread library is needed then use pthreads/parmacs. Gain basic proficiency with the OpenRISC version of these /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/or1k.threads following the basic application in /usr/groups/han/clteach/btlm-baseline/openrisc/splash2/isca95/codes/kernels/radix/radix.c 3. Get your C implementation running on one core of the OpenRISC simulator without any hardware assist. 4. By enabling further conditional compilation flags, demonstrate a run, however small or trivial, of your partitioned implementation that uses multiple cores or custom instructions or coprocessors etc.. 5. Write a text 'interface specification' that clearly defines the interface between the partitioned parts. END ------------- Points arising: Q. I cannot think of four pages' worth of material to write for my report. A. Ultimately you will need several pages or so of text for your final two Exercises, some parts of your report for Exercise 3 can be first draft of bits of that. You should explain what you have done and also some of the infrastructure you are using. Write it as though addressed to an external examiner, not to DJG. DJG knows all about the OR1KSMP system but an external examiner will not. -------------