// // SoC P35 Exercise 3b 2013/14 // Exercise 3b: Preliminary Investigation of performance, power, area trade offs, etc.. // Please do the following 1. Investigate the execution time and energy use figures for your project work described in Exercise 3a and generate a spreadsheet or other table showing their variation as parameters change. You should aim to have three or more parameters and 20 or more results in your table. You should probably write a Makefile or shellscript that automates the experiments. Parameters to adjust typically include input data size, number of CPU cores, cache size and structure and parameters specific to your project work. You may also include other results in your table, such as actual simulation time, cache hit ratios, number of DRAM columns addressed etc.. (Note the sha3 work is not likely to use multicores this year ?). 2. Compare your results to an analytical model of your own creation or results in a published article. This should consist of one or more simple formula(e) that roughly predict the time and energy figures you have measured. You should include an explanation of the coefficients in your formula, saying whether you think they are reasonable or what else we can learn from them. 3. Optional: for further understanding, make variations to the model, such as: enable and disable the cache(s) or vary their allocation policy or clock frequency, add a separate I cache or change the main memory from DRAM to SRAM. Your experiments should help confirm your analytical model and enable you to conclude whether your selected partition was a sensible approach. Perhaps compare with another partition. Full marks will be awarded for a short report (4 pages or so) that includes the main spreadsheet with additional plots of power consumption and execution speed as measured and as predicted by the model. Please give in at least a preliminary report by the published deadline to get feedback. Re-submission to get a higher mark is allowed at any point up to the final deadline at the start of the Easter Term. END Marking Scheme: out of marks Method 5 0 Spreadsheet and plots: 5 0 Analytical formulae 5 0 Discussion 5 0 Total 20 0 --------------------------------------- Points arising this year: When describing cache size please use units of bytes (not some strange logarithmic parameter from the benchx.h file). Also give its way associativity (the next parameter passed to the consistent_cache constructor, defaulting to 8). In fact, please give a full description of the memory system(s) you are using. There is no hard disk model in use this year. We are running bare metal with the program and data initially loaded in to DRAM by a process that is not part of the model. Note that a smaller cache will use less static energy, but its hit ratio is lower, causing a rise in DRAM energy which is generally larger than that saved in the cache. The Univ Maryland DRAM simulator allocates energy to both the static and dynamic energy accounts, but the distinction is not very important. Only the total is important. Note that cache misses can be classified as compulsorary, sharing or capacity misses. It is good to use these terms in your report. Q. How many cycles does the OR1K use? A. Provided there are no cache misses, the fastiss executes each instruction in one cycle, including instructions that in reality would take multiple cycles. There is some commented out code in the fastiss that partially implements accurate cycle counts. The verilated version is cycle accurate. Q. What clock frequency is the OR1K ? A. It is set up as running at 200 MHz. An obvious thing to try is adjusting its clock frequency. But of course, above a certain frequency, performance will start to be limited by cache hit access time. The current set up of the OR1K does not accurately adjust its static and dynamic power model with frequency: the static power will remain constant and the dynamic power will rise linearly. A more realistic approach would be to adjust these with the clock frequency based on the supply voltage or drive strength needed in the gates. Basically the supply voltage needs to be raised according to a 1.2 to 1.3 or so power law and the static and dynamic energy will follow the square of that (around 1.5) but more citable figures should be found in the literature. Q. How can I find the hotspots in my code? A. One way is to compile it natively for the workstation with the gnu profile -pg option. If you extend the/a outer loop so it runs for a few minutes of real time you will then get pretty good indications from gprof. Q. How can I estimate the energy and timing performance from a high-level model of my new hardware ? A. There are many papers on the related subject of pre-synthesis RTL performance estimation. In general there are a number of answers according to the time/space tradeoff envisaged for the final implementation. The easiest answer to find is the one where the final hardware will use the same structure and same loop counts as the high-level model, but structural hazards may require its refinement. Structural hazards occur mainly on memory ports and floating point ALUs. In C code you can use as many array operations as you like in a set of expressions, but if these are mapped to hardware RAMs they may stall a loop if more values need to be accessed then there are RAM ports. Two address ports is the maximum number normally available. Within a single port it is also typically possible to read out the old data and write new data at a location in one clock cycle. (Register files can potentially have unlimited numbers of ports as well as dedicated input and output wiring for specific locations but much more energy and area use.) Work out how many RAMs you need and their sizes (length, width and number of ports) and what each will contain, such that its ports are not overloaded. Pust simple scalars in registers. Refactor your high-level design to balance the load on memory ports and to avoid structural hazards. We now need to estimate the area, energy and clock frequency for your design. Firstly, you must either instrument your (refactored) model to count loops and hamming distance in register assigns or else you can estimate these numbers. Secondly you must compute a logic depth and complexity for the right-hand-sides of your assignments. You can use one gate per bit for logic operations, four gates per bit for arithmetic operations and quadratic cost for multipliers. Constant shifts are free of course. Better still, take your inner loop and use legup or manually convert to your favourite RTL and synthesise it for some technology you have to hand: gate counts and depths will be roughly the same for all technologies. I will put some more information here on sheet 2: http://www.cl.cam.ac.uk/~djg11/greaves-vlsitech-spreadsheet.ods Thirdly, estimate the area of your design, using the RAM areas and register and gate counts. The average net length will be about 0.3 times the square root of the area. Dynamic and wiring energy is then estimated using the same activity ratio as you estimated or measured for the registers. For a hash function, activity is going to be maximal, which means a bit will discharge 25 percent of the time. Static energy is estimated from the gate+register+RAM count. Clock frequency will be the reciprocal of the logic depth. Remember to only count discharge wiring energy since to count the charge (0 to 1 transitions) as well will double count it. The TLM POWER3 library correctly only counts one direction. Note, to be independent of technology you can report your final time and area results in units of FO4 delays and square lambda respectively. --------------------------------------- Points arising last year: Note that the performance of the simulator can be greatly reduced by recompiling without PW_TLM_PAYLOAD=3 (which tracks transitions in the generic payload), or without TLM_POWER3 being installed at all, and a rough doubling in performance is possible using a large quantum keeper constant. You may wish to investigate these effects and write them up. If your measured results do not fit your analytical model do not worry - full credit can still be achieved in this Exercise, but an explanation would be be included if you use this work as your mini-project. --------------------- You can also modify the DRAM bank interleave and see what effect that has. Instead of having one DRAM completely after the other another common design approach is for them to be finely interleaved at the granularity of a cacheline. The change can basically be made in the ten characters or so of C code in the busmux that divides the work between the DRAMS. But but you'll also have to implement a solution to loading the program ELF file for such a memory structure. e.g. change memloader to reflect your interleaving or load the program into another memory or region of memory that is not so interleaved.