ACS SOC D/M P35: Ex 4 A/B 2016/17: Mini-Project and Structured Research Essay. The deadline for all P35 work is the first day of Easter Term. Notes: Please ensure you have completed earlier exercises and feel to free to reuse text or results from earlier exercises for the Mini-Project (4a). Collaborating is not allowed for the Research Essay and is only allowed for any parts of the mini-project that are borrowed from the term-time work or with express permission that will only be granted if the nature of the collaboration will enable individual contributions to be clearly discriminated. Your audience is the External Examiner, Second Assessor and readers of Design and Reuse or Electronics Times. It is therefore worthwhile explaining material that would perhaps be well known to others directly involved in this module. Please feel free to contact DJG as much as you like for assistance and advice with Exercise 4 A/B over the Easter Vac. --------------------------------------------------------------------- Exercise 4a (accounts for 30 credit points): Ex4 Part A: (Mini-Project): Construct an interesting argument based on practical work you have conducted using design tools for FPGA and System-on-Chip embedded software and accelerators. This will most likely contain an evaluation of the group mini-project last term, but you need to make perfectly clear what your own contribution to the work is and any measurements must be your own work. Example arguments for 4a are: 1. Accelerating our project application saves energy because ... 2. Using a virtual platform was a good idea because ... 3. Having determined the performance looked good using a simple FPGA experiment we can explore rack scale and/or custom silicon performance using ... But you may expound on any sensible and interesting result from your practical work. Write a report in a style suitable for publication in Electronics Times or Design and Reuse (or similar). You should aim to write at least 2000 words but full credit is available if information is instead conveyed in diagrams and figures. All of the words must be your own work but diagrams from any source may be included if credited properly. You argument itself does not have to be original: basing your report on an existing D&R or ET article is acceptable. Most importantly: think carefully about your report structure. Cite relevant prior work or alternative solutions. It is generally easiest to use a provocative title that poses a question, then expand on the question in the introduction and answer it at the end. Feel free to ask DJG for further pointers on specific topics. --------------------------------------------------------------------- Exercise 4b: Structured Research Essay Task: See companion sheet. --------------------------------------------------------------------- 4a: Further Notes arising this year (March/April 2017): - This will be updated in response to email interactions Q. In our current implementation, the device will set a "finished" flag to 1 once it finishes the calculation, and the ARM keeps polling to this address to see the state of this flag. This works fine when the compiler optimization is turned off (-O0), but not when it is on (-O2). I'm guessing the ARM is caching the value in this address after the first query, therefore cannot see the update from the device. Does this seems to be a reasonable guess? Is there any workaround except turning off the cache? A. You don't say if this is on the real or virtual platform - the means of implementing cache bypass are slightly different between the two. Prazor bare metal programs simply exploit Prazor's UNCACHED_ADDRESS_SPACE64 macro. But when the MMU is enabled it takes over from the macro, as per the real hardware. If the compiler optimisations are having the effect you report, then it is not a matter of cacheing anyway, but of compiler code generation. Have you used the keyword volatile as per the examples in my notes? int a = *((volatile int *)0xblah); This will stop the compiler cacheing the result. Check the assembler code generated from the C compiler using objdump -d applied to the object file - you should see whether the compiler has hoisted the polling out of your spinning loop. Q. How can I estimate gate counts, area and energy? A. This is in my lecture notes somewhere - but I cannot find it just now ... Estimating Area and Energy values to insert in a high-level model of a peripheral. ---------------------------------------------------------------------------------- You need to chose a target technology: ASIC and FPGA will differ in area by a factor of roughly 100. Let us select ASIC in 22 um with 1 volt supply with 7 layers of metal wiring. Although silicon dioxide has relative permitivity (e_r) of aboue 3.8, meaning the speed of light will be divided by about 2, the resistance of the tracks will degrade this quite a bit further. So you can assume that the effective propagation speed for logic values on the wiring layers is roughly 0.1 times the speed of light, i.e. about 30 metres per microsecond. The wiring capacitance for a pair of conductors, surrounded by silicon dioxide (permitivity 4), whose width is the same as their spacing is Pi times e_0 * e_r Farads per metre, which is roughly 100 pF/m. This holds for VLSI where the units of 0.1 pF/mm or 0.1 fF/um are more useful. A 1mm conductor will therefore have a capacitance of 0.1 pF and, with a one volt supply, will waste 0.1 pJ in an overal charge and discharge cycle. Note that toggle rates are twice activity ratios, so do not count the dynamic energy for both charge and discharge! This figure is hardwired into the pw_tlm_payload.cpp file but there are two minor mistakes that largely cancel. That file wrongly multiplies the energy by 0.5 and should be changed to not do this since that energy is dissipated on both the charge and discharge of the nets. It also uses the figure of 0.3 pF/m based on the permitivity of silicon being 12. In reality, conductors on VLSI are mostly surrounded by silicon dioxide and perhaps silicon nitride (permitivity 8). Overall the computation seems accurate enough in its errored form. A small logic gate, such as a 2-input NAND gate will have dimensions of about 22x13 lambda, and so be about 0.14 square microns in area for lambda 22e-6. Note there are 1e6 square microns in a square millimetre. Broadside Arith/Logic Operator Equivalent Gate Count Rules Of Thumb for Area Estimation W-bit register - Six times the number of bits in the register Shift by constant left or right - No logic AND or OR - Number of bits in the word times 1 (W=word size here) XOR - Number of bits in the words times 1.5 Adder/subtractor - Sum of the number of non-constant bits in the two operands times 2.5 Multipy/divide by constant - Number of non-zero bits in constant multiplied by word width multiplied by adder width Multiply two operands - (flash) Longer operand times shorter operand times 2.5 Divide two operands - (fixed or vari-latency long division) Four registers plus two adders plus 100 Static RAM - (small RAMs WxL < 1000 bits) 600 sqlambda/bit Static RAM - (larger RAM WxL<12800) 400 sqlambda/bit END Q. In the TLM model, if my custom device will takes some time to process a task, then shall I put the task in the blocking transaction function, and estimate the processing time of my logic and add it to the delay variable? If yes, is there a guideline I can follow to do this estimation? Otherwise, shall I use something like a separate SC_THREAD? A. A separate SC_THREAD for the device that make a good deal of independent processing may be a good idea, especially when transaction order needs to be preserved. But, for basic targets devices that simply process slave operations in order and do not initiate any of their own, then the target device should be able to just augment the delay field in the transaction to model a slow response to the transaction. This is an appropriate coding style to use if the bus resources between the initiator and the target would be held up for entire transaction in reality. Many target devices are not like that: the individual bus transactions on them operate at normal speed - taking just one bus clock cycle, but the initiator has to wait for completing using polling or interrupts. So it depends on the type of concurrency that is going on. However, there was a bug in Prazor that was finally fixed just two week's ago where large values being added to the TLM delay parameter in making access to a device upset the performance modelling of instructions being processed by the CPU core at the same time. This matters for out-of-order cores only and the ARM9 on the Zynq board is pretty much in order. Big means likely to cross a loose-timed quantum boundary. The fix, which is a little temporary at the moment, involved replacing the delay variable in the transaction with a pair which is the delay and the kernel time it is relative to. This pair is now kept in the generic payload itself and called lt_delay. I will post this information into the Prazor Temporary Reference Manual and provide a pointer to a rebuilt version shortly. If you get a new copy of Prazor from phabricator it uses this coding style. The augment to the delay should be done with the lt_delay override of the plus operator. See src/tenos/lt_delay.h. Also, when I last checked, it was also necessary to compile with TLM_POWER3 turned on and with a hack in TLM_POWER3 as reflected in /usr/groups/han/clteach/tlm-power3/include/pw_tlm_payload.h that disables the definition of pw_base_protocol_types that is currently redfined in prazor.h #ifndef TEMP_PT_BYPASS struct tlm_pw_base_protocol_types1 { typedef PW_TLM_PAYTYPE tlm_payload_type; typedef tlm::tlm_phase tlm_phase_type; }; Hope that helps! Q. I think I misunderstand the delay thing. By "target device should be able to just augment the delay field in the transaction", do you mean the delay field will be incremented automatically by the TLM module, or I need to estimate the amount of delay and add it to the delay field? What guideline shall I follow if I need to do the estimation? The code for a peripheral device should augment the delay field in the TLM transaction, just like in the Toy ESL classes. For most simple devices, the augment is just one clock cycle, such as 4 ns. But for a slow device which will stall the initiating core by design, it will be much more. The amount of increment in a high-level model will be an estimate: perhaps just a ballpark guess of the number of clock cycles. If you do not have a low-level model then you can do no better. If you have also coded the design in RTL, you can measure the performance of the low-level design on an RTL simulator or count the clock cycles by hand. With the changes to Prazor earlier this month, you must increment the field in the payload and not the field in the transaction socket callback. For example, uart64_cbg at line 485 now uses the following macro to increment the delay AUGMENT_LT_DELAY(trans.ltd, delay, latency) which can be adjusted in prazor.h to augment the new or old delay variable. But I think I will only be using the new field from now on. BUG: A bug in the switch to the new coding style: Class variable were being masked by redeclaration as locals. @@ -487,8 +489,8 @@ void armcore_tlm::run() } - lt_delay lt_i_delay = master_runahead; - lt_delay lt_d_delay = master_runahead; + lt_i_delay = master_runahead; + lt_d_delay = master_runahead; if (reset_or_yield_countdown > 0) reset_or_yield_countdown -= 1; else { Q. We planned to extend the work in 3b to a chained matrix multiplication program, like the form Y=(A*X+B)*C+D, and we planed to divided the work just as we did on 3b, i.e. I work on TLM, Aaron works on RTL, Chris works on energy/timing statistics collection, is this kind of collaboration allowed? If yes, can I use Chris's statistics in my report directly if I reference that? Extensions to the practical work of 3b sound like a good idea and can be shared. It is important that this is all made clear in your write up. The analysis and reports for Exercise 4 should be your own individual work and where this is based on the shared work of Exercise 3 again needs to be clear. As you recall, Exercise 3 is about getting the shared practical work into working order and having confidence in it. If another member of the group has done some in-depth analysis of the performance they will probably be using this for their own Exercise 4a write up. You need to make your own contribution to gain credit in Exercise 4a. You can certainly quote someone else's work but you will not be awarded Exercise 4 credit for it. --------------------------------------------------------------------------- 4a: Further Notes arising from previous years. Q. How about the length of 4a? The articles on design and reuse vary from very short to quite lengthy. A. I agree they vary a lot. For your masters degree you need to show that you have 'mastered' a particular subject, and this does not directly relate to some word count. Also I've tried not to be overly prescriptive with exactly what gets written up where, but being underly prescriptive is also unhelpful. And the word count will vary greatly according to how many diagrams used to tell the story. Overall about 5 pages for each of Ex 4a and Ex4b (total 10) should be sufficient. Other points you might consider are: In the case of IP-XACT, it is not clear how many SoC designs use it, but certainly not all. So there are alternative approaches including some commercial products. Chisel is a new hardware language and rather simple, although elegantly embedded in Scala which provides considerable metaprogramming power. What is its future? What should Chisel 3.0 aim to include? What percentage of industrial high-level modelling uses SystemC and what else is used? Is UML/SysML or GUI-based SoC design serious or toy ? What would be needed to take your practical work and to industry and for it to be adopted ? What is the complete bundle for IP block distribution: this clearly includes high-level models, actual implementations, data sheets and machine-readable meta-information for energy and area accounting, test-programme generation, software programming, automatic configuration ... END