D J Greaves + M Yasin: Very High Level Simulation (and Modelling).
Our proposed work is a series of totally new experiments to test the applicability of the Rent/Greenfield rule and to further refine it at even higher levels of system description. The highest level we consider is a computer program and its input data without any description of the type or size of computer it is to run on. These experiments will give new insight into the applicability of the rule and could lead to a grand contribution that helps steer the direction of computing research in general.
The main questions we investigate will be `How high a level model of computation can we use while still obtaining a reasonably accurate power and performance figures and what are the key facets to include in such a model ?' Or put another way : `Given the source code for a specific application benchmark, what types of useful model exist that span the distance between an abstract interpreted run of the program and a TLM model of a hardware platform executing a compiled version of that program ?'. We shall also discover further results on the general applicability of Rent's Rule and test out the analytical model (consisting of Maxwell-like boundary integrals) in Greenfield's work. A single-threaded interpreter walking the abstract syntax tree for the benchmark (or a related form known as its VSDG) to execute a program can easily collect certain baseline statistics such as the total number of instructions executed or the total number of memory reads and writes (in the absence of speculation), but most statistics of interest, such as how long a memory location remains live (from value being written to last read before next write) depend on implementation details of the interpreter, such as the order of argument evaluation in function applications. In general, as we range over different execution styles for a given program with given input data, there should be several important factors that influence the metrics determined by a very high-level simulator/emulator. Binding (or mapping) is the process of allocating variables and operations (such as a floating point addition) to particular structural resources (i.e. physical components such as memories and adders). This also happens in the compiler when variables are allocated forms of storage, such as register/stack/heap etc. and expressions are allocated an order of computation. Placement is the process of allocating co-ordinates to each resource in two dimensions (X-Y location on a chip) or greater for multi-board/multi-chip systems. Routing is the process of allocating a path through the 2-D or higher space a communication takes (either by physical wire or packet switched path). These three processes involve a lot of detailed decision making, but the Rent/Greenfield power laws are able to predict the macroscopic result at each level without performing the allocation at the next lower level.
A placement model is potentially needed to collect statistics about the length of wire that a datum travelled along. Such a model can be based on the co-ordindates and size of each memory and ALU the Manhattan distance between operand sources and destinations can then be readily computed. The switching activity of the bus nets or on-chip network can also be readily estimated (although how much effect there is from high-order address and data lines having lower activity than low order lines (owing to spatial locality of reference and small values stored in large words) perhaps needs to be taken into account). We need to compare a placed design with an un-placed model that uses the expected distance that data should move as given by Rent/Greenfield model.
Even without placement co-ordinates, we must find out whether it is important to identify the number of memories and register files and the number of ports on them and the mapping ratios of user variables to them. For instance, local variables on the stack might seldom leave L1 cache whereas statically allocated global variables might see more activity in the L2 or L3 caches. The extent to which loops are unwound by the compiler makes a big difference to the number of expressions that need placement. Can this myriad of decisions be mapped to a single scalar metric that represents the general 'meatyness' of the target implementation: ranging from low-power battery computer to datacentre mainframe [CITE AZIZI Azizi:2010:ETP:1815961.1815967] ?
Recent VLSI has greatly increased static power consumption compared with previous generations. With high static power consumption, it is better to keep the powered-up elements busy (with good load balancing and short critical path to the computation) whereas the lowest power result can be obtained using a fewer number of powered-up elements that perform close to serial execution with low data movement distances. On the other hand, another technological pressure is to exploit `dark silicon' where large amounts of VLSI can be filled with low-activity, highly-specialised circuitry: this is the dark silicon or conservation cores approach [CITE]. Can these technology-lead pressures be simply accommodated in a macroscopic model?
Some implementation technologies may have overheads in dynamically mapping work to processing elements, either in terms of determining that a migration would be useful or in performing the migration. However, the cost of this can be set to zero during a very high-level simulator/emulator run while still exploring the variations that arise from selectively allowing or disallowing dynamic load balancing. For instance, if migration at a particular granularity is allowed, then the critical path of the execution should only be that distance away from its minimum length. So the granularity of migration is an important factor that influences a high-level simulation and the costs of the migration are a secondary factor that can also be summed in the results.
Our starting point is a high-level SystemC simulator (described below). We will augment this with certain lower-level models and much higher-level models, including LLVM models of standard C++ benchmarks developed by M Ali (PhD student of Greaves).
Overall we generate one or more power models and shall investigate which parameters and design styles affect their accuracy. Our hunch is that some parameters are more important than others.
This work will be a success if, over a wide variety of experiments, a commonly applicable model emerges. This should also predict other key metrics, relating to memory and cache sizes and interconnection bandwidths.
Our work programme is:
- Select a variety of application programs drawn from standard benchmarks, general workloads and embedded systems. Examples are MPEG compression, chess playing and weather forecasting. Also provide a number of input data sets for each program that exercise it in different ways.
- Construct a variety of simulation platforms at different levels of abstraction, each
capable of executing the applications where each platform is itself widely parameterisable. The levels of abstraction
- Nominal interpreter of the syntax tree-walking style,
- Loosely-timed SystemC transactional model,
- Cycle-accurate SystemC/RTL model,
- Hardware gate-level FPGA/ASIC model (synthesised to gate level with fully-annotated place and route).
- Instruction word (32 or 64 bit) size
- Number of processor cores
- Cache heirarchy and main memory structure
- Interconnection fabric (simple bus, switch or network on chip).
- For each application program, for each data set, run each simulator at each design point. These are the experiments.
- Apply the Greenfield-Rent analysis for each experiment to see which metrics can be accurately predicted from which other within an experiment.
- Perform curve fitting and compute the correlation coefficients to generate the main model.
The aim of the curve fitting is to create an overarching economic model for contemporary computing. This includes the vast number of design parameters available (cores per chip, no of circuit boards, supply voltage, clock freq, cache size, network capacity) and the various needs (parallelism, speed, battery life, energy per map-reduce op) in different scenarios (from embedded to data centre). Energy consumption and cooling needs are as important as capital cost and performance - hence the macro model is indeed overarching.
Simulator (Greaves, Yasin, Puzovic)
The simulator uses SystemC Transactional Level Modelling (TLM) and the TLM POWER3 library for power annotation. With TLM, rather than modelling the individual wires between components, we model complete transactions between them as atomic operations, such as passing a packet.
The simulator currently includes three types of processor core, including MIPS, OpenRISC and a trace-driven core that re-runs traces generated by the PIN tool. The OpenRISC core is available in two forms (cycle-accurate (generated from RTL using Verilator)) and fast ISS (based on ORSIM) and it can run busybox linux or a bare-metal implementation of libc with SMP pthreads. The DRAM model is a high-level model or a wrapper for instances of the DRAMSIM2 from University of Maryland. The SRAM models are from McPat/Cacti. We have run the SPLASH-2 benchmarks directly on the OpenRISC multicore ISS based model and traces from other benchmarks using x86 can be imported.
No results online here yet ...