## spEEDO: Energy Efficiency through Debug suppOrt (& On Chip Analytics)

PEHAM Project: Power estimation from high-level models

David Greaves Ali Zaidi Klaus McDonald Maier

University of Cambridge Computer Laboratory And Ultrasoc Ltd NMI Multicore Meeting, Cambridge March 2014.



David Greaves + Ali Zaidi

## **Computer Laboratory Research 1**

- Energy Management Techniques in Modern Mobile Handsets

(N Vallina-Rodriguez, J Crowcroft, IEEE COMMUNICATIONS SURVEYS 2012).

- Dynamic Microarchitectural Adaptation Using Machine Learning

(C Dubach, TM Jones EV Bonilla , ACM Transactions on Architecture and Code Optimization TACO 2013)

- The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

(KT Sundararajan, TM Jones and NP Topham International Journal of Parallel Programming 2013)

#### - Computer Laboratory: C-AWARE

C-AWARE aims to build services to improve users' awareness of their personal energy consumption, and modify their energy demand.

### **Gates Building Power**

We have a log of nearly all the power used in our building in Cambridge.



#### David Greaves + Ali Zaidi

That's in the *C-Aware Project* which has installed monitors on all the mains cables in the switch room.

The picture shows just four of many.



NMI Multicore Cambridge



| Col | Description               | Start    | End                   | Avg kW Selected | Avg kW Entire | Total Energy (kWh) |
|-----|---------------------------|----------|-----------------------|-----------------|---------------|--------------------|
|     | Entire Building           | Dec 2013 | Jan 2014              | 179.78          | 161.02        | 236,542            |
|     | Logical Sum of Sub Meters | Dec 2013 | Jan 2014              | 172.72          | 154.53        | 227,007            |
|     | Machine Rooms             | Dec 2013 | Jan <mark>2014</mark> | 56.46           | 57.44         | 84,381             |
|     | Sockets                   | Dec 2013 | Jan 2014              | 50.90           | 46.10         | 67,727             |
|     | Miscellaneous             | Dec 2013 | Jan 2014              | 4.85            | 4.01          | 5,889              |

#### David Greaves + Ali Zaidi

# **Computer Laboratory Research 2**

- TLM POWER 3: Power Estimation Methodology for SystemC TLM 2.0' (DJ Greaves & MM Yasin, FDL'12)

- KIWI – Compiling dotnet C# programs to FPGA for low energy execution (DJ Greaves + S Singh).

```
for (i = 0; i < 100; i++)
if (A[i] > 0) foo();
bar();
```

- Achieving Superscalar Performance without Superscalar Overheads – A Dataflow Compiler IR for Custom Computing (AM Zaidi and DJ Greaves).





#### PC CPU Power Probe



#### The same USB probe

Measures 12 volt rail to motherboard CPU socket.

Measures volts and amps at 10 Hz rate.

Accuracy: consistency of about 1 percent between runs.

#### David Greaves + Ali Zaidi

### **Probed and Probing Machines**



AMD 6-Core Phenom 64 Processor with TCP connection to power probe machine.

David Greaves + Ali Zaidi

### Splash-2 'RADIX' : First Test Setup



Plot shows two runs with two cores and then one run with one core.

Problem: Power probe was running on same machine (spikes). Problem: Some spikes missed owing to aliasing (missing ADC LPF). Fixed thereafter (use separate probe machine and add an RC filter). Note: this older CPU used 3x power compared with Phenom... David Greaves + Ali Zaidi

### **TLM Power 2 Library**

```
class FOO:
  public sc_module,
  public pw_module
ſ
  public:
   SC_HAS_PROCESS(FOO);
   FOO(const sc_module_name& p_name):
     sc_module(p_name),
     pw_module("config.txt")
   Ł
     SC_THREAD(process);
   }
   void process(void)
   Ł
     update_power (PW_MODE_ON, PW_PHASE_IDLE);
     wait(10, SC_NS);
     // Perform some computation
     update_power(PW_MODE_ON, PW_PHASE_COMPUTE);
     wait(20, SC_NS);
     update_power(PW_MODE_OFF);
                                      // Turn off module
   }
};
```

- TLM POWER 2 developed at France CEA (Lebreton/Vivet)
  - Used phase/mode modelling
  - No LT
  - No TLM socket integration.

### **TLM POWER 3: Motivation**

- Power estimation from high-level models.
- Rapid architectural exploration using SystemC.
- Absolute accuracy goal: correct order of magnitude at least!
- Relative accuracy goal: 30 percent or so.

• Want correct polarity of the parameter derivatives : *A change is better or worse*!

## **Physical Units**

- SystemC provides overloaded sc\_time units
- TLM POWER 2 added pw\_energy and pw\_power units with all appropriate overloads.
- TLM POWER 3 adds pw\_voltage for F/V scaling.
- TLM POWER 3 also adds pw\_length and pw\_area.

Basic physics: energy divided by time ---> power

Basic physics: length times length ---> area

David Greaves + Ali Zaidi

### **Setting Static Parameters**

```
class F00:
  public sc_module,
  public pw_module
{
    public:
    SC_HAS_PROCESS(F00);
    FO0(const sc_module_name& p_name, int width):
        sc_module(p_name),
        pw_module("config.txt")
    {
        set_excess_area(pw_length(50.0 * width, PW_um), pw_length(5.0 * width, PW_um));
    }
}
```

Excess area: the local increment above the sum of the instantiated modules below.

Typically set the area and static power in the constructor.

Example: for a RAM, the area can be dependent on the number if bits.

David Greaves + Ali Zaidi

};

#### LT b\_transport energy annotation

```
tac_response tac_multiport_router::b_transport(tlm_generic_payload &trans, sc_time &delay)
{
    unsigned int len = trans.get_data_length();
```

... // Main body of the behavioural model

```
sc_time activity_time = ...;
```

```
delay += lt_activity_time; // Or use qk_inc to perform this addition
```

```
#ifdef TLM_POWER3
    // bit_width has been set in the constructor... etc
    sc_energy energy_cost = pw_energy((double) (5 * len), pw_energy_unit::PW_pJ);
    pw_module_base::update_energy(energy_cost, lt_activity_time);
#endif
}
```

Bad:

This shows computation of energy per transaction in the body of the transaction. Better:

Energy and floating point computations done in RECOMPUTE\_PVT callback.

David Greaves + Ali Zaidi

### **Spatial Layout Support**

- Every SC\_MODULE has a chip/region designation.
- The area of a module is sum of
  - its children with the same chip/region name
  - its locally defined 'excess area'.
- Inter-module wiring lengths can be estimated using Rent's Rule on area of lowest-common-parent.
- Actual X-Y co-ordinates could be allocated by a placer.

### Report Formats (2: Ascii-art text file)

| ######################################                                                  | ######################################                | #########<br>UK) | ;#####################################  | ¥<br>¥   |                 |        |
|-----------------------------------------------------------------------------------------|-------------------------------------------------------|------------------|-----------------------------------------|----------|-----------------|--------|
| Statistics                                                                              | -                                                     | <b>#</b><br>#    |                                         |          |                 |        |
| 5141151165                                                                              | ·····                                                 | +<br>#           |                                         |          |                 |        |
| For more information see th                                                             | p                                                     | #                |                                         |          |                 |        |
| Creation Date: 17:27:22                                                                 | 15/09/2012                                            |                  |                                         | £<br>\$  |                 |        |
| ******                                                                                  | #######################################               | #########        | *###################################### | <i>‡</i> |                 |        |
|                                                                                         |                                                       |                  |                                         |          |                 |        |
| itle: privmem-c1n6000-drams<br>Simulation duration: 24826<br>Simulation duration: 24826 | im-withcache-nile-gas<br>590001096 ps<br>590001096 ps | h-harvard        | k.                                      |          |                 |        |
| MODULE NAME                                                                             | STATICO                                               | ENERGY           | DYNAMIC1                                | ENERGY   | WIRING2         | ENERGY |
| tandalone modules:                                                                      |                                                       |                  |                                         |          | +               |        |
| Memory 0 (DRAM)                                                                         | 0.173879501J                                          | 3.49%            | 0.0875462788J                           | 1.76%    | 4.48687512e-07J | 0.00%  |
| the top.uart0                                                                           | 0J                                                    | 0.00%            | 1.644e-06J                              | 0.00%    | 6.7041e-11J     | 0.00%  |
| the top.busmux0                                                                         | 0J                                                    | 0.00%            | 1.1905216e-05J                          | 0.00%    | 0J              | 0.00%  |
| the top.dram=0                                                                          | 0.173879501J                                          | 3.49%            | 0.0875462788J                           | 1.76%    | 4.48687512e-07J | 0.00%  |
| top.coreunit 0.core 0                                                                   | 0.2482659J                                            | 4.99%            | 0.0044012626J                           | 0.09%    | 1.34648772e-05J | 0.00%  |
| reunit 0.ll d cache 0                                                                   | ΘJ                                                    | 0.00%            | 0.000594064671J                         | 0.01%    | 6.14810556e-06J | 0.00%  |
| 0.ll d cache 0.Data 0                                                                   | 0.0333542257J                                         | 0.67%            | 0.000107935695J                         | 0.00%    | 0J              | 0.00%  |
| 0.l1 d cache 0.Tags 0                                                                   | 0.0317907464J                                         | 0.64%            | 4.18042825e-05J                         | 0.00%    | 0J              | 0.00%  |
| 0.l1 d cache 0.Data 1                                                                   | 0.0333542257J                                         | 0.67%            | 0.000105833853J                         | 0.00%    | 0J              | 0.00%  |
| 0.l1 d cache 0.Tags 1                                                                   | 0.0317907464J                                         | 0.64%            | 3.37903219e-05J                         | 0.00%    | j OJ            | 0.00%  |
| 0.l1_d_cache_0.Data_2                                                                   | 0.0333542257J                                         | 0.67%            | 0.000105435493J                         | 0.00%    | 0J              | 0.00%  |
| 0.l1_d_cache 0.Tags 2                                                                   | 0.0317907464J                                         | 0.64%            | 2.60627187e-05J                         | 0.00%    | 0J              | 0.00%  |
| 0.l1 d cache 0.Data 3                                                                   | 0.0333542257J                                         | 0.67%            | 0.000108887529J                         | 0.00%    | 0J              | 0.00%  |
| 0 11 d cocho 0 Torro 2                                                                  | 0 02170074641                                         | 0 6 10-          | 1 027/222/0 051                         | 0 000    | 1 01            | 0.000  |

David Greaves + Ali Zaidi

#### spEEDO

 spEEDO: Energy Efficiency through Debug suppOrt

• University of Cambridge Computer Laboratory in Collaboration with Ultrasoc Limited.

• Funded for six months by the UK TSB

• Started October 2013

David Greaves + Ali Zaidi

#### spEEDO

- Develop a power API for three purposes:
  - Embedded software energy reflection API
  - Remote debugger energy accounting and logging
  - Debug access to power-gated regions

Current activities:

- Develop a strawman energy API for access to 'On Chip Analytics'
- Trials on SystemC virtual SoC
- Extend GDB schemas for energy regs

David Greaves + Ali Zaidi

#### **Reference Architecture**



David Greaves + Ali Zaidi

### **Existing Power Events**

#### Typical device driver stats:

eth0 Link encap:Ethernet HWaddr 00:13:20:84:5d:81 inet addr:128.232.9.140 Bcast:128.232.15.255 Mask:255.255.240.0 inet6 addr: fe80::213:20ff:fe84:5d81/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:24110214 errors:0 dropped:0 overruns:0 frame:0 TX packets:15028627 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:3461755890 (3.4 GB) TX bytes:15455753259 (15.4 GB)

Existing event counters in device drivers and hardware can be projected through a calibration matrix to give energy estimates.

David Greaves + Ali Zaidi

#### MSRs

#### Machine-Specific Registers:

#### Oprofile example.

Oprofile gives a uniform API to a wide variety of hardware platforms.

Listing shows monitorable event counters on AMD x86-Hammer David Greaves + Ali Zaidi idge

#### **New Power Supply Monitors**



David Greaves + Ali Zaidi

#### Intel's Power Gadget MSRs

Intel has implemented a Running Average Power Limit (RAPL) on Sandybridge processors.

A number of machine-specific registers are defined containing energy information:

SandyBridge:

MSR\_RAPL\_POWER\_UNIT MSR\_PKG\_POWER\_LIMIT MSR\_PKG\_ENERGY\_STATUS MSR\_PP0\_POLICY MSR\_PP0\_PERF\_STATUS MSR\_PKG\_POWER\_INFO MSR\_PP0\_POWER\_LIMIT MSR\_PP0\_ENERGY\_STATUS

»Measuring Energy Consumption for Short Code Paths Using RAPL. Hähnel 2012



### Energy Aware COmputing Framework (EACOF)

Hayden Field / James Pedlingham – University of Bristol

Basically an SQL networked server where:

- Multiple sensors and other providers can log energy use

- Multiple customers and analytics can inspect.





#### NMI Multicore Cambridge

### Existing GDB Energy Capability ...

| (gdb) ir | nfo all-registe | rs                    |
|----------|-----------------|-----------------------|
| r0       | 0x0             | 0                     |
| r1       | 0x0             | 0x0                   |
| r2       | 0x0             | 0x0                   |
| r3       | 0x0             | 0                     |
| r4       | 0x0             | 0                     |
| r5       | 0x0             | 0                     |
| r6       | 0x0             | 0                     |
| r7       | 0x0             | 0                     |
| r8       | 0x0             | 0                     |
| C .      | 0x0             | 0                     |
| r 29     | 0x0             | 0                     |
| r30      | 0x0             | 0                     |
| r31      | 0x0             | 0                     |
| ррс      | 0x0             | 0                     |
| npc      | 0x100           | 0x100 <reset></reset> |
| sr       | 0x8001          | 32769                 |
| (gdb) go | dbEPT           |                       |
| Energy   | = 256 j, Time = | 0 ms, Power = 0 mW    |
| (gdb)    |                 |                       |

#### .. is inadequate !

David Greaves + Ali Zaidi



David Greaves + Ali Zaidi

#### **Register Power ABI Strawman**

// Typical hardware register to implement the spEED0 hardware API - unbanked version.

#define SPEED0\_REG\_MONICA
#define SPEED0\_REG\_ABI
#define SPEED0\_REG\_ENERGY\_UNITS
#define SPEED0\_REG\_CMD\_STATUS
#define SPEED0\_REG\_GLOBAL\_ENERGY
#define SPEED0\_REG\_TIME\_UNITS

0 // Contains an identifying constant
8 // Version number of the interface
16 // Energy units for the following
40 // Capability description and commands for res

- 48 // Running total energy in the units given i
- 56 // Units for ticks in the time register.

#define SPEED0\_REG\_CTX0\_BASE 512
#define SPEED0\_REG\_CTX1\_BASE (512+256)

#define SPEED0\_REFLECTION\_URL0 1024 // First location of a canned URL giving further

// Each hardware context contains:

#define SPEED0\_CTX\_REG\_LOCAL\_ENERGY 8 // Running local energy in the units given
#define SPEED0\_CTX\_REG\_LOCAL\_TIME 16 // Running local time (if implemented) for the

David Greaves + Ali Zaidi

### C API – Registers via HAL

```
extern u32_t get_units();
```

```
extern u32_t get_local_energy(); // same as get_customer_energy(get_local_core_no());
```

```
extern u32_t get_customer_energy(customer_t customer_no);
```

```
extern u32_t get_global_energy();
```

```
extern const char *get_reflection_uri();
```

extern int reset\_energy\_counters(u32\_t mask);
 // Returns 0 if ok.
 // Returns -ve error code if a selected register cannot be reset.

extern float report\_average\_power(customer\_t no, int window\_milliseconds) ... // TBD som

David Greaves + Ali Zaidi

NMI Multicore Cambridge

-

#### **Customer Number**

```
typedef unsigned int customer_t; // Value zero is reserved to denote the system global total.
```

```
extern customer_t get_local_customer_no();
extern int get_context_field(customer_t c);
extern int get_core_field(customer_t c);
```

```
int get_local_core_no() { return get_core_field(get_local_customer_no()); }
int get_local_context_no() { return get_context_field(get_local_customer_no()); }
```

NMI Multicore Cambridge

#### Context Swap H/W Energy Bank

#### C language 32-bit API - multi-tasking extensions

It is preferable to support at least two hardware contexts so that one can be active while the other is paused and being context swapped.

extern int set\_current\_customer(int core\_no, int context\_no);

// Depending on the hardware implementation, an access-denied type of error may be
// returned if the core\_no is not the local core.

This will set the current virtual context number for the specified core. The underlying hardware may support multiple contexts and so no context swap is needed. Or else the hardware abstraction layer will replace the current settings with new settings. Having a minimum of two hardware contexts is helpful to enable an atomic swap from one set to the other with no energy potentially lost between reading and writing an active register.

David Greaves + Ali Zaidi

NMI Multicore Cambridge

-

### A Hello World, very-simple C app.

```
#define SOCDAM SPEEDO REGS BASE 0xFFFD0000
#define READ SPEEDO(X) (((unsigned int *)(SOCDAM SPEEDO REGS BASE + X))[0])
int main(int argc, char *argv[])
                                                                        int i:
  printf("Hello World %x\n", READ SPEEDO(SPEEDO REG MONICA));
  printf("Global energy units at start are %i\n", READ SPEEDO(SPEEDO REG GLOBAL ENERGY));
 for (i = 0; i < 10; i++)
      int le = READ SPEEDO(SPEEDO REG CTXO BASE + SPEEDO CTX REG LOCAL ENERGY);
      printf("Core %i: Energy units are %i\n", SOCDAM READ PID REG(0), le);
  printf("Global energy units at end are %i\n", READ SPEEDO(SPEEDO REG GLOBAL ENERGY));
  killsim(0); // This makes a nice exit from SystemC - seems better at making or1ksmp exit!
```

David Greaves + Ali Zaidi

Output from the verysimple C Program Global energy units at start are 847327 Core 0: Energy units are 524070 Core 0: Energy units are 846693 Core 0: Energy units are 1171122 Core 0: Energy units are 1511514 Core 0: Energy units are 1852918 Core 0: Energy units are 2195073 Core 0: Energy units are 2537936 Core 0: Energy units are 2880756 Core 0: Energy units are 3224286 Core 0: Energy units are 3568353 Global energy units at end are 12006801

Hello World 45457073

David Greaves + Ali Zaidi

### **Energy Report With Customer Nos**

| MODULE NAME                                                                                                                | STATIC                                                  | ) ENERGY                           | +               | DYNAMICI                                                    | ENERGY                            | WIRING2                                                                  | ENERGY                      |
|----------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|------------------------------------|-----------------|-------------------------------------------------------------|-----------------------------------|--------------------------------------------------------------------------|-----------------------------|
| Standalone modules:<br> top.coreunit_O.core_O<br>  Memory O (DRAM)<br>  the_top.uartO<br>Customer Accounts:<br>  anonymous | 9.997983e-05J<br>0.00866173075J<br>0J<br>0.00866173075J | 0.77%<br>66.65%<br>0.00%<br>66.65% | +<br> <br> <br> | 3.25128e-05J<br>0.00419979737J<br>8.84e-07J<br>3.25128e-05J | 0.25%<br>32.32%<br>0.01%<br>0.25% | 1.35116151e-07J<br>  1.32334593e-07J<br>  2.746e-12J<br>  2.6745349e-07J | 0.00%  <br>0.00%  <br>0.00% |
| busaccess_0<br>+                                                                                                           | OJ                                                      | 0.00%                              | <br>+           | 0.00420136352J                                              | 32.33%                            | OJ                                                                       | 0.00%                       |
| TOP LEVEL++                                                                                                                | 0.00876171058J                                          | 67.42%                             | <br>+           | 0.00423387632J                                              | 32.58%                            | 2.6745349e-07J                                                           | 0.00%                       |

Each line is for a separately-traced subsystem. These lines may be norther disjoint or complete. The TOP LEVEL figure is simply another line in the table that relates to the highest module found. Total energy used: 12900 uJ (12995854356318 fJ)

| MODULE NAME                                                                                                 | STATIC                      | :0 POWER                      | DYNAMI C1                                                | POWER                          | WIRING2                                           | POWER                       |
|-------------------------------------------------------------------------------------------------------------|-----------------------------|-------------------------------|----------------------------------------------------------|--------------------------------|---------------------------------------------------|-----------------------------|
| Standalone modules:<br> top.coreunit_0.core_0<br>  Memory 0 (DRAM)<br>  the_top.uart0<br>Customer Accounts: | 0.01W<br>0.866347818W<br>OW | 75.38%  <br>67.35%  <br>0.00% | 0.00325193592W 2<br>0.420064464W 3<br>8.84178339e-05W 10 | 24.51%  <br>32.65%  <br>00.00% | 1.35143409e-05W<br>1.3236129e-05W<br>2.74655e-10W | 0.10%  <br>0.00%  <br>0.00% |
| anonymous<br>  busaccess_0<br>+                                                                             | 0.866347818W<br>OW          | 99.62%  <br>0.00%             | 0.00325193592W<br>0.420221111W 10                        | 0.37%  <br>00.00%  <br>+       | 2.67507446e-05W<br>ØW                             | 0.00%  <br>0.00%            |
| TOP LEVEL++                                                                                                 | 0.876347818W                | 67.42%                        | 0.423473047W 3                                           | 32.58%                         | 2.67507446e-05W                                   | 0.00%                       |

Each line is for a separately-traced subsystem. These lines may be neither disjoint or complete. The TOP LEVEL figure is simply another line in the table that relates to the highest module found. Average power used: 1290 mW (1299847614895725 fW)

David Greaves + Ali Zaidi

#### Running on two cores...

| MODULE NAME                                                                                                                                                             | ្រទ                                        | ATICO ENERGY                                                                                 | DYNAMICI                                                                                   | ENERGY                                                          | WIRING2                                                               | ENERGY                                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------|
| Standalone modules:<br> top.coreunit_0.core_0  <br> top.coreunit_1.core_1  <br>  Memory 0 (DRAM)<br>Customer Accounts:<br>  anonymous<br>  busaccess_0<br>  busaccess_1 | 4.806<br>4.806<br>1.04443197<br>1.04443197 | e-08J 0.30%  <br>e-08J 0.30%  <br>e-05J 64.51%  <br>e-05J 64.51%  <br>0J 0.00%  <br>0J 0.00% | 1.3e-08J<br>1.46e-08J<br>5.6217599e-06J<br>2.76e-08J<br>2.89187835e-06J<br>2.73060475e-06J | 0.08%  <br>0.09%  <br>34.72%  <br>0.17%  <br>17.86%  <br>16.87% | 9.0815e-11J<br>8.411e-11J<br>1.46992e-10J<br>3.21917e-10J<br>0J<br>0J | 0.00%  <br>0.00%  <br>0.00%  <br>0.00%  <br>0.00%  <br>0.00% |
| +                                                                                                                                                                       | 1.05404397                                 | e-05J 65.10%                                                                                 | 5.6500831e-06J                                                                             | 34.90%                                                          | 3.21917e-10J                                                          | 0.00%                                                        |

Each line is for a separately-traced subsystem. These kines may be neither disjoint or complete. The TOP LEVEL figure is simply another line in the table that relates to the highest module found. Total energy used: 16100 nJ (16190844749 fJ)

| +                     | +           | +-      |                | +       |                 | +     |
|-----------------------|-------------|---------|----------------|---------|-----------------|-------|
| MODULE NAME           | STATIO      | O POWER | DYNAMICI       | POWER   | WIRING2         | POWER |
| +                     | +           | +-      |                | +       |                 | +     |
| Standalone modules:   |             |         |                |         |                 |       |
| top.coreunit_0.core_0 | 0.01W       | 78.59%  | 0.00270495214W | 21.26%  | 1.88961715e-05W | 0.15% |
| top.coreunit 1.core 1 | 0.01W       | 76.60%  | 0.00303786933W | 23.27%  | 1.75010404e-05W | 0.13% |
| Memory O (DRAM)       | 2.17318346W | 65.01%  | 1.16973781W    | 34.99%  | 3.0585102e-05W  | 0.00% |
| Customer Accounts:    |             |         |                |         |                 |       |
| anonymous             | 2.17318346W | 99.73%  | 0.00574282147W | 0.26%   | 6.69823138e-05W | 0.00% |
| busaccess 0           | OW          | 0.00%   | 0.601722503W 1 | 00.00%  | OW              | 0.00% |
| busaccess 1           | OW          | 0.00%   | 0.568165783W 1 | .00.00% | OW              | 0.00% |
| +                     | +           | +-      |                | +       |                 | +     |
| TOP LEVEL++           | 2.19318346W | 65.10%  | 1.17563111W    | 34.90%  | 6.69823138e-05W | 0.00% |
| +                     | +           |         |                |         |                 | +     |

#### NMI Multicore Cambridge

# Thankyou for listening

David Greaves Ali Zaidi Klaus McDonald Maier

University of Cambridge Computer Laboratory

FOSDEM'14 Energy Efficient Computing.



David Greaves + Ali Zaidi

#### BACKUP SLIDES NOW FOLLOW

# TLM Modelling and TLM POWER 3



David Greaves + Ali Zaidi

#### SMP OpenRISC Demo Platform



1 to 64 cores (four shown) Shared or split or no L1 Cache Flexible cache architectures L2 and L3 caches easily added

Each cache has power-annotated tag and data RAMs SRAM parameters from CACTI DRAM modelled by Univ Maryland DRAMSIM2

David Greaves + Ali Zaidi

#### SystemC

A free C++ library that provides:

- A hardware module description system where a module is a C++ class,
- An eventing and threading kernel,
- Compute/commit signals as well as other forms of channel,
- A library of fixed-precision integers,
- Plotting and logging facilities for generating output,
- Two transactional modelling libraries.

Originally aimed as an RTL replacement, for low-level hardware modelling.

Now being used for high-level (esp. transactional) modelling for architectural exploration.

Also now being used as an implementation language with its own synthesis tools. David Greaves + Ali Zaidi NMI Multicore Cambridge

## SystemC: Example Module

In this example a C++ class is defined using the the SC\_MODULE macro.

```
SC_MODULE(mycounter)
{
   sc_in < bool > clk, enable, reset;
   sc_out < sc_int<10> > sum;
   void m() // Behaviour
   {
      if (reset) sum = 0;
      else if (enable) sum = sum.read()+1;
      // Use .read() since sc_out makes a signal.
   }
   SC_CTOR(mycounter) // constructor
     { SC_METHOD(m);
       sensitive << clk.pos();</pre>
     }
}
```

Modules inherit various attributes appropriate for an hierarchic hardware design including an instance name field and a channel binding capability.

David Greaves + Ali Zaidi

### SystemC: Structural Netlist

}

};



The sc\_signal (extends sc\_channel) should be used to obtain the compute/commit paradigm. Avoids non-determinacy from races on zero-delay flip-flops.

sc\_in and sc\_out extend sc\_channel.

General SystemC channel provides general purpose interface between components.

Other SystemC channel types include FIFOs and semaphores.

sc\_port and sc\_export needed for TLM modelling.

David Greaves + Ali Zaidi

// Example of structural hierarchy and wiring
// between levels:

```
SC MODULE(shiftreg) // Two-bit shift register
    sc in < bool > clk, reset, din;
{
    sc out < bool > dout;
    sc signal < bool > q1 s;
    dff dff1, dff2;
                        // Instantiate FFs
    SC_CTOR(shiftreg) :
                 dff1("dff1"), dff2("dff2")
        dff1.clk(clk);
    {
        dff1.reset(reset);
        dff1.d(din);
        dff1.q(q1 s);
        dff2.clk(clk);
        dff2.reset(reset);
        dff2.d(q1 s);
        dff2.q(dout);
```

```
NMI Multicore Cambridge
```

## **Transaction Level Modelling**



Note that the roles of initiator and target do not necessarily relate to the sources and sinks of the data.

Infact, an initiator can commonly make both a read and a write transaction on a given target and so the direction of data transfer is dynamic.

David Greaves + Ali Zaidi

# **TLM: Loose Timing**

**Naive Coding Style** 

```
b_putbyte(char d)
{
    printf("Byte '%c'\n", d);
    wait(250, SC_NS);
}
```

#### **Loosely-Timed Coding Style**

Have a local variable 'delay' associated with each thread.

```
b_putbyte(char d, sc_time &delay)
{
   sc_time del(250, SC_NS);
   printf("Byte '%c'\n", d);
   delay += del;
}
```

But, at any point, any thread can resynch itself with the kernel by performing:

```
// Resynch idiomatic form:
sc_wait(delay);
Delay = 0;
```

Simulation performance is reduced when there are frequent resynchs, but true transaction ordering will be modelled correctly.

David Greaves + Ali Zaidi

#### Loosely-timed TLM Modelling: General Structure



David Greaves + Ali Zaidi

### Records, Accounts and Observers

- Every monitored module is tied to a *power record* 
  - by inheritance or
  - by SystemC attribute.
- Every power record contains a set of accounts.
- Accounts have common (user-defined) names and purposes across the system. Typically:
  - A1 Static power
  - A2 Dynamic energy
  - A3 Wiring energy
- Each account can track both energy and power.
- An *observer* sums activity in a collection of records keeping accounts separate.
- A report file has one observer per line.

# Hop Tracking: Origin/Hop/Terminus.

Option 1: Track transaction trajectory to get distance travelled.

trans.pw\_set\_origin(this, PW\_TGP\_ADDRESS | PW\_TGP\_ACCT\_SRC, &frontside\_bus); initiator\_socket->b\_transport(trans, delay); trans.pw\_terminus(this);

- Initiator makes the origin and terminus calls.
- Intermediate nodes (cache and bus models) call log\_hop.
- Flags enable energy to be logged at src or dest.
- Options 1+2:
  - For additional transition counting, need to know which bus transaction is on and which fields in TLM payload are active.

# Report Formats (3: VCD)



- Each account and their summations can be plotted in various forms
  - 1: Ascii-art table format
  - 2: SYLK or CSV spreadsheet format
  - 3: VCD temporal display (using dirac impulse response or average over interval)
- A physical layout file is also written.

# An OpenRISC Core in TLM Form

Two approaches to getting an OpenRISC core:

- 1. Verilated:
  - Use OR1200 in verilog and pass through Verilator to create net-level SystemC.
  - Manually write a TLM 2.0 wrapper for it.

2. ORSIM ISS:

- Take the (auto-generated?) sim.C code from orsim
- Add some backdoor nops

(e.g. atomic prefix for load-linked bus transaction)

- Manually write a SystemC TLM wrapper for it.

### **OpenRISC Core Power Annotation**

Two approaches to getting an OpenRISC core:

- 1. Verilated:
  - Add a static power consumption in the constructor.

- Modify Verilator's net update macros to debit energy quanta according to hamming distance (TODO).

#### 2. ORSIM ISS:

- Add a static power consumption in constructor.
- Adjust static power mode on any sleep modes.
- Add an array giving the complexity of each instruction.

- On each instruction, debit dynamic energy proportional to complexity.

#### AMD Phenom 6 Core Model



David Greaves + Ali Zaidi

### Phenom Corner Cases: 1 to 8 threads



David Greaves + Ali Zaidi

## Splash-2 'RADIX' : Power + Energy

Energy Timex100

POWER



Running the RADIX test on n = 1 to 6 cores.

Program modified to suit n not a power of 2.

Increasing n ---> increased performance.

Increasing n ---> better efficiency.

Strange power humps !

One DRAM DIMM shared.

David Greaves + Ali Zaidi

# Phenom Energy Coefficients

| Instruction              | 1 nJ  |
|--------------------------|-------|
| I Cache<br>Miss          | 50 nJ |
| D Cache<br>Miss          | 15 uJ |
| D Cache<br>Snoop<br>Read | 4 mJ  |
| D Cache<br>Evict         | 7 mJ  |
|                          |       |
|                          |       |

Values obtained from curve fitting

CPU + Caches only.

DRAM excluded.

David Greaves + Ali Zaidi

### Measured v Predicted: Runs 19-24 extrapolated from data fitting on 1-18.



David Greaves + Ali Zaidi

# Static or Initial Parameters (2)

- Set up static parameters in constructor:
  - Excess or actual area or dimensions
  - Static power consumption
  - Chip/region name
  - VCC supply voltage
- Optional per-instance or per-kind technology file (XML) can be accessed (defines phases and modes and default VCC ...).
- Some are less static:
  - Set these in PVT change callback (virtual function).
  - Call that yourself from constructor.
- PVT called-back when VCC changes.

### **Confidence Switcher**



Generic API for a measuring and estimating component.

Use for time, energy, transition count and so on ...

Very simple implementation if we just want an estimate of the average metric:

Discard first N measurements, average next N, return this value while making an actual measurement one in every N to check for LOSS OF CONFIDENCE.

#### Augmented DMI Flow



Latency can be credited to the initiating thread's 'delay' as always.

Energy should be credited to the intermediate components:

so DMI record at initiator is extended with either
a) a list of intermediate agents that have their own records or
b) bulk read and write energy records (simpler, not shown).

David Greaves + Ali Zaidi

### **Power Estimation: Project Flow**



#### David Greaves + Ali Zaidi

#### **Backup Slide: ESL Modelling Flow**



David Greaves + Ali Zaidi

### Talk Overview

- SystemC + TLM Introduction
- TLM POWER 2
- TLM POWER 3
  - Loose timing
  - Energy based
  - Layout aware
  - Bit transition counting
- Splash-2 benchmarks, power probed.
- Data fit x86\_64 to OpenRISC !

## Loosely-Timed: Effect of Quantum

Two cores running: main() { for(i=0;i<5;i++) puts("Hello World"); }

Core clock Is 200 MHz (5ns period).

Sim Start: cores=2 HHelleol IWoo rWlodr Id HHeelllloo WWoorrlldd

HHeelllloo Wwoorrlldd

HHeelllloo WWoorrlldd H eHlellol oW oWrolrd Id CPU 0 exit: insns #717 CPU 1 exit: insns #717

Global Q = 5ns Lock-step execution Sim Start: cores=2 Hello World HeHello World Hello World

Hello Woolo World Hello rld Hello World World Hello Wor Hello World CPU 0 exit : insns #717 CPU 1 exit: insns #717

Global Q = 1us Finely interleaved Sim Start: cores=2 Hello World CPU 0 exit: insns #717 CPU 1 exit: insns #717

Global Q = 100us Coarsely interleaved

Three different settings of the global quantum.

David Greaves + Ali Zaidi

#### **Loosely-Timed Performance Lost**



Relative performance of LT TLM Model (2 cores, running SPLASH-2 Radix Sort n=100)

David Greaves + Ali Zaidi