spEEDO: Energy Efficiency through Debug suppOrt: On Chip Analytics

Temporary Web Site

spEEDO: A power debug and monitoring API for virtual platforms and real silicon.

Partners: University of Cambridge Computer Laboratory and Ultrasoc Ltd.

Funder: The UK TSB.

Commence date: 1st October 2013.

Talk at FOSDEM 2014, Brussels: SLIDES (PDF).

Abstract

spEEDO: Energy Efficiency through Debug suppOrt

Dr David J Greaves.

University of Cambridge Computer Laboratory in Collaboration with Ultrasoc Limited.

The spEEDO project aims to augment existing debug APIs (such as GNU's RSP and ARM's Coresight) with a power component for reporting and tracing energy use in multicore systems-on-chip. Energy is logged per IP block and per application thread and reports are made available to the operating system, to applications programs and over the debug interface. The aim is facilitate optimizations for energy-efficiency at all stages of software and silicon development.

Background

Power-efficient design of computer hardware and software is very important. Portable systems, such as smart phones must be power efficient for good battery life and for other reasons. Both hardware designers and software programmers increasing require tools that automate power management. Direct or indirect measurement of power consumption in the running hardware provides the ultimate basis for such decisions. Also important are power-annotated, high-level simulation models and power-aware code generator stages in compilers and in the Android or Java virtual machine.

There is an intrinsic affinity between power management, test and debug. All three rely on information flow outside of the normal system data paths between memories, processor elements and other subsystems. Examples of interdependencies are: 1: debug infrastructure can be used to report power measurements, 2: power management must be overridable by debug hardware to test its own operation and for access to resources that otherwise would be off, and 3: dynamic migration of tasks between processing elements and dynamic reconfiguration of cache and bus structures according to current load requires access to the same counters and other load metrics that are needed for engineering and high-level design exploration.

In addition, the power management and debug interfaces to IP blocks could benefit from standardisation in the same way that the main datapaths have long been converged around a few on-chip standards. If a chip-wide debug architecture is used then a chip-wide power management system should be deployable with the same engineering flow and at the same time. This is a logical next-step in the infrastructure for System-on-Chip design and will be wildly useful.

Aims

  • Specify suitable interface structures in hardware to undertake the measurement of energy usage and some analysis in hardware effectively, while maintaining its relationship with the software executed in the system at that time.
  • Identify a higher-level structure that will make use of these measurements and conduct further analysis based on the measurements that the hardware provides.
  • Propose a 'strawman' standard that is sufficiently flexible to cope with different forms of hardware interface and support in real silicon.
  • Promote the resulting strawman to industry forums.

    Basic Approach

    Extend and integrate existing debug protocols and GUIs so that energy and power figures can be conveyed and displayed.

    Make a practical demonstrator using OpenOCD/GDB/Eclipse and SystemC TLM Power3.

    Strawman Machine Register Interface

    First alpha draft.

    This is the hardware register set read by an operating system kernel running on a core on the SoC.

    This is a typical example. The techniques used by GNU oprofile for portability over architectures can be used.

    // Typical hardware register to implement the spEEDO hardware API - unbanked version.
    
    #define SPEEDO_REG_MONICA            0   // Contains an identifying constant
    #define SPEEDO_REG_ABI               8   // Version number of the interface
    #define SPEEDO_REG_ENERGY_UNITS      16  // Energy units for the following 
    #define SPEEDO_REG_CMD_STATUS        40  // Capability description and commands for resetting totals and so on and also the current hardware context. 
    #define SPEEDO_REG_GLOBAL_ENERGY     48  // Running total energy in the units given - includes local energy
    #define SPEEDO_REG_TIME_UNITS        56  // Units for ticks in the time register.
    
    #define SPEEDO_REG_CTX0_BASE     512
    #define SPEEDO_REG_CTX1_BASE     (512+256)
    
    #define SPEEDO_REFLECTION_URL0   1024 // First location of a canned URL giving further information on this ABI 
    
    
    // Each hardware context contains:
    
    #define SPEEDO_CTX_REG_LOCAL_ENERGY  8  // Running local energy in the units given
    #define SPEEDO_CTX_REG_LOCAL_TIME    16 // Running local time (if implemented) for the context in the time units given
    
    
    // Interrupts?
    // Do we want an interrupt watchpoint on these registers - e.g. interrupt every so many microjoules used?
    
    
    
    
    // Sometimes the system has a separate resource for timer ticks, needed  so that energy reports can be 
    // converted to power.  A hardware abstraction layer or device driver for this register set will integrate the time factor
    // and provide the PET account abstraction.
    
    
    // spEEDO monica is the following data (or low 32 bits of on a D32 machine).
    //      data = ((u64_t)'s' << 0) | ((u64_t)'p' << 8) | ((u64_t)'E' << 16) | ((u64_t)'E' << 24) | ((u64_t)'D' << 32) | ((u64_t)'O' << 40);
    
    #define SPEEDO_API_ALPHA    1000  // This value is returned by the ABI register in the prototype implementation.
    
    // The values returned in the energy registers need to be divided by the value in the units register to give a figure in Joules.
    #define SPEEDO_UNIT_pJ      (1000*1000*1000*1000)  //
    #define SPEEDO_UNIT_nJ      (1000*1000*1000)
    #define SPEEDO_UNIT_uJ      (1000*1000)
    #define SPEEDO_UNIT_mJ      (1000)
    #define SPEEDO_UNIT_UNCAL   0 // The UNCAL value shall be returned if the calibration of the energy registers is unknown.
    
    

    Strawman C-language API

    First alpha draft.

    Customer Number

    The API uses the concept of a customer number. This is an opaque handle that is actually an integer with various bit fields used for custom purposes on a specific platform. We have standard accessor functions to access the bit fields. These generally include:
  • an ip_block_number that is typically a CPU core number but may be used for other purposes.
  • a virtual context number that is used for threads are processes running on a core or across cores.
    
    typedef unsigned int customer_t; // Value zero is reserved to denote the system global total.
    
    
    extern customer_t get_local_customer_no(); 
    extern int get_context_field(customer_t c); 
    extern int get_core_field(customer_t c); 
    
    int get_local_core_no()    { return get_core_field(get_local_customer_no()); }
    int get_local_context_no() { return get_context_field(get_local_customer_no()); }
    
    

    HAL-exported API

    C language 32-bit API - bare metal/single threaded for local core.

    
     extern u32_t get_units();
    
     extern u32_t get_local_energy(); // same as get_customer_energy(get_local_core_no());
    
     extern u32_t get_customer_energy(customer_t customer_no);
    
     extern u32_t get_global_energy();
    
     extern const char *get_reflection_uri();
    
     extern int reset_energy_counters(u32_t mask); 
      // Returns 0 if ok. 
      // Returns -ve error code if a selected register cannot be reset.
    
    extern float report_average_power(customer_t no, int window_milliseconds) ... // TBD some running average indication 
    

    C language 32-bit API - multi-tasking extensions

    It is preferable to support at least two hardware contexts so that one can be active while the other is paused and being context swapped.

    
     extern int set_current_customer(int core_no, int context_no);
    
    // Depending on the hardware implementation, an access-denied type of error may be
    // returned if the core_no is not the local core.
    

    This will set the current virtual context number for the specified core. The underlying hardware may support multiple contexts and so no context swap is needed. Or else the hardware abstraction layer will replace the current settings with new settings. Having a minimum of two hardware contexts is helpful to enable an atomic swap from one set to the other with no energy potentially lost between reading and writing an active register.

    
    // Return codes for C-API
    #define SPEEDO_RC_NO_POWER_MONITOR -199
    #define SPEEDO_RC_INVALID_REG -99
    
    

    Architecture

    Power Supply Monitoring

  • Monitor duty cycle of SMPSU controllers - they are digital to start with!
  • Measure power using a resistive shunt and an ADC.
  • Can get some sort of low-resolution information from battery managers on tablets and so on but probably not sufficiently accurate.

    Overall Architecture - for feedback to local software

    The kernel keeps task control blocks for each context. The energy aggregate can be kept in this alongside the CPU time aggregate commonly maintained.

    We can use higher-order address bits to carry customer identifiers to index banked energy registers in peripherals. On our SystemC model a 'customer_id' field is instead used in the 'generic payload' passed between IP blocks.

    Using Event Counters in Hardware and Device Drivers

    There are many event counters already present in contemporary systems. See oprofile and /sbin/ifconfig for instance. With appropriate calibration factors the hardware abstraction layer can multiply these up to give energy estimates. For per-customer energy profiling, these energy figures can be context swapped by the kernel as virtual registers.

    Typical device driver stats:

    eth0      Link encap:Ethernet  HWaddr 00:13:20:84:5d:81
              inet addr:128.232.9.140  Bcast:128.232.15.255  Mask:255.255.240.0
              inet6 addr: fe80::213:20ff:fe84:5d81/64 Scope:Link
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
              RX packets:24110214 errors:0 dropped:0 overruns:0 frame:0
              TX packets:15028627 errors:0 dropped:0 overruns:0 carrier:0
              collisions:0 txqueuelen:100
              RX bytes:3461755890 (3.4 GB)  TX bytes:15455753259 (15.4 GB)
    

    Here are, for example, the oprofile events monitored on the x86-64 AMD hammer architectures:

    GDB Abuse and Extensions for remote debugging

  • Getting GDB to display registers in Joules (extend native formats beyond hex/decimal/ASCII/bitfield etc..) - needs an extension of the register description XML schema.
  • Abusing the 'H' thread selector command to be a customer selector - for core and/or process.

    Experiments

    Experiment 1: Single-threaded register interface test

    The following simple test program was run bare metal on the VHLS/Prazor simulator implemented in SystemC TLM. The simulation setup was a simple as possible: one OpenRISC core with no caches, one DRAM bank and one UART.
    #define SOCDAM_SPEEDO_REGS_BASE   0xFFFD0000
    #define READ_SPEEDO(X)    (((unsigned int  *)(SOCDAM_SPEEDO_REGS_BASE + X))[0])
    
    
    
    int main(int argc, char *argv[])
    {
      int j;
      printf("Hello World %x\n", READ_SPEEDO(SPEEDO_REG_MONICA));
      printf("Global energy units at start are %i\n", READ_SPEEDO(SPEEDO_REG_GLOBAL_ENERGY));
      for (j = 0; j < 10; j++)
        {
          int le = READ_SPEEDO(SPEEDO_REG_CTX0_BASE + SPEEDO_CTX_REG_LOCAL_ENERGY);
          printf("Core %i: Energy units are %i\n", SOCDAM_READ_PID_REG(0), le);
        }
    
      printf("Global energy units at end are %i\n", READ_SPEEDO(SPEEDO_REG_GLOBAL_ENERGY));
      _killsim(0); // This makes a nice exit from SystemC - seems better at making or1ksmp exit!
    }
    

    Power Report File

    The customer concept is new and not listed in the report file but it will be.

    ################################################################################
    #              TLM POWER 3.1  (University of Cambridge, UK)                    #
    #                                                                              #
    #                Statistics file: energy/power consumption.                    #
    # ---------------------------------------------------------------------------- #
    # For more information see the TLM POWER3 manual pdf.                   p      #
    # ---------------------------------------------------------------------------- #
    # Creation Date: 17:35:43 -- 25/01/2014                                        #
    ################################################################################
    
    
    Title: notitle
    # Simulation duration 9997983 ns(sc=9997983 ns)
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | MODULE  NAME             |              STATIC0 ENERGY |             DYNAMIC1 ENERGY |              WIRING2 ENERGY |
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Standalone modules:
    | ...top.coreunit_0.core_0 |       9.997983e-05J   0.77% |        3.25128e-05J   0.25% |     1.35116151e-07J   0.00% | 
    | Memory 0 (DRAM)          |      0.00866173075J  66.65% |      0.00419979737J  32.32% |     1.32334593e-07J   0.00% | 
    | the_top.uart0            |                  0J   0.00% |           8.84e-07J   0.01% |          2.746e-12J   0.00% | 
    Customer Accounts:
    | anonymous                |      0.00866173075J  66.65% |        3.25128e-05J   0.25% |      2.6745349e-07J   0.00% | 
    | busaccess_0              |                  0J   0.00% |      0.00420136352J  32.33% |                  0J   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | TOP LEVEL++              |      0.00876171058J  67.42% |      0.00423387632J  32.58% |      2.6745349e-07J   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Each line is for a separately-traced subsystem. These lines may be neither disjoint or complete.
    The TOP LEVEL figure is simply another line in the table that relates to the highest module found.
    Total energy used: 12900 uJ  (12995854356318 fJ)
    
    
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | MODULE  NAME             |               STATIC0 POWER |              DYNAMIC1 POWER |               WIRING2 POWER |
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Standalone modules:
    | ...top.coreunit_0.core_0 |               0.01W  75.38% |      0.00325193592W  24.51% |     1.35143409e-05W   0.10% | 
    | Memory 0 (DRAM)          |        0.866347818W  67.35% |        0.420064464W  32.65% |      1.3236129e-05W   0.00% | 
    | the_top.uart0            |                  0W   0.00% |     8.84178339e-05W 100.00% |        2.74655e-10W   0.00% | 
    Customer Accounts:
    | anonymous                |        0.866347818W  99.62% |      0.00325193592W   0.37% |     2.67507446e-05W   0.00% | 
    | busaccess_0              |                  0W   0.00% |        0.420221111W 100.00% |                  0W   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | TOP LEVEL++              |        0.876347818W  67.42% |        0.423473047W  32.58% |     2.67507446e-05W   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Each line is for a separately-traced subsystem. These lines may be neither disjoint or complete.
    The TOP LEVEL figure is simply another line in the table that relates to the highest module found.
    Average power used: 1290 mW  (1299847614895725 fW)
    

    UART Output

    The thing to note is that the CPU core is able to see the energy consumed by the DRAM on its behalf.
    Hello World 45457073
    
    Global energy units at start are 847327
    
    Core 0: Energy units are 524070
    
    Core 0: Energy units are 846693
    
    Core 0: Energy units are 1171122
    
    Core 0: Energy units are 1511514
    
    Core 0: Energy units are 1852918
    
    Core 0: Energy units are 2195073
    
    Core 0: Energy units are 2537936
    
    Core 0: Energy units are 2880756
    
    Core 0: Energy units are 3224286
    
    Core 0: Energy units are 3568353
    
    Global energy units at end are 12006801
    

    With two cores in the simulation, where the second core sits in a spin loop, we get the following report. The DRAM energy can be seen to be roughly shared between the two customer cores.

    ################################################################################
    #              TLM POWER 3.1  (University of Cambridge, UK)                    #
    #                                                                              #
    #                Statistics file: energy/power consumption.                    #
    # ---------------------------------------------------------------------------- #
    # For more information see the TLM POWER3 manual pdf.                   p      #
    # ---------------------------------------------------------------------------- #
    # Creation Date: 17:32:43 -- 25/01/2014                                        #
    ################################################################################
    
    
    Title: notitle
    # Simulation duration 4806 ns(sc=4806 ns)
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | MODULE  NAME             |              STATIC0 ENERGY |             DYNAMIC1 ENERGY |              WIRING2 ENERGY |
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Standalone modules:
    | ...top.coreunit_0.core_0 |          4.806e-08J   0.30% |            1.3e-08J   0.08% |         9.0815e-11J   0.00% | 
    | ...top.coreunit_1.core_1 |          4.806e-08J   0.30% |           1.46e-08J   0.09% |          8.411e-11J   0.00% | 
    | Memory 0 (DRAM)          |     1.04443197e-05J  64.51% |      5.6217599e-06J  34.72% |        1.46992e-10J   0.00% | 
    Customer Accounts:
    | anonymous                |     1.04443197e-05J  64.51% |           2.76e-08J   0.17% |        3.21917e-10J   0.00% | 
    | busaccess_0              |                  0J   0.00% |     2.89187835e-06J  17.86% |                  0J   0.00% | 
    | busaccess_1              |                  0J   0.00% |     2.73060475e-06J  16.87% |                  0J   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | TOP LEVEL++              |     1.05404397e-05J  65.10% |      5.6500831e-06J  34.90% |        3.21917e-10J   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Each line is for a separately-traced subsystem. These lines may be neither disjoint or complete.
    The TOP LEVEL figure is simply another line in the table that relates to the highest module found.
    Total energy used: 16100 nJ  (16190844749 fJ)
    
    
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | MODULE  NAME             |               STATIC0 POWER |              DYNAMIC1 POWER |               WIRING2 POWER |
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Standalone modules:
    | ...top.coreunit_0.core_0 |               0.01W  78.59% |      0.00270495214W  21.26% |     1.88961715e-05W   0.15% | 
    | ...top.coreunit_1.core_1 |               0.01W  76.60% |      0.00303786933W  23.27% |     1.75010404e-05W   0.13% | 
    | Memory 0 (DRAM)          |         2.17318346W  65.01% |         1.16973781W  34.99% |      3.0585102e-05W   0.00% | 
    Customer Accounts:
    | anonymous                |         2.17318346W  99.73% |      0.00574282147W   0.26% |     6.69823138e-05W   0.00% | 
    | busaccess_0              |                  0W   0.00% |        0.601722503W 100.00% |                  0W   0.00% | 
    | busaccess_1              |                  0W   0.00% |        0.568165783W 100.00% |                  0W   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    | TOP LEVEL++              |         2.19318346W  65.10% |         1.17563111W  34.90% |     6.69823138e-05W   0.00% | 
    +--------------------------+-----------------------------+-----------------------------+-----------------------------+
    Each line is for a separately-traced subsystem. These lines may be neither disjoint or complete.
    The TOP LEVEL figure is simply another line in the table that relates to the highest module found.
    Average power used: 3360 mW  (3368881554099043 fW)
    

    Debugger Access to Powered-Down Regions

    As well as extending the RSP protocol/schema of gnu's debugger, gdb, to support reporting energy units we are also putting in some hooks for debug access to power-gated regions. This will amount to recording ancillary steps that are needed to run certain primary debug operations and making gdb or on-chip hardware trigger these steps automatically when needed.

    (Potential) Achievements

  • 2nd Feb 2014: Talk at FosDem FOSDEM_2014_Energy-efficient_Computing_devroom: SLIDES (PDF).

  • Potential integration with EACOF from Univ Bristol, The Energy-Aware COmputing Framework.

  • Potential talk at the Multicore - NMI Processor and Processing workshop 13th March 2014

  • Potential submission to the GNU for gdb extensions to display floating point registers with units - such as milliwatts to 3 significant figures.

    Links

  • Energy Aware Computing Framework (TSB/Univ Bristol).

  • Intel's Power Gadget.
  • END