# ParaVerser: Harnessing Heterogeneous Parallelism for **Affordable Fault Detection in Data Centers**

Minli Julie Liao\*, Sam Ainsworth<sup>†</sup>, Lev Mukhanov<sup>‡</sup>, Adrian Barredo<sup>#</sup>, Markos Kynigos<sup>§</sup>, Timothy Jones\*

\*University of Cambridge <sup>#</sup>Barcelona Supercomputing Center

<sup>†</sup>University of Edinburgh <sup>§</sup>The University of Manchester <sup>‡</sup>Queen Mary University London

UNIVERSITY OF CAMBRIDGE

#### **Errors at Server-Scale**

Frequent, undetected hard faults causing silent data

### **Affordable hardware overhead**

Minor alterations to **existing** cores in the system

Network on chip

corruption (SDC)

- **Insufficient** existing software scanners:
  - Infrequent out-of-production tests
    - $\rightarrow$  SDC goes on undetected for **months**
  - In-production light-weight tests  $\bullet$ 
    - $\rightarrow$  Low error detection coverage
- **Unaffordable** dual/triple-core lockstep:
  - Guarantees full-coverage error detection  $\bullet$
  - Real-time with low performance overhead
  - **Double/triple** energy + hardware area

## **Heterogeneous Parallel Error Detection**

Full-coverage error detection through redundancy with low energy overhead

- **Extra parallelism** in redundant execution enabled through checkpointing + load-store logging
- Multiple energy **efficient** checker cores **in parallel** for redundant execution:
  - Keep up with high performance main core
  - At much lower energy cost

- Repurpose existing L1 data cache to store load-store log (LSL) when used as checker
  - 1 extra bit per cache line
- 1064B per-core overhead
  - Mainly for register checkpoint (cpt)
  - Any core can be main or checker, same overhead for all cores

## **Basic Operations**

#### Main core

#### Take register cpt + start inst counter

- Run: log and push loads/stores to 2. checker's LSL\$
- Take cpt + stop counter 3.

#### Load-store log Load-store cache (LSL\$) push unit L1 data cache L1 inst cache Load-store Load-store Decode Rename Fetch comparator queue Counte Register Register file cpt unit nst ROB Pipeline Core

#### **Checker core**

- Set register from cpt + start inst counter
- Run: get loads/stores from LSL\$ + compare (e.g. address, store value)
- Stop at count + take cpt + compare 3.

## **Evaluation**

High performance X2 + energy efficient A510 system

Full-speed X2 main core + Various checker core type/count at various DVFS points

cpt + count

cpt

LSL

## **ParaVerser**

Affordable, adjustable error detection with heterogeneous cores in data centers



#### Adjustable performance and error coverage

## **Full-coverage error detection**

- DSN18 & Paradox: prior works, **25% area** overhead **dedicated** checker cores
- Homogeneous: similar to dual-core **lockstep**, **95%** energy overhead
- **ParaVerser** with 4\*A510 min ED2P (energy \* delay^2) DVFS config checkers:
  - **4.3%** performance **degradation** (vs Baseline without error detection)
- **70% reduction** in energy overhead (vs Homogeneous)



## **Opportunistic error detection**

- <1% performance overhead
- High error detection coverage with very little resource
- Potential for sampling



- **Full-coverage**: error detection first
  - Guarantees **full** error detection **coverage**
  - Insufficient resource in error detection  $\rightarrow$  original execution stalls
- **Opportunistic**: performance first
  - Guarantees minimal performance overhead
  - Insufficient resource in error detection

 $\rightarrow$  original execution **continues** with segments left **unchecked** 

## **NoC overhead and hash-mode**

- LSL traffic cause heavy NoC contention with slower NoC
- Hash-mode hashes all LSL traffic except for load value
- Greatly reduce NoC contention
- Similar slowdown to fast NoC (2x width, +1/3 clock rate)