# Machine Learning Systems

#### 4: Hardware Acceleration

Nicholas D. Lane

- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies





#### HW enabling Deep Learning

- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies



#### HW enables Deep Learning







http://mlsys.cst.cam.ac.uk/teach

#### HW & Deep Learning Basics



• 1986: Backpropagation published





#### HW & Deep Learning Basics



1986: Backpropagation published
~30 years of trying this on CPUs



Sequential Complex instruction set Great for control flow



#### HW & Deep Learning Basics





Parallel Great for matrix math Bad for control flow

- 1986: Backpropagation published
- •~30 years of trying this on CPUs
- 2012: AlexNet paper  $\Rightarrow$  +10% accuracy  $\Rightarrow$  Deep learning explosion



#### Hardware types





#### Specialized Hardware





- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies



#### Compute Performance Metrics



- MACs/s: Multiply-accumulate Ops/s

   Half FLOPs/s
- OPs/s: for non floating-point operations
- Chips are often labeled with "peak FLOPs/s"
  - Not achievable under normal workloads
  - Very rough indication of performance







#### Memory Performance Metrics

- Memory capacity [GB]
- Memory bandwidth [GB/s]
  - Transfer speed from memory chip to compute chip
- More complicated because there is a memory hierarchy
  - Showing "external"/"main" memory
  - Can have caches, local memory, registers with much higher bandwidth

| Accelerator C | hip                             |                     |
|---------------|---------------------------------|---------------------|
|               | Memory Bana                     | lwidth e.g. 20 GB/s |
| Memory Chip   | Memory<br>Capacity<br>e.g. 8 GB |                     |



#### **DNN Performance**

- Latency: Number of seconds per inference (unit = seconds)
- Throughput: Number of inferences per second (unit = inference/second)







- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies



#### Where does the energy go?

- Energy breakdown of an add instruction in a 45nm CPU
- How can we optimize this?



Source: Horowitz



#### Amortize Overhead



Source: Dally



#### "Special" Instruction Examples



16x16 = 256\* MAC/cycle \*~ 500 tensor cores per GPU



256x256 = 64 kMAC/cycle



http://mlsys.cst.cam.ac.uk/teach

#### Numerical Format & Precision



- IEEE standard includes FP32 and FP16
- Many exotic FP numbers in DNN
  - E.g. bfloat, minifloat



- Whole numbers only
- (typically) much cheaper circuit area and power



#### Cost of Arithmetic Operations





- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies



- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
  - Arithmetic
  - Memory
  - Ineffectual Operation
- Hardware Case Studies



- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
  - Arithmetic
  - Memory
  - Ineffectual Operation
- Hardware Case Studies



#### Memory is the bottleneck





#### Mem. Hierarchy Optimizations

- 1. Get data close to the computation. (LOCALITY)
- 2. Once data is close perform all computations with this data. (REUSE)





#### Memory Hierarchy

Why do we have a memory hierarchy?

- The closer you get to compute, the more \$\$ and scarce the memory resource becomes
- In most cases, the DNN parameters live off chip and are fetched layerby-layer or tile-by-tile
- Data locality: how to get data close to the PEs (to keep them fully utilized)





#### Memory Hierarchy Examples





#### Data Reuse





#### Stationary Weights?







#### Stationary Inputs?





















http://mlsys.cst.cam.ac.uk/teach









- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
  - Arithmetic
  - Memory
  - Ineffectual Operation
- Hardware Case Studies



#### Kinds of Sparsity



# 36

## Coarse-grained "Block" Sparsity

- All DNN accelerators are parallel
  - Multiple MACs/cycle
- The smallest unit of computation that can be skipped is a large block (recall <u>amortized overhead</u>)
- Example:
  - Systolic array with 64 MACs/cycle
    - 8x8 pattern
  - 64x64 matrix = 4096 MACs
  - Total # cycles = 64 cycles
  - Block sparsity pattern needs to skip blocks of 8x8
  - Speedup = 64/(64-26) = 1.7X
     faster

#### 64 MACs/cycle





Dense weights

Block-sparse weights



http://mlsys.cst.cam.ac.uk/teach

## Coarse-grained "Block" Sparsity

- All DNN accelerators are parallel
  - Multiple MACs/cycle
- The smallest unit of computation that can be skipped is a large block (recall <u>amortized overhead</u>)
- Example:
  - Systolic array with 64 MACs/cycle
    - 8x8 pattern
  - 64x64 matrix = 4096 MACs
  - Total # cycles = 64 cycles
  - Block sparsity pattern needs to skip blocks of 8x8
  - Speedup = 64/(64-26) = 1.7X
     faster

# 64 MACs/cycle



Dense weights

Block-sparse weights

#### Simplest way to leverage sparsity with low overhead

- ⇒ Single bit per 8x8 block (1/64 = 1.6% overhead)
- ⇒ Simple control logic because entire block is skipped



## Fine-grained Sparsity (Ampere GPUs)

- Very recently, fine-grained sparsity was added to Tensor Cores on Nvidia GPUs
- 2 elements for every block of 4 elements can be zero
- Requires retraining to regain accuracy
- Overhead?
  - 2 bits per 8-bit element
  - 12.5% memory overhead
  - Control logic? Performance improvement? Power savings?





- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies

#### Nvidia GPU Progression



Source: Dally



#### Google TPU





Source: D. Harris

### Groq

- Programmable dataflow architecture
- 1000 TOPs/s peak INT8 performance
- 200 MB on-chip SRAM (80 TB/s)
  - No external memory, scales by increasing number of chips
- FP16 and INT8 precision
- Philosophy: "unroll" a multicore architecture on-chip spatially to allow for custom instructions





Source: Groq



#### Groq





#### Cerebras



#### Largest Chip Ever Built

- 46,225 mm<sup>2</sup> silicon
- 1.2 trillion transistors
- 400,000 AI optimized cores
- 18 Gigabytes of On-chip Memory
- 9 PByte/s memory bandwidth
- 100 Pbit/s fabric bandwidth
- TSMC 16nm process



Largest GPU 21.1 Billion Transistors 815 mm<sup>2</sup> Silicon

Source:Cerebras



#### Summary of the Day

- HW enabling Deep Learning
- Performance Metrics
- Where does Energy Go?
- Hardware Efficiency Options
- Hardware Case Studies