# Superscalar Performance without Superscalar Overheads

## Ali Mustafa Zaidi & David Greaves

#### The 'Dark Silicon' Challenge

In the 'Dark Silicon Era', achievable speedups will be far below the exponential growth suggested by Moore's Law, primarily due to constrained power budgets. To mitigate the effects of Dark Silicon, we must dramatically improve the energy *S peedup* = efficiency of computation.

Computer architects are increasingly relying on custom hardware to achieve high performance with very high efficiency. Unfortunately, there are several problems: (1) Amenability: custom hardware currently exhibits poor performance on irregular, control-flow intensive applications, or those with low data parallelism. (2) **Programmability**: developing custom hardware is far more difficult & time consuming than writing code for a sequential processor.

Thus to provide a balance between energy, performance and programmability, we must build heterogeneous designs: combining complex, power hungry superscalar processors with energy efficient custom hardware.

### The Dark Silicon Problem

UNIVERSITY OF

CAMBRIDGE

Computer Laboratory



Superscalar processors achieve high performance by internally

<mark>॑</mark> <sup>↓</sup>
<sup>↓</sup>
<sup>↓</sup>
<sup>↓</sup>
<sup>↓</sup>
<sup>↓</sup>



#### **Dataflow Computing for Efficiency and Performance**

Our research goal is to address the Programmability and Amenability issues of custom hardware, enabling a broader, more energy-efficient design space for heterogeneous systems. We do this by:

(1) Developing a new program representation (IR) - the 'Value-State Flow Graph (VSFG)' that is inherently dataflow oriented, exposing much greater concurrency than existing IRs like the conventional Control Flow Graph (CFG).

(2) Providing a new High-level synthesis (HLS) toolchain that converts input high-level language code into our dataflow IR and then implements it as custom dataflow hardware.

(3) Unlike conventional HLS tools, but like superscalar processors, our generated custom hardware uses dynamic *execution scheduling* to provide further performance improvements on control-flow intensive code.



converting from the sequential Von Neumann to the highly concurrent Dataflow computational model: (1) Branch prediction is used to overcome control dependences, (2) Register Renaming is used to overcome name dependences (3) Out-of-order Issue, Bypass & forwarding logic is used to accelerate true dependences. However, performing this conversion at runtime in a general-purpose processor incurs over 10-1000x greater energy overheads than the equivalent custom / reconfigurable hardware implementation of an application.



Ali-Mustafa.Zaidi@cl.cam.ac.uk David.Greaves@cl.cam.ac.uk

## **Computer Architecture Group**

http://www.cl.cam.ac.uk/research/comparch/