# The future of instruction-level parallelism (ILP) Alexandra W. Chadwick, Márton Erdős, Utpal Bora, Akshay Bhosale, Yuxin Guo, Timothy M. Jones ldr x8, [x0, #264] #### Instruction-level parallelism Programs consist of instructions. These are simple operations that a microprocessor can execute, such as addition or multiplication. Abstractly, executing a program consists of executing its instructions in program order. However, for decades, processors have utilised instruction-level parallelism (ILP) to execute multiple instructions simultaneously, and thus to execute programs faster than the implications of program order allow. Processor vendors have recently started to use ILP more aggressively to achieve performance. **Figure 1**: Maximum measurable instructions per cycle (IPC) of an optimal program executed on commercial processors from six vendors. Notice the increasing trend since 2015. mov x0, x19 thr w8, [x19, #460]add x9, x9, #0x300movk x10, #0x80, lsl #48 orr x25, x9, x10 str x23, [sp, #16][dr x9, [x20, #296] orr x25, x10, x11 x8 -ldr w10, [x0, #4388] | mov x15, x6 tst x9, x8 stp x22, x21, [sp, #322]fx w2, w8, #6, #6 ubfx x9, x8, #6, #6 b.eq b824 st x9, x8 -str x18, [x10, #520] \$tr w11, [x0, #476] | str w12, [x10, #504] adr x10, 8334 b.eq 1560 b.gt b7a8 b.eq 100d8 ldr x26, [x0, #264] ret add w16, w17, w16 add x8, x19, x8, lsl #2 cmp x27, x22 b.ge b9e4 add x9, x21, x9, lsl #2 ldrsw x9, [x9, #4] lsl x9, x9, #8 #### Dynamic dependency graphs (DDG) When a program executes, each dynamic instruction may consume values produced by previous instructions, and produce values that future instructions may consume. By tracking these producer/consumer relationships we can build a *dynamic dependency graph* (DDG). This consists of one node per dynamic instruction, with directed edges from producers to consumers. The background of this poster is a tiny fraction of a DDG for the deepsjeng benchmark from the SPEC CPU2017 benchmark suite, compiled for AArch64. The full DDG for this program would have 2 trillion nodes, a modern high-performance processor would execute all of the instructions on this poster in about 0.3 microseconds. stur x8, [x29, #-8] stur and uad str x8, ldr x8, orr x8, x8, x22 and x0, x27, x8 fmov d0, x0 cnt v0.8b, v0.8b uaddlv h0, v0.8b fmov w0, s0 add w9, w9, w0 add w9, w9, w9, lsl #1 madd w9, w10, w11, w9 add w8, w9, w8 add w8, w10, w8 cp w9, w8, [x11, #52 stp x28, x27, [sp, #192] ldp x28, x27, [sp, #192] mov x8, x27 add x27, x8, #0x1 stp x28, x27, [sp, #192] ldp x28, x27, [sp, #192] sbfiz x25, x20, #2, #32 ldr w1, [x19, x25] ubfx w9, w1, #6, #6 add x3, x0, w9, uxtw #2 ldr w5, [x3, #4] sxtw x5, w5 add x17, x0, x5, lsl #3 ldr x1, [x17, #288] eor x1, x1, x16 str x1, [x17, #288] ldr x9, [x21, #368] orr x21, x9, x8 tst x8. x21 b.eq 17f4 ldp x22, x21, [sp, #32] mov x0, x21 mov x19, x0 mov x0, x19 stur x8, [x29, #-8] ldr x8, [x0] neg x9, x8 and x9, x8, x9 mul x8, x9, x8 lsr x8, x8, #56 and x8, x8, #0xfc ldr w0, [x9, x8] mov x1, x0 mov w10, w1 lsl x10, x10, #3 and x12, x11, x8 mul x12, x12, x13 lsr x12, x12, #52 and x12, x12, #0xfc0 add x12, x14, x12 ldr x11, [x11, x10] sub mul lsr add ldr > str > stp x22 and ldr x8, [sp, #64] [sp, #64] sub w21 stp x22 add x8, x12, x8 x8, [x8, x10] L x8, x8, x9 x0, x11, x8 x12, x12, #52 12, x12, #0xfc0 x12, x14, x12 x12, [x12, x9] x11, x12, x11 x0, x8, x11 nov x27, x0 x8, x27, x22 x8, [sp, #40] w28, w0, lsl #1 x27, [sp, #128] x23, [sp, #160] w28, [x29, #-36] w12, w0, w8, w19 ov x26, x12 w9, w9, w10 w19, w8, w9 w8, w10, w8 w8, w0, w8, w19 w8, [x11, #32] w0, [x29, #-32] w9, w9, lsl #2 w9, w10, w11, w9 , w9, w8, lsl #1 dd w9, w9, w19 add w8, w8, w12 add w8, w8, w13 add w8, w8, w14 add w8, w8, w15 ret ldr x9, [x20, #376] str w11, [x10, #496] ldp w16, w17, [sp, #48] mul w18, w16, w16 sub w16, w16, w18, lsr #8 add w16, w16, w0, lsr #8 mul w16, w16, w17 add w17, w16, #0x7ff csel w16, w17, w16, lt add w22, w8, w16, asr #11 mov x1, x22 mov x19, x1 mov x0, x19 sub w8, w0, w23 str w8, [x9, #2800] cmp w9, w8 cmp w12, w9 b.ge b6f0 and x9, x9, #0xfc0 add x9, x10, x9 dr x9, [x9, w10, uxtw] lsl x0, x9, x8 and x8, x0, x25 orr x21, x8, x21 orr x21, x8, x21 mov x0, x21 mov x23, x0 cbnz x23, ff4c b.eq 10000 and x10, x10, #0xfc0 add x10, x11, x10 and x0, x8, x9 tst x0, x21 b.eq 17f4 lsl w8, w8, #10 smull x8, w8, w24 lsr x9, x8, #32 add w22, w8, w9, asr #5 mov x1, x22 stur w1, [x29, #-12] mul x8, x8, x13 lsr x8, x8, #52 and x8, x8, #0xfc0 add x8, x14, x8 r x8, [x10, w8, uxtw #3] nov x22, x0 Program Order The longest path through a DDG is called the critical path. Assuming no processor can ever execute any instruction in less than one cycle, the length of the critical path should be a lower bound for the execution time of the program. This is because the instructions on the critical path must all execute, in sequence, in order for the program to finish. The critical path for the DDG in the background of the poster is shown in bold in the middle. The function that each instruction belongs to is indicated by a coloured background. ### **Speculative execution** High-performance microprocessors use a technique called speculative execution to improve performance. The processor predicts the result or outcome of some instructions, and can thus execute instructions that depend upon that result earlier. If the prediction is correct, this can allow a processor to achieve higher performance than the critical path in the DDG would suggest. Speculative execution is used for all branch outcomes in state-of-the-art processors, as well as to disambiguate memory operations. We can nevertheless model the effects of speculative execution in a DDG; we simply delete any edges representing results that predictors would allow the processor to predict. In practice this means we delete the edges outward from most branch instructions as can be seen in the indicated positions. These branch instructions have no dependencies because a simulated branch predictor could predict their target. #### **Limits of ILP** Using the DDG, we can evaluate the limits of instruction-level parallelism in current code. We created the DDG for the first billion instructions of each of the SPEC CPU2017 benchmark programs. We break any dependencies that state-of-the-art branch prediction could successfully predict, thus modelling speculative execution. We then split the DDG into contiguous 'windows' of power-of-two sized groups of instructions. This models the execution paradigm of real-world processors, which cannot view the entire DDG simultaneously, but instead see only small finite groups of instructions (about $2^9 = 512$ for the biggest processors today). For each of these windows we can use the DDG to compute the length of the critical path. This gives the minimum execution time of that window. If we divide the number of instructions in each window by its execution time, we obtain the average number of instructions that can execute in parallel in that part of the program. These average parallelism values vary considerably in different parts of the program. **Figure 2**: Worryingly, according to our model, even with extreme window scaling, a large majority of program execution time is spent in regions with an average ILP below 16. ## Opportunities for the future? The limits of ILP found in this study are concerning, suggesting that future microprocessors would struggle to ever achieve more than 16 IPC on average as shown in figure 2. This is concerning given the current scaling trend towards increasing hardware IPC resources shown in figure 1. Can we use our knowledge of the DDG to spot improvements? Take a look at the indicated parts of the critical path in the DDG. Instructions are coloured according to the function they are part of in the source code. The calling convention utilised by this architecture dictates that callee functions must save and restore certain registers (x19–x29). We see sequences of instructions on the critical path that are enforcing this convention. The caller moves values it wishes to save into these registers, and the callee saves and restores them. The fact that these convention-enforcing instructions are on the critical path suggests a missed opportunity. To investigate, we cut the edges in the DDG that just amounted to saving and restoring callee-saved registers, and recomputed the critical path. Doing so allows us to evaluate the potential IPC improvement for a processor that could somehow avoid the cost of this calling convention (either via architectural or microarchitectural innovation). **Figure 3**: As figure 2, but with edges cut representing the calling convention. A significant boost to ILP is observed.