Extract from "HOW TO OPTIMIZE FOR THE PENTIUM PROCESSORS" ********************************************************* Copyright (c) 1996, 1997 by Agner Fog. Last modified 1998-03-05. 12. JUMPS AND BRANCHES ====================== The Pentium family of processors attempt to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction is correct, then it can save a considerable amount of time by loading the subsequent instructions into the pipeline and start decoding them before the jump is executed. If the prediction turns out to be wrong, then the pipeline has to be flushed, which will cost a penalty depending on the length of the pipeline. The predictions are based on a Branch Target Buffer (BTB) which stores the history for each branch or jump instruction and makes predictions based on the prior history of executions of each instruction. The BTB is organized like a set-associative cache where new entries are allocated according to a pseudo-random replacement method. When optimizing code, it is important to minimize the number of misprediction penalties. This requires a good understanding of how the jump prediction works. The branch prediction mechanisms are not described adequately in Intel manuals or anywhere else. I am therefore giving a very detailed description here. This information is based on my own research (with the help of Karki Jitendra Bahadur for the PPlain). In the following, I will use the term 'control transfer instruction' for any instruction which can change the instruction pointer, including conditional and unconditional, direct and indirect, near and far, jumps, calls, and returns. All these instructions use prediction. 12.2 Branch prediction in PMMX, PPro and PII ============================================= 12.2.1 BTB organization (PMMX, PPro and PII) --------------------------------------------- The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16 sets. Each entry is identified by bits 2-31 of the address of the last byte of the control transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB as a tag. Control transfer instructions which are spaced 64 bytes apart have the same set-value and may therefore occasionally push each other out of the BTB. Since there are 16 ways per set, this won't happen too often. The branch target buffers of the PPro and PII have 512 entries. I have no information about how they are organized into sets. The PPro and PII allocate a BTB entry to any control transfer instruction the first time it is executed. The PMMX allocates it the first time it jumps. A branch instruction which never jumps will stay out of the BTB on the PMMX. As soon as it has jumped once, it will stay in the BTB, even if it never jumps again. An entry may be pushed out of the BTB when another control transfer instruction with the same set-value needs a BTB entry. 12.2.2 misprediction penalty (PMMX, PPro and PII) --------------------------------------------------- In the PMMX, the penalty for misprediction of a conditional jump is 4 clocks in the U-pipe, and 5 clocks if it is executed in the V-pipe. For all other control transfer instructions it is 4 clocks. In the PPro and PII, the misprediction penalty is very high due to the long pipeline. A misprediction usually costs between 10 and 20 clock cycles. It is therefore very important to be aware of poorly predictable branches when running on PPro or PII. 12.2.3 Pattern recognition for conditional jumps (PMMX, PPro and PII) ---------------------------------------------------------------------- These processors have an advanced pattern recognition mechanism which will correctly predict a branch instruction which, for example, is taken every fourth time and falls through the other three times. In fact, they can predict any repetitive pattern of jumps and nojumps with a period of up to five, and many patterns with higher periods. The mechanism is a so-called "two-level adaptive branch prediction scheme", invented by T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two- bit counters as described above for the PPlain (but without the assymmetry flaw). The counter is incremented when the jump is taken and decremented when not taken. There is no wrap-around when counting up from 3 or down from 0. A branch instruction is predicted to be taken when the corresponding counter is in state 2 or 3, and to fall through when in state 0 or 1. An impressive improvement is now obtained by having sixteen such counters for each BTB entry. It selects one of these sixteen counters based on the history of the branch instruction for the last four executions. If, for example, the branch instruction jumps once and then falls through three times, then you have the history bits 1000 (1=jump, 0=nojump). This will make it use counter number 8 (1000 = binary 8) for predicting the next time and update counter 8 afterwards. If the sequence 1000 is always followed by a 1, then counter number 8 will soon end up in its highest state (state 3) so that it will always predict a 1000 sequence to be followed by a 1. It will take two deviations from this pattern to change the prediction. The repetitive pattern 100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state 0. The other twelve counters will be unused. 12.2.4 Perfectly predicted patterns (PMMX, PPro and PII) --------------------------------------------------------- Below is a list of repetitive branch patterns which are predicted perfectly. Period Patterns ----------------------------------------------------------------------------- 1 - 5 all 6 000011, 000101, 000111, 001011 7 0000101, 0000111, 0001011 8 00001011, 00001111, 00010011, 00010111, 00101101 9 000010011, 000010111, 000100111, 000101101 10 0000100111, 0000101101, 0000101111, 0000110111, 0001010011, 0001011101 11 00001001111, 00001010011, 00001011101, 00010100111 12 000010100111, 000010111101, 000011010111, 000100110111, 000100111011 13 0000100110111, 0000100111011, 0000101001111 14 00001001101111, 00001001111011, 00010011010111, 00010011101011 00010110011101, 00010110100111 15 000010011010111, 000010011101011, 000010100110111, 000010100111011 000010110011101, 000010110100111, 000010111010011, 000011010010111 16 0000100110101111, 0000100111101011, 0000101100111101, 0000101101001111 ----------------------------------------------------------------------------- When reading this table, you should be aware that if a pattern is predicted correctly than the same pattern reversed (read backwards) is also predicted correctly, as well as the same pattern with all bits inverted. Example: In the table we find the pattern: 0001011. Reversing this pattern gives: 1101000. Inverting all bits gives: 1110100. Both reversing and inverting: 0010111. These four patterns are all recognizable. Rotating the pattern one place to the left gives: 0010110. This is of course not a new pattern, only a phase shifted version of the same pattern. All patterns which can be derived from one of the patterns in the table by reversing, inverting and rotating are also recognizable. For reasons of brevity, these are not listed. It takes two periods for the pattern recognition mechanism to learn a regular repetitive pattern after the BTB entry has been allocated. The pattern of mispredictions in the learning period is not reproducible. This is probably because the BTB entry contained something prior to allocation. Since BTB entries are allocated according to a random scheme, there is little chance of predicting what happens during the initial learning period. 12.2.5 Handling deviations from a regular pattern (PMMX, PPro and PII) ----------------------------------------------------------------------- The branch prediction mechanism is also extremely good at handling 'almost regular' patterns, or deviations from the regular pattern. Not only does it learn what the regular pattern looks like. It also learns what deviations from the regular pattern look like. If deviations are always of the same type, then it will remember what comes after the irregular event, and the deviation will cost only one misprediction. Example: 0001110001110001110001011100011100011100010111000 ^ ^ In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns that the repeated sequence is 000111. The first irregularity is an unexpected 0, which I have marked with a ^. After this 0 the next three jumps may be mispredicted, because it hasn't learned what comes after 0010, 0101, and 1011. After one or two irregularities of the same kind it has learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes 1. This means that after at most two irregularities of the same kind, it has learned to handle this kind of irregularity with only one misprediction. The prediction mechanism is also very effective when alternating between two different regular patterns. If, for example, we have the pattern 000111 (with period 6) repeated many times, then the pattern 01 (period 2) many times, and then return to the 000111 pattern, then the mechanism doesn't have to relearn the 000111 pattern, because the counters used in the 000111 sequence have been left un-touched during the 01 sequence. After a few alternations between the two patterns, it has also learned to handle the changes of pattern with only one misprediction for each time the pattern is switched. 12.2.6 Patterns which are not predicted perfectly (PMMX, PPro and PII) ----------------------------------------------------------------------- The simplest branch pattern which cannot be predicted perfectly is a branch which is taken on every 6'th execution. The pattern is: 000001000001000001 ^^ ^^ ^^ ab ab ab The sequence 0000 is alternatingly followed by a 0, in the positions marked a above, and by a 1, in the positions marked b. This affects counter number 0 which will count up and down all the time. If counter 0 happens to start in state 0 or 1, then it will alternate between state 0 and 1. This will lead to a misprediction in position b. If counter 0 happens to start in state 3, then it will alternate between state 2 and 3 which will cause a misprediction in position a. The worst case is when it starts in state 2. It will alternate between state 1 and 2 with the unfortunate consequence that we get a misprediction both in position a and b. (This is analogous to the worst case for the PPlain explained in the end of section 12.1.1 above). Which of these four situations we will get depends on the history of the BTB entry prior to allocation to this branch. This is beyond our control because of the random allocation method. In principle, it is possible to avoid the worst case situation where we have two mispredictions per cycle by giving it an initial branch sequence which is specially designed for putting the counter in the desired state. Such an approach cannot be recommended, however, because of the considerable extra code complexity required, and because whatever information we have put into the counter is likely to be lost during the next timer interrupt or task switch. 12.2.7 Completely random patterns (PMMX, PPro and PII) ------------------------------------------------------- The powerful capability of pattern recognition has a minor drawback in the case of completely random sequences with no regularities. The following table lists the experimental fraction of mispredictions for a completely random sequence of jumps and nojumps: fraction fraction of jumps/nojumps mispredictions --------------------------------------- 0.001/0.999 0.001001 0.01/0.99 0.0101 0.05/0.95 0.0525 0.10/0.90 0.110 0.15/0.85 0.171 0.20/0.80 0.235 0.25/0.75 0.300 0.30/0.70 0.362 0.35/0.65 0.418 0.40/0.60 0.462 0.45/0.55 0.490 0.50/0.50 0.500 --------------------------------------- The fraction of mispredictions is slightly higher than it would be without pattern recognition because the processor keeps trying to find repeated patterns in a sequence which has no regularities. 12.2.8 Tight loops (PMMX) -------------------------- The branch prediction is not reliable in tiny loops where the pattern recognition mechanism doesn't have time to update its data before the next branch is met. This means that simple patterns, which would normally be predicted perfectly, are not recognized. Incidentally, some patterns which normally would not be recognized, are predicted perfectly in tight loops. For example, a loop which always repeats 6 times would have the branch pattern 111110 for the branch instruction at the bottom of the loop. This pattern would normally have one or two mispredictions per iteration, but in a tight loop it has none. The same applies to a loop which repeats 7 times. Most other repeat counts are predicted poorer in tight loops than normally. This means that a loop which iterates 6 or 7 times should preferably be tight, whereas other loops should preferably not be tight. You may unroll a loop if necessary to make it less tight. To find out whether a loop will behave as 'tight' on the PMMX you may follow the following rule of thumb: Count the number of instructions in the loop. If the number is 6 or less, then the loop will behave as tight. If you have more than 7 instructions, then you can be reasonably sure that the pattern recognition functions normally. Strangely enough, it doesn't matter how many clock cycles each instruction takes, whether it has stalls, or whether it is paired or not. Complex integer instructions do not count. A loop can have lots of complex integer instructions and still behave as a tight loop. A complex integer instruction is a non- pairable integer instruction which always takes more than one clock cycle. Complex floating point instructions and MMX instructions still count as one. Note, that this rule of thumb is heuristic and not completely reliable. In important cases you may want to do your own testing. You can use performance monitor counter number 35H for the PMMX (0C5H for PPro and PII) to count branch mispredictions. Test results may not be completely deterministic, because branch predictions may depend on the history of the BTB entry prior to allocation. I have no exact experimental information on tight loops in PPro and PII. 12.2.9 Indirect jumps and calls (PMMX, PPro and PII) ----------------------------------------------------- There is no pattern recognition for indirect jumps and calls, and the BTB can remember no more than one target for an indirect jump. It is simply predicted to go to the same target as it did last time. 12.2.10 JECXZ and LOOP (PMMX) ------------------------------ There is no pattern recognition for these two instructions in the PMMX. They are simply predicted to go the same way as last time they were executed. These two instructions should be avoided in time-critical code for PMMX. 12.2.11 Returns (PMMX, PPro and PII) ------------------------------------- The PMMX, PPro and PII processors have a Return Stack Buffer (RSB) which is used for predicting return instructions. The RSB works as a First-In-Last-Out buffer. Each time a CALL instruction is executed, the corresponding return address is pushed into the RSB. And each time a RET instruction is executed, a return address is pulled out of the RSB and used for prediction of the RET. This mechanism makes sure that return instructions are correctly predicted when the same subroutine is called from several different locations. In order to make sure this mechanism works correctly, you must make sure that all calls and returns are matched. Never jump out of a subroutine without a return and never use a return as an indirect jump if speed is critical. The RSB can hold only four entries in the PMMX. In the case where the RSB is empty, the return instruction is predicted in the same way as an indirect jump, i.e. it is expected to go to the same target as it did last time. When subroutines are nested deeper than four levels, then the innermost four levels use the RSB, whereas all subsequent returns from the outer levels use the simpler prediction mechanism as long as there are no new calls. A return instruction which uses the RSB still occupies a BTB entry. Four entries in the RSB doesn't sound of much, but it is probably sufficient. Subroutine nesting deeper than four levels is certainly not unusual, but only the innermost levels matter in terms of speed. The RSB in PPro and PII can probably hold 12 entries. 12.2.12 Static prediction (PMMX) --------------------------------- A control transfer instruction which has not been seen before or which is not in the BTB is always predicted to fall through on the PMMX. It doesn't matter whether it goes forward or backwards. A branch instruction will not get a BTB entry if it always falls through. As soon as it is taken once, it will get into the BTB and stay there no matter how many times it falls through. A control transfer instruction can only go out of the BTB when it is pushed out by another control transfer instruction which steals its BTB entry. Any control transfer instruction which jumps to the address immediately following itself will not get a BTB entry. Example: JMP SHORT LL LL: This instruction will never get a BTB entry and therefore always have a misprediction penalty. 12.2.13 Static prediction (PPro and PII) ----------------------------------------- On PPro and PII, a control transfer instruction which has not been seen before or which is not in the BTB is predicted to fall through if it goes forwards, and to be taken if it goes backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on these processors. 12.2.14 Close jumps (PMMX) --------------------------- On the PMMX, there is a risk that two control transfer instructions will share the same BTB entry if they are too close to each other. The obvious result is that they will always be mispredicted. The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of the last byte in the instruction. If two control transfer instructions are so close together that they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry. Example: CALL P JNC SHORT L If the last byte of the CALL instruction and the last byte of the JNC instruction lie within the same dword of memory, then we have the penalty. You have to look at the output list file from the assembler to see whether the two addresses are separated by a dword boundary or not. (A dword boundary is an address divisible by 4). There are various ways to solve this problem: 1. Move the code sequence a little up or down in memory so that you get a dword boundary between the two addresses. 2. Change the short jump to a near jump (with 4 bytes displacement) so that the end of the instruction is moved further down. There is no way you can force the assembler to use anything but the shortest form of an instruction so you have to hard-code the near branch if you choose this solution. 3. Put in some instruction between the CALL and the JNC instructions. This is the easiest method, and the only method if you don't know where dword boundaries are because your segment is not dword aligned or because the code keeps moving up and down as you make changes in the preceding code: CALL P MOV EAX,EAX ; two bytes filler to be safe JNC SHORT L If you want to avoid problems on the PPlain too, then put in two NOP's in stead to prevent pairing (see section 12.1.3 above). The RET instruction is particularly prone to this problem because it is only one byte long: JNZ NEXT RET Here you may need up to three bytes of fillers: JNZ NEXT NOP MOV EAX,EAX RET 12.2.15 Consequtive calls or returns (PMMX) -------------------------------------------- There is a penalty when the first instruction pair following the target label of a call contains another call instruction or if a return follows immediately after another return. Example: FUNC1 PROC NEAR NOP ; avoid call after call NOP CALL FUNC2 CALL FUNC3 NOP ; avoid return after return RET FUNC1 ENDP Two NOP's are required before CALL FUNC2 because a single NOP would pair with the CALL. One NOP is enough before the RET because RET is unpairable. No NOP's are required between the two CALL instructions because there is no penalty for call after return. (On the PPlain you would need two NOP's here too). The penalty for chained calls only occurs when the same subroutines are called from more than one location (probably because the RSB needs updating). Chained returns always have a penalty. There is sometimes a small stall for a jump after a call, but no penalty for return after call; call after return; jump, call, or return after jump; or jump after return. 12.2.16 Designing for branch predictability (PMMX, PPro and PII) ----------------------------------------------------------------- Multiway branches (switch/case statements) are implemented either as an indirect jump using a list of jump addresses, or as a tree of branch instructions. Since indirect jumps are poorly predicted, the latter method may be preferred if easily predicted patterns can be expected and you have enough BTB entries. You may want to reorganize your code so that branch patterns which are not predicted perfectly can be replaced by other patterns which are. Consider, for example, a loop which always executes 20 times. The conditional jump at the bottom of the loop is taken 19 times and falls through every 20'th time. This pattern is regular, but not recognized by the pattern recognition mechanism. You may make two nested loops by four and five, or unroll the loop by four and let it execute 5 times, in order to have only recognizable patterns. This kind of complicated schemes are only worth the extra code on the PPro and PII processors where mispredictions are very expensive. For higher loop counts there is no reason to do anything about the single misprediction. 12.3 Avoiding branches (all processors) ---------------------------------------- Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags. For example you may calculate the absoulte value of a signed number without branching: CWD XOR EAX,EDX SUB EAX,EDX (On PPlain and PMMX, use MOV EDX,EAX / SAR EDX,31 in stead of CWD). The carry flag is particularly useful for this kind of tricks. Setting carry if a value is zero: CMP [value],1 Setting carry if a value is not zero: XOR EAX,EAX / CMP EAX,[value] Incrementing a counter if carry: ADC EAX,0 Setting a bit for each time the carry is set: RCL EAX,1 Generating a bit mask if carry is set: SBB EAX,EAX Setting a bit on an arbitrary condition: SETcond AL Setting eight bits on an arbitrary condition: SETNcond AL / DEC AL (remember to reverse the condition in the last example) This example finds the minimum of two unsigned numbers: if (b < a) a = b; SUB EBX,EAX SBB ECX,ECX AND ECX,EBX ADD EAX,ECX This example chooses between two numbers: if (a != 0) a = b; else a = c; CMP EAX,1 SBB EAX,EAX XOR ECX,EBX AND EAX,ECX XOR EAX,EBX Whether or not such tricks are worth the extra code depend on how predictable a conditional jump would be, whether the extra pairing opportunities of the branch-free code can be utilized, and whether there are other jumps following immediately after which could suffer the penalties of consecutive jumps. The PPro and PII processors have conditional move instructions intended specially for avoiding branches. For code that will run only on these processors, you should replace all poorly predictable branches with conditional moves if possible.