Extract from "HOW TO OPTIMIZE FOR THE PENTIUM PROCESSORS"
*********************************************************
Copyright (c) 1996, 1997 by Agner Fog. Last modified 1998-03-05.


12. JUMPS AND BRANCHES
======================
The Pentium family of processors attempt to predict where a jump will go
to, and whether a conditional jump will be taken or fall through. If the
prediction is correct, then it can save a considerable amount of time by
loading the subsequent instructions into the pipeline and start decoding
them before the jump is executed. If the prediction turns out to be wrong,
then the pipeline has to be flushed, which will cost a penalty depending
on the length of the pipeline.

The predictions are based on a Branch Target Buffer (BTB) which stores
the history for each branch or jump instruction and makes predictions
based on the prior history of executions of each instruction. The BTB
is organized like a set-associative cache where new entries are allocated
according to a pseudo-random replacement method.

When optimizing code, it is important to minimize the number of
misprediction penalties. This requires a good understanding of how the
jump prediction works.

The branch prediction mechanisms are not described adequately in Intel
manuals or anywhere else. I am therefore giving a very detailed
description here. This information is based on my own research (with the
help of Karki Jitendra Bahadur for the PPlain).

In the following, I will use the term 'control transfer instruction' for
any instruction which can change the instruction pointer, including
conditional and unconditional, direct and indirect, near and far, jumps,
calls, and returns. All these instructions use prediction.


12.2  Branch prediction in PMMX, PPro and PII
=============================================
12.2.1  BTB organization (PMMX, PPro and PII)
---------------------------------------------
The branch target buffer (BTB) of the PMMX has 256 entries organized
as 16 ways * 16 sets. Each entry is identified by bits 2-31 of
the address of the last byte of the control transfer instruction it
belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the
BTB as a tag. Control transfer instructions which are spaced 64 bytes
apart have the same set-value and may therefore occasionally push each
other out of the BTB. Since there are 16 ways per set, this won't happen
too often.

The branch target buffers of the PPro and PII have 512 entries. I have no
information about how they are organized into sets.

The PPro and PII allocate a BTB entry to any control transfer instruction
the first time it is executed. The PMMX allocates it the first time it
jumps. A branch instruction which never jumps will stay out of the BTB
on the PMMX. As soon as it has jumped once, it will stay in the BTB, even
if it never jumps again.

An entry may be pushed out of the BTB when another control transfer
instruction with the same set-value needs a BTB entry.


12.2.2  misprediction penalty  (PMMX, PPro and PII)
---------------------------------------------------
In the PMMX, the penalty for misprediction of a conditional jump is 4
clocks in the U-pipe, and 5 clocks if it is executed in the V-pipe.
For all other control transfer instructions it is 4 clocks.

In the PPro and PII, the misprediction penalty is very high due to the
long pipeline. A misprediction usually costs between 10 and 20 clock
cycles. It is therefore very important to be aware of poorly predictable
branches when running on PPro or PII.


12.2.3 Pattern recognition for conditional jumps  (PMMX, PPro and PII)
----------------------------------------------------------------------
These processors have an advanced pattern recognition mechanism which will
correctly predict a branch instruction which, for example, is taken
every fourth time and falls through the other three times. In fact,
they can predict any repetitive pattern of jumps and nojumps with a
period of up to five, and many patterns with higher periods.

The mechanism is a so-called "two-level adaptive branch prediction scheme",
invented by T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two-
bit counters as described above for the PPlain (but without the assymmetry
flaw). The counter is incremented when the jump is taken and decremented
when not taken. There is no wrap-around when counting up from 3 or down
from 0. A branch instruction is predicted to be taken when the corresponding
counter is in state 2 or 3, and to fall through when in state 0 or 1.
An impressive improvement is now obtained by having sixteen such
counters for each BTB entry. It selects one of these sixteen counters
based on the history of the branch instruction for the last four executions.
If, for example, the branch instruction jumps once and then falls through
three times, then you have the history bits 1000 (1=jump, 0=nojump). This
will make it use counter number 8 (1000 = binary 8) for predicting the
next time and update counter 8 afterwards.

If the sequence 1000 is always followed by a 1, then counter number 8
will soon end up in its highest state (state 3) so that it will always
predict a 1000 sequence to be followed by a 1. It will take two deviations
from this pattern to change the prediction. The repetitive pattern
100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state
0. The other twelve counters will be unused.

12.2.4 Perfectly predicted patterns  (PMMX, PPro and PII)
---------------------------------------------------------
Below is a list of repetitive branch patterns which are predicted perfectly.
Period     Patterns
-----------------------------------------------------------------------------
1 - 5      all
6          000011, 000101, 000111, 001011
7          0000101, 0000111, 0001011
8          00001011, 00001111, 00010011, 00010111, 00101101
9          000010011, 000010111, 000100111, 000101101
10         0000100111, 0000101101, 0000101111, 0000110111, 0001010011,
           0001011101
11         00001001111, 00001010011, 00001011101, 00010100111
12         000010100111, 000010111101, 000011010111, 000100110111,
           000100111011
13         0000100110111, 0000100111011, 0000101001111
14         00001001101111, 00001001111011, 00010011010111, 00010011101011
           00010110011101, 00010110100111
15         000010011010111, 000010011101011, 000010100110111, 000010100111011
           000010110011101, 000010110100111, 000010111010011, 000011010010111
16         0000100110101111, 0000100111101011, 0000101100111101,
           0000101101001111
-----------------------------------------------------------------------------

When reading this table, you should be aware that if a pattern is
predicted correctly than the same pattern reversed (read backwards) is
also predicted correctly, as well as the same pattern with all bits
inverted. Example:
In the table we find the pattern: 0001011.
Reversing this pattern gives:     1101000.
Inverting all bits gives:         1110100.
Both reversing and inverting:     0010111.
These four patterns are all recognizable. Rotating the pattern one place
to the left gives: 0010110. This is of course not a new pattern, only
a phase shifted version of the same pattern. All patterns which can be
derived from one of the patterns in the table by reversing, inverting
and rotating are also recognizable. For reasons of brevity, these are not
listed.

It takes two periods for the pattern recognition mechanism to learn a
regular repetitive pattern after the BTB entry has been allocated.
The pattern of mispredictions in the learning period is not reproducible.
This is probably because the BTB entry contained something prior to
allocation. Since BTB entries are allocated according to a random scheme,
there is little chance of predicting what happens during the initial
learning period.

12.2.5 Handling deviations from a regular pattern  (PMMX, PPro and PII)
-----------------------------------------------------------------------
The branch prediction mechanism is also extremely good at handling
'almost regular' patterns, or deviations from the regular pattern.
Not only does it learn what the regular pattern looks like. It also
learns what deviations from the regular pattern look like. If deviations
are always of the same type, then it will remember what comes after the
irregular event, and the deviation will cost only one misprediction.

Example:
0001110001110001110001011100011100011100010111000
                      ^                   ^
In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns
that the repeated sequence is 000111. The first irregularity is an
unexpected 0, which I have marked with a ^. After this 0 the next three
jumps may be mispredicted, because it hasn't learned what comes after 0010,
0101, and 1011. After one or two irregularities of the same kind it has
learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes
1. This means that after at most two irregularities of the same kind, it
has learned to handle this kind of irregularity with only one misprediction.

The prediction mechanism is also very effective when alternating between
two different regular patterns. If, for example, we have the pattern 000111
(with period 6) repeated many times, then the pattern 01 (period 2) many
times, and then return to the 000111 pattern, then the mechanism doesn't
have to relearn the 000111 pattern, because the counters used in the 000111
sequence have been left un-touched during the 01 sequence. After a few
alternations between the two patterns, it has also learned to handle the
changes of pattern with only one misprediction for each time the pattern
is switched.

12.2.6 Patterns which are not predicted perfectly  (PMMX, PPro and PII)
-----------------------------------------------------------------------
The simplest branch pattern which cannot be predicted perfectly is a
branch which is taken on every 6'th execution. The pattern is:
000001000001000001
    ^^    ^^    ^^
    ab    ab    ab
The sequence 0000 is alternatingly followed by a 0, in the positions
marked a above, and by a 1, in the positions marked b. This affects
counter number 0 which will count up and down all the time. If counter
0 happens to start in state 0 or 1, then it will alternate between
state 0 and 1. This will lead to a misprediction in position b. If counter
0 happens to start in state 3, then it will alternate between state 2 and
3 which will cause a misprediction in position a. The worst case is when
it starts in state 2. It will alternate between state 1 and 2 with the
unfortunate consequence that we get a misprediction both in position
a and b. (This is analogous to the worst case for the PPlain explained in
the end of section 12.1.1 above). Which of these four situations we will
get depends on the history of the BTB entry prior to allocation to this
branch. This is beyond our control because of the random allocation method.

In principle, it is possible to avoid the worst case situation where we
have two mispredictions per cycle by giving it an initial branch sequence
which is specially designed for putting the counter in the desired state.
Such an approach cannot be recommended, however, because of the considerable
extra code complexity required, and because whatever information we have
put into the counter is likely to be lost during the next timer interrupt
or task switch.

12.2.7 Completely random patterns  (PMMX, PPro and PII)
-------------------------------------------------------
The powerful capability of pattern recognition has a minor drawback in
the case of completely random sequences with no regularities.

The following table lists the experimental fraction of mispredictions
for a completely random sequence of jumps and nojumps:

  fraction           fraction of
jumps/nojumps       mispredictions
---------------------------------------
0.001/0.999           0.001001
 0.01/0.99            0.0101
 0.05/0.95            0.0525
 0.10/0.90            0.110
 0.15/0.85            0.171
 0.20/0.80            0.235
 0.25/0.75            0.300
 0.30/0.70            0.362
 0.35/0.65            0.418
 0.40/0.60            0.462
 0.45/0.55            0.490
 0.50/0.50            0.500
---------------------------------------

The fraction of mispredictions is slightly higher than it would be
without pattern recognition because the processor keeps trying to find
repeated patterns in a sequence which has no regularities.

12.2.8 Tight loops  (PMMX)
--------------------------
The branch prediction is not reliable in tiny loops where the pattern
recognition mechanism doesn't have time to update its data before the
next branch is met. This means that simple patterns, which would normally
be predicted perfectly, are not recognized. Incidentally, some patterns
which normally would not be recognized, are predicted perfectly in tight
loops. For example, a loop which always repeats 6 times would have the
branch pattern 111110 for the branch instruction at the bottom of the
loop. This pattern would normally have one or two mispredictions per
iteration, but in a tight loop it has none. The same applies to a loop
which repeats 7 times. Most other repeat counts are predicted poorer in
tight loops than normally. This means that a loop which iterates 6 or 7
times should preferably be tight, whereas other loops should preferably
not be tight. You may unroll a loop if necessary to make it less tight.

To find out whether a loop will behave as 'tight' on the PMMX you may follow
the following rule of thumb: Count the number of instructions in the loop.
If the number is 6 or less, then the loop will behave as tight. If you
have more than 7 instructions, then you can be reasonably sure that the
pattern recognition functions normally. Strangely enough, it doesn't
matter how many clock cycles each instruction takes, whether it has
stalls, or whether it is paired or not. Complex integer instructions
do not count. A loop can have lots of complex integer instructions and
still behave as a tight loop. A complex integer instruction is a non-
pairable integer instruction which always takes more than one clock
cycle. Complex floating point instructions and MMX instructions still
count as one. Note, that this rule of thumb is heuristic and not
completely reliable. In important cases you may want to do your own
testing. You can use performance monitor counter number 35H for the
PMMX (0C5H for PPro and PII) to count branch mispredictions. Test
results may not be completely deterministic, because branch predictions
may depend on the history of the BTB entry prior to allocation.

I have no exact experimental information on tight loops in PPro and PII.


12.2.9 Indirect jumps and calls  (PMMX, PPro and PII)
-----------------------------------------------------
There is no pattern recognition for indirect jumps and calls, and the
BTB can remember no more than one target for an indirect jump. It is
simply predicted to go to the same target as it did last time.


12.2.10 JECXZ and LOOP  (PMMX)
------------------------------
There is no pattern recognition for these two instructions in the PMMX.
They are simply predicted to go the same way as last time they were executed.
These two instructions should be avoided in time-critical code for PMMX.


12.2.11 Returns  (PMMX, PPro and PII)
-------------------------------------
The PMMX, PPro and PII processors have a Return Stack Buffer (RSB) which
is used for predicting return instructions. The RSB works as a
First-In-Last-Out  buffer. Each time a CALL instruction is executed, the
corresponding return address is pushed into the RSB. And each time a RET
instruction is executed, a return address is pulled out of the RSB and
used for prediction of the RET. This mechanism makes sure that return
instructions are correctly predicted when the same subroutine is called
from several different locations.

In order to make sure this mechanism works correctly, you must make sure
that all calls and returns are matched. Never jump out of a subroutine
without a return and never use a return as an indirect jump if speed is
critical.

The RSB can hold only four entries in the PMMX. In the case where the RSB
is empty, the return instruction is predicted in the same way as an
indirect jump, i.e. it is expected to go to the same target as it did
last time. When subroutines are nested deeper than four levels, then the
innermost four levels use the RSB, whereas all subsequent returns from
the outer levels use the simpler prediction mechanism as long as there
are no new calls. A return instruction which uses the RSB still occupies
a BTB entry.

Four entries in the RSB doesn't sound of much, but it is probably
sufficient. Subroutine nesting deeper than four levels is certainly not
unusual, but only the innermost levels matter in terms of speed.

The RSB in PPro and PII can probably hold 12 entries.


12.2.12 Static prediction  (PMMX)
---------------------------------
A control transfer instruction which has not been seen before or which
is not in the BTB is always predicted to fall through on the PMMX. It
doesn't matter whether it goes forward or backwards.

A branch instruction will not get a BTB entry if it always falls through.
As soon as it is taken once, it will get into the BTB and stay there no
matter how many times it falls through. A control transfer instruction
can only go out of the BTB when it is pushed out by another control
transfer instruction which steals its BTB entry.

Any control transfer instruction which jumps to the address immediately
following itself will not get a BTB entry. Example:
        JMP SHORT LL
LL:

This instruction will never get a BTB entry and therefore always have
a misprediction penalty.


12.2.13 Static prediction  (PPro and PII)
-----------------------------------------
On PPro and PII, a control transfer instruction which has not been seen
before or which is not in the BTB is predicted to fall through if it goes
forwards, and to be taken if it goes backwards (e.g. a loop). Static
prediction takes longer time than dynamic prediction on these processors.


12.2.14 Close jumps  (PMMX)
---------------------------
On the PMMX, there is a risk that two control transfer instructions will
share the same BTB entry if they are too close to each other. The obvious
result is that they will always be mispredicted.

The BTB entry for a control transfer instruction is identified by bits
2-31 of the address of the last byte in the instruction. If two control
transfer instructions are so close together that they differ only in
bits 0-1 of the address, then we have the problem of a shared BTB entry.
Example:

        CALL    P
        JNC     SHORT L

If the last byte of the CALL instruction and the last byte of the JNC
instruction lie within the same dword of memory, then we have the
penalty. You have to look at the output list file from the assembler to
see whether the two addresses are separated by a dword boundary or not.
(A dword boundary is an address divisible by 4).

There are various ways to solve this problem:
1. Move the code sequence a little up or down in memory so that you get
   a dword boundary between the two addresses.
2. Change the short jump to a near jump (with 4 bytes displacement) so
   that the end of the instruction is moved further down. There is no
   way you can force the assembler to use anything but the shortest
   form of an instruction so you have to hard-code the near branch if
   you choose this solution.
3. Put in some instruction between the CALL and the JNC instructions.
   This is the easiest method, and the only method if you don't know
   where dword boundaries are because your segment is not dword aligned
   or because the code keeps moving up and down as you make changes in
   the preceding code:

        CALL    P
        MOV     EAX,EAX         ; two bytes filler to be safe
        JNC     SHORT L

If you want to avoid problems on the PPlain too, then put in two NOP's
in stead to prevent pairing (see section 12.1.3 above).

The RET instruction is particularly prone to this problem because it is
only one byte long:

        JNZ     NEXT
        RET

Here you may need up to three bytes of fillers:

        JNZ     NEXT
        NOP
        MOV     EAX,EAX
        RET


12.2.15 Consequtive calls or returns  (PMMX)
--------------------------------------------
There is a penalty when the first instruction pair following the target
label of a call contains another call instruction or if a return follows
immediately after another return. Example:

FUNC1   PROC    NEAR
        NOP             ; avoid call after call
        NOP
        CALL    FUNC2
        CALL    FUNC3
        NOP             ; avoid return after return
        RET
FUNC1   ENDP

Two NOP's are required before  CALL FUNC2  because a single NOP would
pair with the CALL. One NOP is enough before the RET because RET is
unpairable. No NOP's are required between the two CALL instructions
because there is no penalty for call after return. (On the PPlain you
would need two NOP's here too).

The penalty for chained calls only occurs when the same subroutines are
called from more than one location (probably because the RSB needs
updating). Chained returns always have a penalty. There is sometimes a
small stall for a jump after a call, but no penalty for return after call;
call after return; jump, call, or return after jump; or jump after return.


12.2.16 Designing for branch predictability  (PMMX, PPro and PII)
-----------------------------------------------------------------
Multiway branches (switch/case statements) are implemented either as
an indirect jump using a list of jump addresses, or as a tree of branch
instructions. Since indirect jumps are poorly predicted, the latter
method may be preferred if easily predicted patterns can be expected
and you have enough BTB entries.
                 
You may want to reorganize your code so that branch patterns which are
not predicted perfectly can be replaced by other patterns which are.
Consider, for example, a loop which always executes 20 times. The
conditional jump at the bottom of the loop is taken 19 times and falls
through every 20'th time. This pattern is regular, but not recognized by
the pattern recognition mechanism. You may make two nested loops by
four and five, or unroll the loop by four and let it execute 5 times,
in order to have only recognizable patterns. This kind of complicated
schemes are only worth the extra code on the PPro and PII processors
where mispredictions are very expensive. For higher loop counts there
is no reason to do anything about the single misprediction.

    
12.3 Avoiding branches  (all processors)
----------------------------------------
Sometimes it is possible to obtain the same effect as a branch by ingenious
manipulation of bits and flags. For example you may calculate the absoulte
value of a signed number without branching:
        CWD
        XOR EAX,EDX
        SUB EAX,EDX
(On PPlain and PMMX, use  MOV EDX,EAX / SAR EDX,31  in stead of  CWD).

The carry flag is particularly useful for this kind of tricks.
Setting carry if a value is zero:  CMP [value],1
Setting carry if a value is not zero:  XOR EAX,EAX / CMP EAX,[value]
Incrementing a counter if carry:  ADC EAX,0
Setting a bit for each time the carry is set:  RCL EAX,1
Generating a bit mask if carry is set:  SBB EAX,EAX
Setting a bit on an arbitrary condition:  SETcond AL
Setting eight bits on an arbitrary condition:  SETNcond AL / DEC AL
(remember to reverse the condition in the last example)

This example finds the minimum of two unsigned numbers: if (b < a) a = b;
        SUB EBX,EAX
        SBB ECX,ECX
        AND ECX,EBX
        ADD EAX,ECX

This example chooses between two numbers:  if (a != 0) a = b; else a = c;
        CMP EAX,1
        SBB EAX,EAX
        XOR ECX,EBX
        AND EAX,ECX
        XOR EAX,EBX

Whether or not such tricks are worth the extra code depend on how
predictable a conditional jump would be, whether the extra pairing
opportunities of the branch-free code can be utilized, and whether there
are other jumps following immediately after which could suffer the penalties
of consecutive jumps.

The PPro and PII processors have conditional move instructions intended
specially for avoiding branches. For code that will run only on these
processors, you should replace all poorly predictable branches with
conditional moves if possible.