HOW TO OPTIMIZE FOR THE PENTIUM PROCESSORS
******************************************
Copyright (c) 1996, 1997 by Agner Fog. Last modified 1998-03-05.
(This document should be printed with a fixed-width font, such as courier)


Contents:
*********

1.  introduction
2.  literature
3.  debugging and verifying
4.  memory model
5.  alignment
6.  cache
7.  address generation interlock (PPlain and PMMX)
8.  pairing integer instructions (PPlain and PMMX)
9.  first time vs. repeated execution
10. imperfect pairing (PPlain and PMMX)
11. splitting complex instructions into simpler ones (PPlain and PMMX)
12. jumps and branches
13. prefixes
14. reducing code size
15. scheduling floating point code (PPlain and PMMX)
16. loops (PPlain and PMMX)
17. discussion of special instructions
18. using integer instructions to do floating point operations
19. using floating point instructions to do integer operations
20. list of integer instructions (PPlain and PMMX)
21. list of floating point instructions (PPlain and PMMX)
22. MMX instructions (PMMX)
23. testing code speed
24. Pentium Pro and Pentium II processors
25. Comparison of the different microprocessors


1. INTRODUCTION
===============
This manual describes in detail how to write optimized assembly language
code, with particular focus on the Pentium® family of microprocessors.

This manual is more comprehensive and exact than other sources of
information and contains many details not found anywhere else.
This information will enable you in many cases to calculate exactly how
many clock cycles a piece of code will take.

The following versions of Pentium processors are discussed in this manual:

abbreviation:   name:
------------------------------------------------
PPlain          plain old Pentium (without MMX)
PMMX            Pentium with MMX
PPro            Pentium Pro
PII             Pentium II
------------------------------------------------

The information in this manual is based on my own research and testing,
supplemented with information I have received from various people. I
want to thank those who have sent me additional information for this
manual. My research has so far only covered the PPlain and PMMX. the
PPro and PII are only described briefly.

Programming in assembly language is much more difficult than high level
language. Making bugs is very easy, and finding them is very difficult.
Now you have been warned! It is assumed that the reader is already
experienced in assembly programming. If not, then please read some books
on the subject and get some programming experience before you begin to do
complicated optimations.

The hardware design of the PPlain and PMMX chips has many features which
are optimized specifically for some commonly used instructions or
instruction combinations, rather than using general optimation methods.
Consequently, the rules for optimizing software for this design are quite
complicated and have many exceptions, but the possible gain in performance
may be substantial.

Before you start to convert your code to assembly, make sure that your
algorithm is optimal. Often you can improve a piece of code much more by
improving the algorithm than by converting it to assembly code.

Next, you have to identify the critical part of your program. Often more
than 99% of the CPU time is used in the innermost loop of a program. In
this case you should optimize only that loop and leave everything else
in high level language. Some assembly programmers waste a lot of energy
optimizing the wrong parts of their programs, the only significant effect
of their effort being that the programs become more difficult to debug
and maintain!

If it is not obvious where the critical parts of your program are then
you may use a profiler to find them. If it turns out that the bottleneck
is disk access, then you may modify your program to make disk access
sequential, in order to improve disk caching, rather than turning to
assembly programming. If the bottleneck is graphics output then you may
look for a way of reducing the number of calls to graphic procedures.

Some high level language compilers offer relatively good optimation for
the Pentium, and further optimation by hand may not be worth the effort
except for the most critical pieces of code - or for the sport of it.

Please don't send your programming questions to me. I am not gonna do
your homework for you!

Good luck with your hunt for nanoseconds!

PS. I have deliberately misspelled 'optimization' because I think
'optimation' is more optimal to pronounce.


2. LITERATURE
=============
A lot of useful literature can be downloaded for free from Intel's www
site or acquired in print or on CD-ROM.

I will not give the URL's here because they change very often.
You can find the documents you need by using the search facilities at:
http://www.intel.com/sites/developer/search.htm or follow the links from
http://announce.com/agner/assem

Some documents are in .PDF format. If you don't have software for viewing
or printing .PDF files, then you may download the Acrobat file reader from
www.adobe.com

The use of MMX instructions for optimizing specific applications are
described in several application notes. The MMX instruction set is
described in various manuals and tutorials.

A lot of other sources than Intel also have useful information. These
sources are listed in the FAQ for the newsgroup comp.lang.asm.x86.
The shareware editor ASMEDIT has an online help which covers all
instruction codes etc. It is available from:
http://www.inf.tu-dresden.de/~ok3/asmedit.html

For other internet ressources follow the links from
http://announce.com/agner/assem/


3. DEBUGGING AND VERIFYING
==========================
Debugging assembly code can be quite hard and frustrating, as you probably
already have discovered. I would recommend that you start with writing the
piece of code you want to optimize as a subroutine in a high level language.
Next, write a test program that will test your subroutine thoroughly. Make
sure the test program goes into all branches and special cases.

When your high level language subroutine works with your test program, then
you are ready to translate the code to assembly language. You can either
use inline assembly or code the subroutine entirely in assembly language
and link it into your project. If you choose the latter option, then it
is recommended that you use a compiler which is capable of translating
high level code directly to assembly. This assures that you get the
function calling method right. Most C++ compilers can do this.

The methods for function calling and name mangling can be quite
complicated. There are many different calling conventions, and the
different brands of compilers are not compatible in this respect.
If you are calling assembly language subroutines from C++, then the
best method in terms of consistency and compatibility is to declare
your function  extern"C"  and  _cdecl. The assembly code must then have
the function name prefixed with an underscore (_) and be assembled with
case sensitivity on externals (option -mx). Parameters are pushed on the
stack in reverse order, so that the first parameters come at the lowest
address. Parameters are removed from the stack by the caller, not by
the called function. The return value is in register eax or st(0).
The function must save all used registers and restore them before
returning, except for eax, ecx and edx (in 16 bit mode: ax, bx, cx, dx, es).
If you need name mangling or another calling convention, then you have
to use a compiler which can translate high level code to assembly, in
order to get the assembly code right.

Now you can start to optimize. Each time you have made a modification you
should run it on the test program to see if it works correctly.

Number all your versions and save them so that you can go back and test
them again in case you discover an error which the test program didn't
catch (such as writing to a wrong address).

Test the speed of the most critical part of your program with the method
described in chapter 23. If the code is significantly slower than expected,
then the most probable reasons are: cache misses (chapter 6), misaligned
operands (chapter 5), first time penalty (chapter 9), and branch
mispredictions (chapter 12).


4. MEMORY MODEL
===============
The Pentiums are designed primarily for 32 bit code, and the performance
is inferior on 16 bit code. Segmenting your code and data also degrades
performance significantly, so you should generally prefer 32 bit flat mode,
and an operating system which supports this mode (e.g. Windows 95, OS/2, or
a 32 bit DOS extender). The code examples shown in this manual assume a 32
bit flat memory model, unless otherwise specified.


5. ALIGNMENT
============
All data in RAM should be aligned to addresses divisible by 2, 4, or 8
according to this scheme:

                     alignment        alignment
operand size      PPlain and PMMX    PPro and PII
--------------------------------------------------
1  (byte)              1                   1
2  (word)         addr MOD 4 < 3           2
4  (dword)             4                   4
6  (fword)             4                   8
8  (qword)             8                   8
10 (tbyte)             8                  16
--------------------------------------------------

Misaligned data will take at least 3 clock cycles extra to access. The
penalty is higher when a cache line boundary is crossed.

Aligning code is not necessary on the PPlain and PMMX, but for optimal
performance on PPro and PII you may align subroutine entries and important
loop entries by 16.


6. CACHE  (all processors)
==========================
The PPlain and PPro have 8 kb of on-chip cache (level one cache) for
code, and 8 kb for data. The PMMX and PII have 16 kb for code and 16 kb
for data. Data in the level 1 cache can be read or written to in just one
clock cycle, whereas a cache miss may cost many clock cycles. It is
therefore important that you understand how the cache works in order to
use it most efficiently.

The data cache consists of 256 or 512 lines of 32 bytes each. Each time
you read a data item which is not cached, the processor will read an
entire cache line from memory. The cache lines are always aligned to a
physical address divisible by 32. When you have read a byte at an address
divisible by 32, then the next 31 bytes can be read or written to at
almost no extra cost. You can take advantage of this by arranging data
items which are used near each other together into aligned blocks of 32
bytes of memory. If, for example, you have a loop which accesses two
arrays, then you may interleave the two arrays into one array of
structures, so that data which are used together are also stored together.

If the size of an array or other data structure is a multiple of 32 bytes,
then you should preferably align it by 32.

The cache is set-associative. This means that a cache line can not be
assigned to an arbitrary memory address. Each cache line has a 7-bit
set-value which must match bits 5 through 11 of the physical RAM address
(bit 0-4 define the 32 bytes within a cache line). The PPlain and PPro
have two cache lines for each of the 128 set-values, so there are two
possible cache lines to assign to any RAM address. The PMMX and PII
versions have four.

The consequence of this is that the cache can hold no more than two or
four different data blocks which have the same value in bits 5-11 of the
address. You can determine if two addresses have the same set-value by
the following method: Strip off the lower 5 bits of each address to get
a value divisible by 32. If the difference between the two truncated
addresses is a multiple of 4096 (=1000H), then the addresses have the
same set-value.

Let me illustrate this by the following piece of code, where ESI holds an
address divisible by 32:

AGAIN:  MOV  EAX, [ESI]
        MOV  EBX, [ESI + 13*4096 +  4]
        MOV  ECX, [ESI + 20*4096 + 28]
        DEC  EDX
        JNZ  AGAIN

The three addresses used here all have the same set-value because the
differences between the truncated addresses are multipla of 4096. This
loop will perform very poorly on the PPlain and PPro. At the time
you read ECX there is no free cache line with the proper set-value so
the processor takes the least recently used of the two possible cache
lines, that is the one which was used for EAX, and fills it with the
data from [ESI + 20*4096] to [ESI + 20*4096 + 31] and reads ECX.  Next,
when reading EAX, you find that the cache line that held the value for
EAX has now been discarded, so you take the least recently used line,
which is the one holding the EBX value, and so on..  You have nothing
but cache misses and the loop takes something like 60 clock cycles.
If the third line is changed to:

        MOV  ECX, [ESI + 20*4096 + 32]

then we have crossed a 32 byte boundary, so that we do not have the same
set-value as in the first two lines, and there will be no problem
assigning a cache line to each of the three addresses. The loop now
takes only 3 clock cycles (except for the first time) - a very
considerable improvement!

It may be very difficult to determine if your data addresses have the
same set-values, especially if they are scattered around in different
segments. The best thing you can do to avoid problems of this kind is
to keep all data used in the critical part or your program within one
contiguous block not bigger than the cache, or two contiguous blocks
no bigger than half that size (for example one block for static data
and another block for data on the stack). This will make sure that
your cache lines are used optimally.

If the critical part of your code accesses big data structures or random
data addresses, then you may want to keep all frequently used variables
(counters, pointers, control variables, etc.) within a single contiguous
block of max 4 kbytes so that you have a complete set of cache lines free
for accessing random data. Since you probably need stack space anyway for
subroutine parameters and return addresses, the best thing is to copy all
frequently used static data to dynamic variables on the stack, and copy
them back again outside the critical loop if they have been changed.

Reading a data item which is not in the level one cache causes an entire
cache line to be filled from the level two cache, which takes approximately
200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz
system), but the bytes you ask for first are available already after 50-100
ns. If the data item is not in the level two cache either, then you will
get a delay of something like 200-300 ns. This delay will be somewhat longer
if you cross a DRAM page boundary. (The size of a DRAM page is 1 kb for
4 and 8 MB 72 pin RAM modules, and 2 kb for 16 and 32 MB modules).

When you write to an address which is not in the level 1 cache, then the
value will go right through to the level 2 cache or to the RAM (depending
on how the level 2 cache is set up) on the PPlain and PMMX. This takes
approximately 100 ns. If you write eight or more times to the same 32
byte block of memory without also reading from it, and the block is not
in the level one cache, then it may be advantageous to make a dummy read
from the block first to load it into a cache line. All subsequent writes
to the same block will then go to the cache instead, which takes only one
clock cycle. This is not needed on a PPro or PII, which always load a
cache line on write misses. There is sometimes a small penalty for writing
repeatedly to the same address without reading in between on PPlain and
PMMX.

Another way to reduce the number of write misses is to write 8 bytes at a
time using  FILD / FISTP  with qword operands rather than writing 4 bytes
at a time using integer registers. The extra cost of the slow FILD and
FISTP instructions is compensated for by the fact that you only have to do
half as many writes. However, this method is only advantageous on a PPlain
and PMMX, and only if the destination is not in the level 1 cache. The
method is explained in section 19 below. If you have the PMMX or PII,
then you may of course use MMX instructions rather than floating point
instructions to move 8 bytes at a time.

The PPlain has two write buffers, PMMX has four, and PPro has eight.
On the PMMX you may have up to four unfinished writes to uncached memory
without delaying the subsequent instructions.

Temporary data may conveniently be stored on the stack because the stack
area is very likely to be in the cache. You should be aware, however,
that if you store QWORD data on a DWORD size stack, or DWORD data on a
WORD size stack, then you have a potential alignment problem.

If the life ranges of two data structures do not overlap, then they may
use the same RAM area to increase cache efficiency. This is consistent
with the common practice of allocating space for temporary variables on
the stack.

Storing temporary data in registers is of course even more efficient.
Since registers is a scarce ressource you may want to use [ESP] rather
than [EBP] for addressing data on the stack, in order to free EBP for
other purposes. Just don't forget that the value of ESP changes every
time you do a PUSH or POP. (You cannot use ESP under 16-bit Windows
because the timer interrupt will modify the high word of ESP at
unpredictable places in your code.)

There is a separate cache for code, which is similar to the data cache.
The size of the code cache is 8 kb on PPlain and PPro and 16 kb on the
PMMX and PII. It is important that the critical part of your code
(the innermost loops) fit in the code cache. Frequently used pieces of
code or routines which are used together should preferable be stored near
each other. Seldom used branches or procedures should be put away in the
bottom of your code or somewhere else.


7. ADDRESS GENERATION INTERLOCK (AGI)  (PPlain and PMMX)
========================================================
It takes one clock cycle to calculate the address needed by an instruction
which accesses memory. Normally, this calculation is done at a separate
stage in the pipeline while the preceding instruction or instruction pair is
executing. But if the address depends on the result of an instruction
executing in the preceding clock cycle, then you have to wait an extra clock
cycle for the address to be calculated. This is called an AGI stall.

Example:
ADD EBX,4 / MOV EAX,[EBX]    ; AGI stall
The stall in this example can be removed by putting some other instructions
in between  ADD EBX,4  and  MOV EAX,[EBX]  or by rewriting the code to:
MOV EAX,[EBX+4] / ADD EBX,4

You can also get an AGI stall with instructions which use ESP implicitly
for addressing, such as PUSH, POP, CALL, and RET, if ESP has been changed
in the preceding clock cycle by instructions such as MOV, ADD, or SUB.
The PPlain and PMMX have special circuitry to predict the value of ESP
after a stack operation so that you do not get an AGI delay after changing
ESP with PUSH, POP, or CALL. You can get an AGI stall after RET only if it
has an immediate operand to add to ESP.

Examples:
ADD ESP,4 / POP ESI            ; AGI stall
POP EAX   / POP ESI            ; no stall, pair
MOV ESP,EBP / RET              ; AGI stall
CALL L1 / L1: MOV EAX,[ESP+8]  ; no stall
RET / POP EAX                  ; no stall
RET 8 / POP EAX                ; AGI stall

The LEA instruction is also subject to an AGI stall if it uses a base or
index register which has been changed in the preceding clock cycle. Example:
INC ESI / LEA EAX,[EBX+4*ESI]  ; AGI stall

PPro and PII have no AGI stalls.


8. PAIRING INTEGER INSTRUCTIONS (PPlain and PMMX)
=================================================
The PPlain and PMMX have two pipelines for executing instructions, called
the U-pipe and the V-pipe. Under certain conditions it is possible to
execute two instructions simultaneously, one in the U-pipe and one in the
V-pipe. This can almost double the speed. It is therefore advantageous to
reorder your instructions to make them pair.

The following instructions are pairable in either pipe:
  MOV register, memory, or immediate into register or memory
  PUSH register or immediate
  POP register
  LEA, NOP
  INC, DEC, ADD, SUB, CMP, AND, OR, XOR,
  and some forms of TEST (see section 17.3)

The following instructions are pairable in the U-pipe only:
ADC, SBB
SHR, SAR, SHL, SAL with immediate count
ROR, ROL, RCR, RCL with an immediate count of 1

The following instructions can execute in either pipe but are only pairable
when in the V-pipe: near call, short and near jump, short and near
conditional jump.

All other integer instructions can execute in the U-pipe only, and are not
pairable.

Two consecutive instructions will pair when the following conditions are met:

8.1 The first instruction is pairable in the U-pipe and the second
    instruction is pairable in the V-pipe.

8.2 The second instruction cannot read or write a register which the first
    instruction writes to. Examples:
    MOV EAX, EBX / MOV ECX, EAX     ; read after write, do not pair
    MOV EAX, 1   / MOV EAX, 2       ; write after write, do not pair
    MOV EBX, EAX / MOV EAX, 2       ; write after read, pair OK
    MOV EBX, EAX / MOV ECX, EAX     ; read after read, pair OK
    MOV EBX, EAX / INC EAX          ; read and write after read, pair OK

8.3 In rule 8.2 partial registers are treated as full registers. Example:
    MOV AL, BL  /  MOV AH, 0        ; writes to different parts of the same
                                    ; register, do not pair

8.4 Two instructions which both write to parts of the flags register can
    pair despite rule 8.2 and 8.3.  Example:
    SHR EAX, 4 / INC EBX            ; pair OK

8.5 An instruction which writes to the flags can pair with a conditional
    jump despite rule 8.2.  Example:
    CMP EAX, 2 / JA LabelBigger     ; pair OK

8.6 The following instruction combinations can pair despite the fact that
    they both modify the stack pointer:
    PUSH + PUSH,  PUSH + CALL,  POP + POP

8.7 There are restrictions on the pairing of instructions with prefix.
    There are several types of prefixes:
    - instructions addressing a non-default segment have a segment prefix.
    - instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit
      mode have an operand size prefix.
    - instructions using 32 bit base or index registers in 16 bit mode have
      an address size prefix.
    - repeated string instructions have a repeat prefix.
    - locked instructions have a LOCK prefix.
    - many instructions which were not implemented on the 8086 processor
      have a two byte opcode where the first byte is 0FH. The 0FH byte
      behaves as a prefix on the PPlain, but not on the other versions.
      The most common instructions with 0FH prefix are: MOVZX,
      MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT,
      BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, and IMUL with two operands and
      no immediate operand.

    On the PPlain, a prefixed instruction can only execute in the U-pipe,
    except for conditional near jumps.

    On the PMMX, instructions with operand size, address size, or 0FH
    prefix can execute in either pipe, whereas instructions with
    segment, repeat, or lock prefix can only execute in the U-pipe.

8.8 An instruction which has both a displacement and immediate data is not
    pairable on the PPlain and only pairable in the U-pipe on the PMMX:
    MOV DWORD PTR DS:[1000], 0    ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX+8], 1       ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX], 1         ; pairable
    CMP BYTE PTR [EBX+8], AL      ; pairable
    (Another problem with instructions which have both a displacement and
    immediate data on the PMMX is that such instructions may be longer
    than 7 bytes, which means that only one instruction can be decoded
    per clock cycle, as explained in section 13.)

8.9 Both instructions must be preloaded and decoded. This is explained in
    section 9.

8.10 There are special pairing rules for MMX instructions on the PMMX:
    - MMX shift, pack or unpack instructions can execute in either pipe
      but cannot pair with other MMX shift, pack or unpack instructions.
    - MMX multiply instructions can execute in either pipe but cannot pair
      with other MMX multiply instructions. They take 3 clock cycles and
      the last 2 clock cycles can overlap with subsequent instructions
      in the same way as floating point instructions can (see chapter 15).
    - an MMX instruction which accesses memory or integer registers can
      execute only in the U-pipe and cannot pair with a non-MMX instruction.


9. FIRST TIME VERSUS REPEATED EXECUTION  (all processors)
=========================================================
A piece of code usually takes much more time the first time it is executed
than when it is repeated. The reasons are the following:

9.1 Loading the code from RAM into the cache takes longer time than
    executing it.

9.2 In some cases decoding the code is the bottleneck. If it takes one clock
    cycle to determine the length of an instruction, then it is not possible
    to decode two instructions per clock cycle, because the processor
    doesn't know where the second instruction begins. The Pentium solves
    this problem by remembering the length of any instruction which has
    remained in the cache since last time it was executed. As a consequence
    of this, a set of instructions will not pair in the PPlain and PMMX
    the first time they are executed, unless the first of the two
    instructions is only one byte long.

9.3 Jump instructions will not be in the branch target buffer the first
    time they execute, and therefore are less likely to be predicted
    correctly. See chapter 12.

9.4 Any data accessed by the code has to be loaded into the cache, which
    may take much more time than executing the instructions. When the code is
    repeated, then the data are more likely to be in the cache.

For these four reasons, a piece of code inside a loop will generally take
much more time the first time it executes than the subsequent times.

If you have a big loop which doesn't fit into the code cache then you will
get the penalty of 9.1 and 9.2 all the time because it doesn't run from the
cache. You should therefore try to reorganize the loop to make it fit into
the cache.

If you have many jumps and branches inside a loop, then you may get the
penalty in 9.3 all the time.

Likewise, if a loop repeatedly accesses a data structure too big for
the data cache, then you will get the penalty of data cache misses all the
time.


10. IMPERFECT PAIRING (PPlain and PMMX)
=======================================
There are situations where the two instructions in a pair will not execute
simultaneously, or only partially overlap in time. They should still be
considered a pair, though, because the first instruction executes in the
U-pipe, and the second in the V-pipe. No subsequent instruction can start
to execute before both instructions in the imperfect pair have finished.

Imperfect pairing will happen in the following cases:

10.1 If the second instructions suffers an AGI stall (see chapter 7).

10.2 Two instructions cannot access the same dword of memory simultaneously.
     The following examples assume that ESI is divisible by 4:
     MOV AL, [ESI] / MOV BL, [ESI+1]
     The two operands are within the same dword, so they cannot execute
     simultaneously. The pair takes 2 clock cycles.
     MOV AL, [ESI+3] / MOV BL, [ESI+4]
     Here the two operands are on each side of a dword boundary, so they
     pair perfectly, and take only one clock cycle.

10.3 Rule 10.2 is extended to the case where bit 2-4 is the same in the two
     addresses (cache bank conflict). For dword addresses this means that
     the difference between the two addresses should not be divisible by 32.
     Examples:
     MOV [ESI], EAX / MOV [ESI+32000], EBX ;  imperfect pairing
     MOV [ESI], EAX / MOV [ESI+32004], EBX ;  perfect pairing

Pairable integer instructions which do not access memory take one clock
cycle to execute, except for mispredicted jumps. MOV instructions to or
from memory also take only one clock cycle if the data area is in the
cache and properly aligned. There is no speed penalty for using complex
addressing modes such as scaled index registers.

A pairable integer instruction which reads from memory, does some
calculation, and stores the result in a register or flags, takes 2 clock
cycles. (read/modify instructions).

A pairable integer instruction which reads from memory, does some
calculation, and writes the result back to the memory, takes 3 clock
cycles. (read/modify/write instructions).

10.4 If a read/modify/write instruction is paired with a read/modify or
     read/modify/write instruction, then they will pair imperfectly.

The number of clock cycles used is given in the following table:

                      |             Second instruction
                      | MOV or             read/       read/modify/
First  instruction    | register only      modify      write
----------------------|----------------------------------------------
MOV or register only  |      1               2              3
read/modify           |      2               2              3
read/modify/write     |      3               4              5
----------------------|-----------------------------------------------

Example:
ADD [mem1], EAX / ADD EBX, [mem2]  ; 4 clock cycles
ADD EBX, [mem2] / ADD [mem1], EAX  ; 3 clock cycles

10.5 When two paired instructions both take extra time due to cache
     misses, misalignment, or jump misprediction, then the pair will take
     more time than each instruction, but less than the sum of the two.

10.6 A pairable floating point instruction followed by FXCH will make
     imperfect pairing if the next instruction is not a floating point
     instruction.

In order to avoid imperfect pairing you have to know which instructions go
into the U-pipe, and which to the V-pipe. You can find out this by looking
backwards in your code and search for instructions which are unpairable,
pairable only in one of the pipes, or cannot pair due to one of the rules
in chapter 8.

Imperfect pairing can often be avoided by reordering your instructions.
Example:

L1:     MOV     EAX,[ESI]
        MOV     EBX,[ESI]
        INC     ECX

Here the two MOV instructions form an imperfect pair because they both
access the same memory location, and the sequence takes 3 clock cycles.
You can improve the code by reordering the instructions so that  INC ECX
pairs with one of the MOV instructions.

L2:     MOV     EAX,OFFSET A
        XOR     EBX,EBX
        INC     EBX
        MOV     ECX,[EAX]
        JMP     L1

The pair  INC EBX / MOV ECX,[EAX]  is imperfect because the latter
instruction has an AGI stall. The sequence takes 4 clocks. If you insert
a NOP or any other instruction so that  MOV ECX,[EAX] pairs with  JMP L1
instead, then the sequence takes only 3 clocks.

10.7
----
The next example is in 16 bit mode, assuming that SP is divisible by 4:

L3:     PUSH    AX
        PUSH    BX
        PUSH    CX
        PUSH    DX
        CALL    FUNC

Here the PUSH instructions form two imperfect pairs, because both operands
in each pair go into the same dword of memory. PUSH BX could possibly pair
perfectly with PUSH CX (because they go on each side of a dword boundary)
but it doesn't because it has already been paired with PUSH AX. The sequence
therefore takes 5 clocks. If you insert a NOP or any other instruction so
that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the
sequence will take only 3 clocks. Another way to solve the problem is to
make sure that SP is not divisible by 4. Knowing whether SP is divisible by
4 or not in 16 bit mode can be difficult, so the best way to avoid this
problem is to use 32 bit mode.


11. SPLITTING COMPLEX INSTRUCTIONS INTO SIMPLER ONES (PPlain and PMMX)
======================================================================
You may split up read/modify and read/modify/write instructions to improve
pairing. Example:
ADD [mem1],EAX / ADD [mem2],EBX    ; 5 clock cycles
This code may be split up into a sequence which takes only 3 clock cycles:
MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
MOV [mem1],ECX / MOV [mem2],EDX

Likewise you may split up non-pairable instructions into pairable
instructions:
PUSH [mem1] / PUSH [mem2]  ; non-pairable
Split up into:
MOV EAX,[mem1] / MOV EBX,[mem2] / PUSH EAX / PUSH EBX  ; everything pairs

Other examples of non-pairable instructions which may be split up into 
simpler pairable instructions:
CDQ  split into:  MOV EDX,EAX / SAR EDX,31
NOT EAX  change to  XOR EAX,-1
NEG EAX  split into  XOR EAX,-1 / INC EAX
MOVZX EAX,BYTE PTR [mem]  split into  XOR EAX,EAX / MOV AL,BYTE PTR [mem]
JECXZ  split into  TEST ECX,ECX / JZ
LOOP   split into  DEC ECX / JNZ
XLAT   change to   MOV AL,[EBX+EAX]

If splitting instructions doesn't improve speed, then you may keep the
complex or nonpairable instructions in order to reduce code size.
Splitting instructions is not needed on the PPro and PII.


12. JUMPS AND BRANCHES (all processors)
=======================================
The Pentium family of processors attempt to predict where a jump will go
to, and whether a conditional jump will be taken or fall through. If the
prediction is correct, then it can save a considerable amount of time by
loading the subsequent instructions into the pipeline and start decoding
them before the jump is executed. If the prediction turns out to be wrong,
then the pipeline has to be flushed, which will cost a penalty depending
on the length of the pipeline.

The predictions are based on a Branch Target Buffer (BTB) which stores
the history for each branch or jump instruction and makes predictions
based on the prior history of executions of each instruction. The BTB
is organized like a set-associative cache where new entries are allocated
according to a pseudo-random replacement method.

When optimizing code, it is important to minimize the number of
misprediction penalties. This requires a good understanding of how the
jump prediction works.

The branch prediction mechanisms are not described adequately in Intel
manuals or anywhere else. I am therefore giving a very detailed
description here. This information is based on my own research (with the
help of Karki Jitendra Bahadur for the PPlain).

In the following, I will use the term 'control transfer instruction' for
any instruction which can change the instruction pointer, including
conditional and unconditional, direct and indirect, near and far, jumps,
calls, and returns. All these instructions use prediction.


12.1.1 Branch prediction in the PPlain
======================================
The branch prediction mechanism for the PPlain is very different from the
other three processors. Information found in Intel documents and
elsewhere on this subject is directly misleading, and following the
advises given is such documents is likely to lead to sub-optimal code.

The PPlain has a branch target buffer (BTB), which can hold information
for up to 256 jump instructions. The BTB is organized like a 4-way set-
associative cache with 64 entries per way. This means that the BTB can
hold no more than 4 entries with the same set value. Unlike the data
cache, the BTB uses a pseudo random replacement algorithm, which means
that a new entry will not necessarily displace the least recently used
entry of the same set-value. How the set-value is calculated will be
explained later. Each BTB entry stores the address of the jump target
and a prediction state, which can have four different values:

state 0: "strongly not taken"
state 1: "weakly not taken"
state 2: "weakly taken"
state 3: "strongly taken"

A branch instruction is predicted to jump when in state 2 or 3, and to
fall through when in state 0 or 1. The state transition works like a two-
bit counter, so that the state is incremented when the branch is taken,
and decremented when it falls through. The counter saturates, rather than
wrap around, so that it does not decrement beyond 0 or increment beyond 3.
Ideally, this would provide a reasonably good prediction, because a branch
instruction would have to deviate twice from what it does most of the time,
before the prediction changes.

However, this mechanism has been compromised by the fact that state
0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as
no BTB entry. This makes sense, because a branch instruction is predicted
to fall through if it has no BTB entry. This improves the utilization of
the BTB, because a branch instruction which is seldom taken will most of
the time not take up any BTB entry.

Now, if a jumping instruction has no BTB entry, then a new BTB entry will
be generated, and this new entry will always be set to state 3. This means
that it is impossible to go from state 0 to state 1 (except for a very
special case discussed later). From state 0 you can only go to state 3, if
the branch is taken. If the branch falls through, then it will stay out of
the BTB.

This is a serious design flaw. By throwing state 0 entries out of the BTB
and always setting new entries to state 3, the designers apparently have
given priority to minimizing the first time penalty for unconditional
jumps and branches often taken, and ignored that this seriously compromises
the basic idea behind the mechanism and reduces the performance in small
innermost loops. The consequence of this flaw is, that a branch instruction
which falls through most of the time will have up to three times as many
mispredictions as a branch instruction which is taken most of the time.

You may take this asymmetry into account by organizing your branches so
that they are taken more often than not.

Consider for example this if-then-else construction:

        TEST EAX,EAX
        JZ   A
        <branch 1>
        JMP  E
A:      <branch 2>
E:

If branch 1 is executed more often than branch 2, and branch 2 is seldom
executed twice in succession, then you can reduce the number of branch
mispredictions by up to a factor 3 by swapping the two branches so that
the branch instruction will jump more often than fall through:

        TEST EAX,EAX
        JNZ  A
        <branch 2>
        JMP  E
A:      <branch 1>
E:

(This is _contrary_ to the recommendations in Intel's manuals and
tutorials. They don't seem to recognize the asymmetry in branch
prediction. But it is easy to prove that the latter example is superior).

There may be reasons to put the most often executed branch first, however:
1. Putting seldom executed branches away in the bottom of your code can
   improve code cache utilization.
2. A branch instruction seldom taken will stay out of the BTB most of the
   time, possibly improving BTB utilization.
3. The branch instruction will be predicted as not taken if it has been
   flushed out of the BTB by other branch instructions.
4. The asymmetry in branch prediction only exists on the PPlain.

These considerations have little weight, however, for small critical loops,
so I would still recommend organizing branches with a skewed distribution
so that the branch instruction is taken more often than not, unless branch
2 is executed so seldom, that misprediction doesn't matter.

Likewise, you should preferably organize loops with the testing branch
instruction at the bottom, as in this example:

        MOV ECX, [N]
L:      MOV [EDI],EAX
        ADD EDI,4
        DEC ECX
        JNZ L

If N is high, then the JNZ instruction here will be taken more often than
not, and never fall through twice in succession.

Consider the situation where a branch is taken every second time. The
first time it jumps the BTB entry will go into state 3, and will then
alternate between state 2 and 3. It is predicted to jump all the time,
which gives 50% mispredictions. Assume now that it deviates from this
regular pattern and falls through an extra time. The jump pattern is:
01010100101010101010101, where 0 means nojump, and 1 means jump.
       ^
The extra nojump is indicated with a ^ above. After this incident, the
BTB entry will alternate between state 1 and 2, which gives 100%
mispredictions. It will continue in this unfortunate mode until there
is another deviation from the 0101 pattern. This is the worst case for
this branch prediction mechanism.


12.1.2 BTB is looking ahead  (PPlain)
-------------------------------------
The BTB mechanism is counting instruction pairs, rather than single
instructions, so you have to know how instructions are pairing in order
to analyze where a BTB entry is stored. The BTB entry for any control
instruction is attached to the address of the U-pipe instruction in the
preceding instruction pair. (An unpaired instruction counts as one pair).
Example:

        SHR EAX,1
        MOV EBX,[ESI]
        CMP EAX,EBX
        JB  L

Here  SHR  pairs with  MOV,  and  CMP  pairs with  JB. The BTB entry for
JB L  is thus attached to the address of the SHR EAX,1 instruction. When
this BTB entry is met, and if it is in state 2 or 3, then the Pentium will
read the target address from the BTB entry, and load the instructions
following L into the pipeline. This happens before the branch instruction
has been decoded, so the Pentium relies solely on the information in the
BTB when doing this.

You may remember, that instructions are seldom pairing the first time they
are executed (see paragraph 9.2). If the instructions above are not
pairing, then the BTB entry should be attached to the address of the CMP
instruction, and this entry would be wrong on the next execution, when
instructions are pairing. However, in most cases the PPlain is smart enough
to not make a BTB entry when there is an unused pairing opportunity, so you
don't get a BTB entry until the second execution, and hence you won't get a
prediction until the third execution. (In the rare case, where every second
instruction is a single-byte instruction, you may get a BTB entry on the
first execution which becomes invalid in the second execution, but since
the instruction it is attached to will then go to the V-pipe, it is ignored
and gives no penalty. A BTB entry is only read if it is attached to the
address of a U-pipe instruction).

A BTB entry is identified by its set-value which is equal to bits 0-5
of the address it is attached to. Bits 6-31 are then stored in the BTB as
a tag. Addresses which are spaced a multiple of 64 bytes apart will have
the same set-value. You can have no more than four BTB entries with the
same set-value. If you want to check whether your jump instructions contend
for the same BTB entries, then you have to compare bits 0-5 of the addresses
of the U-pipe instructions in the preceding instruction pairs. This is very
tedious, and I have never heard of anybody doing so. There are no tools
available to do this job for you.


12.1.3 Consecutive branches  (PPlain)
-------------------------------------
When a jump is mispredicted, then the pipeline gets flushed. If the next
instruction pair executed also contains a control transfer instruction,
then the PPlain won't load its target because it cannot load a new
target while the pipeline is being flushed. The result is that the
second jump instruction is predicted to fall through regardless of the
state of its BTB entry. Therefore, if the second jump is also taken, then
you will get another penalty. The state of the BTB entry for the second
jump instruction does get correctly updated, though. If you have a long
chain of control transfer instructions, and the first jump in the chain is
mispredicted, then the pipeline will get flushed all the time, and you
will get nothing but mispredictions until you meet an instruction pair
which does not jump. The most extreme case of this is a loop which jumps
to itself: It will get a misprediction penalty for each iteration.

This is not the only problem with consecutive control transfer instructions.
Another problem is that you can have another branch instruction between
a BTB entry and the control transfer instruction it belongs to. If the
first branch instruction jumps to somewhere else, then strange things
may happen. Consider this example:

        SHR EAX,1
        MOV EBX,[ESI]
        CMP EAX,EBX
        JB  L1
        JMP L2

L1:     MOV EAX,EBX
        INC EBX

When  JB L1  falls through, then you will get a BTB entry for  JMP L2
attached to the address of  CMP EAX,EBX.  But what will happen when
JB L1  later is taken?  At the time when the BTB entry for  JMP L2  is
read, the processor doesn't know that the next instruction pair does not
contain a jump instruction, so it will actually predict the instruction
pair  MOV EAX,EBX / INC EBX  to jump to L2.  The penalty for predicting
non-jump instructions to jump is 3 clock cycles.  The BTB entry for  JMP L2
will get its state decremented, because it is applied to something which
doesn't jump.  If we keep going to L1, then the BTB entry for  JMP L2
will be decremented to state 1 and 0, so that the problem will disappear
until next time  JMP L2  is executed.

The penalty for predicting the non-jumping instructions to jump only occurs
when the jump to L1 is predicted. In the case that  JB L1  is mispredictedly
jumping, then the pipeline gets flushed and we won't get the false L2 target
loaded, so in this case we will not see the penalty of predicting the
non-jumping instructions to jump, but we do get the BTB entry for  JMP L2
decremented.

Suppose, now, that we replace the  INC EBX  instruction above with another
jump instruction.  This third jump instruction will then use the same
BTB entry as  JMP L2  with the possible penalty of predicting a wrong target,
(unless it happens to also have L2 as target).

To summarize, consecutive jumps can lead to the following problems:
- failure to load a jump target when the pipeline is being flushed by a
  preceding mispredicted jump.
- a BTB entry being mis-applied to non-jumping instructions and predicting
  them to jump.
- a second consequence of the above is that a mis-applied BTB entry will
  get its state decremented, possibly leading to a later misprediction of
  the jump it belongs to. Even unconditional jumps can be predicted to fall
  through for this reason.
- two jump instructions may share the same BTB entry, leading to the
  prediction of a wrong target.

All this mess may give you a lot of penalties, so you should definitely
avoid having an instruction pair containing a jump immediately after
another poorly predictable control transfer instruction or its target.

It is time for an illustrative example:

        CALL P
        TEST EAX,EAX
        JZ   L2
L1:     MOV  [EDI],EBX
        ADD  EDI,4
        DEC  EAX
        JNZ  L1
L2:     CALL P

This looks like a quite nice and normal piece of code: A function call,
a loop which is bypassed when the count is zero, and another function call.
How many problems can you spot in this program?

First, we may note that the function P is called alternatingly from two
different locations. This means that the target for the return from P will
be changing all the time. Consequently, the return from P will always be
mispredicted.

Assume, now, that EAX is zero. The jump to L2 will not have its target
loaded because the mispredicted return caused a pipeline flush. Next, the
second  CALL P  will also fail to have its target loaded because  JZ L2
caused a pipeline flush. Here we have the situation where a chain of
consecutive jumps makes the pipeline flush repeatedly because the first
jump was mispredicted. The BTB entry for  JZ L2  is stored at the address
of P's return instruction. This BTB entry will now be mis-applied to
whatever comes after the second  CALL P,  but that doesn't give a penalty
because the pipeline is flushed by the mispredicted second return.

Now, let's see what happens if EAX has a nonzero value the next time:
JZ L2  is always predicted to fall through because of the flush. The second
CALL P  has a BTB entry at the address of  TEST EAX,EAX.  This entry will
be mis-applied to the  MOV/ADD  pair, predicting it to jump to P. This
causes a flush which prevents  JNZ L1  from loading its target. If we have
been here before, then the second  CALL P  will have another BTB entry at
the address of  DEC EAX.  On the second and third iteration of the loop, this
entry will also be mis-applied to the  MOV/ADD  pair, until it has had its
state decremented to 1 or 0. This will not cause a penalty on the second
iteration because the flush from  JNZ L1  prevents it from loading its false
target, but on the third iteration it will. The subsequent iterations of the
loop have no penalties, but when it exits,  JNZ L1  is mispredicted. The
flush would now prevent  CALL P  from loading its target, were it not for
the fact that the BTB entry for  CALL P  has already been destroyed by
being mis-applied several times.

We can improve this code by putting in some NOP's to separate all consecutive
jumps:

        CALL P
        TEST EAX,EAX
        NOP
        JZ   L2
L1:     MOV  [EDI],EBX
        ADD  EDI,4
        DEC  EAX
        JNZ  L1
L2:     NOP
        NOP
        CALL P

The extra NOP's cost 2 clock cycles, but they save much more. Furthermore,
JZ L2  is now moved to the U-pipe which reduces its penalty from 4 to 3
when mispredicted. The only problem that remains is that the returns from
P are always mispredicted. This problem can only be solved by replacing the
call to P by an inline macro (if you have enough code cache).

The lesson to learn from this example is that you should always look
carefully for consecutive jumps and see if you can save time by inserting
some NOP's. You should be particularly aware of those situations where
misprediction is unavoidable, such as loop exits and returns from procedures
which are called from varying locations. If you have something useful to
put in, in stead of the NOP's, then you should of course do so.

Multiway branches (case statements) may be implemented either as a tree of
branch instructions or as a list of jump addresses. If you choose to use
a tree of branch instructions, then you have to include some NOP's or
other instructions to separate the consecutive branches. A list of jump
addresses may therefore be a better solution on the PPlain. The list of
jump addresses should be placed in the data segment. Never put data in
the code segment!


12.1.4 Multiple BTB entries  (PPlain)
-------------------------------------
While two jump instructions can share the same BTB entry, one jump
instruction can also have multiple BTB entries if it is jumped to from
different locations, or if its pairing is changing. Example:

        JZ  L1
        MOV EAX,1
L1:     MOV EBX,2
        MOV ECX,3
        JC  L2

If  JZ L1  falls through, then  MOV EAX,1  pairs with  MOV EBX,2, and
MOV ECX,3  pairs with  JC L2.  The BTB entry for  JC L2  will be attached
to the address of  MOV EAX,1. When  JZ L1  is taken, then  MOV EBX,2 will
pair with  MOV ECX,3,  JC L2  will be unpaired, and its BTB entry will be
at  MOV EBX,2.  So you have two BTB entries for  JC L2.  The first entry
is used when the zero flag is clear, the other when set.  This may actually
be an advantage, because it will improve the predictability of the second
branch instruction if there is a correllation between the two.  Assume, for
example, that this code sequence is preceded by a  CMP  instruction. Then
we can be certain that the zero flag and the carry flag will never both be
set at the same time. We will get no second BTB entry for  JC L2  at
MOV EBX,2  because it always falls through if the zero flag is set, and
we will never get a misprediction for  JC L2  when  JZ L1  is taken.
The situations where you can take advantage of this trick are rare,
however, because you might as well let L1 go to another branch which
doesn't test the carry flag at all.


12.1.5 Tight loops  (PPlain)
----------------------------
In a small loop you will often access the same BTB entry repeatedly with
small intervals. This never causes a stall. Rather than waiting for a BTB
entry to be updated, the PPlain somehow bypasses the pipeline and gets the
resulting state from the last jump before it has been written to the BTB.
This mechanism is almost transparent to the user, but it does in some cases
have funny effects: You can see a branch prediction going from state 0 to
state 1, rather than to state 3, if the zero has not yet been written to the
BTB. This happens if the loop has no more than four instruction pairs. In
loops with only two instruction pairs you may sometimes have state 0 for two
consecutive iterations without going out of the BTB. In such small loops it
also happens in rare cases that the prediction uses the state resulting from
two iterations ago, rather than from the last iteration. These funny effects
will usually not have any negative effects on performance.


12.2  Branch prediction in PMMX, PPro and PII
=============================================
12.2.1  BTB organization (PMMX, PPro and PII)
---------------------------------------------
The branch target buffer (BTB) of the PMMX has 256 entries organized
as 16 ways * 16 sets. Each entry is identified by bits 2-31 of
the address of the last byte of the control transfer instruction it
belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the
BTB as a tag. Control transfer instructions which are spaced 64 bytes
apart have the same set-value and may therefore occasionally push each
other out of the BTB. Since there are 16 ways per set, this won't happen
too often.

The branch target buffers of the PPro and PII have 512 entries. I have no
information about how they are organized into sets.

The PPro and PII allocate a BTB entry to any control transfer instruction
the first time it is executed. The PMMX allocates it the first time it
jumps. A branch instruction which never jumps will stay out of the BTB
on the PMMX. As soon as it has jumped once, it will stay in the BTB, even
if it never jumps again.

An entry may be pushed out of the BTB when another control transfer
instruction with the same set-value needs a BTB entry.


12.2.2  misprediction penalty  (PMMX, PPro and PII)
---------------------------------------------------
In the PMMX, the penalty for misprediction of a conditional jump is 4
clocks in the U-pipe, and 5 clocks if it is executed in the V-pipe.
For all other control transfer instructions it is 4 clocks.

In the PPro and PII, the misprediction penalty is very high due to the
long pipeline. A misprediction usually costs between 10 and 20 clock
cycles. It is therefore very important to be aware of poorly predictable
branches when running on PPro or PII.


12.2.3 Pattern recognition for conditional jumps  (PMMX, PPro and PII)
----------------------------------------------------------------------
These processors have an advanced pattern recognition mechanism which will
correctly predict a branch instruction which, for example, is taken
every fourth time and falls through the other three times. In fact,
they can predict any repetitive pattern of jumps and nojumps with a
period of up to five, and many patterns with higher periods.

The mechanism is a so-called "two-level adaptive branch prediction scheme",
invented by T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two-
bit counters as described above for the PPlain (but without the assymmetry
flaw). The counter is incremented when the jump is taken and decremented
when not taken. There is no wrap-around when counting up from 3 or down
from 0. A branch instruction is predicted to be taken when the corresponding
counter is in state 2 or 3, and to fall through when in state 0 or 1.
An impressive improvement is now obtained by having sixteen such
counters for each BTB entry. It selects one of these sixteen counters
based on the history of the branch instruction for the last four executions.
If, for example, the branch instruction jumps once and then falls through
three times, then you have the history bits 1000 (1=jump, 0=nojump). This
will make it use counter number 8 (1000 = binary 8) for predicting the
next time and update counter 8 afterwards.

If the sequence 1000 is always followed by a 1, then counter number 8
will soon end up in its highest state (state 3) so that it will always
predict a 1000 sequence to be followed by a 1. It will take two deviations
from this pattern to change the prediction. The repetitive pattern
100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state
0. The other twelve counters will be unused.

12.2.4 Perfectly predicted patterns  (PMMX, PPro and PII)
---------------------------------------------------------
Below is a list of repetitive branch patterns which are predicted perfectly.
Period     Patterns
-----------------------------------------------------------------------------
1 - 5      all
6          000011, 000101, 000111, 001011
7          0000101, 0000111, 0001011
8          00001011, 00001111, 00010011, 00010111, 00101101
9          000010011, 000010111, 000100111, 000101101
10         0000100111, 0000101101, 0000101111, 0000110111, 0001010011,
           0001011101
11         00001001111, 00001010011, 00001011101, 00010100111
12         000010100111, 000010111101, 000011010111, 000100110111,
           000100111011
13         0000100110111, 0000100111011, 0000101001111
14         00001001101111, 00001001111011, 00010011010111, 00010011101011
           00010110011101, 00010110100111
15         000010011010111, 000010011101011, 000010100110111, 000010100111011
           000010110011101, 000010110100111, 000010111010011, 000011010010111
16         0000100110101111, 0000100111101011, 0000101100111101,
           0000101101001111
-----------------------------------------------------------------------------

When reading this table, you should be aware that if a pattern is
predicted correctly than the same pattern reversed (read backwards) is
also predicted correctly, as well as the same pattern with all bits
inverted. Example:
In the table we find the pattern: 0001011.
Reversing this pattern gives:     1101000.
Inverting all bits gives:         1110100.
Both reversing and inverting:     0010111.
These four patterns are all recognizable. Rotating the pattern one place
to the left gives: 0010110. This is of course not a new pattern, only
a phase shifted version of the same pattern. All patterns which can be
derived from one of the patterns in the table by reversing, inverting
and rotating are also recognizable. For reasons of brevity, these are not
listed.

It takes two periods for the pattern recognition mechanism to learn a
regular repetitive pattern after the BTB entry has been allocated.
The pattern of mispredictions in the learning period is not reproducible.
This is probably because the BTB entry contained something prior to
allocation. Since BTB entries are allocated according to a random scheme,
there is little chance of predicting what happens during the initial
learning period.

12.2.5 Handling deviations from a regular pattern  (PMMX, PPro and PII)
-----------------------------------------------------------------------
The branch prediction mechanism is also extremely good at handling
'almost regular' patterns, or deviations from the regular pattern.
Not only does it learn what the regular pattern looks like. It also
learns what deviations from the regular pattern look like. If deviations
are always of the same type, then it will remember what comes after the
irregular event, and the deviation will cost only one misprediction.

Example:
0001110001110001110001011100011100011100010111000
                      ^                   ^
In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns
that the repeated sequence is 000111. The first irregularity is an
unexpected 0, which I have marked with a ^. After this 0 the next three
jumps may be mispredicted, because it hasn't learned what comes after 0010,
0101, and 1011. After one or two irregularities of the same kind it has
learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes
1. This means that after at most two irregularities of the same kind, it
has learned to handle this kind of irregularity with only one misprediction.

The prediction mechanism is also very effective when alternating between
two different regular patterns. If, for example, we have the pattern 000111
(with period 6) repeated many times, then the pattern 01 (period 2) many
times, and then return to the 000111 pattern, then the mechanism doesn't
have to relearn the 000111 pattern, because the counters used in the 000111
sequence have been left un-touched during the 01 sequence. After a few
alternations between the two patterns, it has also learned to handle the
changes of pattern with only one misprediction for each time the pattern
is switched.

12.2.6 Patterns which are not predicted perfectly  (PMMX, PPro and PII)
-----------------------------------------------------------------------
The simplest branch pattern which cannot be predicted perfectly is a
branch which is taken on every 6'th execution. The pattern is:
000001000001000001
    ^^    ^^    ^^
    ab    ab    ab
The sequence 0000 is alternatingly followed by a 0, in the positions
marked a above, and by a 1, in the positions marked b. This affects
counter number 0 which will count up and down all the time. If counter
0 happens to start in state 0 or 1, then it will alternate between
state 0 and 1. This will lead to a misprediction in position b. If counter
0 happens to start in state 3, then it will alternate between state 2 and
3 which will cause a misprediction in position a. The worst case is when
it starts in state 2. It will alternate between state 1 and 2 with the
unfortunate consequence that we get a misprediction both in position
a and b. (This is analogous to the worst case for the PPlain explained in
the end of section 12.1.1 above). Which of these four situations we will
get depends on the history of the BTB entry prior to allocation to this
branch. This is beyond our control because of the random allocation method.

In principle, it is possible to avoid the worst case situation where we
have two mispredictions per cycle by giving it an initial branch sequence
which is specially designed for putting the counter in the desired state.
Such an approach cannot be recommended, however, because of the considerable
extra code complexity required, and because whatever information we have
put into the counter is likely to be lost during the next timer interrupt
or task switch.

12.2.7 Completely random patterns  (PMMX, PPro and PII)
-------------------------------------------------------
The powerful capability of pattern recognition has a minor drawback in
the case of completely random sequences with no regularities.

The following table lists the experimental fraction of mispredictions
for a completely random sequence of jumps and nojumps:

  fraction           fraction of
jumps/nojumps       mispredictions
---------------------------------------
0.001/0.999           0.001001
 0.01/0.99            0.0101
 0.05/0.95            0.0525
 0.10/0.90            0.110
 0.15/0.85            0.171
 0.20/0.80            0.235
 0.25/0.75            0.300
 0.30/0.70            0.362
 0.35/0.65            0.418
 0.40/0.60            0.462
 0.45/0.55            0.490
 0.50/0.50            0.500
---------------------------------------

The fraction of mispredictions is slightly higher than it would be
without pattern recognition because the processor keeps trying to find
repeated patterns in a sequence which has no regularities.

12.2.8 Tight loops  (PMMX)
--------------------------
The branch prediction is not reliable in tiny loops where the pattern
recognition mechanism doesn't have time to update its data before the
next branch is met. This means that simple patterns, which would normally
be predicted perfectly, are not recognized. Incidentally, some patterns
which normally would not be recognized, are predicted perfectly in tight
loops. For example, a loop which always repeats 6 times would have the
branch pattern 111110 for the branch instruction at the bottom of the
loop. This pattern would normally have one or two mispredictions per
iteration, but in a tight loop it has none. The same applies to a loop
which repeats 7 times. Most other repeat counts are predicted poorer in
tight loops than normally. This means that a loop which iterates 6 or 7
times should preferably be tight, whereas other loops should preferably
not be tight. You may unroll a loop if necessary to make it less tight.

To find out whether a loop will behave as 'tight' on the PMMX you may follow
the following rule of thumb: Count the number of instructions in the loop.
If the number is 6 or less, then the loop will behave as tight. If you
have more than 7 instructions, then you can be reasonably sure that the
pattern recognition functions normally. Strangely enough, it doesn't
matter how many clock cycles each instruction takes, whether it has
stalls, or whether it is paired or not. Complex integer instructions
do not count. A loop can have lots of complex integer instructions and
still behave as a tight loop. A complex integer instruction is a non-
pairable integer instruction which always takes more than one clock
cycle. Complex floating point instructions and MMX instructions still
count as one. Note, that this rule of thumb is heuristic and not
completely reliable. In important cases you may want to do your own
testing. You can use performance monitor counter number 35H for the
PMMX (0C5H for PPro and PII) to count branch mispredictions. Test
results may not be completely deterministic, because branch predictions
may depend on the history of the BTB entry prior to allocation.

I have no exact experimental information on tight loops in PPro and PII.


12.2.9 Indirect jumps and calls  (PMMX, PPro and PII)
-----------------------------------------------------
There is no pattern recognition for indirect jumps and calls, and the
BTB can remember no more than one target for an indirect jump. It is
simply predicted to go to the same target as it did last time.


12.2.10 JECXZ and LOOP  (PMMX)
------------------------------
There is no pattern recognition for these two instructions in the PMMX.
They are simply predicted to go the same way as last time they were executed.
These two instructions should be avoided in time-critical code for PMMX.


12.2.11 Returns  (PMMX, PPro and PII)
-------------------------------------
The PMMX, PPro and PII processors have a Return Stack Buffer (RSB) which
is used for predicting return instructions. The RSB works as a
First-In-Last-Out  buffer. Each time a CALL instruction is executed, the
corresponding return address is pushed into the RSB. And each time a RET
instruction is executed, a return address is pulled out of the RSB and
used for prediction of the RET. This mechanism makes sure that return
instructions are correctly predicted when the same subroutine is called
from several different locations.

In order to make sure this mechanism works correctly, you must make sure
that all calls and returns are matched. Never jump out of a subroutine
without a return and never use a return as an indirect jump if speed is
critical.

The RSB can hold only four entries in the PMMX. In the case where the RSB
is empty, the return instruction is predicted in the same way as an
indirect jump, i.e. it is expected to go to the same target as it did
last time. When subroutines are nested deeper than four levels, then the
innermost four levels use the RSB, whereas all subsequent returns from
the outer levels use the simpler prediction mechanism as long as there
are no new calls. A return instruction which uses the RSB still occupies
a BTB entry.

Four entries in the RSB doesn't sound of much, but it is probably
sufficient. Subroutine nesting deeper than four levels is certainly not
unusual, but only the innermost levels matter in terms of speed.

The RSB in PPro and PII can probably hold 12 entries.


12.2.12 Static prediction  (PMMX)
---------------------------------
A control transfer instruction which has not been seen before or which
is not in the BTB is always predicted to fall through on the PMMX. It
doesn't matter whether it goes forward or backwards.

A branch instruction will not get a BTB entry if it always falls through.
As soon as it is taken once, it will get into the BTB and stay there no
matter how many times it falls through. A control transfer instruction
can only go out of the BTB when it is pushed out by another control
transfer instruction which steals its BTB entry.

Any control transfer instruction which jumps to the address immediately
following itself will not get a BTB entry. Example:
        JMP SHORT LL
LL:

This instruction will never get a BTB entry and therefore always have
a misprediction penalty.


12.2.13 Static prediction  (PPro and PII)
-----------------------------------------
On PPro and PII, a control transfer instruction which has not been seen
before or which is not in the BTB is predicted to fall through if it goes
forwards, and to be taken if it goes backwards (e.g. a loop). Static
prediction takes longer time than dynamic prediction on these processors.


12.2.14 Close jumps  (PMMX)
---------------------------
On the PMMX, there is a risk that two control transfer instructions will
share the same BTB entry if they are too close to each other. The obvious
result is that they will always be mispredicted.

The BTB entry for a control transfer instruction is identified by bits
2-31 of the address of the last byte in the instruction. If two control
transfer instructions are so close together that they differ only in
bits 0-1 of the address, then we have the problem of a shared BTB entry.
Example:

        CALL    P
        JNC     SHORT L

If the last byte of the CALL instruction and the last byte of the JNC
instruction lie within the same dword of memory, then we have the
penalty. You have to look at the output list file from the assembler to
see whether the two addresses are separated by a dword boundary or not.
(A dword boundary is an address divisible by 4).

There are various ways to solve this problem:
1. Move the code sequence a little up or down in memory so that you get
   a dword boundary between the two addresses.
2. Change the short jump to a near jump (with 4 bytes displacement) so
   that the end of the instruction is moved further down. There is no
   way you can force the assembler to use anything but the shortest
   form of an instruction so you have to hard-code the near branch if
   you choose this solution.
3. Put in some instruction between the CALL and the JNC instructions.
   This is the easiest method, and the only method if you don't know
   where dword boundaries are because your segment is not dword aligned
   or because the code keeps moving up and down as you make changes in
   the preceding code:

        CALL    P
        MOV     EAX,EAX         ; two bytes filler to be safe
        JNC     SHORT L

If you want to avoid problems on the PPlain too, then put in two NOP's
in stead to prevent pairing (see section 12.1.3 above).

The RET instruction is particularly prone to this problem because it is
only one byte long:

        JNZ     NEXT
        RET

Here you may need up to three bytes of fillers:

        JNZ     NEXT
        NOP
        MOV     EAX,EAX
        RET


12.2.15 Consequtive calls or returns  (PMMX)
--------------------------------------------
There is a penalty when the first instruction pair following the target
label of a call contains another call instruction or if a return follows
immediately after another return. Example:

FUNC1   PROC    NEAR
        NOP             ; avoid call after call
        NOP
        CALL    FUNC2
        CALL    FUNC3
        NOP             ; avoid return after return
        RET
FUNC1   ENDP

Two NOP's are required before  CALL FUNC2  because a single NOP would
pair with the CALL. One NOP is enough before the RET because RET is
unpairable. No NOP's are required between the two CALL instructions
because there is no penalty for call after return. (On the PPlain you
would need two NOP's here too).

The penalty for chained calls only occurs when the same subroutines are
called from more than one location (probably because the RSB needs
updating). Chained returns always have a penalty. There is sometimes a
small stall for a jump after a call, but no penalty for return after call;
call after return; jump, call, or return after jump; or jump after return.


12.2.16 Designing for branch predictability  (PMMX, PPro and PII)
-----------------------------------------------------------------
Multiway branches (switch/case statements) are implemented either as
an indirect jump using a list of jump addresses, or as a tree of branch
instructions. Since indirect jumps are poorly predicted, the latter
method may be preferred if easily predicted patterns can be expected
and you have enough BTB entries.
                 
You may want to reorganize your code so that branch patterns which are
not predicted perfectly can be replaced by other patterns which are.
Consider, for example, a loop which always executes 20 times. The
conditional jump at the bottom of the loop is taken 19 times and falls
through every 20'th time. This pattern is regular, but not recognized by
the pattern recognition mechanism. You may make two nested loops by
four and five, or unroll the loop by four and let it execute 5 times,
in order to have only recognizable patterns. This kind of complicated
schemes are only worth the extra code on the PPro and PII processors
where mispredictions are very expensive. For higher loop counts there
is no reason to do anything about the single misprediction.

    
12.3 Avoiding branches  (all processors)
----------------------------------------
Sometimes it is possible to obtain the same effect as a branch by ingenious
manipulation of bits and flags. For example you may calculate the absoulte
value of a signed number without branching:
        CWD
        XOR EAX,EDX
        SUB EAX,EDX
(On PPlain and PMMX, use  MOV EDX,EAX / SAR EDX,31  in stead of  CWD).

The carry flag is particularly useful for this kind of tricks.
Setting carry if a value is zero:  CMP [value],1
Setting carry if a value is not zero:  XOR EAX,EAX / CMP EAX,[value]
Incrementing a counter if carry:  ADC EAX,0
Setting a bit for each time the carry is set:  RCL EAX,1
Generating a bit mask if carry is set:  SBB EAX,EAX
Setting a bit on an arbitrary condition:  SETcond AL
Setting eight bits on an arbitrary condition:  SETNcond AL / DEC AL
(remember to reverse the condition in the last example)

This example finds the minimum of two unsigned numbers: if (b < a) a = b;
        SUB EBX,EAX
        SBB ECX,ECX
        AND ECX,EBX
        ADD EAX,ECX

This example chooses between two numbers:  if (a != 0) a = b; else a = c;
        CMP EAX,1
        SBB EAX,EAX
        XOR ECX,EBX
        AND EAX,ECX
        XOR EAX,EBX

Whether or not such tricks are worth the extra code depend on how
predictable a conditional jump would be, whether the extra pairing
opportunities of the branch-free code can be utilized, and whether there
are other jumps following immediately after which could suffer the penalties
of consecutive jumps.

The PPro and PII processors have conditional move instructions intended
specially for avoiding branches. For code that will run only on these
processors, you should replace all poorly predictable branches with
conditional moves if possible.


13. PREFIXES (PPlain and PMMX)
==============================
An instruction with one or more prefixes may not be able to execute in the
V-pipe (se paragraph 8.7), and it may take more than one clock cycle to
decode.

On the PPlain, the decoding delay is one clock cycle for each prefix
except for the 0Fh prefix of conditional near jumps.

The PMMX has no decoding delay for 0Fh prefix. Segment and repeat prefixes
take one clock extra to decode. Address and operand size prefixes take
two clocks extra to decode. The PMMX can decode two instructions per clock
cycle if the first instruction has a segment or repeat prefix or no prefix,
and the second instruction has no prefix. Instructions with address or
operand prefixes can only decode alone on the PMMX. Instructions with more
than one prefix take one clock extra for each prefix.

Address size prefixes can be avoided by using 32 bit mode.
Segment prefixes can be avoided in 32 bit mode by using a flat memory
model. Operand size prefixes can be avoided in 32 bit mode by using
only 8 bit and 32 bit integers.

Where prefixes are unavoidable, the decoding delay may be masked if a
preceding instruction takes more than one clock cycle to execute. The rule
for the PPlain is that any instruction which takes N clock cycles to
execute (not to decode) can 'overshadow' the decoding delay of N-1
prefixes in the next two (sometimes three) instructions or instruction
pairs. In other words, each extra clock cycle that an instruction takes to
execute can be used to decode one prefix in a later instruction. This
shadowing effect even extends across a predicted branch. Any instruction
which takes more than one clock cycle to execute, and any instruction which
is delayed because of an AGI stall, cache miss, misalignment, or any other
reason except decoding delay and branch misprediction, has a shadowing
effect.

The PMMX has a similar shadowing effect, but the mechanism is different.
Decoded instructions are stored in a transparent first-in-first-out (FIFO)
buffer, which can hold up to four instructions. As long as there are
instructions in the FIFO buffer you get no delay. When the buffer is empty
then instructions are executed as soon as they are decoded. The buffer is
filled when instructions are decoded faster than they are executed, i.e.
when you have unpaired or multi-cycle instructions. The FIFO buffer is
emptied when instructions execute faster than they are decoded, i.e. when
you have decoding delays due to prefixes. The FIFO buffer is empty after
a mispredicted branch. The FIFO buffer can receive two instructions per
clock cycle provided that the second instruction is without prefixes and
none of the instructions are longer than 7 bytes. The two execution
pipelines (U and V) can each receive one instruction per clock cycle from
the FIFO buffer.


Examples:

CLD / REP MOVSD
The CLD instruction takes two clock cycles and can therefore overshadow
the decoding delay of the REP prefix. The code would take one clock cycle
more if the CLD instruction was placed far from the REP MOVSD.

CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL
The CMP instruction takes two clock cycles here because it is a read/modify
instruction. The 0FH prefix of the SETNZ instruction is decoded during the
second clock cycle of the CMP instruction, so that the decoding delay is
hidden on the PPlain (The PMMX has no decoding delay for the 0FH).


14. REDUCING CODE SIZE  (all processors)
========================================
As explained in paragraph 6, the code cache is 8 or 16 kb. If you have
problems keeping the critical parts of your code within the code cache,
then you may consider reducing the size of your code.

32 bit code is usually bigger than 16 bit code because addresses and data
constants take 4 bytes in 32 bit code and only 2 bytes in 16 bit code.
However, 16 bit code has other penalties such as prefixes and problems
with accessing adjacent words simultaneously (see chapter 10). Some other
methods for reducing the size or your code are discussed below.

Both jump addresses, data addresses, and data constants take less space if
they can be expressed as a sign-extended byte, i.e. if they are within the
interval from -128 to +127.

For jump addresses this means that short jumps take two bytes of code,
whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes
if conditional.

Likewise, data addresses take less space if they can be expressed as a
pointer and a displacement between -128 and +127.
Example:
MOV EBX,DS:[100000] / ADD EBX,DS:[100004]          ; 12 bytes
Reduce to:
MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4]   ; 10 bytes

The advantage of using a pointer obviously increases if you use it many
times. Storing data on the stack and using EBP or ESP as pointer will thus
make your code smaller than if you use static memory locations and absolute
addresses, provided of course that your data are within 127 bytes of the
pointer. Using PUSH and POP to write and read temporary data is even more
advantageous.

Data constants may also take less space if they are between -128 and +127.
Most instructions with immediate operands have a short form where the
operand is a sign-extended single byte. Examples:
PUSH 200      ; 5 bytes
PUSH 100      ; 2 bytes

ADD EBX,128   ; 6 bytes
SUB EBX,-128  ; 3 bytes

The most important instruction with an immediate operand which doesn't
have such a short form is MOV. Examples:
MOV EAX, 1              ; 5 bytes
May be changed to:
XOR EAX,EAX / INC EAX   ; 3 bytes
or:
PUSH 1 / POP EAX        ; 3 bytes

A MOV with a 4-byte immediate operand may sometimes be replaced by an
arithmetic instruction if the value of the register before the MOV is
known. Example:
        MOV     [mem1],0        ; 10 bytes
        MOV     [mem2],-1       ; 10 bytes
        MOV     EAX,100         ;  5 bytes
        MOV     EBX,150         ;  5 bytes
May be changed to:
        XOR     EAX,EAX         ;  2 bytes
        MOV     [mem1],EAX      ;  5 bytes
        DEC     EAX             ;  1 byte
        MOV     [mem2],EAX      ;  5 bytes
        ADD     EAX,101         ;  3 bytes
        LEA     EBX,[EAX+50]    ;  3 bytes
Be aware of the AGI stall in the LEA instruction.

If the same constant is used more than once then you may of course load it
into a register. Example:
MOV DWORD PTR [EBX],0 / MOV DWORD PTR [EBX+4],0   ; 13 bytes
XOR EAX,EAX / MOV [EBX],EAX / MOV [EBX+4],EAX     ;  7 bytes

You may also consider that different instructions have different lengths.
The following instructions take only one byte and are therefore very
attractive: PUSH reg, POP reg, INC reg32, DEC reg32.
INC and DEC with 8 bit registers take 2 bytes, so  INC EAX  is shorter
than  INC AL.

XCHG EAX,reg is also a single-byte instruction and thus takes less space 
than MOV EAX,reg, but it is slower and not pairable.

Some instructions take one byte less when they use the accumulator than
when they use any other register. Examples:

MOV EAX,DS:[100000]  is smaller than  MOV EBX,DS:[100000]
ADD EAX,1000         is smaller than  ADD EBX,1000

Instructions with pointers take one byte less when they have only a base 
pointer (not ESP) and a displacement than when they have a scaled index 
register, or both base pointer and index register, or ESP as base pointer. 
Examples:
MOV EAX,[array][EBX]  is smaller than  MOV EAX,[array][EBX*4]
MOV EAX,[EBP+12]      is smaller than  MOV EAX,[ESP+12]
Instructions with EBP as base pointer and no displacement and no index
take one byte more than with other registers:
MOV EAX,[EBX]    is smaller than  MOV EAX,[EBP],  but
MOV EAX,[EBX+4]  is same size as  MOV EAX,[EBP+4].


15. SCHEDULING FLOATING POINT CODE (PPlain and PMMX)
====================================================
Floating point instructions cannot pair the way integer instructions can,
except for one special case, defined by the following rules:

- the first instruction (executing in the U-pipe) must be FLD, FADD, FSUB,
  FMUL, FDIV, FCOM, FCHS, or FABS

- the second instruction (in V-pipe) must be FXCH

- the instruction following the FXCH must be a floating point instruction,
  otherwise the FXCH will pair imperfectly and take an extra clock cycle.

This special pairing is important, as will be explained shortly.

While floating point instructions in general cannot be paired, many can be
pipelined, i.e. one instruction can begin before the previous instruction has 
finished. Example:

FADD ST(1),ST(0)   ; clock cycle 1-3
FADD ST(2),ST(0)   ; clock cycle 2-4
FADD ST(3),ST(0)   ; clock cycle 3-5
FADD ST(4),ST(0)   ; clock cycle 4-6

Obviously, two instructions cannot overlap if the second instruction needs
the result of the first. Since almost all floating point instructions
involve the top of stack register, ST(0), there are seemingly not very many
possibilities for making an instruction independent of the result of
previous instructions. The solution to this problem is register renaming.
The FXCH instruction does not in reality swap the contents of two registers,
it only swaps their names. Instructions which push or pop the register
stack also work by renaming. Floating point register renaming has been
highly optimized on the Pentiums so that a register may be renamed while
in use. Register renaming never causes stalls - it is even possible to
rename a register more than once in the same clock cycle, as for example
when you pair FLD or FCOMPP with FXCH.

By the proper use of FXCH instructions you may obtain a lot of overlapping in
your floating point code. Example:

FLD     [a1]    ; clock cycle 1
FADD    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FADD    [b2]    ; clock cycle 4-6
FLD     [c1]    ; clock cycle 5
FADD    [c2]    ; clock cycle 6-8
FXCH    ST(2)   ; clock cycle 6
FADD    [a3]    ; clock cycle 7-9
FXCH    ST(1)   ; clock cycle 7
FADD    [b3]    ; clock cycle 8-10
FXCH    ST(2)   ; clock cycle 8
FADD    [c3]    ; clock cycle 9-11
FXCH    ST(1)   ; clock cycle 9
FADD    [a4]    ; clock cycle 10-12
FXCH    ST(2)   ; clock cycle 10
FADD    [b4]    ; clock cycle 11-13
FXCH    ST(1)   ; clock cycle 11
FADD    [c4]    ; clock cycle 12-14
FXCH    ST(2)   ; clock cycle 12

In the above example we are interleaving three independent threads. Each
FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle.
When we have started an FADD in the 'a' thread we have time to start two new
FADD instructions in the 'b' and 'c' threads before returning to the 'a'
thread, so every third FADD belongs to the same thread. We are using FXCH
instructions every time to get the register that belongs to the desired
thread into ST(0). As you can see in the example above, this generates a
regular pattern, but note well that the FXCH instructions repeat with a
period of two while the threads have a period of three. This can be quite
confusing, so you have to 'play computer' in order to know which registers
are where.

All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock
cycles and are able to overlap, so that these instructions may be scheduled
using the method described above. Using a memory operand does not take more
time than a register operand if the memory operand is in the level 1 cache
and properly aligned.

By now you must be used to the rules having exceptions, and the overlapping
rule is no exception: You cannot start an FMUL instruction one clock cycle 
after another FMUL instruction, because the FMUL circuitry is not perfectly
pipelined. It is recommended that you put another instruction in between two
FMUL's. Example:

FLD     [a1]    ; clock cycle 1
FLD     [b1]    ; clock cycle 2
FLD     [c1]    ; clock cycle 3
FXCH    ST(2)   ; clock cycle 3
FMUL    [a2]    ; clock cycle 4-6
FXCH            ; clock cycle 4
FMUL    [b2]    ; clock cycle 5-7    (stall)
FXCH    ST(2)   ; clock cycle 5
FMUL    [c2]    ; clock cycle 7-9    (stall)
FXCH            ; clock cycle 7
FSTP    [a3]    ; clock cycle 8-9
FXCH            ; clock cycle 10     (unpaired)
FSTP    [b3]    ; clock cycle 11-12
FSTP    [c3]    ; clock cycle 13-14

Here you have a stall before  FMUL [b2]  and before  FMUL [c2]  because 
another FMUL started in the preceding clock cycle. You can improve this code
by putting FLD instructions in between the FMUL's:

FLD     [a1]    ; clock cycle 1
FMUL    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FMUL    [b2]    ; clock cycle 4-6
FLD     [c1]    ; clock cycle 5
FMUL    [c2]    ; clock cycle 6-8
FXCH    ST(2)   ; clock cycle 6
FSTP    [a3]    ; clock cycle 7-8
FSTP    [b3]    ; clock cycle 9-10
FSTP    [c3]    ; clock cycle 11-12

In other cases you may put FADD, FSUB, or anything else in between FMUL's to
avoid the stalls.

Overlapping floating point instructions requires of course that you have
some independent threads that you can interleave. If you have only one big 
formula to execute, then you may compute parts of the formula in parallel to 
achieve overlapping. If, for example, you want to add six numbers, then you
may split the operations into two threads with three numbers in each, and
add the two threads in the end:

FLD     [a]     ; clock cycle 1
FADD    [b]     ; clock cycle 2-4
FLD     [c]     ; clock cycle 3
FADD    [d]     ; clock cycle 4-6
FXCH            ; clock cycle 4
FADD    [e]     ; clock cycle 5-7
FXCH            ; clock cycle 5
FADD    [f]     ; clock cycle 7-9    (stall)
FADD            ; clock cycle 10-12  (stall)

Here we have a one clock stall before  FADD [f]  because it is waiting for
the result of  FADD [d]  and a two clock stall before the last FADD because
it is waiting for the result of  FADD [f]. The latter stall can be hidden by 
filling in some integer instructions, but the first stall can not because an
integer instruction at this place would make the FXCH unpairable.  

The first stall can be avoided by having three threads rather than two, but
that would cost an extra FLD so we do not save anything by having three
threads rather than two unless there are at least eight numbers to add.

Not all floating point instructions can overlap. And some floating point
instructions can overlap more subsequent integer instructions than 
subsequent floating point instructions. The FDIV instruction, for example,
takes 39 clock cycles. All but the first clock cycle can overlap with
integer instructions, but only the last two clock cycles can overlap with 
floating point instructions. Example: 

FDIV            ; clock cycle 1-39  (U-pipe)
FXCH            ; clock cycle 1-2   (V-pipe, imperfect pairing)
SHR EAX,1       ; clock cycle 3     (U-pipe)
INC EBX         ; clock cycle 3     (V-pipe)
CMC             ; clock cycle 4-5   (non-pairable)
FADD [x]        ; clock cycle 38-40 (U-pipe, waiting while f.p. unit busy)
FXCH            ; clock cycle 38    (V-pipe)
FMUL [y]        ; clock cycle 40-42 (U-pipe, waiting for result of FDIV)

The first FXCH pairs with the FDIV, but takes an extra clock cycle because
it is not followed by a floating point instruction. The SHR / INC pair
starts before the FDIV is finished, but has to wait for the FXCH to
finish. The FADD has to wait till clock 38 because new floating point
instructions can only execute during the last two clock cycles of the
FDIV. The second FXCH pairs with the FADD. The FMUL has to wait for the
FDIV to finish because it uses the result of the division.

If you have nothing else to put in after a floating point instruction with
a large integer overlap, such as FDIV or FSQRT, then you may put in a
dummy read from an address which you expect to need later in the program
to make sure it is in the level one cache. Example:
        FDIV    QWORD PTR [EBX]
        CMP     [ESI],EAX
        FMUL    QWORD PTR [ESI]
Here we use the integer overlap to pre-load the value at [ESI] into the
cache while the FDIV is being computed (we don't care what the result of
the CMP is).

Paragraph 21 gives a complete listing of floating point instructions, and
what they can pair or overlap with.

There is no penalty for using a memory operand on floating point
instuctions because the arithmetic unit is one step later in the pipeline
than the read unit. The tradeoff of this comes when you store floating
point data to memory. The FST or FSTP instruction with a memory operand
takes two clock cycles in the execution stage, but it needs the data one
clock earlier so you will get a one clock stall if the value to store is
not ready one clock cycle in advance. This is analogous to an AGI stall.
Example:

FLD     [a1]    ; clock cycle 1
FADD    [a2]    ; clock cycle 2-4
FLD     [b1]    ; clock cycle 3
FADD    [b2]    ; clock cycle 4-6
FXCH            ; clock cycle 4
FSTP    [a3]    ; clock cycle 6-7
FSTP    [b3]    ; clock cycle 8-9

The  FSTP [a3]  stalls for one clock cycle because the result of  FADD [a2]  
is not ready in the preceding clock cycle. In many cases you cannot hide 
this type of stall without scheduling your floating point code into four
threads or putting some integer instructions in between.
The two clock cycles in the execution stage of the FST(P) instruction
cannot pair or overlap with any subsequent instructions.

Instructions with integer operands such as FIADD, FISUB, FIMUL, FIDIV, FICOM 
may be split up into simpler operations in order to improve overlapping. 
Example:

FILD    [a]     ; clock cycle 1-3
FIMUL   [b]     ; clock cycle 4-9

Split up into:

FILD    [a]     ; clock cycle 1-3
FILD    [b]     ; clock cycle 2-4
FMUL            ; clock cycle 5-7

In this example, you save two clocks by overlapping the two FILD
instructions.


16. Loop Optimation (PPlain and PMMX)
=====================================
When analyzing a program you often find that most of the time consumption
lies in the innermost loop. The way to improve the speed is to carefully
optimize the most time-consuming loop using ASM language. The rest of the 
program may be left in high-level language.  

A loop generally contains a counter controlling how many times to iterate,
and often array access reading or writing one array element for each
iteration. I have chosen as example a procedure which reads integers from
an array, changes the sign of each integer, and stores the results in
another array.

A C language code for this procedure would be:

void ChangeSign (int * A, int * B, int N) {
  int i;
  for (i=0; i<N; i++) B[i] = -A[i];}

Translating to assembler, we might write the procedure like this: 

Example 1:

_ChangeSign PROCEDURE NEAR
        PUSH    ESI
        PUSH    EDI
A       EQU     DWORD PTR [ESP+12]
B       EQU     DWORD PTR [ESP+16]
N       EQU     DWORD PTR [ESP+20]

        MOV     ECX, [N]
        JECXZ   L2
        MOV     ESI, [A]
        MOV     EDI, [B]
        CLD
L1:     LODSD
        NEG     EAX
        STOSD
        LOOP    L1
L2:     POP     EDI
        POP     ESI
        RET                     ; (no extra pop if C calling convention)
_ChangeSign     ENDP

This looks like a nice solution, but it is not optimal because it uses
slow non-pairable instructions. It takes 11 clock cycles per iteration
if all data are in the level one cache.

Using pairable instructions only  (PPlain and PMMX)
---------------------------------------------------

Example 2:

        MOV     ECX, [N]
        MOV     ESI, [A]
        TEST    ECX, ECX
        JZ      SHORT L2
        MOV     EDI, [B]
L1:     MOV     EAX, [ESI]       ; u
        XOR     EBX, EBX         ; v (pairs)
        ADD     ESI, 4           ; u
        SUB     EBX, EAX         ; v (pairs)
        MOV     [EDI], EBX       ; u
        ADD     EDI, 4           ; v (pairs)
        DEC     ECX              ; u
        JNZ     L1               ; v (pairs)
L2:

Here we have used pairable instructions only, and scheduled the instruc-
tions so that everything pairs. It now takes only 4 clock cycles per
iteration. We could have obtained the same speed without splitting the
NEG  instruction, but the other unpairable instructions should be split
up.

Using the same register for counter and index
---------------------------------------------

Example 3:

        MOV     ESI, [A]
        MOV     EDI, [B]
        MOV     ECX, [N]
        XOR     EDX, EDX
        TEST    ECX, ECX
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*EDX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*EDX], EAX          ; u
        INC     EDX                       ; v (pairs)
        CMP     EDX, ECX                  ; u
        JB      L1                        ; v (pairs)
L2:

Using the same register for counter and index gives us fewer instructions
in the body of the loop, but it still takes 4 clocks because we have two
unpaired instructions.

Letting the counter end at zero
-------------------------------
We want to get rid of the CMP instruction in example 3 by letting the 
counter end at zero and use the zero flag for detecting when we are
finished as we did in example 2. One way to do this would be to execute
the loop backwards taking the last array elements first. However, data
caches are optimized for accessing data forwards, not backwards, so if
cache misses are likely, then you should rather start the counter at -N
and count through negative values up to zero. The base registers should
then point to the end of the arrays rather than the beginning:

Example 4:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        INC     ECX                       ; v (pairs)
        JNZ     L1                        ; u
L2:

We are now down at five instructions in the loop body but it still takes
4 clocks because of poor pairing. (If the addresses and sizes of the
arrays are constants we may save two registers by substituting A+SIZE A
for ESI and B+SIZE B for EDI). Now let's see how we can improve pairing.

Pairing calculations with loop overhead  (PPlain and PMMX)
----------------------------------------------------------
We may want to improve pairing by intermingling calculations with the loop
control instructions. If we want to put something in between INC ECX and
JNZ L1, it has to be something that doesn't affect the zero flag. The
MOV [EDI+4*ECX],EBX  instruction after  INC ECX  would generate an AGI
delay, so we have to be more ingenious:

Example 5:

        MOV     EAX, [N]
        XOR     ECX, ECX
        SHL     EAX, 2                    ; 4 * N
        JZ      SHORT L3
        MOV     ESI, [A]
        MOV     EDI, [B]
        SUB     ECX, EAX                  ; - 4 * N
        ADD     ESI, EAX                  ; point to end of array A
        ADD     EDI, EAX                  ; point to end of array B
        JMP     SHORT L2
L1:     MOV     [EDI+ECX-4], EAX          ; u
L2:     MOV     EAX, [ESI+ECX]            ; v (pairs)
        XOR     EAX, -1                   ; u
        ADD     ECX, 4                    ; v (pairs)
        INC     EAX                       ; u
        JNC     L1                        ; v (pairs)
        MOV     [EDI+ECX-4], EAX
L3:

I have used a different way to calculate the negative of EAX here:
inverting all bits and adding one. The reason why I am using this method
is that I can use a dirty trick with the INC instruction: INC doesn't
change the carry flag, whereas ADD does. I am using ADD rather than INC
to increment my loop counter and testing the carry flag rather than the
zero flag. It is then possible to put the INC EAX in between without
affecting the carry flag. You may think that we could have used
LEA EAX,[EAX+1]  here instead of INC EAX,  at least that doesn't change
any flags, but the LEA instruction would have an AGI stall so that's not
the best solution.

I have obtained perfect pairing here and the loop now takes only 3 clock
cycles. Whether you want to increment the loop counter by 1 (as in example
4) or by 4 (as in example 5) is a matter of taste, it makes no difference
in loop timing.

Overlapping the end of one operation with the beginning of the next
-------------------------------------------------------------------
The method used in example 5 is not very generally applicable so we may
look for other methods of improving pairing opportunities. One way is to
reorganize the loop so that the end of one operation overlaps with the
beginning of the next. I will call this convoluting the loop. A convoluted
loop has an unfinished operation at the end of each loop iteration which
will be finished in the next run. Actually, example 5 did pair the last
MOV of one iteration with the first MOV of the next, but we want to
explore this method further:

Example 6:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        XOR     EBX, EBX
        MOV     EAX, [ESI+4*ECX]
        INC     ECX
        JZ      SHORT L2
L1:     SUB     EBX, EAX                  ; u
        MOV     EAX, [ESI+4*ECX]          ; v (pairs)
        MOV     [EDI+4*ECX-4], EBX        ; u
        INC     ECX                       ; v (pairs)
        MOV     EBX, 0                    ; u
        JNZ     L1                        ; v (pairs)
L2:     SUB     EBX, EAX
        MOV     [EDI+4*ECX-4], EBX
L3:

Here we begin reading the second value before we have stored the first,
and this of course improves pairing opportunities. The  MOV EBX,0
instruction has been put in between INC ECX and JNZ L1 not to improve
pairing but to avoid AGI stall.

Rolling out a loop
------------------
The most generally applicable way to improve pairing opportunities is to
do two operations for each run and do half as many runs. This is called
rolling out a loop:

Example 7:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX
        MOV     [EDI+4*ECX], EAX
        INC     ECX                       ; make counter even
        JZ      SHORT L2                  ; N = 1
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        MOV     [EDI+4*ECX+4], EBX        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
L2:

Now we are doing two operations in parallel which gives the best pairing
opportunities. We have to test if N is odd and if so do one operation
outside the loop because the loop can only do an even number of operations.

The loop has an AGI stall at the first MOV instruction because ECX has
been incremented in the preceding clock cycle. The loop therefore takes
6 clock cycles for two operations.

Reorganizing a loop to remove AGI stall  (PPlain and PMMX)
----------------------------------------------------------
Example 8:

        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L2
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX                       ; no pairing opportunity
        MOV     [EDI+4*ECX-4], EAX
        INC     ECX                       ; make counter even
        JNZ     SHORT L2
        NOP                               ; add NOP's if JNZ L2 not predictable
        NOP
        JMP     SHORT L3                  ; N = 1
L1:     NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX-8], EAX        ; u
        MOV     [EDI+4*ECX-4], EBX        ; v (pairs)
L2:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
        NEG     EAX
        NEG     EBX
        MOV     [EDI+4*ECX-8], EAX
        MOV     [EDI+4*ECX-4], EBX
L3:

The trick is to find a pair of instructions that do not use the loop
counter as index and reorganize the loop so that the counter is
incremented in the preceding clock cycle. We are now down at 5 clock
cycles for two operations which is close to the best possible.

If data caching is critical, then you may improve the speed further by
interleaving the A and B arrays into one structured array so that each
B[i] comes immediately after the corresponding A[i]. If the structured
array is aligned by at least 8 then B[i] will always be in the same cache
line as A[i], so you will never have a cache miss when writing B[i]. This
may of course have a tradeoff in other parts of the program so you have
to weigh the costs against the benefits.


Rolling out by more than 2
--------------------------
You may think of doing more than two operations per iteration in order to 
reduce the loop overhead per operation. But since the loop overhead in
most cases can be reduced to only one clock cycle per iteration, then
rolling out the loop by 4 rather than by 2 would only save 1/4 clock cycle
per operation, which is hardly worth the effort. Only if the loop overhead
cannot be reduced to one clock cycle and if N is very big, should you
think of unrolling by 4.

The drawbacks of excessive loop unrolling are: 
1.  You need to calculate N MODULO R, where R is the unrolling factor,
    and do N MODULO R operations before or after the main loop in order
    to make the remaining number of operations divisible by R. This takes
    a lot of extra code and poorly predictable branches. And the loop body
    of course also becomes bigger.
2.  A Piece of code usually takes much more time the first time it
    executes, and the penalty of first time execution is bigger the more
    code you have, especially if N is small.
3.  Excessive code size makes the utilization of the code cache less
    effective.


Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers
-------------------------------------------------------------------------
(all processors)
----------------
If you need to manipulate arrays of 8 or 16 bit operands, then there is a
problem with unrolled loops because you may not be able to pair two memory 
access operations. For example  MOV AL,[ESI] / MOV BL,[ESI+1]  will not pair
if the two operands are within the same dword of memory. But there may be a 
much smarter method, namely to handle four bytes at a time in the same 32
bit register.

The following example adds 2 to all elements of an array of bytes:

Example 9:

        MOV     ESI, [A]         ; address of byte array
        MOV     ECX, [N]         ; number of elements in byte array
        TEST    ECX, ECX         ; test if N is 0
        JZ      SHORT L2
        MOV     EAX, [ESI]       ; read first four bytes
L1:     MOV     EBX, EAX         ; copy into EBX
        AND     EAX, 7F7F7F7FH   ; get lower 7 bits of each byte in EAX
        XOR     EBX, EAX         ; get the highest bit of each byte in EBX
        ADD     EAX, 02020202H   ; add desired value to all four bytes
        XOR     EBX, EAX         ; combine bits again
        MOV     EAX, [ESI+4]     ; read next four bytes
        MOV     [ESI], EBX       ; store result
        ADD     ESI, 4           ; increment pointer
        SUB     ECX, 4           ; decrement loop counter
        JA      L1               ; loop
L2:

This loop takes 5 clock cycles for every 4 bytes. The array should of course 
be aligned by 4. If the number of elements in the array is not divisible by 
four, then you may padd it in the end with a few extra bytes to make the
length divisible by four. This loop will always read past the end of the
array, so you should make sure the array is not placed at the end of a
segment to avoid a general protection error.

Note that I have masked out the highest bit of each byte to avoid a possible 
carry from each byte into the next when adding. I am using XOR rather than 
ADD when putting in the high bit again to avoid carry.

The  ADD ESI,4  instruction could have been avoided by using the loop
counter as index as in example 4. However, this would give an odd number of 
instructions in the loop body, so there would be one unpaired instruction 
and the loop would still take 5 clocks. Making the branch instruction
unpaired would save one clock after the last operation when the branch is 
mispredicted, but we would have to spend an extra clock cycle in the prolog 
code to setup a pointer to the end of the array and calculate -N, so the two
methods will be exactly equally fast. The method presented here is the
simplest and shortest.  

The next example finds the length of a zero-terminated string by searching 
for the first byte of zero. It is faster than using REP SCASB:

Example 10:

STRLEN  PROC    NEAR
        MOV     EAX,[ESP+4]               ; get pointer
        MOV     EDX,7
        ADD     EDX,EAX                   ; pointer+7 used in the end
        PUSH    EBX
        MOV     EBX,[EAX]                 ; read first 4 bytes
        ADD     EAX,4                     ; increment pointer
L1:     LEA     ECX,[EBX-01010101H]       ; subtract 1 from each byte
        XOR     EBX,-1                    ; invert all bytes
        AND     ECX,EBX                   ; and these two
        MOV     EBX,[EAX]                 ; read next 4 bytes
        ADD     EAX,4                     ; increment pointer
        AND     ECX,80808080H             ; test all sign bits
        JZ      L1                        ; no zero bytes, continue loop
        TEST    ECX,00008080H             ; test first two bytes
        JNZ     SHORT L2
        SHR     ECX,16                    ; not in the first 2 bytes
        ADD     EAX,2
L2:     SHL     CL,1                      ; use carry flag to avoid a branch
        POP     EBX
        SBB     EAX,EDX                   ; compute length
        RET                               ; (or RET 4 if pascal)
STRLEN  ENDP

Again we have used the method of overlapping the end of one operation with
the beginning of the next to improve pairing. I have not unrolled the loop
because it is likely to repeat relatively few times. The string should of 
course be aligned by 4. The code will always read past the end of the 
string, so the string should not be placed at the end of a segment.

The loop body has an odd number of instructions so there is one unpaired.  
Making the branch instruction unpaired rather than one of the other
instructions has the advantage that it saves 1 clock cycle when the branch
is mispredicted.

The  TEST ECX,00008080H  instruction is non-pairable. You could use the 
pairable instruction  OR CH,CL  here instead, but then you would have to
put in a NOP or something to avoid the penalties of consequtive branches.
Another problem with  OR CH,CL  is that it would cause a partial register
stall on a PPro or PII. So I have chosen to keep the unpairable TEST
instruction.

Handling 4 bytes simultaneously can be quite difficult. The code uses a
formula which generates a nonzero value for a byte if, and only if, the
byte is zero. This makes it possible to test all four bytes in one
operation. This algorithm involves the subtraction of 1 from all bytes (in
the LEA instruction). I have not masked out the highest bit of each byte
before subtracting, as I did in the previous example, so the subtraction may
generate a borrow to the next byte, but only if it is zero, and this is 
exactly the situation where we don't care what the next byte is, because we 
are searching forwards for the first zero. If we were searching backwards 
then we would have to re-read the dword after detecting a zero, and then
test all four bytes to find the last zero, or use BSWAP to reverse the order
of the bytes.

If you want to search for a byte value other than zero, then you may XOR all 
four bytes with the value you are searching for, and then use the method
above to search for zero.


Loops with MMX operations (PMMX)
--------------------------------
Handling multiple operands in the same register is easier on the MMX
processors because they have special instructions and special 64 bit
registers for exactly this purpose.

Returning to the problem of adding two to all bytes in an array, we may
take advantage of the MMX instructions:

Example 11:
.data
ALIGN   8
ADDENTS DQ      0202020202020202h       ; specify byte to add eight times
A       DD      ?                       ; address of byte array
N       DD      ?                       ; number of iterations

.code
        MOV     ESI, [A]
        MOV     ECX, [N]
        MOVQ    MM2, [ADDENTS]
        JMP     SHORT L2
        ; top of loop
L1:     MOVQ    [ESI-8], MM0    ; store result
L2:     MOVQ    MM0, MM2        ; load addents
        PADDB   MM0, [ESI]      ; add eight bytes in one operation
        ADD     ESI, 8
        DEC     ECX
        JNZ     L1
        MOVQ    [ESI-8], MM0    ; store last result
        EMMS

The store instruction is moved to after the loop control instructions in
order to avoid a store stall.

This loop takes 4 clocks because the PADDB instruction doesn't pair with
ADD ESI,8. (A MMX instruction with memory access cannot pair with a non-
MMX instruction or with another MMX instruction with memory access).
We could get rid of  ADD ESI,8  by using ECX as index, but that would give
an AGI stall.

Since the loop overhead is considerable we might want to unroll the loop:

Example 12:

.data
ALIGN   8
ADDENTS DQ      0202020202020202h       ; specify byte to add eight times
A       DD      ?                       ; address of byte array
N       DD      ?                       ; number of iterations

.code
        MOVQ    MM2, [ADDENTS]
        MOV     ESI, [A]
        MOV     ECX, [N]
        MOVQ    MM0, MM2
        MOVQ    MM1, MM2
L3:     PADDB   MM0, [ESI]
        PADDB   MM1, [ESI+8]
        MOVQ    [ESI], MM0
        MOVQ    MM0, MM2
        MOVQ    [ESI+8], MM1
        MOVQ    MM1, MM2
        ADD     ESI, 16
        DEC     ECX
        JNZ     L3
        EMMS

This unrolled loop takes 6 clocks per iteration for adding 16 bytes.
The PADDB instructions are not paired. The two threads are interleaved
to avoid a store stall.

Using the MMX instructions has a high penalty if you are using floating
point instructions shortly afterwards, so there may still be situations
where you want to use 32 bit registers as in example 9.


Loops with floating point operations (PPlain and PMMX)
------------------------------------------------------
The methods of optimizing floating point loops are basically the same as
for integer loops, although the floating point instructions are
overlapping rather than pairing.

Consider the C language code:

  int i, n;  double * X;  double * Y;  double DA;
  for (i=0; i<n; i++)  Y[i] = Y[i] - DA * X[i];

This piece of code (called DAXPY) has been studied extensively because
it is the key to solving linear equations.

Example 11:

DSIZE   = 8                                      ; data size
        MOV     EAX, [N]                         ; number of elements
        MOV     ESI, [X]                         ; pointer to X
        MOV     EDI, [Y]                         ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]             ; point to end of X
        SUB     ECX, EAX                         ; -N
        LEA     EDI, [EDI+DSIZE*EAX]             ; point to end of Y
        JZ      SHORT L3                         ; test for N = 0
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[0]
        JMP     SHORT L2                         ; jump into loop
L1:     FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[i]
        FXCH                                     ; get old result
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store Y[i]
L2:     FSUBR   DSIZE PTR [EDI+DSIZE*ECX]        ; subtract from Y[i]
        INC     ECX                              ; increment index
        JNZ     L1                               ; loop
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store last result
L3:

Here we are using the same methods as in example 6: Using the loop counter
as index register and counting through negative values up to zero. The end
of one operation overlaps with the beginning of the next.

The interleaving of floating point operations work perfectly here: The 2
clock stall between FMUL and FSUBR is filled with the FSTP of the previous
result. The 3 clock stall between FSUBR and FSTP is filled with the loop
overhead and the first two instructions of the next operation. An AGI stall
has been avoided by reading the only parameter that doesn't depend on the
index in the first clock cycle after the index has been incremented.

This solution takes 6 clock cycles per operation, which is better than the
unrolled solution published by Intel!

Unrolling floating point loops (PPlain and PMMX)
------------------------------------------------
The DAXPY loop unrolled by 3 is quite complicated:

Example 12:
DSIZE = 8                                 ; data size
IF DSIZE EQ 4
SHIFTCOUNT = 2
ELSE
SHIFTCOUNT = 3
ENDIF

        MOV     EAX, [N]                  ; number of elements
        MOV     ECX, 3*DSIZE              ; counter bias
        SHL     EAX, SHIFTCOUNT           ; DSIZE*N
        JZ      L4                        ; N = 0
        MOV     ESI, [X]                  ; pointer to X
        SUB     ECX, EAX                  ; (3-N)*DSIZE
        MOV     EDI, [Y]                  ; pointer to Y
        SUB     ESI, ECX                  ; end of pointer - bias
        SUB     EDI, ECX
        TEST    ECX, ECX
        FLD     DSIZE PTR [ESI+ECX]       ; first X
        JNS     SHORT L2                  ; less than 4 operations
L1:     ; main loop
        FMUL    DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+ECX+DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX]
        FXCH
        FLD     DSIZE PTR [ESI+ECX+2*DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX+DSIZE]
        FXCH    ST(2)
        FSTP    DSIZE PTR [EDI+ECX]
        FSUBR   DSIZE PTR [EDI+ECX+2*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+DSIZE]
        FLD     DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        ADD     ECX, 3*DSIZE
        JS      L1                        ; loop
L2:     FMUL    DSIZE PTR [DA]            ; finish leftover operation
        FSUBR   DSIZE PTR [EDI+ECX]
        SUB     ECX, 2*DSIZE              ; change pointer bias
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]            ; start next operation
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
L3:     FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
L4:

The reason why I am showing you how to unroll a loop by 3 is not to 
recommend it, but to warn you how difficult it is! Be prepared to spend a 
considerable amount of time debugging and verifying your code when doing
something like this. There are several problems to take care of: In most 
cases, you cannot remove all stalls from a floating point loop unrolled by
less than 4 unless you convolute it (i.e. there are unfinished operations
at the end of each run which are being finished in the next run). The last
FLD in the main loop above is the beginning of the first operation in the
next run. It would be tempting here to make a solution which reads past
the end of the array and then discards the extra value in the end, as in
example 9 and 10, but that is not recommended in floating point loops
because the reading of the extra value might generate a denormal operand
exception in case the memory position after the array doesn't contain a
valid floating point number. To avoid this, we have to do at least one
more operation after the main loop.  

The number of operations to do outside an unrolled loop would normally be
N MODULO R, where N is the number of operations, and R is the unrolling
factor.  But in the case of a convoluted loop, we have to do one more,
i.e. (N-1) MODULO R + 1, for the abovementioned reason.

Normally, we would prefer to do the extra operations before the main loop,
but here we have to do them afterwards for two reasons: One reason is to
take care of the leftover operand from the convolution. The other reason
is that calculating the number of extra operations requires a division if
R is not a power of 2, and a division is time consuming. Doing the extra
operations after the loop saves the division.

The next problem is to calculate how to bias the loop counter so that it
will change sign at the right time, and adjust the base pointers so as to
compensate for this bias. Finally, you have to make sure the leftover
operand from the convolution is handled correctly for all values of N.

The epilog code doing 1-3 operations could have been implemented as a
separate loop, but that would cost an extra branch misprediction, so the 
solution above is faster.  

Now that I have scared you by demonstrating how difficult it is to unroll
by 3, I will show you that it is much easier to unroll by 4:

Example 13:

DSIZE   = 8                               ; data size
        MOV     EAX, [N]                  ; number of elements
        MOV     ESI, [X]                  ; pointer to X
        MOV     EDI, [Y]                  ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]      ; point to end of X
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+DSIZE*EAX]      ; point to end of Y
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        FLD     DSIZE PTR [DA]            ; do the odd operation
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        INC     ECX                       ; adjust counter
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]
L1:     TEST    AL,2                      ; test for possibly 2 more operations
        JZ      L2
        FLD     DSIZE PTR [DA]            ; N MOD 4 = 2 or 3. Do two more
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        ADD     ECX, 2                    ; counter is now divisible by 4
L2:     TEST    ECX, ECX
        JZ      L4                        ; no more operations
L3:     ; main loop:
        FLD     DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+DSIZE*ECX]
        FMUL    ST,ST(1)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FMUL    ST,ST(2)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE]
        FMUL    ST,ST(3)
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH    ST(3)
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        FXCH    ST(3)
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        ADD     ECX, 4                             ; increment index by 4
        JNZ     L3                                 ; loop
L4:

It is usually quite easy to find a stall-free solution when unrolling by
4, and there is no need for convolution. The number of extra operations
to do outside the main loop is N MODULO 4, which can be calculated easily
without division, simply by testing the two lowest bits in N. The extra
operations are done before the main loop rather than after, to make the
handling of the loop counter simpler.

The tradeoff of loop unrolling is that the extra operations outside the
loop are slower due to incomplete overlapping and possible branch
mispredictions, and the first time penalty is higher because of increased
code size.

As a general recommendation, I would say that if N is big or if
convoluting the loop without unrolling cannot remove enough stalls, then
you should unroll critical integer loops by 2 and floating point loops
by 4.


17. DISCUSSION OF SPECIAL INSTRUCTIONS
======================================

17.1 LEA  (all processors)
--------------------------
The LEA instruction is useful for many purposes because it can do a shift,
two additions, and a move in just one pairable instruction taking one clock
cycle.  Example:
LEA EAX,[EBX+8*ECX-1000]
is much faster than
MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000
The LEA instruction can also be used to do an add or shift without changing
the flags. The source and destination need not have the same word size, so
LEA EAX,[BX]  is a useful replacement for  MOVZX EAX,BX.

You must be aware, however, that the LEA instruction will suffer an AGI
stall on the PPlain and PMMX if it uses a base or index register which
has been changed in the preceding clock cycle.

Since the LEA instruction is pairable in the V-pipe (on PPlain and PMMX)
and shift instructions are not, you may use LEA as a substitute for a
SHL by 1, 2, or 3 if you want the instruction to execute in the V-pipe.

The 32 bit processors have no documented addressing mode with a scaled index
register and nothing else, so an instruction like  LEA EAX,[EAX*2]  is
actually coded as  LEA EAX,[EAX*2+00000000]  with an immediate displacement
of 4 bytes.  You may reduce the instruction size by instead writing  LEA
EAX,[EAX+EAX] or even better  ADD EAX,EAX.  The latter code cannot have an
AGI delay. If you happen to have a register which is zero (like a loop
counter after a loop), then you may use it as a base register to reduce the
code size:

LEA EAX,[EBX*4]     ; 7 bytes
LEA EAX,[ECX+EBX*4] ; 3 bytes


17.2 MOV [MEM],ACCUM  (PPlain and PMMX)
---------------------------------------
The instructions  MOV [mem],AL   MOV [mem],AX   MOV [mem],EAX
are treated by the pairing circuitry as if they were writing to the
accumulator. Thus the following instructions do not pair:
MOV [mydata], EAX
MOV EBX, EAX

This problem occurs only with the short form of the MOV instruction which
can not have a base or index register, and which can only have the
accumulator as source. You can avoid the problem by using another register,
by reordering your instructions, by using a pointer, or by hard-coding the
general form of the MOV instruction.

In 32 bit mode you can write the general form of MOV [mem],EAX:
DB 89H, 05H
DD OFFSET DS:mem

In 16 bit mode you can write the general form of MOV [mem],AX:
DB 89H, 06H
DW OFFSET DS:mem

To use AL instead of (E)AX, you replace 89H with 88H

This bug has not been fixed in the MMX version.


17.3 TEST  (PPlain and PMMX)
----------------------------
The TEST instruction with an immediate operand is only pairable if the
destination is AL, AX, or EAX.

TEST register,register  and  TEST register,memory  is always pairable.

Examples:
TEST ECX,ECX                ; pairable
TEST [mem],EBX              ; pairable
TEST EDX,256                ; not pairable
TEST DWORD PTR [EBX],8000H  ; not pairable
To make it pairable, use any of the following methods:
MOV EAX,[EBX] / TEST EAX,8000H
MOV EDX,[EBX] / AND  EDX,8000H
MOV AL,[EBX+1] / TEST AL,80H
MOV AL,[EBX+1] / TEST AL,AL  ; (result in sign flag)
It is also possible to test a bit by shifting it into the carry flag:
MOV EAX,[EBX] / SHR EAX,16   ; (result in carry flag)

(The reason for this non-pairability is probably that the first byte of
the 2-byte instruction is the same as for some other non-pairable
instructions, and the processor cannot afford to check the second byte too
when determining pairability.)


17.4 XCHG  (all processors)
---------------------------
The  XCHG register,[memory]  instruction is dangerous. By default this
instruction has an implicit LOCK prefix which prevents it from using the
cache. The instruction is therefore very time consuming, and should always
be avoided.


17.5 rotates through carry  (PPlain and PMMX)
---------------------------------------------
RCR and RCL with a count different from one are slow and should be avoided.


17.6 string instructions  (PPlain and PMMX)
-------------------------------------------
String instructions without a repeat prefix are too slow, and should always
be replaced by simpler instructions. The same applies to LOOP and JECXZ.  

String instructions with repeat may be optimal. Always use the dword version
if possible, and make sure that both source and destination are aligned by 4.

REP MOVSD is the fastest way to move blocks of data when the destination
is in the level 1 cache. See section 19 for an alternative. Note that
while the REP MOVS instruction writes a word to the destination, it reads
the next word from the source in the same clock cycle. According to rule
10.3 above, you have a cache bank conflict if bit 2-4 are the same in
these two addresses. In other words, you will get a penalty of one clock
extra per iteration if ESI+(wordsize)-EDI is divisible by 32. The easiest
way to avoid cache bank conflicts is to use the DWORD version and align
both source and destination by 8. Never use MOVSB or MOVSW in optimized
code, not even in 16 bit mode.

REP STOSD is optimal when the destination is in the cache.

REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by
loops. See section 16 example 10 for an alternative to REP SCASB.
REP CMPS may suffer cache bank conflicts if bit 2-4 are the same in ESI
and EDI.


17.7 bit scan  (PPlain and PMMX)
--------------------------------
BSF and BSR are the poorest optimized instructions on the PPlain and PMMX,
taking approximately 11 + 2*n clock cycles, where n is the number of zeros
skipped.

The following code emulates BSR ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS1
        MOV     DWORD PTR [TEMP],EAX
        MOV     DWORD PTR [TEMP+4],0
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with old processors
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20        ; isolate exponent
        SUB     ECX,3FFH      ; adjust
        TEST    EAX,EAX       ; clear zero flag
BS1:

The following code emulates BSF ECX,EAX:
        TEST    EAX,EAX
        JZ      SHORT BS2
        XOR     ECX,ECX
        MOV     DWORD PTR [TEMP+4],ECX
        SUB     ECX,EAX
        AND     EAX,ECX
        MOV     DWORD PTR [TEMP],EAX
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with earlier processors
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20
        SUB     ECX,3FFH
        TEST    EAX,EAX       ; clear zero flag
BS2:

These emulation codes should not be used on the PPro and PII, where the
bit scan instructions take only 1 or 2 clocks, and where the emulation
codes shown above have two partial memory stalls.


17.8 bit test  (PPlain and PMMX)
--------------------------------
BT, BTC, BTR, and BTS instructions should preferably be replaced by
instructions like TEST, AND, OR, XOR, or shifts.


17.9 integer multiplication  (PPlain and PMMX)
----------------------------------------------
An integer multiplication takes approximately 9 clock cycles. It is
therefore advantageous to replace a multiplication by a constant with a
combination of other instructions such as SHL, ADD, SUB, and LEA.
Example:
IMUL EAX,10
can be replaced with
MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX
or
LEA EAX,[EAX+4*EAX] / ADD EAX,EAX

Floating point multiplication is faster than integer multiplication on
PPlain and PMMX, but the time used to convert integers to float and
convert the product back again is usually more than the time saved by
using floating point multiplication, except when the number of conversions
is low compared with the number of multiplications.
MMX multiplication is fast, but is only available with 16-bit operands.


17.10 division  (all processors)
--------------------------------
Division is quite time consuming. The DIV instruction takes 17, 25, or 41
clock cycles for byte, word, and dword divisors respectively on the
PPlain and PMMX. The IDIV instruction takes 5 clock cycles more. It is
therefore preferable to use the smallest operand size possible that won't
generate an overflow, even if it costs an operand size prefix, and use
unsigned division if possible.

Division by a constant  (all processors)
----------------------------------------
Unsigned division by a power of two can be done with SHR.  Division of a
signed number by a power of two can be done with SAR, but the result with
SAR is rounded towards minus infinity, whereas the result with IDIV is
truncated towards zero.

Dividing by a constant can be done by multiplying with the reciprocal.
To calculate q = x / d, you first calculate the reciprocal f = 2^r / d,
where r defines the position of the binary decimal point (radix point).
Then multiply x with f and shift right r positions. The maximum value
of r is 32+b, where b is the number of binary digits in d minus 1.
(b is the highest integer for which 2^b <= d). Use r = 32+b to cover
the maximum range for the value of the dividend x.

This method needs some refinement in order to compensate for rounding
errors. The following algorithm will give you the correct result for
unsigned integer division with truncation, i.e. the same as the DIV
instruction does:

  b = the number of significant bits in d - 1
  r = 32 + b
  f = 2^r / d
  If f is an integer then d is a power of 2: goto case A.
  If f is not an integer, then check if the fractional part of f is < 0.5
  If the fractional part of f < 0.5: goto case B.
  If the fractional part of f > 0.5: goto case C.

  case A: (d = 2^b)
  result = x SHR b

  case B: (fractional part of f < 0.5)
  round f down to nearest integer
  result = ((x+1) * f) SHR r

  case C: (fractional part of f > 0.5)
  round f up to nearest integer
  result = (x * f) SHR r

Example:
Assume that you want to divide by 5.
5 = 00000101b.
b = number of significant binary digits - 1 = 2
r = 32+2 = 34
f = 2^34 / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)
The fractional part is greater than a half: use case C.
Round f up to 0CCCCCCCDh.

This code divides EAX by 5 and returns the result in EDX:

        MOV     EDX,0CCCCCCCDh
        MUL     EDX
        SHR     EDX,2

After the multiplication, EDX contains the product shifted right 32
places. Since r = 34 you have to shift 2 more places to get the result.
To divide by 10 you just change the last line to SHR EDX,3.

In case B you would have:

        INC     EAX
        MOV     EDX,f
        MUL     EDX
        SHR     EDX,b

This code works for all values of x except 0FFFFFFFFH which gives zero
because of overflow in the INC instruction. If x = 0FFFFFFFFH is possible,
then change the code to:

        MOV     EDX,f
        ADD     EAX,1
        JC      DOVERFL
        MUL     EDX
DOVERFL:SHR     EDX,b

If the value of x is limited, then you may use a lower value of r, i.e.
fewer digits. There can be several reasons to use a lower value of r:
- you may set r = 32 to avoid the SHR EDX,b in the end.
- you may set r = 16+b and use a multiplication instruction that gives
  a 32 bit result rather than 64 bits. This will free the EDX register:
  IMUL EAX,0CCCDh / SHR EAX,18
- you may choose a value of r that gives case C rather than case B in
  order to avoid the INC EAX instruction

The maximum value for x in these cases is at least 2^(r-b), sometimes
higher. You have to do a systematic test if you want to know the exact
maximum value of x for which your code works correctly.

You may want to replace the slow multiplication instruction with faster
instructions as explained in paragraph 17.9.

The following example divides EAX by 10 and returns the result in EAX.
I have chosen r=17 rather than 19 because it happens to give a code,
which is easier to optimize, and covers the same range for x.
f = 2^17 / 10 = 3333h, case B: q = (x+1)*3333h:

        LEA     EBX,[EAX+2*EAX+3]
        LEA     ECX,[EAX+2*EAX+3]
        SHL     EBX,4
        MOV     EAX,ECX
        SHL     ECX,8
        ADD     EAX,EBX
        SHL     EBX,8
        ADD     EAX,ECX
        ADD     EAX,EBX
        SHR     EAX,17

A systematic test shows that this code works correctly for all x < 10004H.

Repeated division by the same value  (all processors)
-----------------------------------------------------
If the divisor is not known at assembly time, but you are dividing
repeatedly with the same divisor, then you may use the same method as
above. The code has to distinguish between case A, B and C and calculate
f before doing the divisions.

The code that follows shows how to do many divisions with the same
divisor (unsigned division with truncation). First call SET_DIVISOR
to specify the divisor and calculate the reciprocal, then call
DIVIDE_FIXED for each value to divide by the same divisor.

.data

RECIPROCAL_DIVISOR DD ?                ; rounded reciprocal divisor
CORRECTION         DD ?                ; case A: -1, case B: 1, case C: 0
BSHIFT             DD ?                ; number of bits in divisor - 1

.code

SET_DIVISOR PROC NEAR                  ; divisor in EAX
        PUSH    EBX
        MOV     EBX,EAX
        BSR     ECX,EAX                ; b = number of bits in divisor - 1
        MOV     EDX,1
        JZ      ERROR                  ; error: divisor is zero
        SHL     EDX,CL                 ; 2^b
        MOV     [BSHIFT],ECX           ; save b
        CMP     EAX,EDX
        MOV     EAX,0
        JE      SHORT CASE_A           ; divisor is a power of 2
        DIV     EBX                    ; 2^(32+b) / d
        SHR     EBX,1                  ; divisor / 2
        XOR     ECX,ECX
        CMP     EDX,EBX                ; compare remainder with divisor/2
        SETBE   CL                     ; 1 if case B
        MOV     [CORRECTION],ECX       ; correction for rounding errors
        XOR     ECX,1
        ADD     EAX,ECX                ; add 1 if case C
        MOV     [RECIPROCAL_DIVISOR],EAX ; rounded reciprocal divisor
        POP     EBX
        RET
CASE_A: MOV     [CORRECTION],-1        ; remember that we have case A
        POP     EBX
        RET
SET_DIVISOR     ENDP

DIVIDE_FIXED PROC NEAR                 ; dividend in EAX, result in EAX
        MOV     EDX,[CORRECTION]
        MOV     ECX,[BSHIFT]
        TEST    EDX,EDX
        JS      SHORT DSHIFT           ; divisor is power of 2
        ADD     EAX,EDX                ; correct for rounding error
        JC      SHORT DOVERFL          ; correct for overflow
        MUL     [RECIPROCAL_DIVISOR]   ; multiply with reciprocal divisor
        MOV     EAX,EDX
DSHIFT: SHR     EAX,CL                 ; adjust for number of bits
        RET
DOVERFL:MOV     EAX,[RECIPROCAL_DIVISOR] ; dividend = 0FFFFFFFFH
        SHR     EAX,CL                 ; do division by shifting
        RET
DIVIDE_FIXED    ENDP

This code gives the same result as the DIV instruction for
0 <= x < 2^32,  0 < d < 2^32.

Note: The line  JC DOVERFL  and its target are not needed if you are
certain that x < 0FFFFFFFFH.

If powers of 2 occur so seldom that it is not worth optimizing for them,
then you may leave out the jump to DSHIFT and in stead do a multiplication
with CORRECTION = 0 for case A.

If the divisor changes so often that SET_DIVISOR needs optimizing, then
you may replace the BSR instruction with the code given in section 17.7.


floating point division  (all processors)
-----------------------------------------
Floating point division takes 39 clock cycles for the highest precision
on the PPlain and PMMX. You can save time by specifying a lower precision
in the floating point control word (only FDIV and FIDIV are faster at low
precision, no other instructions can be speeded up this way).

It is possible to do a floating point division and an integer division in
parallel to save time.  Example: A = A1 / A2;  B = B1 / B2

        FILD    [B1]
        FILD    [B2]
        MOV     EAX, [A1]
        MOV     EBX, [A2]
        CDQ
        FDIV
        DIV     EBX
        FISTP   [B]
        MOV     [A], EAX
        
(make sure you set the floating point control word to the desired
rounding method)

Obviously, you should always try to minimize the number of divisions.
Floating point division with a constant or repeated division with the same
value should of course by done by multiplying with the reciprocal.
But there are many other situations where you can reduce the number of
divisions. For example:
if (A/B > C)...  can be rewritten as  if (A > B*C)...  when B is
positive, and the opposite when B is negative.

A/B + C/D  can be rewritten as  (A*D + C*B) / (B*D)

If you are using integer division, then you should be aware that the
rounding errors may be different when you rewrite the formulas.


17.11 WAIT  (all processors)
----------------------------
You can often increase speed by omitting the WAIT instruction.
The WAIT instruction has three functions:

  a. The old 8087 processor requires a WAIT before _every_ floating point
     instruction to make sure the coprocessor is ready to receive it.

  b. WAIT is used to coordinate memory access between the floating point
     unit and the integer unit. Examples:
     b.1.  FISTP [mem32]
           WAIT             ; wait for floating point unit to write before..
           MOV EAX,[mem32]  ; reading the result with the integer unit

     b.2.  FILD [mem32]
           WAIT             ; wait for floating point unit to read value..
           MOV [mem32],EAX  ; before overwriting it with integer unit

     b.3.  FLD QWORD PTR [ESP]
           WAIT             ; prevent an accidental hardware interrupt from..
           ADD ESP,8        ; overwriting value on stack before it is read

  c. WAIT is sometimes used to check for exceptions. It will generate an
     interrupt if an unmasked exception bit in the floating point status
     word has been set by a preceding floating point instruction.

Regarding a:
The function in point a is never needed on any other processors than the old
8087. Unless you want your code to be compatible with the 8087 you should
tell your assembler not to put in these WAITs by specifying a higher
processor. A 8087 floating point emulator also inserts WAIT instructions.
You should therefore tell your assembler not to generate emulation code
unless you need it.

Regarding b:
WAIT instructions to coordinate memory access are definitely needed on the
8087 and 80287 but not on the Pentiums. It is not quite clear whether it is
needed on the 80387 and 80486. I have made several tests on these Intel
processors and not been able to provoke any error by omitting the WAIT on
any 32 bit Intel processor, although Intel manuals say that the WAIT is
needed for this purpose except after FNSTSW and FNSTCW. Omitting WAIT
instructions for coordinating memory access is not 100 % safe, even when
writing 32 bit code, because the code may be able to run on a 80386 main
processor with a 287 coprocessor, which requires the WAIT. Also, I have not
tested all possible hardware and software combinations, so there may be
other situations where the WAIT is needed.

If you want to be certain that your code will work on _any_ 32 bit processor
(including non-Intel processors) then I would recommend that you include the
WAIT here in order to be safe.

Regarding c:
The assembler automatically inserts a WAIT for this purpose before the
following instructions:  FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW.
You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT
is unneccessary in most cases because these instructions without WAIT will
still generate an interrupt on exceptions except for FNCLEX and FNINIT on
the 80387. (There is some inconsistency about whether the IRET from the
interrupt points to the FN.. instruction or to the next instruction).

Almost all other floating point instructions will also generate an interrupt
if a previous floating point instruction has set an unmasked exception bit,
so the exception is likely to be detected sooner or later anyway. You may
insert a WAIT after the last floating point instruction in your program to
be sure to catch all exceptions.

You may still need the WAIT if you want to know exactly where an exception
occurred in order to be able to recover from the situation. Consider, for
example, the code under b.3 above: If you want to be able to recover from
an exception generated by the FLD here, then you need the WAIT because an
interrupt after ADD ESP,8 would overwrite the value to load.


17.12 FCOM + FNSTSW AX  (all processors)
----------------------------------------
The PPro and PII processors have FCOMI instructions to avoid the slow
FNSTSW. On processors without FCOMI instructions, the usual way of doing
floating point comparisons is:

FLD [a]
FCOMP [b]
FSTSW AX
SAHF
JB ASmallerThanB

You may improve this code by using FNSTSW AX rather than FSTSW AX and
test AH directly rather than using the non-pairable SAHF.
(TASM version 3.0 has a bug with the FNSTSW AX instruction)

FLD [a]
FCOMP [b]
FNSTSW AX
SHR AH,1
JC ASmallerThanB

Testing for zero or equality:

FTST
FNSTSW AX
AND AH,40H
JNZ IsZero     ; (the zero flag is inverted!)

Test if greater:

FLD [a]
FCOMP [b]
FNSTSW AX
AND AH,41H
JZ AGreaterThanB

Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as
it would produce a partial register stall on the PPro and PII.

On the PPlain and PMMX, the FNSTSW instruction takes 2 clocks, but it is
delayed for an additional 4 clocks after any floating point instruction
because it is waiting for the status word to retire from the pipeline.
This delay comes even after FNOP which cannot change the status word,
but not after integer instructions. You can fill the latency between FCOM
and FNSTSW with integer instructions taking up to four clock cycles.
A paired FXCH immediately after FCOM doesn't delay the FNSTSW, not even
if the pairing is imperfect:

FCOM                  ; clock 1
FXCH                  ; clock 1-2 (imperfect pairing due to integer instruction)
INC DWORD PTR [EBX]   ; clock 3-5
FNSTSW AX             ; clock 6-7

You may want to use FCOM rather than FTST here because FTST is not pairable.
Remember to include the N in FNSTSW. FSTSW (without N) has a WAIT prefix
which delays it further.


It is sometimes faster to use integer instructions for comparing floating
point values, as described in paragraph 18 below.


17.13 FLDCW  (PPro and PII)
---------------------------
The PPro and PII have a serious stall after the FLDCW instruction if
followed by any floating point instruction which reads the control word
(which almost all floating point instructions do).

When C or C++ code is compiled it often generates a lot of FLDCW
instructions because conversion of floating point numbers to integers
is done with truncation while other floating point instructions use
rounding. After translation to assembly, you can improve this code by
using rounding in stead of truncation where possible, or by moving the
FLDCW out of a loop where truncation is needed inside the loop.


17.14 freeing floating point registers  (PPlain and PMMX)
---------------------------------------------------------
You have to free all used floating point registers before exiting a
subroutine, except for any register used for the result.

The fastest way to free one register is  FSTP ST.
The fastest way to free two registers is  FCOMPP.
It is not recommended to use  FFREE.


17.15 FISTP  (PPlain and PMMX)
------------------------------
Converting a floating point number to integer is normally done like this:
        FISTP   DWORD PTR [TEMP]
        MOV     EAX, [TEMP]

An alternative method is:

.DATA
ALIGN 8
TEMP    DQ      ?
MAGIC   DD      59C00000H   ; f.p. representation of 2^51 + 2^52

.CODE
        FADD    [MAGIC]
        FSTP    QWORD PTR [TEMP]
        MOV     EAX, DWORD PTR [TEMP]

Adding the 'magic number' of 2^51 + 2^52 has the effect that any integer
between -2^31 and +2^31 will be aligned in the lower 32 bits when storing
as a double precision floating point number. The result is the same as you
get with FISTP for all rounding methods except truncation towards zero.
The result is different from FISTP if the control word specifies truncation
or in case of overflow. You may need a WAIT instruction for compatibility
with older processors, see section 17.11.

This method is not faster than using FISTP, but it gives better scheduling
opportunities because there is a 3 clock void between FADD and FSTP which
may be filled with other instrucions. You may multiply or divide the number
by a power of 2 in the same operation by doing the opposite to the magic
number. You may also add a constant by adding it to the magic number,
which then has to be double precision.


17.16 FPTAN  (all processors)
-----------------------------
According to the manuals, FPTAN returns two values X and Y and leaves it to
the programmer to divide Y with X to get the result, but in fact it always
returns 1 in X so you can save the division. My tests show that on all 32
bit Intel processors with floating point unit or coprocessor, FPTAN always
returns 1 in X regardless of the argument. If you want to be absolutely sure
that your code will run correctly on all processors, then you may test if X
is 1, which is faster than dividing with X. The Y value may be very high,
but never infinity, so you don't have to test if Y contains a valid number
if you know that the argument is valid.


18. USING INTEGER INSTRS TO DO FLOATING POINT OPERATIONS (PPlain and PMMX)
==========================================================================
Integer instructions are generally faster than floating point instructions,
so it is often advantageous to use integer instructions for doing simple
floating point operations. The most obvious example is moving data. Example:

FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]

Change to:

MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX

The former code takes 4 clocks, the latter takes 2.

Testing if a floating point value is zero:

The floating point value of zero is usually represented as 32 or 64 bits of 
zero, but there is a pitfall here: The sign bit may be set! Minus zero is
regarded as a valid floating point number, and the processor may actually
generate a zero with the sign bit set if for example multiplying a negative
number with zero. So if you want to test if a floating point number is zero,
you should not test the sign bit. Example:

FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero

Use integer instructions instead, and shift out the sign bit:

MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero

The former code takes 9 clocks, the latter takes only 2.
If the floating point number is double precision (QWORD) then you only have
to test bit 32-62. If they are zero, then the lower half will also be zero
if it is a normal floating point number.

Testing if negative:
A floating point number is negative if the sign bit is set and at least one
other bit is set. Example:
MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative

Manipulating the sign bit:
You can change the sign of a floating point number simply by flipping the
sign bit. Example:
XOR BYTE PTR [a] + (TYPE a) - 1, 80H

Likewise you may get the absolute value of a floating point number by simply
ANDing out the sign bit.

Comparing numbers:
Floating point numbers are stored in a unique format which allows you to use
integer instructions for comparing floating point numbers, except for the
sign bit. If you are certain that two floating point numbers both are normal
and positive then you may simply compare them as integers. Example:

FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB

Change to:

MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB

This method only works if the two numbers have the same precision and you
are certain that none of the numbers have the sign bit set.

If negative numbers are possible, then you have to convert the
negative numbers to 2-complement, and do a signed compare:

        MOV     EAX, [a]
        MOV     EBX, [b]
        MOV     ECX, EAX
        MOV     EDX, EBX
        SAR     ECX, 31              ; copy sign bit
        AND     EAX, 7FFFFFFFH       ; remove sign bit
        SAR     EDX, 31
        AND     EBX, 7FFFFFFFH
        XOR     EAX, ECX             ; make 2-complement if sign bit was set
        XOR     EBX, EDX
        SUB     EAX, ECX
        SUB     EBX, EDX
        CMP     EAX, EBX
        JL      ASmallerThanB        ; signed comparison

This method works for all normal floating point numbers, including -0.


19. USING FLOATING POINT INSTRS TO DO INTEGER OPERATIONS (PPlain and PMMX)
==========================================================================

19.1 Moving data  (PPlain and PMMX)
-----------------------------------
Floating point instructions can be used to move 8 bytes at a time:
FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI]
This is only an advantage if the destination is not in the cache. The
optimal way to move a block of data to uncached memory on the PPlain is:

TOP:    FILD    QWORD PTR [ESI]
        FILD    QWORD PTR [ESI+8]
        FXCH
        FISTP   QWORD PTR [EDI]
        FISTP   QWORD PTR [EDI+8]
        ADD     ESI, 16
        ADD     EDI, 16
        DEC     ECX
        JNZ     TOP

The source and destination should of course be aligned by 8. The extra
time used by the slow FILD and FISTP instructions is compensated for by
the fact that you only have to do half as many write operations.
Note that this method is only advantageous on the PPlain and PMMX and
only if the destination is not in the level 1 cache. On all other
processors it is faster to use  REP MOVSD.

On MMX processors it is faster to use MMX instructions to move eight
bytes at a time:

TOP:    MOVQ    MM0,[ESI]
        MOVQ    [EDI],MM0
        ADD     ESI,8
        ADD     EDI,8
        DEC     ECX
        JNZ     TOP

There is no need to unroll this loop or optimize it further if cache
misses are expected, because memory access is the bottleneck here, not
instruction execution.


19.2 Integer multiplication  (PPlain and PMMX)
----------------------------------------------
Floating point multiplication is faster than integer multiplication on the
PPlain and PMMX, but the price for converting integer factors to float
and converting the result back to integer is high, so floating point
multiplication is only advantageous if the number of conversions needed is
low compared with the number of multiplications. (It may be tempting to
use denormal floating point operands to save some of the conversions here,
but the handling of denormals is very slow, so this is not a good idea!)

On the PMMX, MMX multiplication instructions are faster than integer
multiplication, and can be pipelined to a throughput of one multiplication
per clock cycle, so this may be the best solution for doing fast
multiplication on the PMMX, if you can live with 16 bit precision.

Integer multiplication is faster than floating point on PPro and PII.


19.3 Integer division  (PPlain and PMMX)
----------------------------------------
Floating point division is not faster than integer division, but you can do
other integer operations (including integer division, but not integer
multiplication) while the floating point unit is working on the division.
See paragraph 17.10 above for an example.


19.4 Converting binary to decimal numbers  (all processors)
-----------------------------------------------------------
Using the FBSTP instruction is a simple and convenient way of converting a
binary number to decimal, although not necessarily the fastest method.


20. LIST OF INTEGER INSTRUCTIONS  (PPlain and PMMX)
===================================================
Explanations:
Operands: r=register, m=memory, i=immediate data, sr=segment register
m32= 32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably.

Pairability:
u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe,
np=not pairable

Opcode                 Operands            Clock cycles        Pairability
----------------------------------------------------------------------------
NOP                                        1                   uv
MOV                    r/m, r/m/i          1                   uv
MOV                    r/m, sr             1                   np
MOV                    sr , r/m            >= 2 b)             np
MOV                    m  , accum          1                   uv h)
XCHG                   (E)AX, r            2                   np
XCHG                   r  ,   r            3                   np
XCHG                   r  ,   m            >15                 np
XLAT                                       4                   np
PUSH                   r/i                 1                   uv
POP                    r                   1                   uv
PUSH                   m                   2                   np
POP                    m                   3                   np
PUSH                   sr                  1 b)                np
POP                    sr                  >= 3 b)             np
PUSHF                                      3-5                 np
POPF                                       4-6                 np
PUSHA POPA                                 5-9 i)              np
PUSHAD POPAD                               5                   np
LAHF SAHF                                  2                   np
MOVSX MOVZX            r, r/m              3 a)                np
LEA                    r/m                 1                   uv
LDS LES LFS LGS LSS    m                   4 c)                np
ADD SUB AND OR XOR     r  , r/i            1                   uv
ADD SUB AND OR XOR     r  , m              2                   uv
ADD SUB AND OR XOR     m  , r/i            3                   uv
ADC SBB                r  , r/i            1                   u
ADC SBB                r  , m              2                   u
ADC SBB                m  , r/i            3                   u
CMP                    r  , r/i            1                   uv
CMP                    m  , r/i            2                   uv
TEST                   r  , r              1                   uv
TEST                   m  , r              2                   uv
TEST                   r  , i              1                   f)
TEST                   m  , i              2                   np
INC DEC                r                   1                   uv
INC DEC                m                   3                   uv
NEG NOT                r/m                 1/3                 np
MUL IMUL               r8/r16/m8/m16       11                  np
MUL IMUL               all other versions  9 d)                np
DIV                    r8/m8               17                  np
DIV                    r16/m16             25                  np
DIV                    r32/m32             41                  np
IDIV                   r8/m8               22                  np
IDIV                   r16/m16             30                  np
IDIV                   r32/m32             46                  np
CBW CWDE                                   3                   np
CWD CDQ                                    2                   np
SHR SHL SAR SAL        r  , i              1                   u
SHR SHL SAR SAL        m  , i              3                   u
SHR SHL SAR SAL        r/m, CL             4/5                 np
ROR ROL RCR RCL        r/m, 1              1/3                 u
ROR ROL                r/m, i(><1)         1/3                 np
ROR ROL                r/m, CL             4/5                 np
RCR RCL                r/m, i(><1)         8/10                np
RCR RCL                r/m, CL             7/9                 np
SHLD SHRD              r, i/CL             4 a)                np
SHLD SHRD              m, i/CL             5 a)                np
BT                     r, r/i              4 a)                np
BT                     m, i                4 a)                np
BT                     m, r                9 a)                np
BTR BTS BTC            r, r/i              7 a)                np
BTR BTS BTC            m, i                8 a)                np
BTR BTS BTC            m, r                14 a)               np
BSF BSR                r  , r/m            7-73 a)             np
SETcc                  r/m                 1/2 a)              np
JMP CALL               short/near          1   e)              v
JMP CALL               far                 >= 3 e)             np
conditional jump       short/near          1/4/5/6 e)          v
CALL JMP               r/m                 2/5 e)              np
RETN                                       2/5 e)              np
RETN                   i                   3/6 e)              np
RETF                                       4/7 e)              np
RETF                   i                   5/8 e)              np
J(E)CXZ                short               4-11 e)             np
LOOP                   short               5-10 e)             np
BOUND                  r  , m              8                   np
CLC STC CMC CLD STD                        2                   np
CLI STI                                    6-9                 np
LODS                                       2                   np
REP LODS                                   7+3*n    g)         np
STOS                                       3                   np
REP STOS                                   10+n     g)         np
MOVS                                       4                   np
REP MOVS                                   12+n     g)         np
SCAS                                       4                   np
REP(N)E SCAS                               9+4*n    g)         np
CMPS                                       5                   np
REP(N)E CMPS                               8+4*n    g)         np
BSWAP                                      1 a)                np
CPUID                                      13-16    a)         np
RDTSC                                      6-13  a) j)         np
----------------------------------------------------------------------------
Notes:
a) this instruction has a 0FH prefix which takes one clock cycle extra to
   decode on a PPlain unless preceded by a multicycle instruction (see
   section 13 above).
b) versions with FS and GS have a 0FH prefix. see note a.
c) versions with SS, FS, and GS have a 0FH prefix. see note a.
d) versions with two operands and no immediate have a 0FH prefix, see note a.
e) see section 12 above
f) only pairable if register is accumulator. see paragraph 17.3 above
g) add one clock cycle for decoding the repeat prefix unless preceded by a
   multicycle instruction (such as CLD. see section 13 above).
h) pairs as if it were writing to the accumulator. see paragraph 17.2 above.
i) 9 if SP divisible by 4. See section 10.7 above.
j) on PPlain: 6 in priviledged or real mode, 11 in nonpriviledged, error in
   virtual mode. On PMMX: 8 and 13 clocks respectively.


21. LIST OF FLOATING POINT INSTRUCTIONS
=======================================
Explanations:
Operands: r=register, m=memory, m32=32 bit memory operand, etc.

Clock cycles:
The numbers are minimum values. Cache misses, misalignment, denormal
operands, and exceptions may increase the clock counts considerably.

Pairability:
+=pairable with FXCH, np=not pairable with FXCH.

i-ov:
Overlap with integer instructions. i-ov = 4 means that the last four clock
cycles can overlap with subsequent integer instructions.

fp-ov:
Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)

Opcode               Operand     Clock cycles    Pairability    i-ov   fp-ov
-----------------------------------------------------------------------------
FLD                  r/m32/m64         1         +              0      0
FLD                  m80               3         np             0      0
FBLD                 m80           48-58         np             0      0
FST(P)               r                 1         np             0      0
FST(P)               m32/m64           2 m)      np             0      0
FST(P)               m80               3 m)      np             0      0
FBSTP                m80         148-154         np             0      0
FILD                 m                 3         np             2      2
FIST(P)              m                 6         np             0      0
FLDZ FLD1                              2         np             0      0
FLDPI FLDL2E etc.                      5 s)      np             2      2
FNSTSW               AX/m16            6 q)      np             0      0
FLDCW                m16               8         np             0      0
FNSTCW               m16               2         np             0      0

FADD(P)              r/m               3         +              2      2
FSUB(R)(P)           r/m               3         +              2      2
FMUL(P)              r/m               3         +              2      2 n)
FDIV(R)(P)           r/m        19/33/39 p)      +             38 o)   2
FCHS FABS                              1         +              0      0
FCOM(P)(P) FUCOM     r/m               1         +              0      0
FIADD FISUB(R)       m                 6         np             2      2
FIMUL                m                 6         np             2      2
FIDIV(R)             m          22/36/42 p)      np            38 o)   2
FICOM                m                 4         np             0      0
FTST                                   1         np             0      0
FXAM                               17-21         np             4      0
FPREM                              16-64         np             2      2
FPREM1                             20-70         np             2      2
FRNDINT                             9-20         np             0      0
FSCALE                             20-32         np             5      0
FXTRACT                            12-66         np             0      0

FSQRT                                 70         np            69 o)   2
FSIN FCOS                         65-100 r)      np             2      2
FSINCOS                           89-112 r)      np             2      2
F2XM1                              53-59 r)      np             2      2
FYL2X                                103 r)      np             2      2
FYL2XP1                              105 r)      np             2      2
FPTAN                            120-147 r)      np            36 o)   0
FPATAN                           112-134 r)      np             2      2

FNOP                                   1         np             0      0
FXCH                 r                 1         np             0      0
FINCSTP FDECSTP                        2         np             0      0
FFREE                r                 2         np             0      0
FNCLEX                               6-9         np             0      0
FNINIT                             12-22         np             0      0
FNSAVE               m           124-300         np             0      0
FRSTOR               m             70-95         np             0      0
WAIT                                   1         np             0      0
-----------------------------------------------------------------------------
Notes:
m) The value to store is needed one clock cycle in advance.
n) 1 if the overlapping instruction is also an FMUL.
o) Cannot overlap integer multiplication instructions.
p) FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision
   respectively. FIDIV takes 3 clocks more. The precision is defined by bit
   8-9 of the floating point control word.
q) The first 4 clock cycles can overlap with preceding integer instructions.
   See section 17.12 above.
r) clock counts are typical. Trivial cases may be faster, extreme cases may
   be slower.
s) may be up to 3 clocks more when output needed for fst, fchs, or fabs.


22. MMX instructions  (PMMX)
============================
A list of MMX instruction timings is not needed because they all take
one clock cycle, except the MMX multiply instructions which take 3.
MMX multiply instructions can be overlapped and pipelined to yield a
throughput of one multiplication per clock cycle.

The EMMS instruction takes only one clock cycle, but the first floating
point instruction after an EMMS takes approximately 58 clocks extra, and
the first MMX instruction after a floating point instruction takes
approximately 38 clocks extra. There is no penalty for an MMX instruction
after EMMS on the PMMX, but a possible penalty on the PII.

There is no penalty for using a memory operand in an MMX instruction
because the MMX arithmetic unit is one step later in the pipeline than
the load unit. But the penalty comes when you store data from an MMX
register to memory or to a 32 bit register: The data have to be ready
one clock cycle in advance. This is analogous to the floating point
store instructions.

All MMX instructions except EMMS are pairable in either pipe. Pairing
rules for MMX instructions are described in paragraph 8.10 above.


23. TESTING SPEED  (all processors)
===================================
The Pentium family or processors have an internal 64 bit clock counter
which can be read into EDX:EAX using the instruction RDTSC (read time
stamp counter). This is very useful for testing exactly how many clock
cycles a piece of code takes.

The program below is useful for measuring the number of clock cycles a piece
of code takes. The program executes the code to test 10 times and stores the
10 clock counts. The program can be used in both 16 and 32 bit mode on the
PPlain and PMMX:

ITER    EQU     10              ; number of iterations
OVERHEAD EQU    15              ; 15 for PPlain, 17 for PMMX

RDTSC   MACRO                   ; define RDTSC instruction
        DB      0FH,31H
ENDM

.DATA                           ; data segment
ALIGN   4
COUNTER DD      0               ; loop counter
TICS    DD      0               ; temporary storage of clock
RESULTLIST  DD  ITER DUP (0)    ; list of test results

.CODE                           ; code segment
BEGIN:  MOV     [COUNTER],0     ; reset loop counter
TESTLOOP:                       ; test loop
;****************   Do any initializations here:    ************************
        FINIT
;****************   End of initializations          ************************
        RDTSC                   ; read clock counter
        MOV     [TICS],EAX      ; save count
        CLD                     ; non-pairable filler
REPT    8
        NOP                     ; eight NOP's to avoid shadowing effect
ENDM

;****************   Put instructions to test here:  ************************
        FLDPI                   ; this is only an example
        FSQRT
        RCR     EBX,10
        FSTP    ST
;********************* End of instructions to test  ************************

        CLC                     ; non-pairable filler with shadow
        RDTSC                   ; read counter again
        SUB     EAX,[TICS]      ; compute difference
        SUB     EAX,OVERHEAD    ; subtract the clk cycl used by fillers etc
        MOV     EDX,[COUNTER]   ; loop counter
        MOV     [RESULTLIST][EDX],EAX   ; store result in table
        ADD     EDX,TYPE RESULTLIST     ; increment counter
        MOV     [COUNTER],EDX           ; store counter
        CMP     EDX,ITER * (TYPE RESULTLIST)
        JB      TESTLOOP                ; repeat ITER times

; insert here code to read out the values in RESULTLIST


The 'filler' instructions before and after the piece of code to test are
are included in order to get consistent results on the PPlain.
The CLD is a non-pairable instruction which has been inserted to
make sure the pairing is the same the first time as the subsequent times.
The eight NOP instructions are inserted to prevent any prefixes in the code
to test to be decoded in the shadow of the preceding instructions on the
PPlain. Single byte instructions are used here to obtain the same pairing
the first time as the subsequent times. The CLC after the code to test is
a non-pairable instruction which has a shadow under which the 0FH prefix
of the RDTSC can be decoded so that it is independent of any shadowing
effect from the code to test on the PPlain.

On The PMMX you may want to insert  XOR EAX,EAX / CPUID  before the
instructions to test if you want the FIFO instruction buffer to be
empty, or some time-consuming instruction (f.ex. CLI or AAD) if you
want the FIFO buffer to be full.

On the PPro and PII you have to put in a serializing instruction like
CPUID before and after each RDTSC to prevent it from executing in parallel
with anything else. (CPUID is a serializing instruction which means that
it flushes the pipeline and waits for all pending operations to finish
before proceeding. This is useful for testing purposes. CPUID has no
shadow under which prefixes of subsequent instructions can decode.)

The RDTSC instruction cannot execute in virtual mode on the PPlain and
PMMX, so if you are running DOS programs you must run in real mode. (Press
F8 while booting and select 'safe mode command prompt only' or 'bypass
startup files').

The Pentium processors have special performance monitor counters which can
count events such as cache misses, misalignments, AGI stalls, etc. Details
about how to use the performance monitor counters are not covered by this
manual but can be found in the MMX technology developer's manual.


24. PENTIUM PRO AND PENTIUM II MICROPROCESSORS
==============================================
The Pentium Pro and Pentium II processors are very different from the
earlier versions. Each instruction is decoded into micro-operations
(uops) which are reordered before execution. The main advantage of this
is that the processor does most of the optimations for you: splitting
complex instructions into simpler ones, and reordering instructions so
that independent operations can be done in parallel. These dynamic
processors therefore perform better on non-optimized code than before.
Another advantage is automatic register renaming. Example:
        MOV     EAX,[ESI]
        INC     EAX
        MOV     [EDI],EAX
        MOV     EAX,[ESI+4]
        INC     EAX
        MOV     [EDI+4],EAX
In this case the dynamic processor will use a different physical register
for the last three instructions and rename it to EAX so that the two
calculations can be done in parallel. In case of a cache miss for [ESI]
and a hit for [ESI+4], you will actually have the last calculation done
first. There are more physical registers than logical register names, so
you can take advantage of this automatic register renaming if you are
short on registers.

The dynamic processors also have drawbacks, however. Most importantly,
mispredicted jumps are very expensive (10-20 clock cycles!). To
compensate for this problem, the dynamic processors have conditional
move instructions which you can use in some situations in stead of
conditional jumps.

Another drawback is partial register stalls and partial memory stalls.
Using a full register after writing to part of the register will cause a
severe delay on the PPro and PII. Example:
MOV AL,[EBX] / MOV ECX,EAX
You may avoid this penalty by using  MOVZX EAX,BYTE PTR [EBX]
or by zeroing the full register first:
XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX
The dynamic processors have been designed specially to recognize this
combination where a register is XOR'ed with itself and avoid a partial
register stall. This is the preferred method for code that will execute
on both Pentium and PentiumPro processors. The processor remembers that
the register has been xor'ed with itself until the next mispredicted
branch or interrupt.

The partial register stall occurs whenever the processor needs to combine
data from different sources into different parts of a register. Example:
        MOV     AL, [mem1]
        MOV     AH, [mem2]
        MOV     [mem3], AX
should be changed to:
        MOVZX   EAX, BYTE PTR [mem1]
        MOVZX   EBX, BYTE PTR [mem2]
        SHL     EBX, 8
        OR      EAX, EBX
        MOV     [mem3], AX

The partial register stall occurs whenever the processor has to combine
bits from two different sources into one register. This can also happen
with the flags register:

        SAHF
        JL      XX

SAHF writes to several of the flag bits, but not the overflow flag.
The JL instruction reads both the sign flag and the overflow flag, so
it has to combine the sign flag from SAHF with the overflow flag from
before the SAHF. This gives a partial register stall. The same happens
when PUSHF or PUSHFD follows after an instruction which modifies the
flags, such as ADD.  PUSHFD reads all 32 bits of the flags register,
but ADD writes to only six of these, so you will get a partial stall
here too.

Memory operations can also cause a partial stall when you address the
same piece of memory with different data sizes. Example:
MOV [ESI],AL  /  MOV EBX,[ESI]
This situation is analogous to the partial register stall: The first
instruction writes to the first byte of the dword that ESI points to,
and the second instruction reads four bytes. The byte from AL has to
be combined with the remaining three bytes from memory to give the
dword to read into EBX.

This does not give a partial memory stall:
MOV [ESI],EAX  /  MOV BL,[ESI]
but this does:
MOV [ESI],EAX  /  MOV BL,[ESI+1]
This situation is not analogous to the partial register stall. The rule
here is that reading part of what you have written causes a partial stall
if the alignment isn't the same.

A third kind of stall occurs when reading any flags bits after a shift
or rotate by CL.

These three kinds of stall can be masked if you put other instructions
in between.


A thorough optimation for the dynamic processors require that you know
how ASM instructions are broken down into uops. You will need Intel's
application note: 'AP-526 Optimizations for Intel's 32-bit Processors'
which has a table of this in appendix D. The tutorial 'Optimizing for
the Pentium® Pro Processor Family' is also recommended.

The maximum throughput in the PPro and PII processors is
3 uops per clock cycle. In order to achieve this, you should beware of
the possible bottlenecks. The instruction decode stage can decode up to
three ASM instructions per clock cycle, but only if the first of these
generates no more than 4 uops and the next two instructions generate
only 1 uop each. This means that a throughput of 6 uops in the decode
stage is achieved with a 4-1-1 4-1-1 sequence, whereas you get only 2
uops per clock if all instructions generate 2 uops each. In order to get
a throughput in the decode stage of at least 3 uops per clock cycle you
should, if possible, intersperce instructions that generate more than
one uop with one or two instructions that generate only one uop each.

More often you will get a bottleneck in the execution stage. There are
five execution pipelines called port 0-4. Port 0 and 1 correspond roughly
to the u and v pipes of the PPlain. Port 0 can do all artihmetic
operations whereas port 1 can do only simple ALU operations. Port 2 does
all memory read operations. A memory write operation requires two uops
which go to port 3 and 4 respectively.

Each port can receive only one uop per clock cycle. In order to get the
maximum throughput of 3 uops in the execution stage you must make sure
that no port gets more than one third of the uops. If, for example, you
have too many uops for port 0 and 1, then you may replace some
MOV reg,reg  or  MOV reg,constant  instructions with MOV reg,[mem]
instructions, which go to port 2 instead or port 0 or 1.

You also have to take data-flow dependency into account. If all uops
depend on the result of the preceding one, then you can execute only one
per clock cycle. The dynamic processor can reorder your uops in order to
first execute any operation that does not depend on the preceding ones,
but this requires that your code is not completely sequential.

Normally, you wouldn't know what the reorder-buffer contains at the
start of the piece of code you are optimizing, but if you are coding a
loop you can assume that you will reach a steady-state after one or a
few iterations, so that you only have to count the uops inside the loop.


25. COMPARISON OF THE DIFFERENT MICROPROCESSORS
===============================================
The following table summarizes some important differences between the
microprocessors in the Pentium family:

                                  PPlain    PMMX     PPro      PII
--------------------------------------------------------------------
code cache, kb                        8       16        8       16
data cache, kb                        8       16        8       16
MMX instructions                     no      yes       no      yes
conditional move instructructions    no       no      yes      yes
dynamic execution                    no       no      yes      yes
branch prediction                  poor     good     good     good
branch target buffer entries        256      256      512      512
return stack buffer                  no      yes      yes      yes
branch misprediction penalty        3-4      4-5    10-20    10-20
partial register stall                0        0       >6        ?
FMUL latency                          3        3        5        5
FMUL throughput                     1/2      1/2      1/2      1/2
IMUL latency                          9        9      4-5      4-5
IMUL throughput                     1/9      1/9      1/1      1/1
--------------------------------------------------------------------

Comments to the table:
Code cache size is important if the critical part of your program is
not limited to a small memory space.

Data cache size is important for all programs that handle more than
small amounts of data in the critical part.

MMX instructions are useful for programs that handle massively parallel
integer data, such as sound and image processing. In other applications
it may not be possible to take advantage of the MMX instructions.

Conditional move instructructions are useful for avoiding poorly
predictable conditional jumps.

Dynamic execution improves performance on non-optimized code. It includes
automatic instruction reordering and register renaming.

Processors with a good branch prediction method can predict simple
repetitive patterns. A good branch prediction is most important if the
branch misprediction penalty is high.

A return stack buffer improves prediction of return instructions when
a subroutine is called alternatingly from different locations.

A partial register stall makes handling of mixed data sizes (8, 16, 32 bit)
more difficult.

The latency of a multiplication instruction is the time it takes in
sequential code. A throughput of 1/2 means that the execution can be
pipelined so that a new multiplication can begin every second clock cycle.
This defines the speed for handling parallel data.

As you can see from the table, the PII processor will be the best choise
in most situations, but definitely not in all situations. A fully
optimized program with a lot of sequential floating point code and a lot
of unpredictable branches might actually run fastest on the simple PPlain
(at the same clock frequency).


Most of the optimations described in this document have little or no
negative effects on other microprocessors, including non-Intel processors,
but there are some problems to be aware of.

Scheduling floating point code for the PPlain and PMMX often requires a
lot of extra FXCH instructions. This will slow down execution on older
microprocessors, but not on the Pentium family and advanced non-Intel
processors.

The MMX instructions available on the PMMX and PII processors
are very useful for massively parallel integer calculations, but they
cannot be used without a serious penalty if you have floating point
instructions in the same part of the program.

Taking advantage of the MMX instructions in the PMMX and PII processors
or the conditional moves in the PPro and PII will create problems if you
want your code to be compatible with earlier microprocessors. The solution
may be to write several versions of your code, each optimized for a
particular processor. Your program should automatically detect which
processor it is running on and select the appropriate version of code.
Such a complicated approach is of course only needed for the most critical
parts of your program.