Computer Laboratory

Bit Twiddling Hacks

I have often had to optimise assembly code by hand to get maximum performance for various reasons. Part of this is algorithmic, but part of this is about finding the fastest way to realise some conceptually simple operation in assembly code. In the same vein as Sean Eron Anderson's excellent collection of snippets, this page is a collection of some of these results.

Please feel free to use any of these in your work. They come with no warranty or guarantee of correctness. All code snippets are public domain unless otherwise noted. The text of this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The examples given herein are in RISC like assembly code. The destination register is always given first in an argument list. I believe the main use of these snippets would be in assembly code; generally if you're working in some higher level language, a few extra operations is the least of your problems. I've not made any particular attempt to minimise register use; depending on the use case various temporaries may be able to alias the arguments.

1 SIMD in Software

SIMD (Single Instruction Multiple Data) operations operate on multiple independent operands with a single instruction. Some architectures have such operations as instructions, but architectures which don't support these can still emulate the effect. Even architectures which do support them may only support certain specific operations.

1.1 Addition/Subtraction

from chessprogramming.wikispaces.com

Addition and subtraction of arbitrary width fields packed into a register can be done with around 6 typical machine instructions. For example, adding four 8 bit values packed into a 32 bit register to another such four values.

# result = ((x&0x7f7f7f7f)+(y&0x7f7f7f7f))^((x^y)&0x80808080)
and tmp1, x, 0x7f7f7f7f
and tmp2, y, 0x7f7f7f7f
add tmp1, tmp1, tmp2
xor tmp2, x, y
and tmp2, tmp2, 0x80808080
xor result, tmp1, tmp2

Arbitrary fields can be supported like this, by changing the mask value from 0x7f7f7f7f to instead have a 0 at the top bit of each field packed into the word. Architectures supporting "add with carry" can use this to support fields packed arbitrarily across many registers by using "add with carry" for the add instruction.

Subtraction is similar but requires slightly different treatement of the subtracted register.

# result = ((x|0x80808080)-(y&0x7f7f7f7f))^((x^(~y))&0x80808080)
or tmp1, x, 0x80808080
and tmp2, y, 0x7f7f7f7f
sub tmp1, tmp1, tmp2
not tmp2, y
xor tmp2, x, tmp2
and tmp2, tmp2, 0x80808080
xor result, tmp1, tmp2

1.1.1 Exactly Two Fields

by Alex Chadwick

Addition and subtraction of exactly two arbitrary width fields packed into a register can be done with around 5 typical machine instructions. For example, adding two 16 bit values packed into a 32 bit register to another such two values.

# result = (x+y)-(((x+y)^(x^y))&0x00010000)
add tmp1, x, y
xor tmp2, x, y
xor tmp2, tmp2, tmp1
and tmp2, tmp2, 0x00010000
sub result, tmp1, tmp2

This works by adding the numbers normally, then detecting cases where a one has carried from one field to the next by inspecting the bottom bit of each field and comparing it with a non-carrying addition (XOR). If these differ, a one must be subtracted to counteract the carry.

Arbitrary fields can be supported like this, by changing the mask value from 0x00010000 to instead have a 1 at the bottom bit of the upper field. Architectures supporting "add with carry" can use this to support two fields packed arbitrarily across many registers by using "add with carry" for the first instruction.

Subtraction can be supported by inverting all addition and subtraction operations.

# result = (x-y)+(((x-y)^(x^y))&0x00010000)
sub tmp1, x, y
xor tmp2, x, y
xor tmp2, tmp2, tmp1
and tmp2, tmp2, 0x00010000
add result, tmp1, tmp2

1.1.2 Sparse Packing

The above code focuses on the case of dense packing where every field is adjacent to every other. This requirement can be relaxed however to yield simpler code. For example, the following code adds three 10 bit values packed into a 32 bit register with 1 bit of padding between each value to another such three values. This assumes the padding bit is zero.

# result = (x+y)&0xffdffbff
add tmp, x, y
and result, tmp, 0xffdffbff

This works by allowing the field to overflow into the padding bit, and then masking out any possible carry.

Arbitrary fields can be supported like this, by changing the mask value from 0xffdffbff to instead have a 1 in every non-padding bit location. Architectures supporting "add with carry" can use this to support fields packed arbitrarily across many registers by using "add with carry" for the first instruction.

Subtraction is similar, but the padding bits of the first operand must be set to one before hand. This is to ensure that the fields don't have any carry between them.

# result = ((x|0x00200400)-y)&0xffdffbff
or tmp, x, 0x00200400
sub tmp, tmp, y
and result, tmp, 0xffdffbff

1.2 Any Field Satisfiying Tests

These operations attempt to test if any field in a register satisifies some condition. They put a non-zero value in the result if any field does satisfy the condition, zero otherwise.

1.2.1 Any Field Equals 0

apparently original from Alan Mycroft

A very useful test for strlen is to test whether any byte in a word is zero. The following code snippet tests if any 8 bit value packed into a 32 bit word is zero.

# result = (x-0x01010101)&(~x)&0x80808080
sub tmp1, x, 0x01010101
not tmp2, x
and tmp1, tmp1, tmp2
and result, tmp1, 0x80808080

This works by setting the first field that is all zero to all ones, then confirming that the top bit of that field used to be zero.

Arbitrary fields can be supported like this, by changing the mask value from 0x01010101 to instead have a 1 at the bottom bit of each field packed into the word and changing the mask 0x80808080 to have a 1 at the top bit of each field.

1.2.2 Any Field Equals N

A simple extension to the above is to test any field equals some other value. Each field may test a different value simultaneously. The following example tests if any 8 bit value packed into a 32 bit word equals the corresponding value in another such register.

# result = ((x^y)-0x01010101)&(~(x^y))&0x80808080
xor tmp1, x, y
sub tmp2, tmp1, 0x01010101
not tmp1, tmp1
and tmp1, tmp1, tmp2
and result, tmp1, 0x80808080

This works by XORing the two values in order to set any equal fields to all 0s. It then uses the zero test as above.

Arbitrary fields can be supported like this, by changing the mask value from 0x01010101 to instead have a 1 at the bottom bit of each field packed into the word and changing the mask 0x80808080 to have a 1 at the top bit of each field.