# Bit Twiddling Hacks

I have often had to optimise assembly code by hand to get maximum performance for various reasons. Part of this is algorithmic, but part of this is about finding the fastest way to realise some conceptually simple operation in assembly code. In the same vein as Sean Eron Anderson's excellent collection of snippets, this page is a collection of some of these results.

Please feel free to use any of these in your work. They come with no warranty or guarantee of correctness. All code snippets are public domain unless otherwise noted. The text of this page is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The examples given herein are in RISC like assembly code. The destination register is always given first in an argument list. I believe the main use of these snippets would be in assembly code; generally if you're working in some higher level language, a few extra operations is the least of your problems. I've not made any particular attempt to minimise register use; depending on the use case various temporaries may be able to alias the arguments.

Contents |

## 1 SIMD in Software

SIMD (Single Instruction Multiple Data) operations operate on multiple independent operands with a single instruction. Some architectures have such operations as instructions, but architectures which don't support these can still emulate the effect. Even architectures which do support them may only support certain specific operations.

### 1.1 Addition/Subtraction

from chessprogramming.wikispaces.com

Addition and subtraction of arbitrary width fields packed into a register can be done with around 6 typical machine instructions. For example, adding four 8 bit values packed into a 32 bit register to another such four values.

#result= ((x&0x7f7f7f7f)+(y&0x7f7f7f7f))^((x^y)&0x80808080) and tmp1,x, 0x7f7f7f7f and tmp2,y, 0x7f7f7f7f add tmp1, tmp1, tmp2 xor tmp2,x,yand tmp2, tmp2, 0x80808080 xorresult, tmp1, tmp2

Arbitrary fields can be supported like this, by changing the mask value from `0x7f7f7f7f`

to instead have a 0 at the top bit of each field packed into the word. Architectures supporting "add with carry" can use this to support fields packed arbitrarily across many registers by using "add with carry" for the add instruction.

Subtraction is similar but requires slightly different treatement of the subtracted register.

#result= ((x|0x80808080)-(y&0x7f7f7f7f))^((x^(~y))&0x80808080) or tmp1,x, 0x80808080 and tmp2,y, 0x7f7f7f7f sub tmp1, tmp1, tmp2 not tmp2,yxor tmp2,x, tmp2 and tmp2, tmp2, 0x80808080 xorresult, tmp1, tmp2

#### 1.1.1 Exactly Two Fields

by Alex Chadwick

Addition and subtraction of exactly two arbitrary width fields packed into a register can be done with around 5 typical machine instructions. For example, adding two 16 bit values packed into a 32 bit register to another such two values.

#result= (x+y)-(((x+y)^(x^y))&0x00010000) add tmp1,x,yxor tmp2,x,yxor tmp2, tmp2, tmp1 and tmp2, tmp2, 0x00010000 subresult, tmp1, tmp2

This works by adding the numbers normally, then detecting cases where a one has carried from one field to the next by inspecting the bottom bit of each field and comparing it with a non-carrying addition (XOR). If these differ, a one must be subtracted to counteract the carry.

Arbitrary fields can be supported like this, by changing the mask value from `0x00010000`

to instead have a 1 at the bottom bit of the upper field. Architectures supporting "add with carry" can use this to support two fields packed arbitrarily across many registers by using "add with carry" for the first instruction.

Subtraction can be supported by inverting all addition and subtraction operations.

#result= (x-y)+(((x-y)^(x^y))&0x00010000) sub tmp1,x,yxor tmp2,x,yxor tmp2, tmp2, tmp1 and tmp2, tmp2, 0x00010000 addresult, tmp1, tmp2

#### 1.1.2 Sparse Packing

The above code focuses on the case of dense packing where every field is adjacent to every other. This requirement can be relaxed however to yield simpler code. For example, the following code adds three 10 bit values packed into a 32 bit register with 1 bit of padding between each value to another such three values. This assumes the padding bit is zero.

#result= (x+y)&0xffdffbff add tmp,x,yandresult, tmp, 0xffdffbff

This works by allowing the field to overflow into the padding bit, and then masking out any possible carry.

Arbitrary fields can be supported like this, by changing the mask value from `0xffdffbff`

to instead have a 1 in every non-padding bit location. Architectures supporting "add with carry" can use this to support fields packed arbitrarily across many registers by using "add with carry" for the first instruction.

Subtraction is similar, but the padding bits of the first operand must be set to one before hand. This is to ensure that the fields don't have any carry between them.

#result= ((x|0x00200400)-y)&0xffdffbff or tmp,x, 0x00200400 sub tmp, tmp,yandresult, tmp, 0xffdffbff

### 1.2 Any Field Satisfiying Tests

These operations attempt to test if any field in a register satisifies some condition. They put a non-zero value in the result if any field does satisfy the condition, zero otherwise.

#### 1.2.1 Any Field Equals 0

apparently original from Alan Mycroft

A very useful test for `strlen`

is to test whether any byte in a word is zero. The following code snippet tests if any 8 bit value packed into a 32 bit word is zero.

#result= (x-0x01010101)&(~x)&0x80808080 sub tmp1,x, 0x01010101 not tmp2,xand tmp1, tmp1, tmp2 andresult, tmp1, 0x80808080

This works by setting the first field that is all zero to all ones, then confirming that the top bit of that field used to be zero.

Arbitrary fields can be supported like this, by changing the mask value from `0x01010101`

to instead have a 1 at the bottom bit of each field packed into the word and changing the mask `0x80808080`

to have a 1 at the top bit of each field.

#### 1.2.2 Any Field Equals N

A simple extension to the above is to test any field equals some other value. Each field may test a different value simultaneously. The following example tests if any 8 bit value packed into a 32 bit word equals the corresponding value in another such register.

#result= ((x^y)-0x01010101)&(~(x^y))&0x80808080 xor tmp1,x,ysub tmp2, tmp1, 0x01010101 not tmp1, tmp1 and tmp1, tmp1, tmp2 andresult, tmp1, 0x80808080

This works by XORing the two values in order to set any equal fields to all 0s. It then uses the zero test as above.

Arbitrary fields can be supported like this, by changing the mask value from `0x01010101`

to instead have a 1 at the bottom bit of each field packed into the word and changing the mask `0x80808080`

to have a 1 at the top bit of each field.