Binary search trees are simple and perform well on random
data, but they perform poorly on ordered data, degrading to
O(n)
performance for common operations.
Red-black trees are a popular family of balanced binary search
trees, which keep the tree approximately balanced.
Every node in a red-black tree is colored either red or black. We maintain two invariants
Together, these invariants ensure that the longest possible path—one with alternating red and black nodes—is no more than twice as long as the shortest possible path—one with black nodes only.
* Based on Section 3.3 of Purely Functional Data Structures
Recall the datatype of a basic binary search tree
datatype 'a tree = Lf | Br of 'a * 'a tree * 'a tree
We create a color
type
datatype color = R | B
and augment the binary search tree type to include a color
field. All Lf
nodes are considered to be black.
datatype 'a rbtree = Lf | Br of color * 'a * 'a rbtree * 'a rbtree
lookup
is similar to a basic binary search tree
- fun lookup Lf _ = false | lookup (Br (_, elt, l, r)) x = if x < elt then lookup l x else if x > elt then lookup r x else true; > val lookup = fn : int rbtree -> int -> bool
We ignore the color field, and everything else is the same.
insert
must maintain the two invariants, so it
has a few more changes
- fun insert root x = let fun ins Lf = Br (R, x, Lf, Lf) | ins (node as Br (color, elt, l, r)) = if x < elt then balance (color, elt, ins l, r) else if x > elt then balance (color, elt, l, ins r) else node val Br (_, elt, l, r) = ins root (* never returns Lf *) in Br (B, elt, l, r) end; > val insert = fn : int rbtree -> int -> int rbtree
There are three significant changes
ins Lf
case)ins
Br
constructor with
calls to the balance
function in the two
recursive cases.The balance
function works like a
Br
constructor, but it rearranges its arguments
when necessary to maintain the balance invariants.
Making all new nodes red maintains invariant 2, but violates
invariant 1 if the parent of the new node is also red.
balance
works by fixing this red-red violation,
possibly creating a new one higher up the tree.
At most one red-red violation exists at a time, and the calls
to balance
make it bubble up the tree to the root.
Change #2 in the insert
function fixes this case by
always making the root node black.
See the figure to understand how balance
works.
In each case, balance
transforms a black node
with a red child and a red grandchild into the same
configuration—a red node with two black children.
The new red node may have a red parent, but this will be handled at the next level up as the recursion unwinds.
In each of the red-red cases, we label the values so that an in-order traversal yields the sequence {A, X, B, Y, C, Z, D}, which matches the output configuration.
Pattern matching makes the code amazingly simple and elegant (compare this to an implementation in Java or a similar language). We simply describe the four possible cases, and in each case we construct the exact same thing. The code is a direct transcription of the diagram
- fun balance (B, z, Br(R, y, Br(R, x, a, b), c), d) = … | balance (B, z, Br(R, x, a, Br(R, y, b, c)), d) = … | balance (B, x, a, Br(R, z, Br(R, y, b, c), d)) = … | balance (B, x, a, Br(R, y, b, Br(R, z, c, d))) = Br(R, y, Br(B, x, a, b), Br(B, z, c, d)) | balance body = Br body; > val 'a balance = fn : color * 'a * 'a rbtree * 'a rbtree -> 'a rbtree
Here the first three cases are abbreviated for readability, but all four cases have the same code as the fourth case shown here.
Another interesting kind of tree is useful for data encoding and forms the basis of Huffman data compression.
We construct a binary tree with a character stored in each leaf. Instead of coding a character as a byte (8 bits), we encode it as the path through the tree to find the leaf with that character.
Consider how a decode would work: if it starts at the root of the tree, then a zero means follow the left branch, and a one means follow the right branch. When it hits a leaf, it outputs the character stored in that leaf and starts back at the root to find the next character.
If we put frequently used characters near the root and infrequently used characters deeper in the tree, then our encoding will be more efficient than using a fixed number of bits for each character. In the worst case it will be exactly the same.
This encoding scheme relies on the prefix property: no character encoding can be the prefix of any other.
We define two datatypes
- datatype direction = L | R; - datatype huff = C of int * char | B of int * huff * huff;
The int
field will hold the number of
occurrences of a char
in the case of
C
, or all char
s under the given node
in the case of B
.
The decode1
function takes a Huffman tree and a
list of directions, and returns the next character found and the
remaining directions
- exception Bad_input; > exn Bad_input = Bad_input : exn - fun decode1 (C (_, c)) rest = (c, rest) | decode1 (B (_, l, _)) (L :: rest) = decode1 l rest | decode1 (B (_, _, r)) (R :: rest) = decode1 r rest | decode1 _ _ = raise Bad_input; > val decode1 = fn : huff -> direction list -> char * direction list
Using decode1
to decode a full string is
straightforward. A typical implementation will include a special
sentinal value in the tree that marks the end-of-input.
Encoding is trickier. Here is a simple first attempt
- fun member (C (_, c)) ch = c = ch | member (B (_, l, r)) ch = member l ch orelse member r ch; > val member = fn : huff -> char -> bool - fun encode1 (C _) _ = [] | encode1 (B (_, l, r)) ch = if member l ch then L :: encode1 l ch else R :: encode1 r ch; > val encode1 = fn : huff -> char -> direction list
The problem with this approach is in the repeated calls to
member
. Since member
is an
O(n)
operation, in the worst case
encode1
takes O(n2)
.
We can improve encoding time to O(n)
for
each character with this version of encode1
- fun encode1 tree ch = let fun f (C (_, c)) = if c = ch then [[]] else [] | f (B (_, l, r)) = map (fn x => L :: x) (f l) @ map (fn x => R :: x) (f r) in hd (f tree) end; > val encode1 = fn : huff -> char -> direction list
Or we can use exceptions to quit searching when we find the path
- exception Success of direction list; > exn Success = fn : direction list -> exn - fun encode1 tree ch = let fun f (C (_, c)) a = if c <> ch then () else raise (Success (rev a)) | f (B (_, l, r)) a = (f l (L :: a); f r (R :: a)) in (f tree []; []) handle Success res => res end; > val encode1 = fn : huff -> char -> direction list
A better approach is to gather all encodings and put them in a dictionary, e.g., a red-black tree dictionary. We can gather the encodings by walking over all the nodes in the tree and keeping track of every path we take
- fun encodeGather tree = let fun f (C (_, c)) lst a = (c, rev lst) :: a | f (B (_, l, r)) lst a = f r (R :: lst) (f l (L :: lst) a) in f tree [] [] end; > val encodeGather = fn : huff -> (char * direction list) list
This function gathers a list of
(char, encoding)
pairs, which
can easily be folded into a dictionary.
Constructing an optimal Huffman tree is quite simple. Suppose
we are given a list of characters and their relative frequencies
as a list of (freq, char)
pairs
in order from least frequent to most.
First we convert this into a list of Huffman trees using
map C
. Then we repeatedly combine the two
least frequent Huffman trees until we have just one tree
left. We start with a few helper functions
- fun freq (C (x, _)) = x | freq (B (x, _, _)) = x; > val freq = fn : huff -> int - fun insert elt [] = [elt] | insert elt (lst as x::xs) = if freq elt < freq x then elt :: lst else x :: insert elt xs; > val insert = fn : huff -> huff list -> huff list
Then actually constructing the tree is simple
- fun build [] = raise Bad_input | build [x] = x | build (x1::x2::xs) = build (insert (B (freq x1 + freq x2, x1, x2)) xs); > val build = fn : huff list -> huff
build
repeatedly combines the two trees with
lowest frequency (the first two in a sorted list) and puts the
result back into the list until there is only a single tree
left.