Next: Summary Up: Coding and Compression Previous: Effect of Network Loss

Processing Requirements for Video Compression

In this section, we present a brief analysis of the complexity f implementing a lossy video compression algorithm in software. We use an example loosely based on the H621 scheme.

There has been some disagreement about just how many MIPs ^4.3 it takes to compress to a particular bandwidth.

Rather than look in detail at the exact standard, lets take a hypothetical coding standard, called VRUM (Video Reduction for Usable Multimedia). This takes a nominal 1024 pel square true color (24 bit per pel) video frame sequence, at 25 frames per second (75 Mbytes/Sec), and compresses as follows:

Intra frame coding

Divide the 1024*1024 area into 8*8 blocks and run a discrete cosine transform (16384 times).

Inter frame coding

Meanwhile, for

1.

i) difference each 8*8 area with same area in previous frame.

2.

ii) For i=1 to 8 frames in sequence, 
	For each neigbour block in successive frames
		if neigbour == current block offset by i pels
		    send motion vector for this block for all 8 frames
		    (and 1st DCT coded block, of course)

Obviously, a naive implementation of this could be very expensive, but there are a number of very obvious optimisations (not the least is not to bother calculating DCT on blocks that have not changed, or have simply moved and have motion vectors being sent).

Now, DCT looks like this:

for (x=0..7, y=0..7)
f(x,y) = 1/4 Sum(i=0..7, j=0..7) C(i) C(j) 
		Cos(pi (2x+1) i/16)
		Cos(pi (2y+1) j/16)

C(k) = 1/root2 for i=0, else 1

The range of f is -256..255.

We can precalculate all discrete Cos values for the 256 possible input values (i.e. only have a table lookup of 256*2 entries for the transform itself). We could go further and have a table of sums of cosines, but lets just leave it here for now.

This then gives each block taking:

8*8 * (add + (8*8 (2 table lookups) + multiply)
= 64 * 130 = 8320

So for full inter-frame coded stream of 25 frames each taking

16384 blocks, this takes
25*16384*8320 = 
3407872000 = 3000 MIPs
without any differencing...

However, if any significant number of blocks are similar, then we need to lok at the relativer cost of differencing with ``DCT-ing'':

To difference 16384 blocks, a simple minded cmp does this

blkcmp()
for(i=0..7, j=0..7)
    pel = i*8+j
    if (frame[seq][block[pel]] != frame[seq-1][block[pel]]
	return false
return true

Since a pel is 3 bytes, this is 3 compares * 64 to find true, or on average (whatever that is) a lot less to find false.

This gives a result of roughly 196 instructions, to avoid 8320 DCT instructions.

In other words, if we run a simple blk cmp, and get some proportion n of blocks the same, we reduce the MIP count to: n*196 + ((1-n) * (8320 + (p * 196)) * 25*16384 where p is the proportion a changed block has changed. which (for largish n) is approximately: (1-n) * 3000 MIPS For n = 2/3 (e.g. head and shoulders view) we are now down to 1000 MIPs.

To reduce further, we should consider pre-calculating some subset of the DCT Cos products, but need to then check that we do not have too high a memory requirement. Now, the searches are a bit more complex: Each search is just like the differencing code, only it costs more as we have to do it across 8 frames, and across +/- 8 in x and y directions, i.e. 16*16*8 times as much however, this is only incrementing once per new frame, so effectively it is only calculated across the 16 * 16 moves.

In fact, if we roll this into the difference code (since a motion vector of 0,0 is the same as an unchanged block), we have 16*16*196 instructions to see if a block is unchanged, and moved or unmoved. This results in around 50176 instructions

So now we get

(n*50176 + (1-n) * (8320 + (q * 50176))) * 25*16384

(4.1)

where q is the proportion a changed block has changed (whetehr moved ot not). q is probably very small.

Unfortunately, this is rather worse than the mere intra-coded frame.

However, when doing the differencing, we can cache all previous points reached in previous blocks, so we only need run another 196 instructions per frame per block, and we get back to the previous result (1-n) * 3000 MIPS For n = .9, this then gets us to around 300 MIPS

Note that for H.261 input is defined to be CCIR 601 which is 720 x 486 (I think), not 1024 square - i.e. a further factor of 3 smaller in area.
In addition, your compares are on a byte by byte basis. You could take into consideration a 32 bit word size.
It is hard to see how you can use the results of previous comparisons in doing the differencing. The origin of the new block with respect to the old is changed for each iteration: they have nothing in common.
If you reduce the frame rate to 5 Hz, then there will be less correspondence between frames, and the motion compensation will give you less in terms of compression. Since it's optional for the encoder in the H.261 standard, why not just difference. Ok - slightly less compression but a big win in terms of computation. (On the other hand maybe not - differences will be detected quicker).

Now the implementation results on modest performance PC processors achieve 5 frames per second (i.e. 1/5 our frame rate) and 1/4 CIF on a 30 MIPs machine, which confirms our approximate analysis.

Arguments like these have been behind the design of rhe Intel MMX (Multi-Media eXtensions) to the PC processor, and Sun Microsystems Visual Instruction Set - along the lines of RISC arguments for running conventional programs, there is a small number of possible additional instructions that are frequently used can make a big difference to performance.

Next: Summary Up: Coding and Compression Previous: Effect of Network Loss

Jon CROWCROFT
1998-12-03