We mentioned in chapter one that multimedia data originates in some analog form. However, all data is eventually turned into some (typically repetitive) digital code, and the statistics of the digital code are of great interest when we want to think about compressing a set of such data. The statistics are important at several levels of granularity.
Even text input from a given language has several levels of interest, such as characters, words, and grammatical structure (sentences, paragraphs etc).
In a similar way, speech or music signals have repetitive structure which shows correlation at several levels of detail. Images may have some structure although natural images this tends to be very subtle (fractal). Moving images clearly have at least two timescales of interest, partly due to the nature of the input and display devices (and the human eye), the scanline and the frame.
Thus coding and compression go hand in hand. We choose some levels of granularities at which to code an input signal or message - this determines the initial input buffer size over which we run our code. Thiswhich, for real time applications, determines the CODEC delays. This also determines the number of quanta or different values for each "letter of the alphabet" in the input buffer.
The selection of these two (or more) timescales may be determined long in advance for some data types. For example, for text in a given language and alphabet there are large sets of samples that are amenable to analysis , so we can find nearly optimal digital representations of the data in terms of storage. However, there may be other factors that affect our design of a code. For example a variable length code for the english alphabet could be devised that used less bits for the average block of text than a fixed length codeword such as 7 bit ASCII. On the other hand, it may be more efficient in computing terms to trade off a small overhead (even as much as 50%) in storage for speed of processing and choose a fixed length code - which is what has been done in practice for text.
For audio, while a speech input signal is composed of a streams of phonemes, words and sentences, it is relatively hard to build an intermediate representation of this, so typically we start with a set of fixed length samples of short time spans of the signal and work from there.
Similarly with a still (or single image from a moving) image, we choose to sample the input scene at typically fixed horizontal and vertical intervals giving a 2D image of a given resolution. We make a design decision when we choose the number of levels (quanitisation) of the samples (the familiar 8-bit versus 24-bit color display is such a decision - 1 or 3 bytes of storage per pixel of the image).