"The convolution of a monkey is a man."
What is MPEG?
MPEG stands for Motion Picture Experts Group which is an international group that came up with the standards for compression of different types of media. There are various MPEG standards including type 1 for audio and video, type 2 for HDTV, type 4 for video coding, and types 7 and 21 for various types of multimedia. These last two are still being worked out by the MPEG committee as of this date.
As a group we are focusing on the audio compression standard layers
1, 2 and 3 of the MPEG 1 standard. Layer 1 is the simplest, then
layer 2 builds on layer 1, and layer 3 is the most complex. Layer 3 is
known as MP3. The standard was developed over 3 years. The coolest
thing by far about it though, it that a multi-billion dollar industry got
the shaft. Unfortunately, the ownership of the standard is still kind of
up in the air.
MPEG Audio:
Mpeg Audio involves 4 basic parts.
The Filter Bank is:
- A set of overlapping band-pass filters which are implemented using
a DCT.
- Layer 1 and 2 use equally spaced bands.
- Layer 3 uses a special filter bank called a hybrid filter bank which
makes use of a modified DCT (MDCT) which takes advantage of the sometimes
tremendous overlap between consecutive banks.
- In essence, this filter bank is meant to reenact what goes on in
the ear.
Specifically, the filter bank takes time information in chunks and stores
them in a buffer.
- These chunks are different sizes in the different layers.
- In Layer 1, the buffers (chunks of time samples) are 512 samples
long.
- Layer 2 and 3 switch between 64 and 1024 length buffers.
- Samples are added to the beginning (32 for layer 1) and taken off
at the end, making this a dynamic buffer (FIFO)
These buffers are then multiplied point-wise by a function which reduces edge effects (see figure below)
Now the individual band-passed signals (subbands) are sampled in the
time domain a certain number of times designed to minimize noise.
- 12 times for layer 1 (12 x 32 = 384 samples)
- 36 times for layers 2 & 3 (1152 samples)
The Psychoacoustic model has 3 parts:
The Threshold in Quiet, Masking, and Critical Bands.
Threshold in Quiet:
Refers to the sensitivity of the ear as a function of frequency. This function is logarithmic (read as "non-linear"). Frequencies that are underneath the curve are cut off because the ear can't hear them. This also forms the basis of the global masking threshold. This curve was painstakingly created via statistical processes. In other words, the industry (Fraunhofer, Thompson, and others) hired people (or sometimes just made grad-students do it) to sit and listen to sounds under certain conditions and made a model based on their results.
Masking:
Masking occurs when a loud sound overshadows or “masks” another sound in a similar frequency range or band.
- For example, during the day, you cannot see stars because the sun is too bright.
Masking is used to reduce the amount of information sent in each band by removing frequencies that cannot be discerned by the ear. Practically, a “global mask” is formed from all individual band maskings which is then used to remove all the superfluous frequencies at once from critical bands.
Critical Bands:
Healthy, young people can hear in frequencies between 20 Hz-20KHz. The ear then splits up sounds in this range into 26 unequal, overlapping partitions called critical bands which are:
- Logarithmically distributed
- The physiological basis for masking effects.
This picture demonstrates the correlation between the evenly spaced subbands from the filter banks of mpeg audio layer 1 and 2 and the critical bands that actually exist in the ear. Mpeg audio layer 3 has more closely correlated bands.
Intelligent Quantization:
Sample by applying masks from the psychoacoustic model to each of the
subbands.
Sampling introduces a certain degree of error (“quantization error”)
or “noise”.
The encoder must minimize this error according to stringent criteria.
To this effect, each subband is assigned a “scalefactor”, which is
a number that determines how much of that band is represented in the final
signal.
Error is also thought of in terms of the number of bits required to
eliminate it.