Voice compression is an important problem in digital signal processing and communications.
The fundamental concept is to find innovative ways to reduce the data rate of the voice signal
to as low as possible while still keeping the signal intelligible. Voice compression is crucial
to many engineering applications, most importantly in telecommunications such as vocoders on
cell phones. By reducing the data rate required for a given signal, more total users can be
accommodated. This results in an overall lower cost and allows the bandwidth saved to be allocated
for other demanding uses such as new 3G, web oriented services.
Voice compression also has applications in Internet situations with the rising demand of streaming
multimedia files and voice over IP; both of which have real-time constraints and thus little
tolerance for delay.
- Decrease required signal bandwidth
- Utilize frequency domain properties to reduce signal data rate
- Preserve signal integrity
- Minimize processing requirements
The initial problem is that there is a continuous time signal that is not analyzed easily or
quickly in its original state. By sampling this signal and converting it to a discrete signal,
it can be examined and processed with a Digital Signal Processor.
Once the signal is sampled, the Fast Fourier Transform (FFT) can be applied. The FFT is a
transformation allowing the
analysis of a signal with respect to its frequencies. It provides a the representation of the
signal in the frequency domain. This frequency domain representation lends itself to many
different types of compression. This is how modern audio compression, such as MP3, takes place.
While in the time domain the signal is continuous and requires an ongoing analysis before the
signal can be sent. In the frequency domain a section can be analyzed, with certain portions
removed and compressed, and then converted back to yield a smaller signal.
Before the signal is sent, it is bandpass filtered from 300 Hz to 3000 Hz. Since the majority of
the frequency content of the human voice is concentrated in the region, it is logical to only
use this area. This is consistent with all landline telephone systems, as the signal is band
limited to the same region before it is sampled to yield a 64 Kb/s signal.
First, input speech is sampled at 8000 Hz, with 8 bit resolution. This yields a raw bit rate of
64Kbps. The signal is then segmented into N length
sections. Next, the sections are transformed into the frequency domain using an FFT.
Then, these sections are band-pass filtered, allowing frequencies between 300 Hz and 3000 Hz to
pass. Then, the segments are compressed. At this key step, several methods of compression were tried.
One is a pseudo-random zeroing
out of frequencies to reduce the data requirements. We first randomly chose some frequencies, then we
eliminated all of these frequencies from every signal input. The receiving system then takes these
“gaps” and fills them in through an interpolation process. The same zeroing out was used in another
compression scheme, but this time, upon reception, the gaps are simply ignored, and the signal is
rebuilt without the zeroed out frequencies. Another compression scheme we tried was an absolute
quantizing in the frequency domain.
This divides the frequencies into given levels so as to reduce the possible values.
To improve signal integrity, a non-absolute quantization process was also analyzed. This variation
provides a differential resolution, allowing for greater resolution where the difference between
amplitudes is small, and less where the amplitude differences are large.
This process is undergone for each of the N sections. Finally, the signal is sent in the frequency
domain and then reconstructed on the other side once it reaches its destination.
These are the specifics for the two main algorithms that we implemented. The data in the results
section are taken from these two implementations.
Peak Finding Algorithm
Once the Frequency information is found, the peaks are found, and their values and locations are
recorded. The values are then truncated to predetermined values. In the reconstruction,
the "zeroed-out" frequencies are estimated through an interpolation method.
Normalized Truncation
Once the Frequency information is found, the information is normalized. For the Magnitude, the values
are normalized between 0 and 1. For the Phase, the values are normalized between -1 and 1. Once
normalized, the values are truncated to predetermined values. In the reconstruction, the "zeroed-out"
frequencies are ignored and lost.
The analysis primarily focused on bandwidth reductions available from the given implementation
compared to the non-compressed signal. Additionally, signal integrity of the various compressed
versions was compared with the non-compressed versions. Signal integrity was based on the accuracy
of others trying to comprehend the reproduced speech.
The following tables report the results from the implementation of our two main algorithms. Various
chunk sizes and encoding bits were used. For each combination, the number of zero values was
recorded for both real and imaginary values. The total number of bits used to transmit the audio clip
was found. As was the corresponding bit rate. In the last column, you can hear the resulting sound
clip. The original sound clips can be found at the top of each table.
Original Sound Clip - Note: Alex's voice has a
great deal of bass, which is difficult to understand originally, so the compression only makes
it even more difficult to understand.
Peak Finding Algorithm
Chunk Size/# of bits |
Real 0 Count |
Real Non-0 Count |
Imag 0 Count |
Imag Non-0 Count |
Total Bits |
Bit Rate |
Resulting Clip |
64 / 3-bit |
0 |
5148 |
0 |
2340 |
53352 |
14250 |
Play Clip |
64 / 4-bit |
0 |
5148 |
0 |
2340 |
53352 |
14250 |
Play Clip |
64 / 5-bit |
0 |
5148 |
0 |
2340 |
53352 |
14250 |
Play Clip |
64 / 6-bit |
0 |
5148 |
0 |
2340 |
53352 |
14250 |
Play Clip |
64 / 7-bit |
0 |
5148 |
0 |
2340 |
53352 |
14250 |
Play Clip |
128 / 3-bit |
0 |
2574 |
0 |
1170 |
26676 |
7125 |
Play Clip |
128 / 4-bit |
0 |
2574 |
0 |
1170 |
26676 |
7125 |
Play Clip |
128 / 5-bit |
0 |
2574 |
0 |
1170 |
26676 |
7125 |
Play Clip |
128 / 6-bit |
0 |
2574 |
0 |
1170 |
26676 |
7125 |
Play Clip |
128 / 7-bit |
0 |
2574 |
0 |
1170 |
26676 |
7125 |
Play Clip |
256 / 3-bit |
0 |
1287 |
0 |
585 |
2106 |
13338 |
Play Clip |
256 / 4-bit |
0 |
1287 |
0 |
585 |
2106 |
13338 |
Play Clip |
256 / 5-bit |
0 |
1287 |
0 |
585 |
2106 |
13338 |
Play Clip |
256 / 6-bit |
0 |
1287 |
0 |
585 |
2106 |
13338 |
Play Clip |
256 / 7-bit |
0 |
1287 |
0 |
585 |
2106 |
13338 |
Play Clip |
512 / 3-bit |
0 |
638 |
0 |
290 |
6612 |
1781 |
Play Clip |
512 / 4-bit |
0 |
638 |
0 |
290 |
6612 |
1781 |
Play Clip |
512 / 5-bit |
0 |
638 |
0 |
290 |
6612 |
1781 |
Play Clip |
512 / 6-bit |
0 |
638 |
0 |
290 |
6612 |
1781 |
Play Clip |
512 / 7-bit |
0 |
638 |
0 |
290 |
6612 |
1781 |
Play Clip |
|
Normalized Truncation Algorithm
Chunk Size/# of bits |
Real 0 Count |
Real Non-0 Count |
Imag 0 Count |
Imag Non-0 Count |
Total Bits |
Bit Rate |
Resulting Clip |
64 / 3-bit |
20744 |
1720 |
20923 |
1541 |
51450 |
13742 |
Play Clip |
64 / 4-bit |
20744 |
1720 |
20923 |
1541 |
54711 |
14613 |
Play Clip |
64 / 5-bit |
20744 |
1720 |
20923 |
1541 |
57972 |
15484 |
Play Clip |
64 / 6-bit |
20744 |
1720 |
20923 |
1541 |
61233 |
16355 |
Play Clip |
64 / 7-bit |
20744 |
1720 |
20923 |
1541 |
64494 |
17226 |
Play Clip |
128 / 3-bit |
20891 |
1573 |
20889 |
1575 |
51224 |
13682 |
Play Clip |
128 / 4-bit |
20891 |
1573 |
20889 |
1575 |
54372 |
14522 |
Play Clip |
128 / 5-bit |
20891 |
1573 |
20889 |
1575 |
57520 |
15363 |
Play Clip |
128 / 6-bit |
20891 |
1573 |
20889 |
1575 |
60668 |
16204 |
Play Clip |
128 / 7-bit |
20891 |
1573 |
20889 |
1575 |
63816 |
17045 |
Play Clip |
256 / 3-bit |
20957 |
1507 |
21017 |
1447 |
50836 |
13578 |
Play Clip |
256 / 4-bit |
20957 |
1507 |
21017 |
1447 |
53790 |
14367 |
Play Clip |
256 / 5-bit |
20957 |
1507 |
21017 |
1447 |
56744 |
15156 |
Play Clip |
256 / 6-bit |
20957 |
1507 |
21017 |
1447 |
59698 |
15945 |
Play Clip |
256 / 7-bit |
20957 |
1507 |
21017 |
1447 |
62652 |
16734 |
Play Clip |
512 / 3-bit |
20696 |
1576 |
20792 |
1480 |
50656 |
13647 |
Play Clip |
512 / 4-bit |
20696 |
1576 |
20792 |
1480 |
53712 |
14470 |
Play Clip |
512 / 5-bit |
20696 |
1576 |
20792 |
1480 |
56768 |
15293 |
Play Clip |
512 / 6-bit |
20696 |
1576 |
20792 |
1480 |
59824 |
16116 |
Play Clip |
512 / 7-bit |
20696 |
1576 |
20792 |
1480 |
62880 |
16940 |
Play Clip |
|
Original Sound Clip - Note: Britney's voice is
higher in pitch, making it easier to understand.
Peak Finding Algorithm
Chunk Size/# of bits |
Real 0 Count |
Real Non-0 Count |
Imag 0 Count |
Imag Non-0 Count |
Total Bits |
Bit Rate |
Resulting Clip |
64 / 3-bit |
0 |
3531 |
0 |
1605 |
36594 |
14250 |
Play Clip |
64 / 4-bit |
0 |
3531 |
0 |
1605 |
36594 |
14250 |
Play Clip |
64 / 5-bit |
0 |
3531 |
0 |
1605 |
36594 |
14250 |
Play Clip |
64 / 6-bit |
0 |
3531 |
0 |
1605 |
36594 |
14250 |
Play Clip |
64 / 7-bit |
0 |
3531 |
0 |
1605 |
36594 |
14250 |
Play Clip |
128 / 3-bit |
0 |
1760 |
0 |
800 |
18240 |
7125 |
Play Clip |
128 / 4-bit |
0 |
1760 |
0 |
800 |
18240 |
7125 |
Play Clip |
128 / 5-bit |
0 |
1760 |
0 |
800 |
18240 |
7125 |
Play Clip |
128 / 6-bit |
0 |
1760 |
0 |
800 |
18240 |
7125 |
Play Clip |
128 / 7-bit |
0 |
1760 |
0 |
800 |
18240 |
7125 |
Play Clip |
256 / 3-bit |
0 |
880 |
0 |
400 |
9120 |
3563 |
Play Clip |
256 / 4-bit |
0 |
880 |
0 |
400 |
9120 |
3563 |
Play Clip |
256 / 5-bit |
0 |
880 |
0 |
400 |
9120 |
3563 |
Play Clip |
256 / 6-bit |
0 |
880 |
0 |
400 |
9120 |
3563 |
Play Clip |
256 / 7-bit |
0 |
880 |
0 |
400 |
9120 |
3563 |
Play Clip |
512 / 3-bit |
0 |
440 |
0 |
200 |
4560 |
1781 |
Play Clip |
512 / 4-bit |
0 |
440 |
0 |
200 |
4560 |
1781 |
Play Clip |
512 / 5-bit |
0 |
440 |
0 |
200 |
4560 |
1781 |
Play Clip |
512 / 6-bit |
0 |
440 |
0 |
200 |
4560 |
1781 |
Play Clip |
512 / 7-bit |
0 |
440 |
0 |
200 |
4560 |
1781 |
Play Clip |
|
Normalized Truncation Algorithm
Chunk Size/# of bits |
Real 0 Count |
Real Non-0 Count |
Imag 0 Count |
Imag Non-0 Count |
Total Bits |
Bit Rate |
Resulting Clip |
64 / 3-bit |
6817 |
8591 |
7112 |
8296 |
64590 |
25152 |
Play Clip |
64 / 4-bit |
6817 |
8591 |
7112 |
8296 |
81477 |
31728 |
Play Clip |
64 / 5-bit |
6817 |
8591 |
7112 |
8296 |
98364 |
38304 |
Play Clip |
64 / 6-bit |
6817 |
8591 |
7112 |
8296 |
115251 |
44880 |
Play Clip |
64 / 7-bit |
6817 |
8591 |
7112 |
8296 |
132138 |
51456 |
Play Clip |
128 / 3-bit |
7184 |
8176 |
7363 |
7997 |
63066 |
24635 |
Play Clip |
128 / 4-bit |
7184 |
8176 |
7363 |
7997 |
79239 |
30953 |
Play Clip |
128 / 5-bit |
7184 |
8176 |
7363 |
7997 |
95412 |
37270 |
Play Clip |
128 / 6-bit |
7184 |
8176 |
7363 |
7997 |
111585 |
43588 |
Play Clip |
128 / 7-bit |
7184 |
8176 |
7363 |
7997 |
127758 |
49905 |
Play Clip |
256 / 3-bit |
7394 |
7966 |
7599 |
7761 |
62174 |
24287 |
Play Clip |
256 / 4-bit |
7394 |
7966 |
7599 |
7761 |
77901 |
30430 |
Play Clip |
256 / 5-bit |
7394 |
7966 |
7599 |
7761 |
93628 |
36573 |
Play Clip |
256 / 6-bit |
7394 |
7966 |
7599 |
7761 |
109355 |
42717 |
Play Clip |
256 / 7-bit |
7394 |
7966 |
7599 |
7761 |
125082 |
48860 |
Play Clip |
512 / 3-bit |
7344 |
8016 |
7402 |
7958 |
62668 |
24480 |
Play Clip |
512 / 4-bit |
7344 |
8016 |
7402 |
7958 |
78642 |
30720 |
Play Clip |
512 / 5-bit |
7344 |
8016 |
7402 |
7958 |
94616 |
36959 |
Play Clip |
512 / 6-bit |
7344 |
8016 |
7402 |
7958 |
110590 |
43199 |
Play Clip |
512 / 7-bit |
7344 |
8016 |
7402 |
7958 |
126564 |
49439 |
Play Clip |
|
These techniques allow for greatly reduced bandwidth at a small sacrifice in quality compared to
simply sending the raw continuous time signal.
The interpolation that was attempted upon arrival of the signal proved to be inaccurate. In fact,
it actually added additional noise to the signal, causing it to become impossible to understand.
In the end, the “pseudo-random zeroing” compression with the differential quantization process
produced the lowest bitrate and still retained signal integrity.
This technique gives dramatic improvements over the non-compressed version. This reduction in
bandwidth allows for many positive results. For example, the lowered data rate can allow more
overhead for error correcting codes or even encryption while still maintaining the same data rate.
However, this technique does clearly require more processing power than simply sending the raw
data due to the many FFT’s that are inherently required for its implementation.
Additionally, these results, although quite low, are not as low as the current industry standards
which are
in the mid 1000’s of bits per second.
In the future there are still other aspects to be explored with this project.
The most important component that should be explored is a psychoacoustic model of the human voice.
This model alone can reduce the data rate from 64 Kbps to as little as 4 Kbps. Psychoacoustics
is critical in the compression of audio signals. For example, mp3's compression scheme implements
a model of typical music, significantly lowering the bit rate, but maintaining signal quality.
Unfortunately, the mp3 psychoacoustic model is not well fitted for human speech because it maintains
high frequencies that are not present in human speech.
The psychoacoustic model of the human voice would have to be carefully researched and fine-tuned.
Many thanks to Justin Romberg for his help and guidance with this project and throughout this
semester. In addition, Patrick Frantz and Patrick Cresap assistance was greatly appreciated.
Duties of each member
Davy Ho - Main matlab coder.
Dee Fernandez - In charge of testing and data gathering.
Alex Cobb - Head of Poster Design
Brian Lin - Main report writer/coder
We all worked together on each part of the project. The duties described were not exclusive;
this was a collaborative effort.
|