ELEC 301 Project: Alex Cobb, Dee Fernandez, Davy Ho, Brian Lin

Navigation
Introduction
Goals
Theory
Implementation
Block Diagram
Analysis
Results
Conclusion
Future Work
Acknowledgements
Voice compression is an important problem in digital signal processing and communications. The fundamental concept is to find innovative ways to reduce the data rate of the voice signal to as low as possible while still keeping the signal intelligible. Voice compression is crucial to many engineering applications, most importantly in telecommunications such as vocoders on cell phones. By reducing the data rate required for a given signal, more total users can be accommodated. This results in an overall lower cost and allows the bandwidth saved to be allocated for other demanding uses such as new 3G, web oriented services.
Voice compression also has applications in Internet situations with the rising demand of streaming multimedia files and voice over IP; both of which have real-time constraints and thus little tolerance for delay.

Decrease required signal bandwidth
Utilize frequency domain properties to reduce signal data rate
Preserve signal integrity
Minimize processing requirements

The initial problem is that there is a continuous time signal that is not analyzed easily or quickly in its original state. By sampling this signal and converting it to a discrete signal, it can be examined and processed with a Digital Signal Processor.
Once the signal is sampled, the Fast Fourier Transform (FFT) can be applied. The FFT is a transformation allowing the analysis of a signal with respect to its frequencies. It provides a the representation of the signal in the frequency domain. This frequency domain representation lends itself to many different types of compression. This is how modern audio compression, such as MP3, takes place.
While in the time domain the signal is continuous and requires an ongoing analysis before the signal can be sent. In the frequency domain a section can be analyzed, with certain portions removed and compressed, and then converted back to yield a smaller signal.
Before the signal is sent, it is bandpass filtered from 300 Hz to 3000 Hz. Since the majority of the frequency content of the human voice is concentrated in the region, it is logical to only use this area. This is consistent with all landline telephone systems, as the signal is band limited to the same region before it is sampled to yield a 64 Kb/s signal.

First, input speech is sampled at 8000 Hz, with 8 bit resolution. This yields a raw bit rate of 64Kbps. The signal is then segmented into N length sections. Next, the sections are transformed into the frequency domain using an FFT. Then, these sections are band-pass filtered, allowing frequencies between 300 Hz and 3000 Hz to pass. Then, the segments are compressed. At this key step, several methods of compression were tried. One is a pseudo-random zeroing out of frequencies to reduce the data requirements. We first randomly chose some frequencies, then we eliminated all of these frequencies from every signal input. The receiving system then takes these “gaps” and fills them in through an interpolation process. The same zeroing out was used in another compression scheme, but this time, upon reception, the gaps are simply ignored, and the signal is rebuilt without the zeroed out frequencies. Another compression scheme we tried was an absolute quantizing in the frequency domain. This divides the frequencies into given levels so as to reduce the possible values. To improve signal integrity, a non-absolute quantization process was also analyzed. This variation provides a differential resolution, allowing for greater resolution where the difference between amplitudes is small, and less where the amplitude differences are large. This process is undergone for each of the N sections. Finally, the signal is sent in the frequency domain and then reconstructed on the other side once it reaches its destination.
Further Details
These are the specifics for the two main algorithms that we implemented. The data in the results section are taken from these two implementations.
Peak Finding Algorithm
Once the Frequency information is found, the peaks are found, and their values and locations are recorded. The values are then truncated to predetermined values. In the reconstruction, the "zeroed-out" frequencies are estimated through an interpolation method.
Normalized Truncation
Once the Frequency information is found, the information is normalized. For the Magnitude, the values are normalized between 0 and 1. For the Phase, the values are normalized between -1 and 1. Once normalized, the values are truncated to predetermined values. In the reconstruction, the "zeroed-out" frequencies are ignored and lost.

The analysis primarily focused on bandwidth reductions available from the given implementation compared to the non-compressed signal. Additionally, signal integrity of the various compressed versions was compared with the non-compressed versions. Signal integrity was based on the accuracy of others trying to comprehend the reproduced speech.

The following tables report the results from the implementation of our two main algorithms. Various chunk sizes and encoding bits were used. For each combination, the number of zero values was recorded for both real and imaginary values. The total number of bits used to transmit the audio clip was found. As was the corresponding bit rate. In the last column, you can hear the resulting sound clip. The original sound clips can be found at the top of each table.
Alex's Voice
Original Sound Clip - Note: Alex's voice has a great deal of bass, which is difficult to understand originally, so the compression only makes it even more difficult to understand.
Peak Finding Algorithm

Chunk Size/# of bits Real 0 Count Real Non-0 Count Imag 0 Count Imag Non-0 Count Total Bits Bit Rate Resulting Clip

64 / 3-bit 0 5148 0 2340 53352 14250 Play Clip

64 / 4-bit 0 5148 0 2340 53352 14250 Play Clip

64 / 5-bit 0 5148 0 2340 53352 14250 Play Clip

64 / 6-bit 0 5148 0 2340 53352 14250 Play Clip

64 / 7-bit 0 5148 0 2340 53352 14250 Play Clip

128 / 3-bit 0 2574 0 1170 26676 7125 Play Clip

128 / 4-bit 0 2574 0 1170 26676 7125 Play Clip

128 / 5-bit 0 2574 0 1170 26676 7125 Play Clip

128 / 6-bit 0 2574 0 1170 26676 7125 Play Clip

128 / 7-bit 0 2574 0 1170 26676 7125 Play Clip

256 / 3-bit 0 1287 0 585 2106 13338 Play Clip

256 / 4-bit 0 1287 0 585 2106 13338 Play Clip

256 / 5-bit 0 1287 0 585 2106 13338 Play Clip

256 / 6-bit 0 1287 0 585 2106 13338 Play Clip

256 / 7-bit 0 1287 0 585 2106 13338 Play Clip

512 / 3-bit 0 638 0 290 6612 1781 Play Clip

512 / 4-bit 0 638 0 290 6612 1781 Play Clip

512 / 5-bit 0 638 0 290 6612 1781 Play Clip

512 / 6-bit 0 638 0 290 6612 1781 Play Clip

512 / 7-bit 0 638 0 290 6612 1781 Play Clip

Normalized Truncation Algorithm

Chunk Size/# of bits Real 0 Count Real Non-0 Count Imag 0 Count Imag Non-0 Count Total Bits Bit Rate Resulting Clip

64 / 3-bit 20744 1720 20923 1541 51450 13742 Play Clip

64 / 4-bit 20744 1720 20923 1541 54711 14613 Play Clip

64 / 5-bit 20744 1720 20923 1541 57972 15484 Play Clip

64 / 6-bit 20744 1720 20923 1541 61233 16355 Play Clip

64 / 7-bit 20744 1720 20923 1541 64494 17226 Play Clip

128 / 3-bit 20891 1573 20889 1575 51224 13682 Play Clip

128 / 4-bit 20891 1573 20889 1575 54372 14522 Play Clip

128 / 5-bit 20891 1573 20889 1575 57520 15363 Play Clip

128 / 6-bit 20891 1573 20889 1575 60668 16204 Play Clip

128 / 7-bit 20891 1573 20889 1575 63816 17045 Play Clip

256 / 3-bit 20957 1507 21017 1447 50836 13578 Play Clip

256 / 4-bit 20957 1507 21017 1447 53790 14367 Play Clip

256 / 5-bit 20957 1507 21017 1447 56744 15156 Play Clip

256 / 6-bit 20957 1507 21017 1447 59698 15945 Play Clip

256 / 7-bit 20957 1507 21017 1447 62652 16734 Play Clip

512 / 3-bit 20696 1576 20792 1480 50656 13647 Play Clip

512 / 4-bit 20696 1576 20792 1480 53712 14470 Play Clip

512 / 5-bit 20696 1576 20792 1480 56768 15293 Play Clip

512 / 6-bit 20696 1576 20792 1480 59824 16116 Play Clip

512 / 7-bit 20696 1576 20792 1480 62880 16940 Play Clip

Britney Spears Clip
Original Sound Clip - Note: Britney's voice is higher in pitch, making it easier to understand.
Peak Finding Algorithm

Chunk Size/# of bits Real 0 Count Real Non-0 Count Imag 0 Count Imag Non-0 Count Total Bits Bit Rate Resulting Clip

64 / 3-bit 0 3531 0 1605 36594 14250 Play Clip

64 / 4-bit 0 3531 0 1605 36594 14250 Play Clip

64 / 5-bit 0 3531 0 1605 36594 14250 Play Clip

64 / 6-bit 0 3531 0 1605 36594 14250 Play Clip

64 / 7-bit 0 3531 0 1605 36594 14250 Play Clip

128 / 3-bit 0 1760 0 800 18240 7125 Play Clip

128 / 4-bit 0 1760 0 800 18240 7125 Play Clip

128 / 5-bit 0 1760 0 800 18240 7125 Play Clip

128 / 6-bit 0 1760 0 800 18240 7125 Play Clip

128 / 7-bit 0 1760 0 800 18240 7125 Play Clip

256 / 3-bit 0 880 0 400 9120 3563 Play Clip

256 / 4-bit 0 880 0 400 9120 3563 Play Clip

256 / 5-bit 0 880 0 400 9120 3563 Play Clip

256 / 6-bit 0 880 0 400 9120 3563 Play Clip

256 / 7-bit 0 880 0 400 9120 3563 Play Clip

512 / 3-bit 0 440 0 200 4560 1781 Play Clip

512 / 4-bit 0 440 0 200 4560 1781 Play Clip

512 / 5-bit 0 440 0 200 4560 1781 Play Clip

512 / 6-bit 0 440 0 200 4560 1781 Play Clip

512 / 7-bit 0 440 0 200 4560 1781 Play Clip

Normalized Truncation Algorithm

Chunk Size/# of bits Real 0 Count Real Non-0 Count Imag 0 Count Imag Non-0 Count Total Bits Bit Rate Resulting Clip

64 / 3-bit 6817 8591 7112 8296 64590 25152 Play Clip

64 / 4-bit 6817 8591 7112 8296 81477 31728 Play Clip

64 / 5-bit 6817 8591 7112 8296 98364 38304 Play Clip

64 / 6-bit 6817 8591 7112 8296 115251 44880 Play Clip

64 / 7-bit 6817 8591 7112 8296 132138 51456 Play Clip

128 / 3-bit 7184 8176 7363 7997 63066 24635 Play Clip

128 / 4-bit 7184 8176 7363 7997 79239 30953 Play Clip

128 / 5-bit 7184 8176 7363 7997 95412 37270 Play Clip

128 / 6-bit 7184 8176 7363 7997 111585 43588 Play Clip

128 / 7-bit 7184 8176 7363 7997 127758 49905 Play Clip

256 / 3-bit 7394 7966 7599 7761 62174 24287 Play Clip

256 / 4-bit 7394 7966 7599 7761 77901 30430 Play Clip

256 / 5-bit 7394 7966 7599 7761 93628 36573 Play Clip

256 / 6-bit 7394 7966 7599 7761 109355 42717 Play Clip

256 / 7-bit 7394 7966 7599 7761 125082 48860 Play Clip

512 / 3-bit 7344 8016 7402 7958 62668 24480 Play Clip

512 / 4-bit 7344 8016 7402 7958 78642 30720 Play Clip

512 / 5-bit 7344 8016 7402 7958 94616 36959 Play Clip

512 / 6-bit 7344 8016 7402 7958 110590 43199 Play Clip

512 / 7-bit 7344 8016 7402 7958 126564 49439 Play Clip

These techniques allow for greatly reduced bandwidth at a small sacrifice in quality compared to simply sending the raw continuous time signal.
The interpolation that was attempted upon arrival of the signal proved to be inaccurate. In fact, it actually added additional noise to the signal, causing it to become impossible to understand.
In the end, the “pseudo-random zeroing” compression with the differential quantization process produced the lowest bitrate and still retained signal integrity.

This technique gives dramatic improvements over the non-compressed version. This reduction in bandwidth allows for many positive results. For example, the lowered data rate can allow more overhead for error correcting codes or even encryption while still maintaining the same data rate.
However, this technique does clearly require more processing power than simply sending the raw data due to the many FFT’s that are inherently required for its implementation.
Additionally, these results, although quite low, are not as low as the current industry standards which are in the mid 1000’s of bits per second.

In the future there are still other aspects to be explored with this project. The most important component that should be explored is a psychoacoustic model of the human voice. This model alone can reduce the data rate from 64 Kbps to as little as 4 Kbps. Psychoacoustics is critical in the compression of audio signals. For example, mp3's compression scheme implements a model of typical music, significantly lowering the bit rate, but maintaining signal quality. Unfortunately, the mp3 psychoacoustic model is not well fitted for human speech because it maintains high frequencies that are not present in human speech. The psychoacoustic model of the human voice would have to be carefully researched and fine-tuned.

Many thanks to Justin Romberg for his help and guidance with this project and throughout this semester. In addition, Patrick Frantz and Patrick Cresap assistance was greatly appreciated.
Final Details
Duties of each member
Davy Ho - Main matlab coder.
Dee Fernandez - In charge of testing and data gathering.
Alex Cobb - Head of Poster Design
Brian Lin - Main report writer/coder

We all worked together on each part of the project. The duties described were not exclusive; this was a collaborative effort.