Speech Recognition Based on Cepstral Distances


While various different techniques exist for automated speech recognition, the technique which most clearly highlights the principles of linear prediction and autoregressive system modeling is one based solely upon an AR parameterization of speech. This technique, as with linear predictive speech coding, attempts to segment speech into quasi-stationary frames, and then to characterize these frames by their underlying AR model parameters. Test phrases are compared in the frequency domain to various reference phrases, where the frequency domain representation of all phrases is based solely upon their AR model parameters. Note that this technique makes no attempt to utilize the residual speech component (either voiced or unvoiced) which remains after an AR analysis is performed, but rather bases its recognition entirely upon the AR vocal tract model. In this regard, the speech recognition algorithm is somewhat speaker in dependent -- the vocal tract model should, in theory, capture the "shape" of the vocal tract but not the underlying formant or white noise waveforms (voiced or unvoiced, respectively) which "excite" the vocal tract.

The first step in our speech recognition tests invloved the recording of reference phrases against which various other test phrases could be compared. Two members of our group, Adnan and Charles, each recorded five separate utterances of the the numbers zero through nine. These phrases were originally recorded in 16-bit resolution at 48000 Hz, then digitally filtered in Matlab using a 5th order lowpass Butterworth filter at a cutoff frequency of 3600 Hz. The resulting phrases were then resampled at 8000 Hz. The resulting speech sequences (8000 Hz, 16-bit resolution) offered the benefit of smaller size, in addition to the fact that they more accurately represented the frequency domain characteristics of speech transmitted across a standard POTS telephone connection.

After representing the speech in a more manageable form, we "manually" cropped each speech file to eliminate pre-speech and post-speech anomalies which are not actually part of each spoken phrase. Because our reference phrases numbered 100, we opted to write dedicated Matlab code which allowed us to graphically crop the files while we listened to the results. Our original aim was to crop the files to identical lengths, allowing us to perform each AR analysis using a fixed number of frames having a fixed length. We learned very quickly that this was not possible without significant "padding" of the inherently shorter phrases. Since information contained in the padding would be superfluous, we opted to simply crop each file to the portion containing speech, and to deal with the length mismatches in a different way.

After cropping, our phrases ranged in length from approximately 400 milliseconds to approximately 800 milliseconds. We chose to analyze each phrase using 20 frames of 20 milliseconds each, with the frames spaced evenly throughout each phrase. For each phrase, the first sample of the first frame coincided with the first speech sample, while the last sample of the 20th frame coincided with the last speech sample. All other analysis frames were spaced evenly throughout the phrase. Note that the use of 20 frames of 20 milliseconds ensured that we experienced no frame overlap in our shortest phrase. We arbitrarily chose a model order of 16 for each of the AR analyses, resulting in 16 AR coefficients for each speech frame. These coefficients were then organized into an array of 100 phrases by 20 frames by 16 coefficients.
After generating the array of AR coefficient values for our reference phrases, we calculated the frequency response (log magnitude scale) of each speech frame from its AR model parameters. Thirty-two frequency domain coefficients were calculated for each frame, resulting in an array of 100 phrases by 20 frames by 32 frequency domain coefficients (log magnitude scale). This array served as our "lookup table" against which each of our test phrases was later compared.
Once our reference lookup table was completed, we proceeded to record test phrases with which we could perform our speech recognition tests. Our two original speakers, Adnan and Charles, each recorded a single utterance of the numbers zero through nine. Our goal was to compare these twenty test phrases to our lookup table entries without first manually cropping them. This required that some form of automated temporal alignment be performed. The most widely suggested technique for temporal alignment is to align the peak energies of two phrases, a technique which we found to be poor at best. Often, phrases with relatively high energy throughout their duration had their peak energies at very different relative positions. An ad hoc technique which we developed in response to this problem involves forward and backward linear estimates of the energy envelopes at the start and end of each test phrase (the attack energy and decay energy, respectively).
To align our phrases temporally, the peak energy in each test phrase and in each reference phrase was first calculated. Working forward from the beginning of each reference phrase and each test phrase, we recorded the first sample with energy greater than 10% of the phrase's peak energy and the first sample with energy greater than 90% of the phrase's peak energy. Using these points in the phrase's energy signal, a line was drawn back through the time axis, with the intersection chosen to be the start of the phrase. This linear estimate was then repeated working backwards from the end of the file. Using the same energy thresholds, a line was drawn forward through time axis to determine the end of the phrase. Our lookup table was then regenerated using the reference files after they were automatically cropped. Again, an AR analysis was performed on each test phrase, with the same number of windows (20) and the same window length (20 milliseconds).
In order to compare a test phrase to a reference phrase, the sum of the squared differences between each phrase's frequency domain coefficients was calculated in each analysis frame. These 20 "total squared difference measurements" were added to get a total distance between the two phrases. Each test phrase was compared to each reference phrase in this manner, with the reference phrase yielding the smallest distance chosen as the correct match.
Our phrase recognition rate over our twenty test phrases is 100%, while our speaker recognition rate is somewhat less. This is consistent, however, with the use of AR-based techniques, which tend to be somewhat speaker independent.