Endpoint Detection
Init. Plans
About Us
  The implementation for the speaker verification system first addresses the issue of finding the endpoints of speech in a waveform. The code which executes the algorithm can be found in the file locatespeech.m. The algorithm finds the start and end of speech in a given waveform, allowing the speech to be removed and analyzed. Our implementation uses this algorithm for the short-time magnitude analysis of the speech. The endpoint detection algorithm is used here, but not in the cutting of the unvoiced regions of the pitch track. It is important to note that this algorithm gives the entire region where speech exists in an input signal. This speech could include voiced regions as well as unvoiced regions. Voiced speech includes hard consonants such as "ggg" and "ddd", while unvoiced speech includes fricatives such as "fff" and "sss". For the short-time magnitude of a speech signal, it is necessary to include all speech which would be located by this algorithm. However, for short-time pitch, one is only concerned with voiced regions of speech. As a result, this algorithm is not used, and instead, we use the energy in the signal to find the voiced and unvoiced regions of the pitch track. This, however, is further developed in the Short-Time Frequency section.

The endpoint detection algorithm functions as follows:

  1. The algorithm removes any DC offset in the signal. This is a very important step because the zero-crossing rate of the signal is calculated and plays a role in determining where unvoiced sections of speech exist. If the DC offset is not removed, we will be unable to find the zero-crossing rate of noise in order to eliminate it from our signal.

  2. Compute the average magnitude and zero-crossing rate of the signal as well as the average magnitude and zero-crossing rate of background noise. The average magnitude and zero-crossing rate of the noise is taken from the first hundred milliseconds of the signal. The means and standard deviations of both the average magnitude and zero-crossing rate of noise are calculated, enabling us to determine thresholds for each to separate the actual speech signal from the background noise.

  3. At the beginning of the signal, we search for the first point where the signal magnitude exceeds the previously set threshold for the average magnitude. This location marks the beginning of the voiced section of the speech.

  4. From this point, search backwards until the magnitude drops below a lower magnitude threshold.

  5. From here, we search the previous twenty-five frames of the signal to locate if and when a point exists where the zero-crossing rate drops below the previously set threshold. This point, if it is found, demonstrates that the speech begins with an unvoiced sound and allows the algorithm to return a starting point for the speech, which includes any unvoiced section at the start of the phrase.

  6. The above process will be repeated for the end of the speech signal to locate an endpoint for the speech.

<< 3 || 5 >>

PESSIMISM - "Ever dark cloud has a silver lining, but lightning kills hundreds of people each year who try to find it."
© 1999
Sara MacAlpine
JP Slavinsky
Nipul Bharani
Aamir Virani