Frequency-Domain Pitch Period Estimation

Essential features of the human vocal tract

Voiced sounds are produced by forcing air through the glottis (the opening between the vocal cords) when tension of the vocal cords is adjusted such that they oscillate and thereby modulate the airflow into quasi-periodic pulses. These pulses excite the resonances in the remainder of the vocal tract. Different sounds are produced as muscles work to change the shape of the vocal tract, and thereby change its resonant frequencies, or formant frequencies. The rate of the pulses is called the fundamental frequency or pitch.

Frequency-Domain Pitch Period Estimation

Cepstral Pitch Determination

Cepstral Analysis provides a way for the estimation of pitch. If we assume that a sequence of voiced speech is the result of convoluting the glottal excitation sequence e[n] with the vocal tract’s discrete impulse response q [n]. In frequency domain, the convolution relationship becomes a multiplication relationship. Then, using property of log function log AB = log A + log B, the multiplication relationship can be transformed into an additive relationship. Finally, the real cepstrum of a signal s[n] = e[n]* q [n] is defined as

where

Figure 1: CEPSTRAL pitch detection

Time-Domain Pitch Period Estimation

The fact that variations in voiced signal are so evident suggests that time-domain techniques should be capable in detecting pitch period of voiced signals. Most of the time-domain pitch period estimation techniques use autocorrelation function (ACF).

Properties that make ACF an attractive basis for estimating periodicities in all sorts of signals, including speech are:

It is an even function,

attains maximum at k=0 and

the quantity at 0 equals the energy for deterministic signals or the average power for random or periodic signals.

The following three time-domain pitch detection schemes were considered:

Autocorrelation Method (ACM)

ACF of a voiced frame is defined as short-time autocorrelation function:

Figure 2: Typical ACF of voiced signal

One of the major limitations of the autocorrelation representation of the voice is that it retains too much of the information in the speech signal. Most of the peaks in the ACF can be attributed to the damped oscillations of the local tract response, which are responsible for the shape of each period of the speech wave. Also, if the window is too short compared to the pitch period, a false pitch period estimation might occur.

Thus, in cases when the autocorrelation peaks due to the vocal tract response are bigger then those due to periodicity of the vocal excitation, the simple procedure of picking the largest peak in the autocorrelation function will fail.

Center-clipping Autocorrelation Method (CC-ACM)

An improvement of ACF method is CC-ACM. It belongs to the group of numerous "spectrum flattering" techniques. A segment of speech to be used in computing an ACF is preprocessed by passing signal through a clipper. In this way, the clearest indication of pitch periodicity is obtained as shown in Figure 3.

Figure 3: ACF of a clipped voiced signal

The clipping value is determined as 60% of the minimum of maximum amplitudes in first and last third part of the signal.

Modified Autocorrelation Method (MACM)

The conflicting requirements in choosing optimal window size (N) in pitch detection exist. Because of the changing properties of the speech, N should be as small as possible. On the other hand, it should be clear that to get any indication of periodicity in the autocorrelation function, the window must have duration of at least two periods of the waveform. In order to solve this conflicting requirement, modified short-time ACF is used:

where K is the greatest lag of interest.

Two state LPC vocoder synthesizer

Speech can be synthesized from the linear predictive parameters. Figure shows a block diagram of speech synthesizer. The time varying control parameters needed by synthesizer are the pitch period, a voiced/unvoiced switch, gain, and predictor coefficients. The impulse generator acts as the excitation source for voiced sounds producing a pulse of unit amplitude at the beginning of each pitch period. The white noise generator acts as the excitation source for unvoiced sounds producing uncorrelated samples with zero mean Gaussian distribution. The selection between two sources is made by the voiced/unvoiced control. The gain control determines the overall amplitude of the excitation.