Vocal-Tract Characterization by Linear Prediction Coefficients

Although the acoustic tube and transmission line model is relatively accurate, its complexity gives us a hard time if we want to do speech synthesis in real-time. We cannot perform calculations on multiple acoustic tubes or cascaded transmission lines very easily. Fortunately, another model exists which greatly simplifies things. Notice that in the transmission line model, a signal flows from source to load via a series of delays. Also, the signal that finally reaches the load (lips) is a linear combination of many reflected and transmitted waves that were created at the transmission line junctions.

This strongly implies that we can model the output of the vocal tract as the summation of past outputs and past and current inputs. If we let Yapprox be the output of the vocal-tract and by simply neglecting the input, we can write the equation:

This is the linear prediction(LP) approximation! The aj's are called the LP coefficients; their weight uniquely characterizes a difference equation.

Now, if we take the z-transform, we can get the transfer function:

This H(z) is the system response of an all-pole filter! If we pass the excitation source through this filter, the source signal will be shaped into our desired utterance. This is the method we use in this project to synthesis speech.


Discrete-Time Model of Synthesis Using Linear Prediction

Figure 4 is a model of speech production using LP analysis.

The excitation signal is either an impulse train or white noise. For voiced speech, the excitation is a periodic impulse train with period equal to the pitch period of speech. This impulse train is passed through a glottal filter that models the air from the lung and vocal folds. After all, the impulses from our vocal folds more closely resembles Fig 5, rather than an impulse train. For unvoiced speech, a white noise signal is produced. There is a switch that leads the desired source to the filter.

Figure 5

The vocal tract filter is characterized the LP coefficients. The radiation filter models the propagation of sound waves once it leaves the lips, but it is neglected in our simplified model. Besides this, the main difference in the filter setup is that we have simplified the pole-zero filter that fully characterizes the vocal tract model into an all-pole filter. The motivation for this is the ease in calculation. Fortunately, this simplification is justified for simple speech synthesis, as the poles are the ones that determine the essential formant peaks in voiced signals. However, by taking out the zeros, we are essentially taking out the nasal cavity, the alternate air passageway, in our simplified model. This filter will be lacking when we wish to synthesize nasal sounds like /m/, and /n/.

In general, since most of the information in a perceived speech results from the vowels, simple speech synthesizer have worked by just concentrating efforts on producing accurate vowels.