Our work only begins the
journey to the ideal speech synthesizer. There is room for improvement
and refinement in every step we took. Given enough time, there are several
improvements we could have undertaken.
First, we could have incorporated
more linguistic features in our analysis of the reference speech signals.
Looking at such characteristics as stress, intonation, segment length,
and coarticulation, and devising ways of transferring these to the synthesis
end, would result in more convincing speech.
Dynamic generation of LPC
coefficients could help smooth out the synthesized speech. An abrupt
change in coefficients is noticeable, so taking a new set of coefficients
every few microseconds or so would remove the obvious breaks as well
as offer us improved resolution.
Taking the model further,
we could have left the zeroes in the original linear prediction filter.
We removed them for simplicity's sake, removing the effects of the nasal
cavity at the same time. Replacing them would allow the synthesis of
nasal sounds such as /n/ and /m/.
Finally, we could incorporate
this simple model into the grand scheme of a text-to-speech system.
Such a system would be able to take a text word given to it and synthesize
the speech, drawing from its knowledge about word structure and phoneme
characteristics.