Project Final Report

Introduction

The purpose of this project was to devise a way to change the rate of a pre-recorded sound without altering the frequency content. Simply playing the sound at a different rate is not a solution. The frequencies would be distorted in proportion to the scaling factor, and at very low or high rates, would be very difficult to understand at all, let alone identify as human speech. Our approach was to sample the digital signal and then interpolate data points between our samples to produce a sound of the desired length. A slowed-down sound would have more points inserted between the samples than the original signal had, and a speeded-up sound would have fewer than the original.

Applications

There are many important uses for slowed-down speech or music signals, many of which do not require real-time processing. Real-time, or almost real-time processing is possible using a DSP chip, but using Matlab to process information, long computation time is a definite disadvantage to our approach.

Speech transcription is a large application area in which the lengthy computation time to slow down the signal would not be an obstacle. Similarly, both music lovers and musicians would benefit from the ability to slow down music. Musicians might use this function to slow down a recording of themselves and thus be able to improve their performance, or maybe to transcribe an improvisational jazz solo.

Answering machine messages constrain the listener to taking in information at the speed at which the caller recorded the message, and often this is too fast to both understand and write down. Being able to slow down a phone number or address would be a perfect application, very useful and commercially viable.

The study of foreign languages could be made easier by slowing down the speech of natives in order to better distinguish the differences between that language and the student's native tongue.

Also useful is the ability to increase the speed of speech for commercials, both television and radio. Because time is so valuable, being able to communicate more information about their product in a given amount of time would be very valuable to advertising agencies and their clients.

The Model

Figures 1 and 2 show the model we used as defined by McAulay and Quatieri (1985). The model itself is fairly straightforward and easy to understand. First, we take the STFT of an input signal. To do the STFT a signal is windowed at overlapping intervals. For each window of data, the DFT is computed. Afterwards, the peaks of the DFT and the frequencies at which they occur are recorded, as well as the phase at that location. McAulay and Quatieri showed some transformation of the phase, but we were unable to determine what this actually was. The data for each window (now called a frame) is passed along to the next set of functional units shown in Figure 2. The section below that describes our code mentions frames in a little more detail.

Figure 1

From this point the phases and frequencies are fed into a "black box" that performs frame to frame phase unwrapping and interpolation. After interpolation, the data is put into a sine wave generator. In addition, linear interpolation is performed between the amplitudes of each matched frame. From this point, the values are summed across all frames to get the synthesized speech.

Figure 2

The Code

The code, written for use in Matlab, loads the input data, takes the short-time Fourier transform, finds the frequency peaks of the result of that transform, calculates the interpolating constants, and then interpolates the amplitude, linearly, and the phase, cubically. It then takes the cosine of the phases, scales by the amplitudes, and sums those terms to produce a signal which can be played with Matlab's sound command. The entirety of our code can be found in the file project_code.tar.gz.

The first part of the code takes the STFT of an audio signal. The file stft.m performs this function. To take the STFT on an input audio signal, the data is windowed at overlapping intervals. This overlap is measured by the distance between the peaks of consecutive windows. For each windowed segment of data, the DFT is computed, and the peaks of the DFT are picked out using peaks.m. The data returned for each window constitutes one frame, and includes the frequencies, magnitudes, and phases for the peaks of that window. The distance (in samples) between frames is the same as the distance between windows. This distance can be as small as one data point, but ideally, it is many samples. Interpolating between frames is how we slow down or speed up speech.

The next bits of code did the frequency matching between frames. The function freqmat2.m does this job. In order to match frequency peaks picked from one frame to peaks in another frame we implemented the method presented in the McAulay and Quatieri paper. The idea is to choose a matching interval that picks frequencies in the next frame that are possible matches to the frequency of interest in the current frame. In our case we iterated from the lowest to the highest frequency in a frame given as columns in a matrix returned by stft.m. We settled on an interval size of between .06 and .1 radians (i.e., delta = .03 or .05).

After picking possible matches from the next frame and taking the nearest one, the next frequency in the current frame was evaluated to see if there were any points in the next frame that were a possible match that also corresponded to the possible match that the previous frequency found. If this were the case, then we checked to see which frequency was closer to this possible match. If the current frequency were closer then a match was made. However, if the next frequency in the current frame was a closer match to the possible, then the current frequency must take a second choice. If there were no second choices available then that frequency track is said to have died and a zero is inserted. Then we do the same process for the next frequency. Notice that the next frequency has already been evaluated for possible matches so we need not do this again and we save a lot of computation time.

If, after going through an entire column (i.e. track), there are frequencies that have not been matched to frequencies in the next frame, they start a new track and are termed born. These new tracks are simply placed in an open area in the matrix at the corresponding frame (or row index). In this manner we were able to preserve the data an efficient manner that allowed for the easy extraction of frequency tracks. A plot of these tracks can be obtained by using trackplot.m which simply searches for tracks and plots them as lines. Figure 3 shows a sample output from this function.

Figure 3

The interpolation code has two parts, unwarp.m and inter.m. The idea behind phase unwrapping and interpolation is that the function connecting the information taken in each frame should be as smooth as possible. The 2-pi periodicity of phase makes the problem more difficult, but using McAulay and Quatieri's calculations, the computation was fairly straightforward.

Unwarp uses the equations given in McAulay and Quatieri to calculate the smoothing constants M, alpha, and beta for each frame. Then, the function inter.m calculates the interpolating functions for the phase and the amplitude. The phase is dependent upon the above constants as well as the current and next-frame phases and frequencies, as well as the distance in samples between the frames. The amplitude is a straight- forward linear interpolation based on the current and next-frame amplitudes.

The final signal output by the system was calculated in sumy.m. That was simply the sum of the cosine of the phases for each frame, weighted by their amplitudes. In this way, we did not have to explicitly call a function to calculate the inverse Fourier transform.

Results

We used as test input a female voicing, "testing" (s1), a male voicing "foobar" (s3) and a female voicing "beautiful" (s4). To speed up debugging we concentrated on perfecting only the first syllable of "testing". Our results were mixed.

Variables we altered were the window size, the type of window, the spacing between window peaks, the interval over which we searched for a peak, and the delta that determined how selective the frequency matching was. Our best results were with a window size of 255, distance between window peaks of 1, peak searching interval was 5, a Hamming window, and a frequency matching delta of .03. With these parameters the character of the person speaking was intelligible, as was the general structure of the word, but a control person asked to identify the word (not knowing the test word beforehand) would likely have been unable to identify the word. Thus, our largest problem was distortion.

It was also a bit disappointing that we had to set the distance between windows to 1 in order to get intelligble synthesized speech from our model. Since we set this value to one, and speding and slowing of speech is done by interpolating between frames (i.e. adding or reducing samples), we clearly cannot speed up speed, since we can't reduce the samples any more without losing information.

Choosing a window size of 255 gives a broad spectrum and helps select accurate frequencies to characterize the signal. However, such a large window means we do not equally weight the first and last samples of the signal. A fix would be to zero pad the signal so that all of the samples were given equal weight by the windowing mechanism.

Figure 4 shows an example of synthesized speech. You can clearly see from the picture tat the waveform itself is an excellent match. However, due to some distortion, higher frequencies are lost (i.e. the peaks aren't as sharp).

Figure 4

Figure 5 emphasizes this point. From the figure we can see that the peaks of the synthesized speech just aren't as sharp as we would like them. We believe that these dulled peaks are caused during the phase unwrapping and phase interpolation. If this step is done incorrectly, the sinusoids across frames will not add up correctly, causing the peaks to become dulled and misshapen. However, we can see from this plot how close the two underlying sinusoids are.

Figure 5

Slowing Down

Although we weren't able to speed up speech using our algorithm, we were able to perfectly slow down our processed speech. This was evident both through audio testing and looking at the waveforms. But this part was straightforward and we were not surprised that it worked well. Figure 6 shows a sample of some speech slowed by a factor of 2. Notice that the waveform is correct, but due to distortion some of the higher frequencies are lost.

Figure 6

Speeding Up

Slowing down the speech was accomplished by interpolating between each two frames more values than were originally between the frames. In order to minimize distortion of the speech signal we had to choose the distance between the peaks of the windows to be 1 sample. In order to speed up the speech we would normally interpolate out fewer samples between two frames than were originally recorded. But since each frame corresponds to just one sample, speeding up by a factor of n would mean decimating every n-1 frames. Ideally, distortion could be minimized with a larger window peak spacing, and thus no frames (= information) would have to be decimated.

Conclusions and Improvements

As the system stands, the foremost limiting factors in our implementation are the distortion and the computational time.

Improving computational time is a non-trivial problem. The system can be seen as a series of algorithms, and due to the nature of the data and the processing necessary, the algorithms are all on the order of N². Without creating completely new algorithms, it seems theoretically impossible to lower their order and thus computation time improvements must come from lowering how much data we are processing in any one stage. Lowering how much data, i.e. information, could make distortion worse (because we would be working with less information) or it could remain the same or even improve if other changes were made. One such improvement might be to use a smaller delta to ensure a tighter match within a track, and using less tracks to reduce the computational load.

Using a DSP chip would reduce the problem of computation time significantly. The algorithms, bulky though they are in Matlab, would reduce in length dramatically and the process could approach real-time for shorter signals.

Overall, we were fairly pleased with our project. Although we produced no spectacular results, we felt we learned a lot of about speech processing and now have a full appreciation for the difficulty of the task.

References

Robert J. McAulay and Thomas F. Quatieri, Speech Analysis/Synthesis Based on a Sinusoidal Representation, Lincoln Laboratory, M.I.T., Lexington, MA, Tech. Rep. 693, 1985.