Introduction/Goal

Manipulation of the duration of a signal without altering the frequency content of that signal presents a unique problem. The most obvious method of changing duration is to simply take each data point and move it in time to such a location that the signal length is transformed. However, this will result in a frequency alteration; the signal will sound lower if made longer and higher if made shorter. Of course, this suggests that the frequency content of signal must be analyzed if it is to be preserved. Thus, the goal of our project is to be able to use frequency analyzation techniques in order to change specific properties of a signal, specifically its duration. Ideally, this will be able to work with any arbitrary signal, especially speech signals.

 

Overview

In order to accomplish the goal outlined above, decided to implement the speech analysis/synthesis algorithm introduced in a paper by Robert McAulay and Thomas Quatieri. Briefly, this algorithm stipulates that the original signal be divided into overlapping Hamming windows and a short-time Fourier transform be performed on these windows. The main frequency peaks of each window are isolated, and these peaks are then subjected to a frequency-matching algorithm, which will establish which frequencies are linked across windows as well as when specific frequencies are born or die out. With this knowledge, it is possible to either interpolate data between each window in order to lengthen a signal or remove appropriate data without appreciably changing the signal in order to shorten it. A new signal can then be constructed from these interpolations; this new signal should be identical to the original in frequency content but different in length.

 

Algorithm

After the windowed segments were run through the STFT, the frequency peaks in each frame were found, where a peak is defined as the frequency with the most power in a specified delta region around it. The signal is modeled by tracking the changes in the locations and power of these peaks over time.

To record these changes, the frequency, amplitude, and phase of every peak were passed to the function freqmat. Freqmat considers each window (or "frame") of the signal individually. The function loops through the length of the signal, comparing the peaks in the kth frame to the peaks in the k+1th frame. A matrix is created such that every peak in the kth frame has a row and every peak in the k+1th frame has a column. The differences in frequencies between every pair of peaks across the two frames are placed into this matrix so that element (m,n) represents the difference between the mth peak in the kth frame and the nth peak in the k+1th frame. Freqmat then takes the minimum number in the matrix, representing the best possible match out of all the possible pairings. These two peaks are then grouped together and removed from any further considerations. These groupings of peaks from frame to frame over the length of a signal are referred to as tracks.

Freqmat considers the possibility of a track dying or being born by defining a delta interval around each peak. Based on previous research, a delta of .3 was selected. If there is a peak in the kth frame that cannot be matched to a peak in the k+1th frame within its delta interval, the peak's track is said to have died. Similarly, if there is a peak in the k+1th frame that has no corresponding peak in the kth frame, it becomes the first peak in a new track.

Once created, the list of tracks is a record of how the signal's frequency peaks change over time. Each track in the kth frame represents the frequencies present in the overlap between the kth and k+1th windows. To recreate the signal, a sample the length of the overlap synthesized was synthesized for each frame. These samples were then summed together to recreate the signal in the kth frame, and these sums were combined to form a synthesized version of the original signal.

The process used by the interpolate function is slightly more complex, however. Although the frequency and amplitude of the peaks can be assumed to change linearly between frames, phase is best modeled as a cubic. Phase must also be "unwrapped", that is, the fact that a phase of pi equals a phase of -pi must be taken into consideration. The function unwarp takes the track information, unwraps the phases, and then generates the coefficients for the cubic. All the collected information is then fed into interpolate, which generates a sample signal for each frame.

To synthesize the original signal, the samples generated per frame by interpolate are the same length as the overlap of the Hamming windows. However, it is a simple matter to increase or decrease the length of each generated sample, thereby lengthening or compressing the signal while preserving frequency content. Simply put, if a track in the original signal showed a peak moving from 2000 Hz to 5000 Hz over the course of 10 samples, interpolate could synthesize a signal with a frequency peak moving from 2000 Hz to 5000 Hz over the course of 20 samples.

 

Future Improvements to Code

stft.m: Because the IFFT command does not return any information about the basis vectors it uses, this function must be able to take the numerical index of a peak within a vector and convert that to a frequency. The original code made several mistakes in the conversion, making it improssible to regenerate the input signal. These mistakes were corrected in the group's code.

freqmat.m: The code was originally written with a series of nested for loops. It also ran incorrectly, because the nth peak in the kth frame was compared to only the n-1th, nth, and n+1th peaks in the k+1th frame. The group rewrote the code to base it on vector operations, which significantly reduced runtime. More importantly, each peak in the kth frame was compared to every peak in the k+1th frame, thereby ensuring that the best possible matches were found.

interpolate.m: This function originally computed the time indices wrong when sythesizing the new signal. Because of the way the coefficients were passed to the next function, this essentially resulted in all frequency and phase information being lost, thereby preserving only amplitude. Because the amplitude of the signal peaks was the only information retained, doubling the length of the signal cut the frequency in half, and singal compression was impossible. The group corrected the time indices and a number of other computational errors.

 

Wavelet Approach

Fourier analysis is not optimal for analyzing all types of signals, especially in the case of non-stationary signals. There are many other possible ways to analyze signals other than the sinusoidal analysis that the Fourier transform offers. One of these alternatives is the wavelet transform. The wavelet transform is most beneficial for localized analysis of a signal or when edge detection is critical.

In the wavelet approach, the original signal is converted to the 16-length Harr wavelet basis instead of the sinusoidal Fourier basis set (makewavelet.m). In the Harr basis set, the basis functions are defined by two indices, scale (j) and position (k). The scale index represents the size of the edge while the position index represents the location of the edge in the basis according to time. In the wavelet method of this project, wavelet coefficients are grouped together by the scale index while the position index is used as an analogue to phase (wavepeaks.m). A j number of signals are made with the k values as the analogues to time. When this grouping is done, the edges in the bases correspond to frequency.

The frequencies at which the edges occur are determined by taking the Fast Fourier Transform (FFT) of the wave coefficient signals. Because there are only 16 wavelet basis functions in the basis chosen, the maximum possible length of the wavelet coefficient signal is eight. After the FFT is taken, the eight frequency coefficients are retained. For those signals with less than eight Fourier coefficients, the signal is zero-padded in the frequency domain before interpolation is done (peaks2.m).

Afterwards, the McAulay-Quatieri algorithm is used to interpolate the data between the existing samples (waveinterp.m). Once the magnitude of the frequencies of the wave coefficients are interpolated linearly and the phase cubically, the data is converted back to the wavelet basis domain, the wave coefficients are ungrouped. When this is completed, twice the amount of wavelet coefficient samples exists and the inverse wavelet transform is taken to synthesize the extended signal (wavesnd.m).

 

Results:

Because editing and fixing the code found in the DSP project took longer than anticipated (none of the code worked so all of it had to be rewritten), the wavelet code that was written is not fully debugged and does not run properly. However, because this project’s emphasis is on speech signal manipulation, the wavelet approach would not have been nearly as successful as the Fourier Approach. Speech signals are represented more ideally as a sum of sinusoids rather than a series of edges. In the case of other signals such as a series of heartbeats, the wavelet approach would probably be more successful than the Fourier approach because the signal is better represented as a series of edges.

 

Improvements for the future:

There are many possibilities for improvements in the future. Many improvements can be made both to the Fourier and Wavelet algorithms. The speed of the algorithms could be increased and the amount of distortion in the signal could be decreased. Also, additional attempts to debug the wavelet code could be made. Additionally, a DSP board could be used rather than Matlab to test the signal manipulation.

Additionally, testing of a greater variety of signals could be done. The limiting factor that occurred during this experimentation was the speed of Matlab. If the code could be improved or if the experimentation could be performed on a DSP board, then speech signals and signals that have more localized data and emphasize edges could be tested.

Finally, further testing other than signal elongation could expand this project. The code presented in this project can compress a signal as well. However, signal compression testing was not done. Additionally, more subtle things could be done to manipulate the signal. Timber modification of an acoustic signature could be done by altering or removing specific frequency tracks of a signal.

 

Conclusion/Results

After a great deal of difficulty in getting the Matlab code to work, we eventually began getting some desired results. By using the McAulay-Quatieri algorithm, we are now able to, in general, extend a signal in length while retaining enough frequency content to make the signal intelligible when listened to. There is some distortion, since we are only analyzing a few peak frequencies rather than the entire frequency content; however, this is not very significant, and could be fixed if we decided to trade computing time for more analyzed peaks.

There are a few specific instances, however, when this algorithm fails. The underlying assumption of dividing the signal into windows is that each window will be short enough for the signal within the window to be a close approximation of a stationary wave. A signal that is rapidly changing in frequency renders this assumption invalid. The problem can be fixed by taking smaller windows, but this will lead to more computation time. Additionally, a signal with a very high frequency will have aliasing errors, due to the nature of Matlab sampling.

Similarly, a signal of very low frequency will create problems. In this case, the window will actually be too short; the frequency of the signal within the window will not be accurately found since the signal will not go through a full period within the length of the window. The solution to this is to lengthen the window, but this may cause problems if the signal later begins changing frequency rapidly, as described above.

Another assumption made by this algorithm this that the signal has only a few key frequency components. If the signal contains a wide range of nearly equal frequencies, the peak-selecting function will probably remove some strong frequency peaks that need to be included in order to adequately reconstruct a lengthened version of the original signal. This could be remedied by taking more peaks for analyzation, but this will again result in more computation time.

In general, this algorithm works very well. In a large number of instances, the original signal can be lengthened without noticeable deterioration. The extreme cases that cause problems can usually be fixed by making alterations that increase computing time; in effect we can normally increase accuracy by accepting a longer computational length. The problem of low frequencies combined with rapidly changing frequencies could perhaps be fixed by using an adaptively changing window size, although this would be very difficult to implement. Despite these particular cases, however, the algorithm as implemented is sufficient for increasing the length of most signals while retaining frequency.