Speaker Verification: Dynamic Time Warping

		Dynamic Time Warping
		What happens when people vary their rate of speech during a phrase? How can a speaker verification system with a password of "Project" accept the user when he says "Prrroooject"? Obviously, a simple linear squeezing of this longer password will not match the key signal because the user slowed down the first syllable while he kept a normal speed for the "ject" syllable. We need a way to non-linearly time-scale the input signal to the key signal so that we can line up appropriate sections of the signals (i.e. so we can compare "Prrrooo" to "Pro" and "ject" to "ject"). The solution to this problem is to use a technique known as "Dynamic Time Warping" (DTW). This procedure computes a non-linear mapping of one signal onto another by minimizing the distances between the two. In order to get an idea of how to minimize the distances between two signals, let's go ahead and define two: K(n), n = 1,2,...,N, and I(m), m = 1,2,...,M. We can develop a local distance matrix (LDM) which contains the differences between each point of one signal and all the points of the other signal. Basically, if n = 1,2,...,N defines the columns of the matrix and m = 1,2,...,M defines the rows, the values in the local distance matrix LDM(m,n) would be the absolute value of (I(m) - K(n)). The purpose of DTW is to produce a warping function that minimizes the total distance between the respective points of the signals. We now introduce the concept of an accumulated distance matrix (ADM). The ADM contains the respective value in the local distance matrix plus the smallest neighboring accumulated distance. We can use this matrix to develop a mapping path which travels through the cells with the smallest accumulated distances, thereby minimizing the total distance difference between the two signals. There are some constraints that we would want to place on such a mapping path. We would not want this path to wander too far from a linear mapping (we don't want to stretch or squeeze the signal too much). We should also say that the path can only map points that haven't been covered yet. For example, we can't map point 6, point 5, and then point 6 again. And for boundary considerations we would want the endpoints of the signals to map to each other (I(1) maps to K(1) and I(M) maps to K(N)). With these constraints in mind we can develop this concept of an accumulated distance matrix. Above we said that ADM values will be the LDM values at that location plus the smallest neighboring accumulated distances. To keep from backtracking over points that have already been covered, we use the Itakura method to define what these "neighboring" points are. The Itakura method says that neighboring points of the point at (m,n) are (m,n-1), (m-1,n-1), and (m-2,n-1). Defining ADM(1,1) to be LDM(1,1), we can write the equation for the other cells in the accumulated distance matrix: ADM(m,n) = LDM(m,n) + min{ADM(m,n-1), ADM(m-1,n-1), ADM(m-2,n-1)} With this matrix developed, we can create the path by starting at the value in ADM(M,N) and working our way backwards to ADM(1,1), travelling through the cells with the lowest accumulated distances subject to the Itakura constraints. In order to stay near a linear mapping, another Itakura constraint says that the path can't stay in the same row for two consecutive cell movements (i.e. we can't go from (m,n) to (m,n-1) and then to (m,n-2)). Our group added the additional constraint that the path cannot choose to skip two rows twice consecutively. We noticed that this technique slightly improved the quality of the time warping. Once the path has been traced out, the signals can be mapped onto each other in the following way: I onto K: warping_path(i = 1..N) = ADM row number that intersects path at ADM column i time_warped_I = I(warping_path) K onto I: warping_path(i = 1..M) = ADM column number that intersects path at ADM row i time_warped_K = K(warping_path) This epic brought to you by dtw4.m and the letter M. << 4 \|\| 6 >> CONFORMITY - "When people are free to do as they please, they usually imitate each other."
© 1999 Sara MacAlpine JP Slavinsky Nipul Bharani Aamir Virani