the speech recognition problem can be described as a funtion that defines a mapping from the acoustic evidence to a single or a sequence of words.
Let X = (x1, x2, x3, …, xt) represent the acoustic evidence that is generated in time (indicated by the index t) from a given speech signal and belong to the complete set of acoustic sequences, XX . Let W = (w1, w2, w3, …, wn) denote a sequence of n words, each belonging to a fixed and known set of possible words, WW. there’re two frameworks to describe the speech recognition function:
Template Framework
the recognition is performed by finding the possible sequence of words W that minimize a distance funciton between the acoustic evidence X and a sequence of word reference patterns(templates).
Statistic Framework
the statistic framework has dominated the development of speech recognition systems since the 1980s.
the above equation establishes the components of a speech recognizer:
the likelihood P(X | W) is determinied by a set of acoustic models. |
The statistical framwork for speech recognition brings 4 problems that must be addressed.
The acoustic processing problem.
low dimensionality, discriminability, robustness: feature extraction
The acoustic modeling problem.
decide on how P(X | W) should be computed. the acoustic models are usually estimated using HMMs. |
The language modeling problem.
decide on how to compute the priori probability P(W) for a sequence of words, such as N-Gram
The search problem.
due to the physical limitations on the movement rate, a segment of speech sufficiently short can be considered equivalent to a stationary process.
In practical terms, a sliding window (with a fixed length and shape) is used to isolate each segment from the speech signal. Typically, the segments have between 20 ms and 30 ms and they are overlapped by 10 ms.
This approach is commonly referred to short-time analysis.
the only assumption is that the signal is stationary.
two methods are commonly used:
filter banks(滤波器组)
fb estimate the frequency content of a signal using a bank of bandpass filters, whose coverage spans the frequency range of interest in the signal(e.g., 100-3000Hz for telephone speech signals).
the most common technique for implementing a filterbank is the short-time Fourier transform(STFT).
the dicrete STFT is estimated using the following equation:
in speech applications, the fast Fourier transform is used to efficiently compute X(n,k).
wavelet transform（小波变换）
wavelet were introduced to allow signal analysis with different levels of resolution.
Unlike STFT, the width of the wavelet function changes with each spectral component, so that, at high frequencies, it produces good time resolution and poor frequency resolution, whereas at low frequencies, it produces gives good frequency resolution and poor time resolution.
the discrete wavelet transform is estimated using the following equation: