TERM: MFCC 梅尔频率倒谱系数
在任何的自动语音识别系统中，第一步一般都是提取特征，也就是识别出音频信号的组成部分，哪些部分有利于我们识别语义内容，从而舍弃掉其他不相关的信息，比如背景噪音，情绪等等。
关于语音，我们首先需要了解的是一个人发出的声音是由人产生的声音是由包括舌，牙齿等vocal tract的形状filter之后得到的。这些形状决定了发出的声音是怎样的。我们如果能准确辨别出这些shape，就可以得到一种准确的音位。
音位的基本定义是要能区分语义，如果两个声音所代表是同一个词汇、同样的意义，则异音可被视为同一个音位；反过来说，一个词的任何一个音位若被换成别的，那么它就不再是原来的那个词，意义也会随之改变。有意义的词都可由音位组成，然而代换其中任何音位却不能保证产生有意义的词，也有可能变成无意义的一串音。每个语言都有自己的一组音位，这也就是这个语言的语音系统，音位可用来研究某个特定语言中如何将音组合成词。音位有时被译为音素。
声道的shape表现为短时间功率谱的包络线（envelope of the short time power spectrum），MFCCs的工作则是如何准确地表征这种envelope。本文就是关于这一点的。
Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980’s, and have been state-of-the-art ever since.
首先给出一个高度抽象的实现步骤，然后深入分析我们为什么要这样做。
An audio signal is constantly changing, so to simplify things we assume that on short time scales the audio signal doesn’t change much. This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don’t have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame.
used to identifying which frequencies are present in the frame.
The periodogram spectral estimate still contains a lot of information not required for Automatic Speech Recognition (ASR). In particular the cochlea can not discern the difference between two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase.
For this reason we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions.
This is performed by our Mel filterbank: the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher our filters get wider as we become less concerned about variations.
Generally to double the percieved volume of a sound we need to put 8 times as much energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with.
So once we have the filterbank energies, we take the logarithm of them.
Why the logarithm and not a cube root? The logarithm allows us to use cepstral mean subtraction, which is a channel normalisation technique.
The final step is to compute the DCT of the log filterbank energies.
Because our filterbanks are all overlapping, the filterbank energies are quite correlated with each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to model the features in e.g. a HMM classifier.
Only 12 of the 26 DCT coefficients are kept.
This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.
There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.
The Mel scale relates perceived frequency, or pitch of a pure tone to its actual measured frequency.
人类对低频声音的分辨能力要好于高频的时候，使用梅尔刻度可以使得我们得到的feature与人真实听到的声音更接近。
The formula for converting from frequency to Mel scale is:
To go from Mels back to frequency:
让我们从Speech Signal开始，假设采样率是16kHz。
. Frame the signal into 20-40 ms frames. 25ms is standard. This means the frame length for a 16kHz signal is 0.025*16000 = 400 samples. Frame step is usually something like 10ms (160 samples), which allows some overlap to the frames. The first 400 sample frame starts at sample 0, the next 400 sample frame starts at sample 160 etc. until the end of the speech file is reached. If the speech file does not divide into an even number of frames, pad it with zeros so that it does.
The next steps are applied to every single frame, one set of 12 MFCC coefficients is extracted for each frame. A short aside on notation: we call our time domain signal . Once it is framed we have where n ranges over 1-400 (if our frames are 400 samples) and ranges over the number of frames. When we calculate the complex DFT, we get - where the denotes the frame number corresponding to the time-domain frame. is then the power spectrum of frame .
To take the Discrete Fourier Transform of the frame, perform the following:
where is an sample long analysis window (e.g. hamming window), and is the length of the DFT. The periodogram-based power spectral estimate for the speech frame is given by:
This is called the Periodogram estimate of the power spectrum. We take the absolute value of the complex fourier transform, and square the result. We would generally perform a 512 point FFT and keep only the first 257 coefficents.
Compute the Mel-spaced filterbank. This is a set of 20-40 (26 is standard) triangular filters that we apply to the periodogram power spectral estimate from step 2. Our filterbank comes in the form of 26 vectors of length 257 (assuming the FFT settings fom step 2). Each vector is mostly zeros, but is non-zero for a certain section of the spectrum. To calculate filterbank energies we multiply each filterbank with the power spectrum, then add up the coefficents. Once this is performed we are left with 26 numbers that give us an indication of how much energy was in each filterbank. For a detailed explanation of how to calculate the filterbanks see below. Here is a plot to hopefully clear things up:
Plot of Mel Filterbank and windowed power spectrum
The resulting features (12 numbers for each frame) are called Mel Frequency Cepstral Coefficients.
接下来是上文中提到的Mel filterbank 如何计算的问题：
In this section the example will use 10 filterbanks because it is easier to display, in reality you would use 26-40 filterbanks.
To get the filterbanks shown in figure 1(a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency. Of course if the speech is sampled at 8000Hz our upper frequency is limited to 4000Hz. Then follow these steps:
Using equation 1, convert the upper and lower frequencies to Mels. In our case 300Hz is 401.25 Mels and 8000Hz is 2834.99 Mels.
For this example we will do 10 filterbanks, for which we need 12 points. This means we need 10 additional points spaced linearly between 401.25 and 2834.99. This comes out to:
m(i) = 401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74,
1949.99, 2171.24, 2392.49, 2613.74, 2834.99
Now use equation 2 to convert these back to Hertz:h(i) = 300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 3261.62, 4122.63, 5170.76, 6446.70, 8000
Notice that our start- and end-points are at the frequencies we wanted.
We don’t have the frequency resolution required to put filters at the exact points calculated above, so we need to round those frequencies to the nearest FFT bin. This process does not affect the accuracy of the features. To convert the frequncies to fft bin numbers we need to know the FFT size and the sample rate,
f(i) = floor((nfft+1)*h(i)/samplerate)
This results in the following sequence:
f(i) = 9, 16, 25, 35, 47, 63, 81, 104, 132, 165, 206, 256
We can see that the final filterbank finishes at bin 256, which corresponds to 8kHz with a 512 point FFT size.
Now we create our filterbanks. The first filterbank will start at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filterbank will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th etc. A formula for calculating these is as follows: where is the number of filters we want, and is the list of M+2 Mel-spaced frequencies.
The final plot of all 10 filters overlayed on each other is:
A Mel-filterbank containing 10 filters. This filterbank starts at 0Hz and ends at 8000Hz. This is a guide only, the worked example above starts at 300Hz.
Also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, but it seems like speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit (if we have 12 MFCC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector of length 24).
To calculate the delta coefficients, the following formula is used:
where is a delta coefficient, from frame computed in terms of the static coefficients to . A typical value for is 2. Delta-Delta (Acceleration) coefficients are calculated in the same way, but they are calculated from the deltas, not the static coefficients.
I have implemented MFCCs in python, available here. Use the ‘Download ZIP’ button on the right hand side of the page to get the code. Documentation can be found at readthedocs. If you have any troubles or queries about the code, you can leave a comment at the bottom of this page.
There is a good MATLAB implementation of MFCCs over here.
Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366
X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.