Posts /

Mel Frequency Cepstral Coefficient (MFCC) tutorial

Twitter Facebook
20 Oct 2016

TERM: MFCC 梅尔频率倒谱系数

在任何的自动语音识别系统中,第一步一般都是提取特征,也就是识别出音频信号的组成部分,哪些部分有利于我们识别语义内容,从而舍弃掉其他不相关的信息,比如背景噪音,情绪等等。

关于语音,我们首先需要了解的是一个人发出的声音是由人产生的声音是由包括舌,牙齿等vocal tract的形状filter之后得到的。这些形状决定了发出的声音是怎样的。我们如果能准确辨别出这些shape,就可以得到一种准确的音位

音位的基本定义是要能区分语义,如果两个声音所代表是同一个词汇、同样的意义,则异音可被视为同一个音位;反过来说,一个词的任何一个音位若被换成别的,那么它就不再是原来的那个词,意义也会随之改变。有意义的词都可由音位组成,然而代换其中任何音位却不能保证产生有意义的词,也有可能变成无意义的一串音。每个语言都有自己的一组音位,这也就是这个语言的语音系统,音位可用来研究某个特定语言中如何将音组合成词。音位有时被译为音素。

声道的shape表现为短时间功率谱的包络线(envelope of the short time power spectrum),MFCCs的工作则是如何准确地表征这种envelope。本文就是关于这一点的。

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980’s, and have been state-of-the-art ever since.

Step at a Glance

首先给出一个高度抽象的实现步骤,然后深入分析我们为什么要这样做。

There are a few more things commonly done, sometimes the frame energy is appended to each feature vector. Delta and Delta-Delta features are usually also appended. Liftering is also commonly applied to the final features.

Mel Scale 梅尔刻度

The Mel scale relates perceived frequency, or pitch of a pure tone to its actual measured frequency.

人类对低频声音的分辨能力要好于高频的时候,使用梅尔刻度可以使得我们得到的feature与人真实听到的声音更接近。

The formula for converting from frequency to Mel scale is:

img

To go from Mels back to frequency:

img

Implementation Steps

让我们从Speech Signal开始,假设采样率是16kHz。

Plot of Mel Filterbank and windowed power spectrum

​ Plot of Mel Filterbank and windowed power spectrum

The resulting features (12 numbers for each frame) are called Mel Frequency Cepstral Coefficients.

接下来是上文中提到的Mel filterbank 如何计算的问题:

Computing the Mel filterbank

In this section the example will use 10 filterbanks because it is easier to display, in reality you would use 26-40 filterbanks.

To get the filterbanks shown in figure 1(a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency. Of course if the speech is sampled at 8000Hz our upper frequency is limited to 4000Hz. Then follow these steps:

  1. Using equation 1, convert the upper and lower frequencies to Mels. In our case 300Hz is 401.25 Mels and 8000Hz is 2834.99 Mels.

  2. For this example we will do 10 filterbanks, for which we need 12 points. This means we need 10 additional points spaced linearly between 401.25 and 2834.99. This comes out to:

    m(i) = 401.25, 622.50, 843.75, 1065.00, 1286.25, 1507.50, 1728.74, 
           1949.99, 2171.24, 2392.49, 2613.74, 2834.99
    
  3. Now use equation 2 to convert these back to Hertz:h(i) = 300, 517.33, 781.90, 1103.97, 1496.04, 1973.32, 2554.33, 3261.62, 4122.63, 5170.76, 6446.70, 8000Notice that our start- and end-points are at the frequencies we wanted.

  4. We don’t have the frequency resolution required to put filters at the exact points calculated above, so we need to round those frequencies to the nearest FFT bin. This process does not affect the accuracy of the features. To convert the frequncies to fft bin numbers we need to know the FFT size and the sample rate,

    f(i) = floor((nfft+1)*h(i)/samplerate)
    

    This results in the following sequence:

    f(i) =  9, 16,  25,   35,   47,   63,   81,  104,  132, 165,  206,  256
    

    We can see that the final filterbank finishes at bin 256, which corresponds to 8kHz with a 512 point FFT size.

  5. Now we create our filterbanks. The first filterbank will start at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filterbank will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th etc. A formula for calculating these is as follows: img where img is the number of filters we want, and img is the list of M+2 Mel-spaced frequencies.

The final plot of all 10 filters overlayed on each other is:

Plot of 10 filter Mel FilterbankA Mel-filterbank containing 10 filters. This filterbank starts at 0Hz and ends at 8000Hz. This is a guide only, the worked example above starts at 300Hz.

Deltas and Delta-Deltas 

Also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, but it seems like speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit (if we have 12 MFCC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector of length 24).

To calculate the delta coefficients, the following formula is used:

img

where img is a delta coefficient, from frame img computed in terms of the static coefficients img to img. A typical value for img is 2. Delta-Delta (Acceleration) coefficients are calculated in the same way, but they are calculated from the deltas, not the static coefficients.

Implementations 

I have implemented MFCCs in python, available here. Use the ‘Download ZIP’ button on the right hand side of the page to get the code. Documentation can be found at readthedocs. If you have any troubles or queries about the code, you can leave a comment at the bottom of this page.

There is a good MATLAB implementation of MFCCs over here.

References 

Davis, S. Mermelstein, P. (1980) Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. In IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28 No. 4, pp. 357-366

X. Huang, A. Acero, and H. Hon. Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.


阅读原文

另一个中文版的博客

CMU的MFCC课件


Twitter Facebook