Posts /

End to End Models for Speech Processing

Twitter Facebook
30 Jul 2017
CTC - a probabiliistic model p(**Y X**), where

CTC has a specific structure that is suited for speech. (describe below.)

The way this model works is as follows:

the bidirect arrows represent a bidirectinal RNN.

How frame predictions map to output sequences

CTC - Language Models

Sequence to sequence with attention for Speech

we have a model here, which basically does next step prediction. you’re given some data x and you’ve produced some symbols y1 to yi. and your model just going to predict the probability of the next symbol of yi+1.

in trainslation, X would be the source language, in speech, the x itself is this huge sequence of audio. That is now encoded with a recurrent neural network.

Attention example

what is needs to function is the ability to look at different parts on temporal space, because the input is really, really long.

the attention vector essentially looks at different parts of the time steps of the input,

so if you keep doing this over the entire input stream, then you get moving forward attention just learned by the model itself.

here it produce the transcript “cancel cancel cancel”.

The entire input is looked at at every time step, so you don’t really need to add breaks anywhere, you just produce one token and then you produce the next one condition on the last token you produced.

so essentially it’s just doing next step prediction, you have a neural net which is the decoder in a s2s model that looks at the entire input which is the encoder. It feeds in the path symbols that you produce, because it’s a recurrent neural network, you can just keeping feeding symbols and the length issue does not arise.

so you fed in the path symbols as the run and then you’re just predicting the next token itself as the output.

How does the Listen Attend and Spell attention model work

you have a encoder on the left-hand side, it seems to ahve a special structure, for now just forget that it has a special structure, just remember that for every time step of the input, it’s producing some vector representation which encodes the input and that’s represented as hidden vector ht at time step t. so you have ht and you’re generating the nect charachter at every time step with the decoder(right-hand side).

what you do is to take the state vector of the decoder, so bottom layer of the recurring neural network that is the decoder, and you now compare the state vector against each of the hidden time steps of the encoder.

so you basically, ahve the function e_t here, function f takes in a concatenation of the hidden state at a time step t with the state of the recurrent nerual network which is the decoder state, and then produces a single number e_t.

now you do that for every time step of the encoder and so you have a trend in tim in the encoder space. and that’s kind of like a similarity between your query and your source from the encoder.

so you get this trend e_t and of course these are just scalers, and you want to keep these magnitudes under control, so you can pass them through a softmax which normalizers across the time steps, and that’s what’s called attention vector that turns out showing you what’s basically the trends of these attention vector as the query changed over time.

so at every time step you got an attention vector which shows you where you look at for that time step. then you move to the next time step, you recompute your new attention vector, and you do that over and over again.

what to do now is use these probabilities over tilmestep to blend the hidden states together, and get one context value which is this representation that is of interest to you in actually doing the prediciton for that time step.

so here you would take all the hidden states and the correspoding attention value, and just multiply them and add the together, that gives you a content vector, and this content vector is really the content that will guide the prediction you make. so take the content vector, you can catenate that with the state of your rnn, and you pass it through a neural net and you get a prediction at that time step, and this prediction(for example, Y) is the probability of the next token, given all the past tokens you produced and all the input that was fed into the encoder.

Dive into the Encoder

this hierachical encoder is a replacement for a recurrent neural network, so that just instead of one frame of processing at every time step, you collapse neighboring frames as you feed into the next layer. What this does is that every tilmestep it reduces the number of time steps that you have to process. and also makes the processing faster.

so if you do this a few times, by the time you get to the top layer of the encoder, your number of timesteps has been reduced significantly and you attention model is able to work a lot better.

LAS highlights - Multimodal(多峰的 多模式的) outputs

another aspect of the model is causality(因果关系).

if you look at where the attention goes beforer would produce st, then start moving forward when mary came along, it just dwells at the attention space, so that’s a notion where whatever symbol you’ve produced really affects how the neural network behaves at the next few timesteps. that’s really a very strong characteristic of this model.

LAS - Results

Limitations of LAS(Seq2Seq)

Online Sequence to Sequence Models

This model is called a Neural Transducer

you take the input as it comes in and every so often at regular intervals, you run a sequence to sequence model on what you received in the last block.

notice that since we’ve locked up the inputs, we have this situation where you may have received some input, but you can’t produce an output, so we need a blank symbol in this model.

One nice thing about this model is that it maintains causality.

So here in transducer, it preserves disadvantages of a seq2seq model, and it also introduces an alignment problem, so what you want to know is you have to produce some symbols as outputs, but you don’t know which chunk shoud these symbols be aligned to, and you hvae to solve that problem during learning.

the probability distribution turns out to be the probability of y1 to ys given x modeled as the

the way this works is you consider the best candidates that are produced at the end of a block from producing either j-1 tokens or j-2 tokens or j tokens at the end of block b-1, so you know that if I wanted to produce j-2 tokens at the end of the previous block, what’s the best probability, from that dot(left_bottom) you can now extend either by one or two or three symbols, and you get different paths that reach the source, and so now when you’re considering the different ways of entering a source, you just find the best one and you keep that around. and you then now extend those ones to the next time step.

it’s kind of an approximate procedure, because this ability to extend a symbol is not a markovian, so if we take this max as the max of the previous step extended by 1 may be wrong, because it might be 2 steps away.

To make seq2seq model better…

stack them as feature maps and put a convolutional neural network on top.

Choosing the correct output targets

Word Pieces

the above is not very proper for audio, so our approach was try to learn this automatically:

Latent Sequence Decompositions


Twitter Facebook