LDA is a generative probabilistic model for collections of grouped discrete data.Each group is described as a random mixture over a set of latent topics where each topic is a discrete distribution over the collection’s vocabulary.
Corpus: a collection of documents
data: words
The generative process for a document collection D under the LDA model:
For k = 1, … K
a) draw k topic-word distribution phi_k from Dirichlet(beta)
For each document d belongs to D
a) draw a doc-topic distribution for current document from Dirichlet(alpha)
b) for each word w_i belongs to d
i. z_i <- Dsicrete(theta(d))
ii. w_i <- Discrete(phi(z_i))
The generative process described above results in the following joint distribution:
**p(w, z, θ, φ | α, β) = p(φ | β)p(θ | α)p(z | θ)p(w | φ z )** |