Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a popular form of statistical topic modeling. In LDA, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden, also known as a latent layer. 

LDA looks at a document to determine a set of topics that are likely to have generated that collection of words. So, if a document uses certain words that are contained in a topic, you could say the document is about that topic.

Though a topic is composed of words, the likely distribution of those words is not equal. For example, the topic “domesticated animals” may have a probability of 50% dog, 30% cat, 20% goldfish.

LDA consists of two parts, the words within a document (a known factor) and the probability of words belonging to a topic, which is what needs to be calculated. The algorithm tries to determine, for a given document, how many words belong to a specific topic. Plus it attempts to determine how many documents belong to a specific topic because of a certain word.

LDA Topic Models