This paper proposed the idea to improve LSA (Latent Semantic Analysis) based on its strong statistical foundation. LSA is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space. LSA usually takes the vector space representation of documents based on term frequencies and applies Singular Value Decomposition (SVD) of the corresponding term/document matrix.
Model Formulation
Probabilistic Latent Semantic Analysis (PLSA) is based on the likelihood principle and defines a proper generative model of the data. The model of PLSA, which has been called aspect model is defined by three kind of variables. The latent variable z, occurrence variable of a word w, and the documents d.
The generative model is defined as following:
And by using Bayes’ rule we can re-parameterized (1) and (2) as
Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function
The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm.
For this aspect model, the E-step is:
And the M step:
To avoid over fitting, they modified the traditional EM algorithm by introducing the parameterβ,modifying the E-step according to
β= 1 results in the standard E{step, while for β<1 the likelihood part in Bayes' formula is discounted. This becomes their so called tempered-EM algorithm(TEM).
The author also took the geometry of the model to explain their idea, but I just skip it here.
Indexing
One of the most popular families of information retrieval techniques is based on the Vector-Space Model (VSM) for documents. A VSM variant is characterized by three
ingredients:
(i) a transformation function (also called local term weight)
(ii) a term weighting scheme (also called globalterm weight)
(iii) a similarity measure
Their matching function is:
Where the first term of the numerator defined as idf (w)*n(d,w) are the weighted word frequencies
Comments:
This work showed the novelty to apply standard techniques from statistics for questions like model fitting, model combination, and complexity control.
And it improved the performance on LSA, and has the benefit to be able to detect some synonyms as well as words that refer to the same topic.
But as their method uses Mixture Decomposition, EM algorithm would takes lots of computation time as the document amount and size become large, which will lead to an obstacle of this method.








沒有留言:
張貼留言