R00944004_AMMAI: [Paper Summary Lec_06] Probabilistic latent semantic indexing

"Probabilistic latent semantic indexing," T. Hofmann, SIGIR, 1999.

This paper proposed the idea to improve LSA (Latent Semantic Analysis) based on its strong statistical foundation. LSA is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space. LSA usually takes the vector space representation of documents based on term frequencies and applies Singular Value Decomposition (SVD) of the corresponding term/document matrix.

Model Formulation

Probabilistic Latent Semantic Analysis (PLSA) is based on the likelihood principle and defines a proper generative model of the data. The model of PLSA, which has been called aspect model is defined by three kind of variables. The latent variable z, occurrence variable of a word w, and the documents d.

The generative model is defined as following:

And by using Bayes’ rule we can re-parameterized (1) and (2) as

Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function

The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm.

For this aspect model, the E-step is:

And the M step:

To avoid over fitting, they modified the traditional EM algorithm by introducing the parameterβ,modifying the E-step according to

β= 1 results in the standard E{step, while for β<1 the likelihood part in Bayes' formula is discounted. This becomes their so called tempered-EM algorithm(TEM).

The author also took the geometry of the model to explain their idea, but I just skip it here.

Indexing

One of the most popular families of information retrieval techniques is based on the Vector-Space Model (VSM) for documents. A VSM variant is characterized by three

ingredients:

(i) a transformation function (also called local term weight)

(ii) a term weighting scheme (also called globalterm weight)

(iii) a similarity measure

Their matching function is:

Where the first term of the numerator defined as idf (w)*n(d,w) are the weighted word frequencies

Comments:

This work showed the novelty to apply standard techniques from statistics for questions like model fitting, model combination, and complexity control.

And it improved the performance on LSA, and has the benefit to be able to detect some synonyms as well as words that refer to the same topic.

But as their method uses Mixture Decomposition, EM algorithm would takes lots of computation time as the document amount and size become large, which will lead to an obstacle of this method.

R00944004_AMMAI

2012年4月11日星期三

[Paper Summary Lec_06] Probabilistic latent semantic indexing

沒有留言:

張貼留言

2012年4月11日 星期三

[Paper Summary Lec_06] Probabilistic latent semantic indexing

沒有留言:

張貼留言

2012年4月11日星期三