2012年4月27日 星期五

[Paper Summary Lec_08-09] -1 Nonlinear dimensionality reduction by locally linear embedding

"Nonlinear dimensionality reduction by locally linear embedding," 
Roweis & Saul, Science, 2000.


This paper aims to find out the compact representation of data in low dimensional space called manifold. ( some locally Euclidean topological space )
Below is the figure 1 cropped from the paper to give the basic understanding of the idea of manifold. Coherent structure in the world leads to strong correlations between data. As shown in the figure, similarity of neighbor points on the surface can be understand by some “low dimensional” description.



The LLE algorithm can be summarized in the following figure, which is based on simple geometric intuitions. The work characterize the local structure of these patches by linear coefficients that reconstruct each data point from its neighbors.
Where the cost function

The weights Wij summarize the contribution of the jth data point to the ith reconstruction. It can be used to keep the geometry structure of data after dimensional reduction, each high-dimensional observation Xi is mapped to a low-dimensional vector Yi representing global internal coordinates on the manifold. This is done by choosing d-dimensional coordinates Yi to minimize the embedding cost function

Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N X N eigen value problem, whose bottom d nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.


At last, the projection result of some toy example was shown; we can see the similarities have been preserved actually.


Comments:

Manifold methods can preserve the spatial similarity information that traditional PCA, LDA methods can not attend. 
The inference result propose by this work is easy to be implemented, conceptually intuitive and simple.
Also, LLE scales well with the intrinsic manifold dimensionality, d, and does not require a discretized gridding of the embedding space. 
Finally, LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning, leading to lots of future potential of development with every kind of works.

2012年4月25日 星期三

[Paper Summary Lec_06] Latent Dirichlet allocation

"Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan
 Journal of Machine Learning Research, 2003



In IR-field, the historical evolution :
TF-IDF -> LSA -> -> pLSA -> LDA, where pLSA is the one just mentioned in the former summary.
And here we can use some figure to see the evolution from these methods:


d means document; w means word; z means the latent variable (topic);
α and β will be described later.

pLSI does capture the possibility that a document may contain multiple topics since p(z|d) serves as the mixture weights of the topics for a particular document d.
However, it is important to note that d is a dummy index into the list of documents in the training set. Thus, d is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures p(z|d) only for those documents on which it is trained.

A further difficulty with pLSI, which also stems from the use of a distribution indexed by training documents, is that the number of parameters which must be estimated grows linearly with the number of training documents.

LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set.

In LDA, the probability of each document is defined as a continuous mixture distribution:

where p(wn| θ ; β) are the mixture components and p(θ|α) are the mixture weights.
And the model is parameterized by the k-dimensional Dirichlet parameters α = <α1, ….. ,αk>, and a k x |V| matrix β , which are the parameters controlling the k multinomial distribution over words.



The figure above is an example density on unigram distributions p(w|θ ; β) under LDA for three words and four topics. Each of the vertices of the triangle corresponds to a deterministic distribution that assigns probability one to one of the words. The four points marked with an x are the locations of the multinomial distributions p(w|z) for each of the four topics, and the surface shown on top of the simplex is an example of a density over the (V − 1) simplex (multinomial distributions of words) given by LDA.


As in pLSA, the parameters are inference by EM algorithm, but the formulations are lengthy and we just skip them here.

Comments:
LDA has better performance than pLSA as shown in the experiment part of this paper. It propose the idea to add Dirchlet prior on per-document topic distribution, and it solved the problem of overfitting in pLSA as the size of corpus increases.

2012年4月11日 星期三

[Paper Summary Lec_06] Probabilistic latent semantic indexing

"Probabilistic latent semantic indexing," T. Hofmann, SIGIR, 1999.


This paper proposed the idea to improve LSA (Latent Semantic Analysis) based on its strong statistical foundation. LSA is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space. LSA usually takes the vector space representation of documents based on term frequencies and applies Singular Value Decomposition (SVD) of the corresponding term/document matrix.

Model Formulation
Probabilistic Latent Semantic Analysis (PLSA) is based on the likelihood principle and defines a proper generative model of the data. The model of PLSA, which has been called aspect model is defined by three kind of variables. The latent variable z, occurrence variable of a word w, and the documents d.
The generative model is defined as following:


And by using Bayes’ rule we can re-parameterized (1) and (2) as

Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function

The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm.
For this aspect model, the E-step is:

And the M step:


To avoid over fitting, they modified the traditional EM algorithm by introducing the parameterβ,modifying the E-step according to

β= 1 results in the standard E{step, while for β<1 the likelihood part in Bayes' formula is discounted. This becomes their so called tempered-EM algorithm(TEM).

The author also took the geometry of the model to explain their idea, but I just skip it here.

Indexing
One of the most popular families of information retrieval techniques is based on the Vector-Space Model (VSM) for documents. A VSM variant is characterized by three
ingredients:
(i)                 a transformation function (also called local term weight)
(ii)               a term weighting scheme (also called globalterm weight)
(iii)             a similarity measure

Their matching function is:

Where the first term of the numerator defined as idf (w)*n(d,w) are the weighted word frequencies

Comments:
This work showed the novelty to apply standard techniques from statistics for questions like model fitting, model combination, and complexity control.
And it improved the performance on LSA, and has the benefit to be able to detect some synonyms as well as words that refer to the same topic.

But as their method uses Mixture Decomposition, EM algorithm would takes lots of computation time as the document amount and size become large, which will lead to an obstacle of this method.