2012年9月15日 星期六

Mobile Phone Programming Enthrallment Part 3

搭大眾運輸工具出遊, 人生地不熟, 看公車站牌看得霧煞煞...


拿出pad, 使用台北等公車


有什麼路線, 哪一路公車幾分鐘後會到一目了然,
規劃變得輕鬆許多!

拿出app

2012年6月8日 星期五

[Paper Summary Lec_13] Learning to rank: from pairwise approach to listwise approach

As the previous one, this work aims to solve the ranking problem, but refine the setup of pairwise formulation into list-wise formulation.

This work point out four problem of the previous one:

First, the objective of learning is formalized as minimizing errors in classification of document pairs,
rather than minimizing errors in ranking of documents. Second, the training process is computationally costly, as
the number of document pairs is very large. Third, the as-sumption of that the document pairs are generated i.i.d. is also too strong. Fourth, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs.

The list-wise procedure can be summarized by the following picture,



where q are set of queries, d(i) associated with respect to query q(i), The judgment y(i)j represents the relevance degree of d(i)j to q(i).
And x are the feature vectors and z are the list of scores obtained, the objective of learning is formalized as minimization of the total losses with respect to the training data.


To measure the loss function, they first convert the ranking into probability scores, 
by calculating the permutation probability of the lists.

but to simplify the computation procedure, they calculate the top one permutation probability instead.
And with the use of top one probability, given two lists of scores we can use any metric to represent the distance (list-wise loss function) between the two score lists.


For simplicity again, they transformed the top one probability into exponential form,

And finally they use Neural Network and gradient method to solve the regression problem,
the algorithm is as summarized as following:











[Paper Summary Lec_12] Support vector learning for ordinal regression

"Support vector learning for ordinal regression,
" R. Herbrich, ICANN, 1999.

This work aims to reduce the problem of ordinal regression (ranking) to classification problem.
In ordinal regression, the loss function is defined by the ordering of data in original data space and the mapping space, while in classification problem it's just 0 or 1.

The goal is to find the mapping function that maps the data to it's rank, and the loss function is defined as:

given samples 


And the loss is formulated like following:

The idea that this work transformed the problem of ranking to classification can be showed as following:

S is the original data space, and S' is the transformed data that was feed into the SVM for classification learning process.


To derive the linear mapping function, 
and to solve the w, 
is where zi  is the label for training.


And the result will be:


2012年5月6日 星期日

[Paper Summary Lec_08-09] - 2 A Global Geometric Framework for Nonlinear Dimensionality Reduction

"A Global Geometric Framework for Nonlinear Dimensionality Reduction," 

Joshua B. Tenenbaum, Vin de Silva, John C. Langford, 2000. 



This work aims to solve the dimension reduction task as the previous one,
([Paper Summary Lec_08-09] -1 Nonlinear dimensionality reduction by locally linear embedding)
the goal is to uncover the underlying nonlinear structure of data and remain there characteristic while performing dimension reduction process. It aims to solve those tasks traditional PCA and MDS methods can not solved well.

Their approach builds on classical MDS ( http://en.wikipedia.org/wiki/Multidimensional_scaling ) but seeks to preserve the intrinsic geometry of the data, as captured in the geodesic manifold instances between all pairs of data points.


The toy example of this work is also the "swiss roll" as the previous one, as shown in the following figure
A is the original data distribution, and the goal is to measure the distance as the blue line shows.
B is the neighborhood structure between near data points.
C is the unrolling structure this work aims to find.

They formulate the problem as three step:


The first step determines which points are neighbors on the manifold M, based on the distances dX(i,j) between pairs of points i,j in the input space X.


The second step, Isomap estimates the geodesic distances dM(i,j) between all pairs of points on the manifold M by computing their shortest path distances dG(i,j) in the graph G.

The final step applies classical MDS to the matrix of graph distances DG={dG(i,j)}, constructing an embedding of the data in a d-dimensional Euclidean space Y that best preserves the manifold’s estimated intrinsic geometry (Fig. 3C) by minimizing this cost function
Where DY denotes the matrix of Euclidean distances and ||A||L2 is the L2 matrix norm.
And the tao operator converts distances to inner products


Here is the overview of this algorithm


And below is some experiment result of some classes of images, which interpolations along straight lines in the Isomap coordinate space (analogous to the blue line in Fig. 3C)





2012年4月27日 星期五

[Paper Summary Lec_08-09] -1 Nonlinear dimensionality reduction by locally linear embedding

"Nonlinear dimensionality reduction by locally linear embedding," 
Roweis & Saul, Science, 2000.


This paper aims to find out the compact representation of data in low dimensional space called manifold. ( some locally Euclidean topological space )
Below is the figure 1 cropped from the paper to give the basic understanding of the idea of manifold. Coherent structure in the world leads to strong correlations between data. As shown in the figure, similarity of neighbor points on the surface can be understand by some “low dimensional” description.



The LLE algorithm can be summarized in the following figure, which is based on simple geometric intuitions. The work characterize the local structure of these patches by linear coefficients that reconstruct each data point from its neighbors.
Where the cost function

The weights Wij summarize the contribution of the jth data point to the ith reconstruction. It can be used to keep the geometry structure of data after dimensional reduction, each high-dimensional observation Xi is mapped to a low-dimensional vector Yi representing global internal coordinates on the manifold. This is done by choosing d-dimensional coordinates Yi to minimize the embedding cost function

Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N X N eigen value problem, whose bottom d nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.


At last, the projection result of some toy example was shown; we can see the similarities have been preserved actually.


Comments:

Manifold methods can preserve the spatial similarity information that traditional PCA, LDA methods can not attend. 
The inference result propose by this work is easy to be implemented, conceptually intuitive and simple.
Also, LLE scales well with the intrinsic manifold dimensionality, d, and does not require a discretized gridding of the embedding space. 
Finally, LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning, leading to lots of future potential of development with every kind of works.

2012年4月25日 星期三

[Paper Summary Lec_06] Latent Dirichlet allocation

"Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan
 Journal of Machine Learning Research, 2003



In IR-field, the historical evolution :
TF-IDF -> LSA -> -> pLSA -> LDA, where pLSA is the one just mentioned in the former summary.
And here we can use some figure to see the evolution from these methods:


d means document; w means word; z means the latent variable (topic);
α and β will be described later.

pLSI does capture the possibility that a document may contain multiple topics since p(z|d) serves as the mixture weights of the topics for a particular document d.
However, it is important to note that d is a dummy index into the list of documents in the training set. Thus, d is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures p(z|d) only for those documents on which it is trained.

A further difficulty with pLSI, which also stems from the use of a distribution indexed by training documents, is that the number of parameters which must be estimated grows linearly with the number of training documents.

LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set.

In LDA, the probability of each document is defined as a continuous mixture distribution:

where p(wn| θ ; β) are the mixture components and p(θ|α) are the mixture weights.
And the model is parameterized by the k-dimensional Dirichlet parameters α = <α1, ….. ,αk>, and a k x |V| matrix β , which are the parameters controlling the k multinomial distribution over words.



The figure above is an example density on unigram distributions p(w|θ ; β) under LDA for three words and four topics. Each of the vertices of the triangle corresponds to a deterministic distribution that assigns probability one to one of the words. The four points marked with an x are the locations of the multinomial distributions p(w|z) for each of the four topics, and the surface shown on top of the simplex is an example of a density over the (V − 1) simplex (multinomial distributions of words) given by LDA.


As in pLSA, the parameters are inference by EM algorithm, but the formulations are lengthy and we just skip them here.

Comments:
LDA has better performance than pLSA as shown in the experiment part of this paper. It propose the idea to add Dirchlet prior on per-document topic distribution, and it solved the problem of overfitting in pLSA as the size of corpus increases.

2012年4月11日 星期三

[Paper Summary Lec_06] Probabilistic latent semantic indexing

"Probabilistic latent semantic indexing," T. Hofmann, SIGIR, 1999.


This paper proposed the idea to improve LSA (Latent Semantic Analysis) based on its strong statistical foundation. LSA is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space. LSA usually takes the vector space representation of documents based on term frequencies and applies Singular Value Decomposition (SVD) of the corresponding term/document matrix.

Model Formulation
Probabilistic Latent Semantic Analysis (PLSA) is based on the likelihood principle and defines a proper generative model of the data. The model of PLSA, which has been called aspect model is defined by three kind of variables. The latent variable z, occurrence variable of a word w, and the documents d.
The generative model is defined as following:


And by using Bayes’ rule we can re-parameterized (1) and (2) as

Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function

The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm.
For this aspect model, the E-step is:

And the M step:


To avoid over fitting, they modified the traditional EM algorithm by introducing the parameterβ,modifying the E-step according to

β= 1 results in the standard E{step, while for β<1 the likelihood part in Bayes' formula is discounted. This becomes their so called tempered-EM algorithm(TEM).

The author also took the geometry of the model to explain their idea, but I just skip it here.

Indexing
One of the most popular families of information retrieval techniques is based on the Vector-Space Model (VSM) for documents. A VSM variant is characterized by three
ingredients:
(i)                 a transformation function (also called local term weight)
(ii)               a term weighting scheme (also called globalterm weight)
(iii)             a similarity measure

Their matching function is:

Where the first term of the numerator defined as idf (w)*n(d,w) are the weighted word frequencies

Comments:
This work showed the novelty to apply standard techniques from statistics for questions like model fitting, model combination, and complexity control.
And it improved the performance on LSA, and has the benefit to be able to detect some synonyms as well as words that refer to the same topic.

But as their method uses Mixture Decomposition, EM algorithm would takes lots of computation time as the document amount and size become large, which will lead to an obstacle of this method.