2012年9月15日 星期六

Mobile Phone Programming Enthrallment Part 3

搭大眾運輸工具出遊, 人生地不熟, 看公車站牌看得霧煞煞...


拿出pad, 使用台北等公車


有什麼路線, 哪一路公車幾分鐘後會到一目了然,
規劃變得輕鬆許多!

拿出app

2012年6月8日 星期五

[Paper Summary Lec_13] Learning to rank: from pairwise approach to listwise approach

As the previous one, this work aims to solve the ranking problem, but refine the setup of pairwise formulation into list-wise formulation.

This work point out four problem of the previous one:

First, the objective of learning is formalized as minimizing errors in classification of document pairs,
rather than minimizing errors in ranking of documents. Second, the training process is computationally costly, as
the number of document pairs is very large. Third, the as-sumption of that the document pairs are generated i.i.d. is also too strong. Fourth, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs.

The list-wise procedure can be summarized by the following picture,



where q are set of queries, d(i) associated with respect to query q(i), The judgment y(i)j represents the relevance degree of d(i)j to q(i).
And x are the feature vectors and z are the list of scores obtained, the objective of learning is formalized as minimization of the total losses with respect to the training data.


To measure the loss function, they first convert the ranking into probability scores, 
by calculating the permutation probability of the lists.

but to simplify the computation procedure, they calculate the top one permutation probability instead.
And with the use of top one probability, given two lists of scores we can use any metric to represent the distance (list-wise loss function) between the two score lists.


For simplicity again, they transformed the top one probability into exponential form,

And finally they use Neural Network and gradient method to solve the regression problem,
the algorithm is as summarized as following:











[Paper Summary Lec_12] Support vector learning for ordinal regression

"Support vector learning for ordinal regression,
" R. Herbrich, ICANN, 1999.

This work aims to reduce the problem of ordinal regression (ranking) to classification problem.
In ordinal regression, the loss function is defined by the ordering of data in original data space and the mapping space, while in classification problem it's just 0 or 1.

The goal is to find the mapping function that maps the data to it's rank, and the loss function is defined as:

given samples 


And the loss is formulated like following:

The idea that this work transformed the problem of ranking to classification can be showed as following:

S is the original data space, and S' is the transformed data that was feed into the SVM for classification learning process.


To derive the linear mapping function, 
and to solve the w, 
is where zi  is the label for training.


And the result will be:


2012年5月6日 星期日

[Paper Summary Lec_08-09] - 2 A Global Geometric Framework for Nonlinear Dimensionality Reduction

"A Global Geometric Framework for Nonlinear Dimensionality Reduction," 

Joshua B. Tenenbaum, Vin de Silva, John C. Langford, 2000. 



This work aims to solve the dimension reduction task as the previous one,
([Paper Summary Lec_08-09] -1 Nonlinear dimensionality reduction by locally linear embedding)
the goal is to uncover the underlying nonlinear structure of data and remain there characteristic while performing dimension reduction process. It aims to solve those tasks traditional PCA and MDS methods can not solved well.

Their approach builds on classical MDS ( http://en.wikipedia.org/wiki/Multidimensional_scaling ) but seeks to preserve the intrinsic geometry of the data, as captured in the geodesic manifold instances between all pairs of data points.


The toy example of this work is also the "swiss roll" as the previous one, as shown in the following figure
A is the original data distribution, and the goal is to measure the distance as the blue line shows.
B is the neighborhood structure between near data points.
C is the unrolling structure this work aims to find.

They formulate the problem as three step:


The first step determines which points are neighbors on the manifold M, based on the distances dX(i,j) between pairs of points i,j in the input space X.


The second step, Isomap estimates the geodesic distances dM(i,j) between all pairs of points on the manifold M by computing their shortest path distances dG(i,j) in the graph G.

The final step applies classical MDS to the matrix of graph distances DG={dG(i,j)}, constructing an embedding of the data in a d-dimensional Euclidean space Y that best preserves the manifold’s estimated intrinsic geometry (Fig. 3C) by minimizing this cost function
Where DY denotes the matrix of Euclidean distances and ||A||L2 is the L2 matrix norm.
And the tao operator converts distances to inner products


Here is the overview of this algorithm


And below is some experiment result of some classes of images, which interpolations along straight lines in the Isomap coordinate space (analogous to the blue line in Fig. 3C)





2012年4月27日 星期五

[Paper Summary Lec_08-09] -1 Nonlinear dimensionality reduction by locally linear embedding

"Nonlinear dimensionality reduction by locally linear embedding," 
Roweis & Saul, Science, 2000.


This paper aims to find out the compact representation of data in low dimensional space called manifold. ( some locally Euclidean topological space )
Below is the figure 1 cropped from the paper to give the basic understanding of the idea of manifold. Coherent structure in the world leads to strong correlations between data. As shown in the figure, similarity of neighbor points on the surface can be understand by some “low dimensional” description.



The LLE algorithm can be summarized in the following figure, which is based on simple geometric intuitions. The work characterize the local structure of these patches by linear coefficients that reconstruct each data point from its neighbors.
Where the cost function

The weights Wij summarize the contribution of the jth data point to the ith reconstruction. It can be used to keep the geometry structure of data after dimensional reduction, each high-dimensional observation Xi is mapped to a low-dimensional vector Yi representing global internal coordinates on the manifold. This is done by choosing d-dimensional coordinates Yi to minimize the embedding cost function

Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N X N eigen value problem, whose bottom d nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.


At last, the projection result of some toy example was shown; we can see the similarities have been preserved actually.


Comments:

Manifold methods can preserve the spatial similarity information that traditional PCA, LDA methods can not attend. 
The inference result propose by this work is easy to be implemented, conceptually intuitive and simple.
Also, LLE scales well with the intrinsic manifold dimensionality, d, and does not require a discretized gridding of the embedding space. 
Finally, LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning, leading to lots of future potential of development with every kind of works.

2012年4月25日 星期三

[Paper Summary Lec_06] Latent Dirichlet allocation

"Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan
 Journal of Machine Learning Research, 2003



In IR-field, the historical evolution :
TF-IDF -> LSA -> -> pLSA -> LDA, where pLSA is the one just mentioned in the former summary.
And here we can use some figure to see the evolution from these methods:


d means document; w means word; z means the latent variable (topic);
α and β will be described later.

pLSI does capture the possibility that a document may contain multiple topics since p(z|d) serves as the mixture weights of the topics for a particular document d.
However, it is important to note that d is a dummy index into the list of documents in the training set. Thus, d is a multinomial random variable with as many possible values as there are training documents and the model learns the topic mixtures p(z|d) only for those documents on which it is trained.

A further difficulty with pLSI, which also stems from the use of a distribution indexed by training documents, is that the number of parameters which must be estimated grows linearly with the number of training documents.

LDA overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set.

In LDA, the probability of each document is defined as a continuous mixture distribution:

where p(wn| θ ; β) are the mixture components and p(θ|α) are the mixture weights.
And the model is parameterized by the k-dimensional Dirichlet parameters α = <α1, ….. ,αk>, and a k x |V| matrix β , which are the parameters controlling the k multinomial distribution over words.



The figure above is an example density on unigram distributions p(w|θ ; β) under LDA for three words and four topics. Each of the vertices of the triangle corresponds to a deterministic distribution that assigns probability one to one of the words. The four points marked with an x are the locations of the multinomial distributions p(w|z) for each of the four topics, and the surface shown on top of the simplex is an example of a density over the (V − 1) simplex (multinomial distributions of words) given by LDA.


As in pLSA, the parameters are inference by EM algorithm, but the formulations are lengthy and we just skip them here.

Comments:
LDA has better performance than pLSA as shown in the experiment part of this paper. It propose the idea to add Dirchlet prior on per-document topic distribution, and it solved the problem of overfitting in pLSA as the size of corpus increases.

2012年4月11日 星期三

[Paper Summary Lec_06] Probabilistic latent semantic indexing

"Probabilistic latent semantic indexing," T. Hofmann, SIGIR, 1999.


This paper proposed the idea to improve LSA (Latent Semantic Analysis) based on its strong statistical foundation. LSA is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space. LSA usually takes the vector space representation of documents based on term frequencies and applies Singular Value Decomposition (SVD) of the corresponding term/document matrix.

Model Formulation
Probabilistic Latent Semantic Analysis (PLSA) is based on the likelihood principle and defines a proper generative model of the data. The model of PLSA, which has been called aspect model is defined by three kind of variables. The latent variable z, occurrence variable of a word w, and the documents d.
The generative model is defined as following:


And by using Bayes’ rule we can re-parameterized (1) and (2) as

Following the likelihood principle, one determines P(d), P(z|d), and P(w|z) by maximization of the log-likelihood function

The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm.
For this aspect model, the E-step is:

And the M step:


To avoid over fitting, they modified the traditional EM algorithm by introducing the parameterβ,modifying the E-step according to

β= 1 results in the standard E{step, while for β<1 the likelihood part in Bayes' formula is discounted. This becomes their so called tempered-EM algorithm(TEM).

The author also took the geometry of the model to explain their idea, but I just skip it here.

Indexing
One of the most popular families of information retrieval techniques is based on the Vector-Space Model (VSM) for documents. A VSM variant is characterized by three
ingredients:
(i)                 a transformation function (also called local term weight)
(ii)               a term weighting scheme (also called globalterm weight)
(iii)             a similarity measure

Their matching function is:

Where the first term of the numerator defined as idf (w)*n(d,w) are the weighted word frequencies

Comments:
This work showed the novelty to apply standard techniques from statistics for questions like model fitting, model combination, and complexity control.
And it improved the performance on LSA, and has the benefit to be able to detect some synonyms as well as words that refer to the same topic.

But as their method uses Mixture Decomposition, EM algorithm would takes lots of computation time as the document amount and size become large, which will lead to an obstacle of this method.

2012年3月28日 星期三

[Paper Summary Lec_05] Semi-Supervised Hashing for Scalable Image Retrieval


J. Wang et al, "Semi-Supervised Hashing for Scalable Image Retrieval," CVPR, 2010

Hashing methods are used to deal the problems like search in large scale data, similar contents retrieval, etc.

Hashing methods can be divided into two main categories: unsupervised methods and supervised methods. While unsupervised methods do not require any labeled data and their parameters are easy to learn, in some vision problem, similarity can’t just be defined by simple metric. In other words, semantic similarity may not be preserved by the image descriptor under the similarity metric. These problem can be solved by using supervised hashing method, but all kind of supervised based learning process tend to be much slower than unsupervised method.

This paper proposed a novel idea to do semi-supervised Hash learning.
The idea is to label a fraction of the given pairs of data into two categories, M and C.
The goal of SSH is to learn hash functions that minimize the error on the labeled training data, while maximally satisfying the desirable properties of hashing e.g., independence of bits and balanced partitioning.

Basic Formulation

The formulation is as the following, given a vector wk∈RD , the kth hash function is defined as


bk = 0 since data X can be normalized to have zero mean.

One can get the corresponding binary bit as,

They define the following objective function to measure the empirical accuracy on the labeled data for a family of hash functions H:


By defining a matrix S ∈ RLXL, (L is the number of data points)
the above formulation can be expressed as a compact form:

And with the idea that each hash function should be able to separate the dataset into two parts, the objective function is subject to additional constraint.



Relaxing Objective Function

The concept of relaxing the objective function is quite intuitive, just replace the “same sign” criteria with signed magnitude, which means it not only desires similar points to have the same sign but also large projection magnitudes, meanwhile projecting dissimilar points not only with different signs but also as far as possible.
And the new objective function can be rewritten as a function of W, 


And since it’s hard to ensure perfectly balanced partitions, they replace the hard balancing constraint by a “soft” constraint maximizing the variance of the bits.

And with some inference, they derived more compact form, since the data are pre-normalized and have zero-mean. Then the overall objective function becomes 


To summarize, The first (supervised) term is the empir-ical accuracy of the learned hash functions on the pairwise labeled data, while the second (unsupervised) term is a regularizer that prefers those directions that maximize the variance of the projections subject to orthogonality constraints.

Relaxing Orthogonality Constraints

Since the orthogonality constraints force one to progressively pick those directions that have very low variance, substantially reducing the quality of lower bits. Depending on the application, it may make sense to pick a direction that is not necessarily orthogonal to the previous directions but has higher variance as well as low empirical error on the labeled set.
So instead of imposing hard orthogonality constraints, they convert them into a penalty term added to the objective function, so the objective function finally becomes 

And can be solved by some characteristic of matrix operation, I’ll just skip those detail here.

Comments:
This work proposed the novel idea that can reach certain accuracy with small bit hamming distance. Compare with the state-of-art methods like LSH (Locality Sensitive Hashing), SH (Spectral Hashing) and RBMs (Restricted Boltzmann Machines), SSH has better performance with little bits needed, also less training time in comparison with these method, especially RBMs. 
While the data scale will become extremely large in the future, certain method like this will become very influential, and the formulation process may be used in other similar works.

Experimental results on one million gist dataset from the paper.


2012年3月21日 星期三

[Paper Summary Lec_04] Online Dictionary Learning For Sparse Coding

Online dictionary learning for sparse coding.
Mairal et al.  ICML 2009. 

Sparse coding, modeling data vectors as sparse linear combinations of basis elements.
The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one—based on wavelets for example—has recently led to state-of-the-art results for numerous low-level image processing tasks.
In this work, they showed that it is possible to go further and exploit the specific structure of sparse coding in the design of an optimization procedure dedicated to the problem of dictionary learning, with low memory consumption and lower computational cost than classical second-order batch algorithms and without the need of explicit learning rate tuning.


Dictionary learning can be considered as optimizing the empirical cost function


The the loss function l, which is the distance between the original signal and the dictionary representation, plus the regularization term. (To preserve sparsity)


Online Dictionary Learning


The whole procedure is shown in figure of algorithm1 and algorithm2. 



We can see that the procedure iteratively update two variable by fixing the other one, making each of them a convex optimization problem.
(Alpha is the linear combination coefficients, and D is the dictionary matrix)


For further improvement of the algorithm, the author mentioned some point to focus on:
Handling Fixed-Size Datasets
-When dealing with large training sets , it is impossible to store all the past coefficients
Mini-Batch Extension
-More signals at each iteration instead of a single one

Purging the Dictionary from Unused Atoms
-Some of the dictionary atoms are never used should be dropped

The author also gave rigorous prove of the convergence analysis, but just skip the part in my summary.

Comments:
The contribution of this work:
This work cast the dictionary learning problem as the optimization of a nonconvex objective function over convex sets.
The algorithm is significantly faster than previous approaches to dictionary learning.
As the resolution of images and videos in the future will be extremely large, this kind of method will become quite important as to improve the transmission speed of data signal.