2012年3月28日 星期三

[Paper Summary Lec_05] Semi-Supervised Hashing for Scalable Image Retrieval


J. Wang et al, "Semi-Supervised Hashing for Scalable Image Retrieval," CVPR, 2010

Hashing methods are used to deal the problems like search in large scale data, similar contents retrieval, etc.

Hashing methods can be divided into two main categories: unsupervised methods and supervised methods. While unsupervised methods do not require any labeled data and their parameters are easy to learn, in some vision problem, similarity can’t just be defined by simple metric. In other words, semantic similarity may not be preserved by the image descriptor under the similarity metric. These problem can be solved by using supervised hashing method, but all kind of supervised based learning process tend to be much slower than unsupervised method.

This paper proposed a novel idea to do semi-supervised Hash learning.
The idea is to label a fraction of the given pairs of data into two categories, M and C.
The goal of SSH is to learn hash functions that minimize the error on the labeled training data, while maximally satisfying the desirable properties of hashing e.g., independence of bits and balanced partitioning.

Basic Formulation

The formulation is as the following, given a vector wk∈RD , the kth hash function is defined as


bk = 0 since data X can be normalized to have zero mean.

One can get the corresponding binary bit as,

They define the following objective function to measure the empirical accuracy on the labeled data for a family of hash functions H:


By defining a matrix S ∈ RLXL, (L is the number of data points)
the above formulation can be expressed as a compact form:

And with the idea that each hash function should be able to separate the dataset into two parts, the objective function is subject to additional constraint.



Relaxing Objective Function

The concept of relaxing the objective function is quite intuitive, just replace the “same sign” criteria with signed magnitude, which means it not only desires similar points to have the same sign but also large projection magnitudes, meanwhile projecting dissimilar points not only with different signs but also as far as possible.
And the new objective function can be rewritten as a function of W, 


And since it’s hard to ensure perfectly balanced partitions, they replace the hard balancing constraint by a “soft” constraint maximizing the variance of the bits.

And with some inference, they derived more compact form, since the data are pre-normalized and have zero-mean. Then the overall objective function becomes 


To summarize, The first (supervised) term is the empir-ical accuracy of the learned hash functions on the pairwise labeled data, while the second (unsupervised) term is a regularizer that prefers those directions that maximize the variance of the projections subject to orthogonality constraints.

Relaxing Orthogonality Constraints

Since the orthogonality constraints force one to progressively pick those directions that have very low variance, substantially reducing the quality of lower bits. Depending on the application, it may make sense to pick a direction that is not necessarily orthogonal to the previous directions but has higher variance as well as low empirical error on the labeled set.
So instead of imposing hard orthogonality constraints, they convert them into a penalty term added to the objective function, so the objective function finally becomes 

And can be solved by some characteristic of matrix operation, I’ll just skip those detail here.

Comments:
This work proposed the novel idea that can reach certain accuracy with small bit hamming distance. Compare with the state-of-art methods like LSH (Locality Sensitive Hashing), SH (Spectral Hashing) and RBMs (Restricted Boltzmann Machines), SSH has better performance with little bits needed, also less training time in comparison with these method, especially RBMs. 
While the data scale will become extremely large in the future, certain method like this will become very influential, and the formulation process may be used in other similar works.

Experimental results on one million gist dataset from the paper.


2012年3月21日 星期三

[Paper Summary Lec_04] Online Dictionary Learning For Sparse Coding

Online dictionary learning for sparse coding.
Mairal et al.  ICML 2009. 

Sparse coding, modeling data vectors as sparse linear combinations of basis elements.
The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one—based on wavelets for example—has recently led to state-of-the-art results for numerous low-level image processing tasks.
In this work, they showed that it is possible to go further and exploit the specific structure of sparse coding in the design of an optimization procedure dedicated to the problem of dictionary learning, with low memory consumption and lower computational cost than classical second-order batch algorithms and without the need of explicit learning rate tuning.


Dictionary learning can be considered as optimizing the empirical cost function


The the loss function l, which is the distance between the original signal and the dictionary representation, plus the regularization term. (To preserve sparsity)


Online Dictionary Learning


The whole procedure is shown in figure of algorithm1 and algorithm2. 



We can see that the procedure iteratively update two variable by fixing the other one, making each of them a convex optimization problem.
(Alpha is the linear combination coefficients, and D is the dictionary matrix)


For further improvement of the algorithm, the author mentioned some point to focus on:
Handling Fixed-Size Datasets
-When dealing with large training sets , it is impossible to store all the past coefficients
Mini-Batch Extension
-More signals at each iteration instead of a single one

Purging the Dictionary from Unused Atoms
-Some of the dictionary atoms are never used should be dropped

The author also gave rigorous prove of the convergence analysis, but just skip the part in my summary.

Comments:
The contribution of this work:
This work cast the dictionary learning problem as the optimization of a nonconvex objective function over convex sets.
The algorithm is significantly faster than previous approaches to dictionary learning.
As the resolution of images and videos in the future will be extremely large, this kind of method will become quite important as to improve the transmission speed of data signal.

2012年3月14日 星期三

[Paper Summary Lec_03] Aggregating local descriptors into a compact image representation

Aggregating local descriptors into a compact image representation,
Herve Jegou , CVPR'10


The goal of this work is to address the problem of searching the most similar images in a very large image database (million scale) under the joint optimization of three constraints: the search accuracy, its efficiency and the memory usage.
The evaluation shows that their search ac-curacy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.

The whole procedure can be generalized to three stages as following,

Aggregate local image descriptors into a vector representation
They propose a descriptor, derived from both BOF and Fisher kernel, that aggregates SIFT descriptors and produces a compact representation, termed VLAD.
Similar to BOF feature, the idea of the VLAD descriptor is to accumulate, for each visual word ci, the differences x−ci of the vectors x assigned to ci. (where cis the center formed by k-means algorithm) 


Dimensionality reduction of these vectors
By using projection method (PCA) to reduce dimension of feature vector and then represent each vector by quantized version of ADC (asymmetric distance computation, a method referenced from other’s work) method.

The indexing algorithm
After the quantized code word construction, inverted file is then built, termed IVFADC, search will base on this as in traditional IR task.



Comments: 

VLAD feature outperforms BOF for the same size, as the computation is more efficient and the memory cost is less after dimension reduction of PCA procedure.

After dimension reduction, mAP drop most by VLAD method in comparison with BOF and Fisher alone.

[Paper Summary Lec_02] Efficient visual search of videos cast as text retrieval

Efficient visual search of videos cast as text retrieval
Josef Sivic and Andrew Zisserman    TPAMI, 2009


This paper casts the problem about object search query in video as traditional text retrieval problem, by using “visual word” as the word in text. By applying those mature technologies in IR field and some post processing like spatial layout of candidate regions, the goal to localize the occurrence of a query object is achieved.



Off-line
Building visual words and key frame representation

As in the previous one just summarized, first detect affine covariant regions in each key-frame of the video, then represent the region detected by SIFT descriptor.
To reduce noise and reject unstable regions, information is aggregated over a sequence of frames, then some frames are removed after certain velocity and correlation test. (Regions that do not survive for more than 3 frames are rejected)

Samples of normalized affine covariant regions from clusters corresponding to a single visual word


        After SIFT descriptor are calculated, visual vocabulary are constructed using k-means algorithm, then assign each region descriptor in each key-frame to the nearest cluster center (using Mahalanobis distance as the metric). With those vocabularies, each key frame is represented as histogram of each visual word.

        Like in the traditional text retrieval task, a stop list is built by filtering those frequent visual words that occur in almost all images, and visual vocabularies are re-ranked by tf-idf weighting method.

        Finally the inverted file indexing structure can be built.

On-line
Given the query region, the set of visual words are computed in the region, then retrieve key-frames based on visual word frequencies. Documents are ranked by the normalized scalar product (cosine of angle) as following

Finally, as Google did in document search ranking, spatial consistency are considered to be a good choice of post-processing. In the paper they define a search Area by the 15 nearest spatial neighbors of each match in the query and target frames. Each region which also matches within the search areas casts a vote for that frame, then the voting score can be used to re-rank the result again.



This pic shows the case that search area is defined as 3:
Comments: 
The most remarkable contribution of this work is that they cast the well-developed skill in text retrieval to object query in videos.
The spatial consistency stage seems effective, but if the number of neighbors is too large, the efficiency will decline. 
With only 3 movies as their dataset, which is not very  convincing  experiment.



2012年3月11日 星期日

[Paper Summary Lec_02] Distinctive Image Features from Scale-Invariant Keypoints

“ Distinctive Image Features from Scale-Invariant Keypoints, “
Lowe IJCV 2004

The feature presented in this paper is well known as SIFT, which is widely used in matching task between objects in computer vision community. This local feature descriptor achieves invariance to scale, rotation and some degree of affine transformation.

The whole method can be generalized to four stages as following,

1.      Scale-space extrema detection
It is implemented efficiently by using adifference-of-Gaussian function (consecutive Gaussian kernels differed by  k*σ ) to identify potential interest points that are invariant to scale and orientation.

2.      Keypoint localization
      Selecting extrema locations from previous stage by checking pixel value with eight neighbors as shown in the following figure. 
      Later filtering process is taken based on stability of the point, eliminating noise and edge response.

3.      Orientation assignment
              An orientation histogram is formed from the gradient orientations of sample pointswithin a region around the keypoint. The orientation histogram has 36 bins covering the 360 degree range of  orientations. Peaks in the orientation histogram correspond to dominant direction and the others are rotated based on the dominant direction, so that the rotation invariance is attained. 
Local peak that is within 80% of the highest peak is used to also create a keypoint with that orientation. If there are multiple orientations with similar magnitude, the keypoint is duplicated with different orientation assignments.


4.      Keypoint descriptor
      A keypoint descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a region around the keypoint location, as shown on the left. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over4x4 sub-regions, as shown on the right, with the length of each arrow corresponding to the sum of the gradient magnitudes near that direction within the region. 
      This figure shows a 2x2 descriptor array computed from an 8x8 set of samples, whereas the experiments in this paper use 4x4 descriptors computed from a 16x16 sample array.

Comments:

Merits
  • The keypoint descriptors are highly distinctive, which allows a single feature to find its correct match with good probability in a large database of features.
  • Most matching tasks are based on the Harris corner detector that is very sensitive to changes in image scale, while SIFT conquer this thorny problem.
Defects

u  When facing object recognition problem with similar characteristic on different object, the performance of SIFT feature will decline.

u  All the parameters setting are based on experiment result, not derivation of optimal solution.



2012年3月8日 星期四