2012年3月14日 星期三

[Paper Summary Lec_02] Efficient visual search of videos cast as text retrieval

Efficient visual search of videos cast as text retrieval
Josef Sivic and Andrew Zisserman    TPAMI, 2009


This paper casts the problem about object search query in video as traditional text retrieval problem, by using “visual word” as the word in text. By applying those mature technologies in IR field and some post processing like spatial layout of candidate regions, the goal to localize the occurrence of a query object is achieved.



Off-line
Building visual words and key frame representation

As in the previous one just summarized, first detect affine covariant regions in each key-frame of the video, then represent the region detected by SIFT descriptor.
To reduce noise and reject unstable regions, information is aggregated over a sequence of frames, then some frames are removed after certain velocity and correlation test. (Regions that do not survive for more than 3 frames are rejected)

Samples of normalized affine covariant regions from clusters corresponding to a single visual word


        After SIFT descriptor are calculated, visual vocabulary are constructed using k-means algorithm, then assign each region descriptor in each key-frame to the nearest cluster center (using Mahalanobis distance as the metric). With those vocabularies, each key frame is represented as histogram of each visual word.

        Like in the traditional text retrieval task, a stop list is built by filtering those frequent visual words that occur in almost all images, and visual vocabularies are re-ranked by tf-idf weighting method.

        Finally the inverted file indexing structure can be built.

On-line
Given the query region, the set of visual words are computed in the region, then retrieve key-frames based on visual word frequencies. Documents are ranked by the normalized scalar product (cosine of angle) as following

Finally, as Google did in document search ranking, spatial consistency are considered to be a good choice of post-processing. In the paper they define a search Area by the 15 nearest spatial neighbors of each match in the query and target frames. Each region which also matches within the search areas casts a vote for that frame, then the voting score can be used to re-rank the result again.



This pic shows the case that search area is defined as 3:
Comments: 
The most remarkable contribution of this work is that they cast the well-developed skill in text retrieval to object query in videos.
The spatial consistency stage seems effective, but if the number of neighbors is too large, the efficiency will decline. 
With only 3 movies as their dataset, which is not very  convincing  experiment.



沒有留言:

張貼留言