Josef Sivic and Andrew Zisserman TPAMI, 2009
This paper casts the problem about object search query in video as traditional text retrieval problem, by using “visual word” as the word in text. By applying those mature technologies in IR field and some post processing like spatial layout of candidate regions, the goal to localize the occurrence of a query object is achieved.
Finally, as Google did in document search ranking, spatial consistency are considered to be a good choice of post-processing. In the paper they define a search Area by the 15 nearest spatial neighbors of each match in the query and target frames. Each region which also matches within the search areas casts a vote for that frame, then the voting score can be used to re-rank the result again.
This pic shows the case that search area is defined as 3:
Off-line
Building visual words and key frame representation
As in the previous one just summarized, first detect affine covariant regions in each key-frame of the video, then represent the region detected by SIFT descriptor.
To reduce noise and reject unstable regions, information is aggregated over a sequence of frames, then some frames are removed after certain velocity and correlation test. (Regions that do not survive for more than 3 frames are rejected)
Samples of normalized affine covariant regions from clusters corresponding to a single visual word
After SIFT descriptor are calculated, visual vocabulary are constructed using k-means algorithm, then assign each region descriptor in each key-frame to the nearest cluster center (using Mahalanobis distance as the metric). With those vocabularies, each key frame is represented as histogram of each visual word.
Like in the traditional text retrieval task, a stop list is built by filtering those frequent visual words that occur in almost all images, and visual vocabularies are re-ranked by tf-idf weighting method.
Finally the inverted file indexing structure can be built.
On-line
Given the query region, the set of visual words are computed in the region, then retrieve key-frames based on visual word frequencies. Documents are ranked by the normalized scalar product (cosine of angle) as following
This pic shows the case that search area is defined as 3:
Comments:
The most remarkable contribution of this work is that they cast the well-developed skill in text retrieval to object query in videos.
The spatial consistency stage seems effective, but if the number of neighbors is too large, the efficiency will decline.
With only 3 movies as their dataset, which is not very convincing experiment.



沒有留言:
張貼留言