Document 12917239

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
A Survey and Proposal on Video Data
Retrieval Technique
Aniket Sugandhi1, Deepshikha Sharma2
1
2
Research Scholar, Department of Information Technology, SVITS, Indore, India
Assistant Professor, Department of Information Technology, SVITS, Indore, India
Abstract— as the computational technologies are
growing the need of digital data is also increasing. In
the huge data sources for utilizing the required data,
most of the time data retrieval process is consumed.
Therefore the data retrieval technique is a process of
finding relevant information in a given data source. A
number of techniques are available for accurate data
retrieval for text and image data. But there are very
fewer efforts are listed for retrieving the accurate data
for the video search. In this paper a review on the
available video data retrieval process is performed.
On basis of the concluded outcomes a suitable data
model is proposed for improving the search relevancy
in the video data retrieval systems.
Keywords— Improvement of Methods, Information
Retrieval, Review, Search Relevancy, Video Data
Analysis
challenging task. Most of the people have access
tremendous amount of videos from internet. So, there
are rich video-based applications in many areas
including business, security and surveillance,
medicine, news, education, entertainment and others.
Video data contains various kinds of data such as
video, audio and text. The video consists of recurrent
images with some temporal information. The video
content may be defined into three categories:
[i.]
Low-level feature information that includes
features such as color, texture, shape and so
on,
[ii.]
Syntactic information that describes the
contents of video, including salient objects,
their spatial-temporal position and spatial
temporal relations between them, and
[iii.]
Semantic information, which describes what,
happens in the video along with what is
perceived by the users.
1. INTRODUCTION
Data Retrieval is the process by which data is
selected and extracted from a file, or a database. For
retrieving the desired data, user input a set of criteria
by a query such as Structured Query Language (SQL).
But when the data is not available in structured format
the data retrieval becomes complicated. Video has
been widely used data format in computer technology
and a most complex media form with strongest
performance, which contains large amount of
information and images. But its unstructured data
format and the non-transparent content that makes
difficult to search and manage.
Retrieving desired video clips quickly and
accurately from the large video database is one of the
challenging tasks in developing video database.
Former video based information retrieval system is
find videos based on keywords. But manual treatment
of data can increases time and effort, additionally text
associated with video is not define the entire video
contents, or cause a large number of poor results. The
improvement in multimedia storage technology deals
with the extraction of implicit knowledge, multimedia
data relationships or other patterns not explicitly
stored.
The storage and management of multimedia data is
one of the crucial tasks in the data mining due to nonstructured nature of data. Handle data with complex
structure such as video and audio is one of the
ISSN: 2231-5381
The semantic information used to identify the video
events has two important aspects. They are:
[i.]
A spatial aspect presented by a video frame,
such as the location, objects and characters
displayed in the video frame.
[ii.]
A temporal aspect presented by a sequence of
video frames in time such as the character’s
actions and the object’s movements presented
in a sequence.
The higher-level semantic information of video is
extracted by examining the features of the audio,
video and text. Additionally different other features
are completely exploited and used to capture the
semantics for bridging the gap between the high level
semantic and the low level features. Three modalities
are identified within a video:
[i.]
Visual modality containing the scene that can
be seen in the video;
[ii.]
Auditory modality with the speech, music,
and environmental sounds that can be heard
along with the video;
[iii.]
Textual modality having the textual resources
which describes the content of the video
document.
http://www.ijettjournal.org
Page 136
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
Video databases are huge and video data sets are
extremely large in size. There are so many tools for
managing and searching within such collections, but
need some tool to extract hidden knowledge within the
video data for applications. Here, we will discuss
some issues regarding data retrieval first:
[i.] Poor-quality data: noisy data, missing,
inadequate size, and poor data sampling.
[ii.] Lack of understanding/lack of diffusion of data
retrieval techniques.
[iii.] Data variety - trying to accommodate data that
comes from different sources and in a variety of
different forms (images, geo data, text, social,
numeric, etc.).
[iv.] Data velocity - online machine learning
requires models to be constantly updated with
new, incoming data.
[v.] Dealing with huge datasets, or 'Big Data,' that
require distributed approaches.
[vi.] Coming up with the right question or problem "More data beats the better algorithm, but
smarter questions beat more data,"
[vii.] Remaining objective and allowing the data to
lead you, not the opposite.
2. LITERATURE SURVEY
This section provides the study of the recently
developed approaches of video data retrieval
techniques. The study of these techniques provides the
guidelines for performance improvements of the video
data retrieval.
Case I
Current semantic video search approaches usually
build upon the text search against text associated with
the video content. The additional use of features such
as image content, audio, face detection, and high-level
concept has been shown to improve upon the textbased video searching. However, these multimodal
systems aim to get most improvement through
supporting multiple queries, applying specific
semantic concept detectors, or by developing highlytuned retrieval models for specific domain. It will be
difficult for users of multimodal search systems to
acquire example images for their queries. Retrieval by
matching semantic concepts, though promising,
strongly depends on availability of robust detectors
and required training data. It will be difficult for the
developers to develop highly-tuned models and apply
the system to different domains. It is clear, that need
to develop approaches for leveraging available
multimodal techniques in the search without
complicating things [1].
Pseudo-relevance feedback (PRF) is a tool
which is used to extend text search results for text and
video retrieval. PRF is initially introduced in [2],
ISSN: 2231-5381
where the top-rank documents are used to re-rank
retrieved documents. In this context users explicitly
provide feedback by labeling top results as positive or
negative. The same concept has been implemented in
video retrieval. In [3], authors used textual
information in the top-ranking shots to obtain
additional keywords to perform retrieval and re-rank.
The experiment was shown improved MAP1 from
0.120 to 0.124 in the TRECVID 2004 video search
task [4]. In [5], authors sampled the pseudo-negative
images from the lowest rank of the initial query
results; taking query videos and images as examples,
then formulated as a classification problem which
improves search performance from MAP 0.105 to
0.112 in TRECVID 2003.
The problem of example-based approaches
and avoids highly-tuned models, goal is to utilize both
pseudo-positive and pseudo-negative examples and
learn the recurrent relevant visual patterns from the
estimated ―soft‖ pseudo-labels instead of using ―hard‖
pseudo-labels. The probabilistic relevance score of
each shot is then smoothed over the entire raw search
results through kernel density estimation process. An
information-theoretic approach is then applied to
cluster visual patterns of similar semantics. Then
clusters are reordered by the cluster relevance and
then the images within the same cluster are ranked by
their feature density. To balance re-ranking
performance and efficiency, experiments are
conducted with different parameters used in IB
ranking. This re-ranking approach is highly generic
and requires no training in advance but is comparable
to the top automatic or manual video search systems.
We proposed a novel re-ranking process for video
search which requires no image search examples and
is based on a rigorous IB principle. Evaluated on
TRECVID 2003-2005 data set, the approach improves
the text search baseline over different topics in terms
of average performance by up to 23%.
Case II
Recent video search approaches are mostly restricted
to text-based solutions that process keyword queries
against text tokens associated with the video content.
However, textual information may not necessarily
come with the image or video sets. The use of other
modalities such as image content, audio, face
detection, and high-level concept detection has been
used to improve upon text-based video search
systems. Such multimodal systems improve the search
performance by introducing multiple query example
images, specific semantic concept detectors, or highlytuned retrieval models for specific types of queries.
Additionally, it is still an issue whether concept
models for different classes of queries may be
developed and proved effective across multiple
domains. Therefore, incorporation of multimodal
search methods should be as transparent and nonintrusive as possible, in order to keep the simple
search mechanism preferred by typical users today.
http://www.ijettjournal.org
Page 137
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
Based on the above observations, to conduct
semantic video search in a re-ranking manner which
automatically re-ranks the initial text search results
based on the ―recurrent patterns‖ or ―contextual
patterns‖ between videos in the initial search results
[6]. Such patterns exist because of common semantic
topics shared between different videos. Additionally
common visual contents used. Such visual duplicates
provide strong links between broadcast news videos or
web news articles and help cross-domain information
exploitation. In [7], authors analyzed frequency of
such recurrent patterns for cross-language topic
tracking.
Therefore, to use multimodal similarities
between semantic video documents in basic contextual
patterns and developing a new re-ranking method,
context re-ranking. The multimodal contextual links
have been used to link such missing stories to initial
text queries and further to improve the search
accuracy. Due to minimum explicit links between
video stories, context re-ranking is formulated as a
random walk over the context graph whose nodes
represent documents in the search set. These nodes are
connected by edges weighted with the pair-wise
multimodal contextual similarities. The stationary
probability of random walk is used to compute the
final scores of videos after re-ranking. The random
walk is biased with preference towards the stories with
the higher initial text search scores. Contextual reranking method is inspired by page ranking
techniques. Authors also investigate the optimal
weights for combining the text search scores and
multimodal re-ranking for different classes.
Experiments show that given approach can improve
baseline text retrieval up to 32%. Such results are
remarkable since no additional advanced methods are
used. Furthermore, for people-related queries, which
usually have recurrent coverage across news sources,
40% relative improvement in story-level MAP is
recorded. Through parameter sensitivity tests, also
discovered optimal text vs. visual weight ratio for reranking initial text search results is 0.15 to 0.85.
Case III
The amount of online videos is exploding in the past
decade due to the advancement of internet and
multimedia technologies. Take YouTube [8] for
example, over 72 hours of video are uploaded every
minute. However, video search and retrieval mostly
depends on the use of the tags annotated by the users,
which might not always be available or relevant to the
video contents. Therefore, it is still a very challenging
task to search the specific video segments containing
the objects of interest (OOI), which is a difficult task
for several video-based applications including
embedded marketing [9], [10]. In practice, the content
providers might not always know exactly when
particular items are embedded in their videos. Such
information, even if available in the scripts, might not
ISSN: 2231-5381
be successfully transferred to TV stations, or online
video platforms which broadcast the videos. For old
movies or TV series, the aforementioned information
is obviously not available either. Without the ability to
automatically search for the OOI across video frames
based on the query input, it will be challenging for
advertisers or video deliverers to design and provide
interaction services in future smart digital TV
applications.
In this paper [11], a novel learning
framework based on multiple instance learning (MIL),
which integrates the aforementioned weakly
supervised and object matching based strategies.
Given a query image containing the OOI, this
approach only requires one to provide label
information for few video frames. With the above
query image and selected video frames, a novel MIL
algorithm with additional constraints on preserving the
discriminating ability. This allows us to improve the
detection performance of prior MIL-based video
instance search approaches. Compared to prior objectmatching based methods, our approach utilizes the
query image and the input video itself, and there is no
need to collect training data for the OOI. As a result,
our self-training/detection strategy makes our
proposed framework more preferable for practical
uses.
To evaluate the performance of proposed
method, we consider two datasets in our experiments:
a collection of commercial videos, and the TRECVID
dataset. We compare our proposed q-MIL algorithm
with matching-based or weakly supervised instance
search approaches. Our experiments on these two
video datasets verified the effectiveness and
robustness of our approach, which was shown to
outperform existing video summarization or object
matching methods for video instance retrieval.
Case IV
Nowadays, online video propagation has surged up to
an exceptional level. Within the large video pool, the
videos about celebrities appear highly frequently and
are closely followed by the users because of the
―Celebrity Effect‖. It’s common that celebrities are
engaged in multiple domains. The challenges of
realizing such a personalized celebrity video search
scheme lie in three aspects.
[i.]
The various fields that certain celebrity gets
involved in is not always clear and needs to
be explored.
[ii.]
Users seldom explicitly provide their interest
profiles and interest-oriented preferences are
not available in topic level.
[iii.]
How to connect user interest with celebrity
popularity. Generally, user interest and
celebrity popularity are extracted in different
spaces from heterogeneous data sources.
Therefore, how to explore the latent
http://www.ijettjournal.org
Page 138
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
association of two spaces is key factor to
solve problem.
Here an Interest-Popularity Cross-space Mining
based method is proposed to minimize the above
discussed problems [12]. Standard topic modeling
method of Latent Dirichlet Allocation (LDA) is
applied to extract celebrity popularity distribution. For
the user side, since off-the-shelf user profile is
unavailable or hardly informative, user interest based
on his/her online activities. LDA is again utilized for
user interest topic extraction. Given the derived
heterogeneous popularity and interest spaces,
introduce a cross-space correlation method. Semantic
and context intra-word relations are refined by random
walk to bridge interest and popularity spaces. The
inputs include the celebrities’ Wikipedia profile and
the users’ uploaded and favorite videos with their
associated tags. The output is generated video ranking
list. The framework contains three components interest and popularity space construction, cross-space
correlation and video re-ranking. Video re-ranking is
based on joint probability distribution of user,
celebrity and videos in interest space. The main
contributions of author are as follows.
The novel problem of personalized celebrity
video search, by exploiting user interest and
celebrity popularity.
A cross-space correlation method to connect
heterogeneous spaces, to serves as a feasible
solution to other cross-domain problems.
With celebrity as a special case of distributed
query, one of the first attempts to address
query
understanding
challenge
in
personalized search.
The optimal performance is achieved at around λ =
0.2. This result confirms the significance of random
walk in updating the distribution of topic-word. On the
other hand, the statistics on the tags of user and
celebrity spaces shows an overlapping rate of around
70%, which in turn confirms the necessity to adopt
random walk. Promising experimental results have
demonstrated the effectiveness of the method.
Case V
Effective utilization of semantic concept detectors for
large scale video search has recently become a topic of
intensive studies. Using semantic detectors for video
search has been known as one of the most interesting
approaches in bridging ―semantic gap‖. In video
search, given a user query, detectors that could reflect
semantics of query chosen for query answering. Video
segments which contain the desired concepts,
indicated by detectors, are retrieved, ranked and
returned. The selected detectors basically serve as
bridge to narrow gap between user semantics and
features. This retrieval methodology is often referred
ISSN: 2231-5381
to as concept based video search. A recent simulation
study predicts that a large pool of concept detectors,
fewer than five thousands, with mean average
precision (MAP) of 10% can already achieve high
accuracy of search performance.
The open issues underlying the search are the
appropriate mapping and fusion of detectors to
queries. Generally, query-to detector mapping
involves semantic reasoning, domain knowledge, or
spatial-time dependent information. A subsequent
interesting question after selection is: how to use the
set of ultimately selected detectors which involve
concepts. These detectors are correlated, either
complementarily
for
parenthood
relationship,
supportive of one another, or contrasting to each other.
Basically, to understand and exploiting nature of
detectors could ideally derive effective fusion strategy,
the main issue addressed. In essence, the selection and
fusion of detectors requires support from multiple
sources including ontology, statistics, and also
reliability of concept detection. [13] Considers
different aspects of semantic detectors as following for
large-scale video search:
Semantics refers to lexical relatedness between
concepts. A common approach for semantic
measurement between query and detector is use
of ontology such as WordNet to reason their
hyponym (is-a) relationship. The semantic
reasoning relies only on a local view of sub-graph
structure where concepts under investigation.
Such reasoning does not allow uniform
measurement of concept. A semantic space that is
a vector space facilitating uniform mapping of
query to detectors. Semantic space is orthogonal
space which also emphasizes optimal coverage of
semantic space built upon a vocabulary set.
Observability refers to frequent occurrence of
certain concepts in desired domain. That
correlates each other or even coexists in some
unknown subspaces. This information is not
directly observed from semantic reasoning.
Author proposes building of a vector space,
namely observability space, to mine the piece of
information. While SS is for narrowing the
―semantic gap‖, OS addresses the bridging of
―observability gap‖. Traditional correlation
measures only local view of correlation among
few concepts, thus prohibit effective mining of
correlated concepts in a global view. In contrast,
OS offers a global and uniform view of how
concepts co-occur in a vector space. In OS, given
any two concepts, subspace embedded by these
concepts can be efficiently mined to infer useful
detectors for video search.
Reliability refers to robustness of detectors in
terms of detection. Intuitively, only robust
detectors should be considered for query
answering, or at least the robustness of detectors
should be enhanced as best as possible before
http://www.ijettjournal.org
Page 139
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
utilized for video search. To improve the
reliability of a detector, employ OS to uniformly
determine set of positively and negatively
correlated detectors. Robustness of original
detector is enhanced by jointly fusing with set of
correlated detectors.
Diversity refers to variety of detectors in
answering query. Considering each group of
detectors separately during fusion, instead of each
detector individually, could avoid case that
certain groups may have more selected detectors
and thus bias the final search result. There are
various ways of inferring diversity of detectors,
for example by linguistic understanding of a
query.
Based on these four aspects of detectors, a
novel fusion strategy proposed. The framework
consists of two major parts: detector selection and
multi-level fusion. In detector selection, detectors are
selected separately from SS and OS. The detectors
picked up from SS, named as anchor concept detectors,
are semantically related to a query. The detectors
selected from OS, called as bridge concept detectors,
are to bridge the observability gap of anchor detectors
in the subspace of OS. During detector fusion,
different levels of fusion required, each of which
emphasizes one aspect of semantic detectors. In
reliability-based fusion, positively and negatively
correlated detectors are further selected from OS to
enhance the robustness of anchor and bridge detectors.
In observability based fusion, the co-occurrence
chance among concepts is exploited by fusing bridge
detectors with nearest anchor detectors. This not only
enhances the reliability but also enriches the
observability of anchor detectors. Finally, the set of
anchor detectors are either semantically fused or by
further analysing their diversity before fusion. The
semantic-based fusion is conducted using SS, while
diversity-based fusion involves combination of
detectors in both semantic and observability spaces.
We conduct experiments using datasets (TV05,
TV06, TV07) from TRECVID 2005 to 2007
respectively, involving a total of 72 search queries.
The outputs of three classifiers are combined as the
detection score with average fusion. In the
experiments, the retrieved items (shots) are ranked
according to their score to the selected concept
detectors.
3. PROPOSED WORK
There are a number of issues and challenges are exist
for retrieving the user query relevance video search.
Therefore the different kinds of techniques are
developed for improving the accuracy of search
systems. In this presented work a new model for video
data search is presented using the low level features of
segmented video frames. The given method also
incorporates the text annotation technique for
supporting the text based query search. Therefore the
ISSN: 2231-5381
basic conceptual model is presented in two different
major modules. In first module the training of the
system is performed, the basic concept of training and
their participating components are described in figure
1. In addition of that the second module perform the
retrieval of relevant videos from the data base the
figure 2 provides the utilized components of the
information retrieval system.
Figure 1 Training module
According to the given diagram (figure 1) the initial
input of the system is a video object which is uploaded
by the end user. The video is basically the collection
of images and the audio but in this proposed technique
the visual features are utilized for learning the video
objects. Therefore in the next step the video is
fragmented using the FF-MPEG API. This technique
is used to convert entire video into a set of images
according to the time parameter or in random manner.
These extracted images are further utilized in two
different processes. In first the image annotation is
performed by user text or tag input and in next step the
low level features are computed from the extracted or
tagged images. Here for low level features the texture
feature using LBP (local binary pattern), edge feature
using the edge histogram technique and the color
feature using the grid color movement technique is
estimated. These features and tags are then normalized
for storing them into the database. The database
contains the normalized low level features and the
associated tags and the video name which are
identified using this annotation and features.
In next step the video retrieval process is
taken place, which is demonstrated using figure 2. In
this process both kinds of user query can be used by
the end users. Here user can search the video by the
example image or by the text query. Therefore the
provision for input both kinds of query is performed
here. When the query input is provided by example
image then the low level features are extracted by the
http://www.ijettjournal.org
Page 140
International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016
query example and then using the KNN classifier
associated video is classified. Otherwise when
query is a kind of text then the text tokens
searched by the associated tags of video and then
search results are produced.
the
the
are
the
incorporated. In near future the proposed technique is
implemented using the JAVA technology and their
performance is published.
REFERENCES
[1]
Winston H. Hsu, Lyndon S. Kennedy, Shih-Fu Chang,
―Video Search Reranking via Information Bottleneck
Principle‖, MM’06, October 23–27, 2006.
[2]
J. G. Carbonell et al. ―Translingual information retrieval:
A comparative evaluation‖, in International Joint
Conference on Artificial Intelligence, 1997.
[3]
T.-C. Chang et al. TRECVID 2004 search and feature
extraction task by NUS PRIS, in TRECVID Workshop,
Washington DC, 2004.
[4]
TRECVID. TREC, ―Video Retrieval Evaluation‖, in
http://www-nlpir.nist.gov/projects/trecvid/.
[5]
R. Yan, A. Hauptmann, and R. Jin., ―Multimedia search
with pseudo-relevance feedback‖, in CIVR, UrbanaChampaign, IL, 2003.
[6]
Winston H. Hsu, Lyndon S. Kennedy, Shih-Fu Chang,
―Video Search Reranking through Random Walk over
Document-Level Context Graph‖, in MM’07, September
23–28, 2007.
[7]
W. H. Hsu and S.-F. Chang. ―Topic tracking across
broadcast news videos with visual duplicates and
semantic concepts‖. In International Conference on
Image Processing (ICIP), Atlanta, GA, USA, 2006.
Figure 2 Retrieval module
[8]
[Online]. Available: http://www.youtube.com
The proposed working model is described in this
section for efficient and accurate video retrieval. In
near future the proposed model is implemented and
their performance is measured.
[9]
X.-S. Hua, T. Mei, and A. Hanjalic, Online Multimedia
Advertising: Techniques and Technologies. Hershey,
PA, USA: IGI Global, 2011.
4. CONCLUSION
The information retrieval is a classical domain of
research. The IR techniques are directly depends upon
the data and their formats. In this presented paper the
video information retrieval techniques are surveyed
and different kinds of retrieval techniques and aspects
are reviewed. In addition of that for finding the
apocopate technique of video retrieval the recently
developed techniques are also discussed. Finally a
new technique for improving the video retrieval
performance a text annotation and classification based
technique is proposed. For the classification the low
level features are computed by video frames and used
with the classifier for approximating more nearer
videos as user required. For effectiveness of the
implementation of system the proposed technique
consumes both the techniques namely query by
example and query by text both methods are
ISSN: 2231-5381
[10] T.-C. Lin, J.-H. Kao, C.-T. Liu, C.-Y. Tsai, and Y.-C. F.
Wang, ―Video instance search for embedded marketing,‖
in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu.
Summit Conf., Dec. 2012, pp. 1–4.
[11] Ting-Chu Lin, Min-Chun Yang, Chia-Yin Tsai, and YuChiang Frank Wang, "Query-Adaptive Multiple Instance
Learning for Video Instance Retrieval," IEEE
TRANSACTIONS ON IMAGE PROCESSING, VOL.
24, NO. 4, APRIL 2015.
[12] Zhengyu Deng, Jitao Sang, and ChangshengXu,
―Personalized Celebrity Video Search Based on CrossSpace Mining‖, in L. Weisi et al. (Eds.): PCM 2012,
LNCS 7674, pp. 455–463, 2012. Springer-Verlag Berlin
Heidelberg 2012.
[13] Xiao-Yong Wei, Chong-Wah Ngo, ―Fusing Semantics,
Observability, Reliability and Diversity of Concept
Detectors for Video Search‖, in MM’08, October 26–31,
2008.
http://www.ijettjournal.org
Page 141
Download