International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 A Survey and Proposal on Video Data Retrieval Technique Aniket Sugandhi1, Deepshikha Sharma2 1 2 Research Scholar, Department of Information Technology, SVITS, Indore, India Assistant Professor, Department of Information Technology, SVITS, Indore, India Abstract— as the computational technologies are growing the need of digital data is also increasing. In the huge data sources for utilizing the required data, most of the time data retrieval process is consumed. Therefore the data retrieval technique is a process of finding relevant information in a given data source. A number of techniques are available for accurate data retrieval for text and image data. But there are very fewer efforts are listed for retrieving the accurate data for the video search. In this paper a review on the available video data retrieval process is performed. On basis of the concluded outcomes a suitable data model is proposed for improving the search relevancy in the video data retrieval systems. Keywords— Improvement of Methods, Information Retrieval, Review, Search Relevancy, Video Data Analysis challenging task. Most of the people have access tremendous amount of videos from internet. So, there are rich video-based applications in many areas including business, security and surveillance, medicine, news, education, entertainment and others. Video data contains various kinds of data such as video, audio and text. The video consists of recurrent images with some temporal information. The video content may be defined into three categories: [i.] Low-level feature information that includes features such as color, texture, shape and so on, [ii.] Syntactic information that describes the contents of video, including salient objects, their spatial-temporal position and spatial temporal relations between them, and [iii.] Semantic information, which describes what, happens in the video along with what is perceived by the users. 1. INTRODUCTION Data Retrieval is the process by which data is selected and extracted from a file, or a database. For retrieving the desired data, user input a set of criteria by a query such as Structured Query Language (SQL). But when the data is not available in structured format the data retrieval becomes complicated. Video has been widely used data format in computer technology and a most complex media form with strongest performance, which contains large amount of information and images. But its unstructured data format and the non-transparent content that makes difficult to search and manage. Retrieving desired video clips quickly and accurately from the large video database is one of the challenging tasks in developing video database. Former video based information retrieval system is find videos based on keywords. But manual treatment of data can increases time and effort, additionally text associated with video is not define the entire video contents, or cause a large number of poor results. The improvement in multimedia storage technology deals with the extraction of implicit knowledge, multimedia data relationships or other patterns not explicitly stored. The storage and management of multimedia data is one of the crucial tasks in the data mining due to nonstructured nature of data. Handle data with complex structure such as video and audio is one of the ISSN: 2231-5381 The semantic information used to identify the video events has two important aspects. They are: [i.] A spatial aspect presented by a video frame, such as the location, objects and characters displayed in the video frame. [ii.] A temporal aspect presented by a sequence of video frames in time such as the character’s actions and the object’s movements presented in a sequence. The higher-level semantic information of video is extracted by examining the features of the audio, video and text. Additionally different other features are completely exploited and used to capture the semantics for bridging the gap between the high level semantic and the low level features. Three modalities are identified within a video: [i.] Visual modality containing the scene that can be seen in the video; [ii.] Auditory modality with the speech, music, and environmental sounds that can be heard along with the video; [iii.] Textual modality having the textual resources which describes the content of the video document. http://www.ijettjournal.org Page 136 International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 Video databases are huge and video data sets are extremely large in size. There are so many tools for managing and searching within such collections, but need some tool to extract hidden knowledge within the video data for applications. Here, we will discuss some issues regarding data retrieval first: [i.] Poor-quality data: noisy data, missing, inadequate size, and poor data sampling. [ii.] Lack of understanding/lack of diffusion of data retrieval techniques. [iii.] Data variety - trying to accommodate data that comes from different sources and in a variety of different forms (images, geo data, text, social, numeric, etc.). [iv.] Data velocity - online machine learning requires models to be constantly updated with new, incoming data. [v.] Dealing with huge datasets, or 'Big Data,' that require distributed approaches. [vi.] Coming up with the right question or problem "More data beats the better algorithm, but smarter questions beat more data," [vii.] Remaining objective and allowing the data to lead you, not the opposite. 2. LITERATURE SURVEY This section provides the study of the recently developed approaches of video data retrieval techniques. The study of these techniques provides the guidelines for performance improvements of the video data retrieval. Case I Current semantic video search approaches usually build upon the text search against text associated with the video content. The additional use of features such as image content, audio, face detection, and high-level concept has been shown to improve upon the textbased video searching. However, these multimodal systems aim to get most improvement through supporting multiple queries, applying specific semantic concept detectors, or by developing highlytuned retrieval models for specific domain. It will be difficult for users of multimodal search systems to acquire example images for their queries. Retrieval by matching semantic concepts, though promising, strongly depends on availability of robust detectors and required training data. It will be difficult for the developers to develop highly-tuned models and apply the system to different domains. It is clear, that need to develop approaches for leveraging available multimodal techniques in the search without complicating things [1]. Pseudo-relevance feedback (PRF) is a tool which is used to extend text search results for text and video retrieval. PRF is initially introduced in [2], ISSN: 2231-5381 where the top-rank documents are used to re-rank retrieved documents. In this context users explicitly provide feedback by labeling top results as positive or negative. The same concept has been implemented in video retrieval. In [3], authors used textual information in the top-ranking shots to obtain additional keywords to perform retrieval and re-rank. The experiment was shown improved MAP1 from 0.120 to 0.124 in the TRECVID 2004 video search task [4]. In [5], authors sampled the pseudo-negative images from the lowest rank of the initial query results; taking query videos and images as examples, then formulated as a classification problem which improves search performance from MAP 0.105 to 0.112 in TRECVID 2003. The problem of example-based approaches and avoids highly-tuned models, goal is to utilize both pseudo-positive and pseudo-negative examples and learn the recurrent relevant visual patterns from the estimated ―soft‖ pseudo-labels instead of using ―hard‖ pseudo-labels. The probabilistic relevance score of each shot is then smoothed over the entire raw search results through kernel density estimation process. An information-theoretic approach is then applied to cluster visual patterns of similar semantics. Then clusters are reordered by the cluster relevance and then the images within the same cluster are ranked by their feature density. To balance re-ranking performance and efficiency, experiments are conducted with different parameters used in IB ranking. This re-ranking approach is highly generic and requires no training in advance but is comparable to the top automatic or manual video search systems. We proposed a novel re-ranking process for video search which requires no image search examples and is based on a rigorous IB principle. Evaluated on TRECVID 2003-2005 data set, the approach improves the text search baseline over different topics in terms of average performance by up to 23%. Case II Recent video search approaches are mostly restricted to text-based solutions that process keyword queries against text tokens associated with the video content. However, textual information may not necessarily come with the image or video sets. The use of other modalities such as image content, audio, face detection, and high-level concept detection has been used to improve upon text-based video search systems. Such multimodal systems improve the search performance by introducing multiple query example images, specific semantic concept detectors, or highlytuned retrieval models for specific types of queries. Additionally, it is still an issue whether concept models for different classes of queries may be developed and proved effective across multiple domains. Therefore, incorporation of multimodal search methods should be as transparent and nonintrusive as possible, in order to keep the simple search mechanism preferred by typical users today. http://www.ijettjournal.org Page 137 International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 Based on the above observations, to conduct semantic video search in a re-ranking manner which automatically re-ranks the initial text search results based on the ―recurrent patterns‖ or ―contextual patterns‖ between videos in the initial search results [6]. Such patterns exist because of common semantic topics shared between different videos. Additionally common visual contents used. Such visual duplicates provide strong links between broadcast news videos or web news articles and help cross-domain information exploitation. In [7], authors analyzed frequency of such recurrent patterns for cross-language topic tracking. Therefore, to use multimodal similarities between semantic video documents in basic contextual patterns and developing a new re-ranking method, context re-ranking. The multimodal contextual links have been used to link such missing stories to initial text queries and further to improve the search accuracy. Due to minimum explicit links between video stories, context re-ranking is formulated as a random walk over the context graph whose nodes represent documents in the search set. These nodes are connected by edges weighted with the pair-wise multimodal contextual similarities. The stationary probability of random walk is used to compute the final scores of videos after re-ranking. The random walk is biased with preference towards the stories with the higher initial text search scores. Contextual reranking method is inspired by page ranking techniques. Authors also investigate the optimal weights for combining the text search scores and multimodal re-ranking for different classes. Experiments show that given approach can improve baseline text retrieval up to 32%. Such results are remarkable since no additional advanced methods are used. Furthermore, for people-related queries, which usually have recurrent coverage across news sources, 40% relative improvement in story-level MAP is recorded. Through parameter sensitivity tests, also discovered optimal text vs. visual weight ratio for reranking initial text search results is 0.15 to 0.85. Case III The amount of online videos is exploding in the past decade due to the advancement of internet and multimedia technologies. Take YouTube [8] for example, over 72 hours of video are uploaded every minute. However, video search and retrieval mostly depends on the use of the tags annotated by the users, which might not always be available or relevant to the video contents. Therefore, it is still a very challenging task to search the specific video segments containing the objects of interest (OOI), which is a difficult task for several video-based applications including embedded marketing [9], [10]. In practice, the content providers might not always know exactly when particular items are embedded in their videos. Such information, even if available in the scripts, might not ISSN: 2231-5381 be successfully transferred to TV stations, or online video platforms which broadcast the videos. For old movies or TV series, the aforementioned information is obviously not available either. Without the ability to automatically search for the OOI across video frames based on the query input, it will be challenging for advertisers or video deliverers to design and provide interaction services in future smart digital TV applications. In this paper [11], a novel learning framework based on multiple instance learning (MIL), which integrates the aforementioned weakly supervised and object matching based strategies. Given a query image containing the OOI, this approach only requires one to provide label information for few video frames. With the above query image and selected video frames, a novel MIL algorithm with additional constraints on preserving the discriminating ability. This allows us to improve the detection performance of prior MIL-based video instance search approaches. Compared to prior objectmatching based methods, our approach utilizes the query image and the input video itself, and there is no need to collect training data for the OOI. As a result, our self-training/detection strategy makes our proposed framework more preferable for practical uses. To evaluate the performance of proposed method, we consider two datasets in our experiments: a collection of commercial videos, and the TRECVID dataset. We compare our proposed q-MIL algorithm with matching-based or weakly supervised instance search approaches. Our experiments on these two video datasets verified the effectiveness and robustness of our approach, which was shown to outperform existing video summarization or object matching methods for video instance retrieval. Case IV Nowadays, online video propagation has surged up to an exceptional level. Within the large video pool, the videos about celebrities appear highly frequently and are closely followed by the users because of the ―Celebrity Effect‖. It’s common that celebrities are engaged in multiple domains. The challenges of realizing such a personalized celebrity video search scheme lie in three aspects. [i.] The various fields that certain celebrity gets involved in is not always clear and needs to be explored. [ii.] Users seldom explicitly provide their interest profiles and interest-oriented preferences are not available in topic level. [iii.] How to connect user interest with celebrity popularity. Generally, user interest and celebrity popularity are extracted in different spaces from heterogeneous data sources. Therefore, how to explore the latent http://www.ijettjournal.org Page 138 International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 association of two spaces is key factor to solve problem. Here an Interest-Popularity Cross-space Mining based method is proposed to minimize the above discussed problems [12]. Standard topic modeling method of Latent Dirichlet Allocation (LDA) is applied to extract celebrity popularity distribution. For the user side, since off-the-shelf user profile is unavailable or hardly informative, user interest based on his/her online activities. LDA is again utilized for user interest topic extraction. Given the derived heterogeneous popularity and interest spaces, introduce a cross-space correlation method. Semantic and context intra-word relations are refined by random walk to bridge interest and popularity spaces. The inputs include the celebrities’ Wikipedia profile and the users’ uploaded and favorite videos with their associated tags. The output is generated video ranking list. The framework contains three components interest and popularity space construction, cross-space correlation and video re-ranking. Video re-ranking is based on joint probability distribution of user, celebrity and videos in interest space. The main contributions of author are as follows. The novel problem of personalized celebrity video search, by exploiting user interest and celebrity popularity. A cross-space correlation method to connect heterogeneous spaces, to serves as a feasible solution to other cross-domain problems. With celebrity as a special case of distributed query, one of the first attempts to address query understanding challenge in personalized search. The optimal performance is achieved at around λ = 0.2. This result confirms the significance of random walk in updating the distribution of topic-word. On the other hand, the statistics on the tags of user and celebrity spaces shows an overlapping rate of around 70%, which in turn confirms the necessity to adopt random walk. Promising experimental results have demonstrated the effectiveness of the method. Case V Effective utilization of semantic concept detectors for large scale video search has recently become a topic of intensive studies. Using semantic detectors for video search has been known as one of the most interesting approaches in bridging ―semantic gap‖. In video search, given a user query, detectors that could reflect semantics of query chosen for query answering. Video segments which contain the desired concepts, indicated by detectors, are retrieved, ranked and returned. The selected detectors basically serve as bridge to narrow gap between user semantics and features. This retrieval methodology is often referred ISSN: 2231-5381 to as concept based video search. A recent simulation study predicts that a large pool of concept detectors, fewer than five thousands, with mean average precision (MAP) of 10% can already achieve high accuracy of search performance. The open issues underlying the search are the appropriate mapping and fusion of detectors to queries. Generally, query-to detector mapping involves semantic reasoning, domain knowledge, or spatial-time dependent information. A subsequent interesting question after selection is: how to use the set of ultimately selected detectors which involve concepts. These detectors are correlated, either complementarily for parenthood relationship, supportive of one another, or contrasting to each other. Basically, to understand and exploiting nature of detectors could ideally derive effective fusion strategy, the main issue addressed. In essence, the selection and fusion of detectors requires support from multiple sources including ontology, statistics, and also reliability of concept detection. [13] Considers different aspects of semantic detectors as following for large-scale video search: Semantics refers to lexical relatedness between concepts. A common approach for semantic measurement between query and detector is use of ontology such as WordNet to reason their hyponym (is-a) relationship. The semantic reasoning relies only on a local view of sub-graph structure where concepts under investigation. Such reasoning does not allow uniform measurement of concept. A semantic space that is a vector space facilitating uniform mapping of query to detectors. Semantic space is orthogonal space which also emphasizes optimal coverage of semantic space built upon a vocabulary set. Observability refers to frequent occurrence of certain concepts in desired domain. That correlates each other or even coexists in some unknown subspaces. This information is not directly observed from semantic reasoning. Author proposes building of a vector space, namely observability space, to mine the piece of information. While SS is for narrowing the ―semantic gap‖, OS addresses the bridging of ―observability gap‖. Traditional correlation measures only local view of correlation among few concepts, thus prohibit effective mining of correlated concepts in a global view. In contrast, OS offers a global and uniform view of how concepts co-occur in a vector space. In OS, given any two concepts, subspace embedded by these concepts can be efficiently mined to infer useful detectors for video search. Reliability refers to robustness of detectors in terms of detection. Intuitively, only robust detectors should be considered for query answering, or at least the robustness of detectors should be enhanced as best as possible before http://www.ijettjournal.org Page 139 International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 utilized for video search. To improve the reliability of a detector, employ OS to uniformly determine set of positively and negatively correlated detectors. Robustness of original detector is enhanced by jointly fusing with set of correlated detectors. Diversity refers to variety of detectors in answering query. Considering each group of detectors separately during fusion, instead of each detector individually, could avoid case that certain groups may have more selected detectors and thus bias the final search result. There are various ways of inferring diversity of detectors, for example by linguistic understanding of a query. Based on these four aspects of detectors, a novel fusion strategy proposed. The framework consists of two major parts: detector selection and multi-level fusion. In detector selection, detectors are selected separately from SS and OS. The detectors picked up from SS, named as anchor concept detectors, are semantically related to a query. The detectors selected from OS, called as bridge concept detectors, are to bridge the observability gap of anchor detectors in the subspace of OS. During detector fusion, different levels of fusion required, each of which emphasizes one aspect of semantic detectors. In reliability-based fusion, positively and negatively correlated detectors are further selected from OS to enhance the robustness of anchor and bridge detectors. In observability based fusion, the co-occurrence chance among concepts is exploited by fusing bridge detectors with nearest anchor detectors. This not only enhances the reliability but also enriches the observability of anchor detectors. Finally, the set of anchor detectors are either semantically fused or by further analysing their diversity before fusion. The semantic-based fusion is conducted using SS, while diversity-based fusion involves combination of detectors in both semantic and observability spaces. We conduct experiments using datasets (TV05, TV06, TV07) from TRECVID 2005 to 2007 respectively, involving a total of 72 search queries. The outputs of three classifiers are combined as the detection score with average fusion. In the experiments, the retrieved items (shots) are ranked according to their score to the selected concept detectors. 3. PROPOSED WORK There are a number of issues and challenges are exist for retrieving the user query relevance video search. Therefore the different kinds of techniques are developed for improving the accuracy of search systems. In this presented work a new model for video data search is presented using the low level features of segmented video frames. The given method also incorporates the text annotation technique for supporting the text based query search. Therefore the ISSN: 2231-5381 basic conceptual model is presented in two different major modules. In first module the training of the system is performed, the basic concept of training and their participating components are described in figure 1. In addition of that the second module perform the retrieval of relevant videos from the data base the figure 2 provides the utilized components of the information retrieval system. Figure 1 Training module According to the given diagram (figure 1) the initial input of the system is a video object which is uploaded by the end user. The video is basically the collection of images and the audio but in this proposed technique the visual features are utilized for learning the video objects. Therefore in the next step the video is fragmented using the FF-MPEG API. This technique is used to convert entire video into a set of images according to the time parameter or in random manner. These extracted images are further utilized in two different processes. In first the image annotation is performed by user text or tag input and in next step the low level features are computed from the extracted or tagged images. Here for low level features the texture feature using LBP (local binary pattern), edge feature using the edge histogram technique and the color feature using the grid color movement technique is estimated. These features and tags are then normalized for storing them into the database. The database contains the normalized low level features and the associated tags and the video name which are identified using this annotation and features. In next step the video retrieval process is taken place, which is demonstrated using figure 2. In this process both kinds of user query can be used by the end users. Here user can search the video by the example image or by the text query. Therefore the provision for input both kinds of query is performed here. When the query input is provided by example image then the low level features are extracted by the http://www.ijettjournal.org Page 140 International Journal of Engineering Trends and Technology (IJETT) – Volume 32 Number 3- February 2016 query example and then using the KNN classifier associated video is classified. Otherwise when query is a kind of text then the text tokens searched by the associated tags of video and then search results are produced. the the are the incorporated. In near future the proposed technique is implemented using the JAVA technology and their performance is published. REFERENCES [1] Winston H. Hsu, Lyndon S. Kennedy, Shih-Fu Chang, ―Video Search Reranking via Information Bottleneck Principle‖, MM’06, October 23–27, 2006. [2] J. G. Carbonell et al. ―Translingual information retrieval: A comparative evaluation‖, in International Joint Conference on Artificial Intelligence, 1997. [3] T.-C. Chang et al. TRECVID 2004 search and feature extraction task by NUS PRIS, in TRECVID Workshop, Washington DC, 2004. [4] TRECVID. TREC, ―Video Retrieval Evaluation‖, in http://www-nlpir.nist.gov/projects/trecvid/. [5] R. Yan, A. Hauptmann, and R. Jin., ―Multimedia search with pseudo-relevance feedback‖, in CIVR, UrbanaChampaign, IL, 2003. [6] Winston H. Hsu, Lyndon S. Kennedy, Shih-Fu Chang, ―Video Search Reranking through Random Walk over Document-Level Context Graph‖, in MM’07, September 23–28, 2007. [7] W. H. Hsu and S.-F. Chang. ―Topic tracking across broadcast news videos with visual duplicates and semantic concepts‖. In International Conference on Image Processing (ICIP), Atlanta, GA, USA, 2006. Figure 2 Retrieval module [8] [Online]. Available: http://www.youtube.com The proposed working model is described in this section for efficient and accurate video retrieval. In near future the proposed model is implemented and their performance is measured. [9] X.-S. Hua, T. Mei, and A. Hanjalic, Online Multimedia Advertising: Techniques and Technologies. Hershey, PA, USA: IGI Global, 2011. 4. CONCLUSION The information retrieval is a classical domain of research. The IR techniques are directly depends upon the data and their formats. In this presented paper the video information retrieval techniques are surveyed and different kinds of retrieval techniques and aspects are reviewed. In addition of that for finding the apocopate technique of video retrieval the recently developed techniques are also discussed. Finally a new technique for improving the video retrieval performance a text annotation and classification based technique is proposed. For the classification the low level features are computed by video frames and used with the classifier for approximating more nearer videos as user required. For effectiveness of the implementation of system the proposed technique consumes both the techniques namely query by example and query by text both methods are ISSN: 2231-5381 [10] T.-C. Lin, J.-H. Kao, C.-T. Liu, C.-Y. Tsai, and Y.-C. F. Wang, ―Video instance search for embedded marketing,‖ in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., Dec. 2012, pp. 1–4. [11] Ting-Chu Lin, Min-Chun Yang, Chia-Yin Tsai, and YuChiang Frank Wang, "Query-Adaptive Multiple Instance Learning for Video Instance Retrieval," IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 4, APRIL 2015. [12] Zhengyu Deng, Jitao Sang, and ChangshengXu, ―Personalized Celebrity Video Search Based on CrossSpace Mining‖, in L. Weisi et al. (Eds.): PCM 2012, LNCS 7674, pp. 455–463, 2012. Springer-Verlag Berlin Heidelberg 2012. [13] Xiao-Yong Wei, Chong-Wah Ngo, ―Fusing Semantics, Observability, Reliability and Diversity of Concept Detectors for Video Search‖, in MM’08, October 26–31, 2008. http://www.ijettjournal.org Page 141