Abbass S. Abbass Aliaa A. A. Youssif Atef Z. Ghalwash
Faculty of computers and information, Helwan University, Cairo, Egypt
With the development of media technologies, more and more videos are digitally produced, stored and distributed. File sharing of video content on the internet, using hosting services like YouTube which stated that the upload rate of their server is up to 20 hours of video materials per minute. Therefore, a video fingerprinting system to automatically identify the uploaded video is really is needed. Illegal distribution of copyrighted videos online is a huge problem especially for commercial businesses. With digital processing tools; videos can be transformed into different versions and distributed on internet. The need for identification and management of video content grows proportionally with the widespread availability of digital videos. Fingerprints are compact content-based signature that summarizes a video signal or another media signal. Several video fingerprinting methods have been proposed for identifying video, in which fingerprints are extracted by analyzing video in both spatial and temporal dimension.
However, these conventional methods have one resemblance, in which video decompression is still required for extracting the fingerprint from a compressed video. In practical, faster computational time can be achieved if fingerprint is extracted directly from the compressed domain. So far, too fewer methods are known to propose video fingerprinting in compressed domain. This paper presents a simple but effective video fingerprinting technique that works directly in the compressed. Experimental results show that the proposed fingerprints are highly robust against most signal processing transformations.
Keywords: video fingerprinting- compressed domain- perceptual hash.
Due to the advances in compression technology, the expansion of low cost storage media, and the growth of the internet, digitally available video quantity has grown enormously. Some of the potential applications where digital video is used are multimedia information systems, digital libraries, movie industry, and Videoon-Demand services. In all these services digital video needs to be properly processed and inserted into a video server for future access. These processes include compressing, segmenting and indexing a video sequence. Then the users can query the videos by an example image or several chosen features, and the system compares this query with various segments of the video [1, 2].
Content based video search and retrieval has attracted high research interests. According to different user requirements, we may have different kinds of video search intentions including the detection of duplicate/near-duplicate entry of a given clip, the identification of recurrent instances of a given clip, and similar video content search in a sense of coarse or clear semantic meanings. Application comprises video copyright management, video content identification and content-based video retrieval .
Content-Based Copy Detection (CBCD) schemes are an alternative to the watermarking approach for persistent identification of images and video clips [4, 5]. As opposed to watermarking, the CBCD approach only uses a content-based comparison between the original and the candidate objects. For storage and computational considerations, it generally consists in extracting as few features as possible from the candidate objects and matching them with a database (DB). Since the features must be discriminated enough
to identify an image or a video, the features are often called fingerprints or signatures. CBCD presents two major advantages. First, a video clip which has already been delivered can be recognized. Secondly, contentbased features are intrinsically more robust than inserted watermarks because they contain information that cannot be suppressed without considerably corrupting the perceptual content itself.
Among various solutions to these problems, fingerprinting is receiving increased attention.
Similar to cryptographic hashing, video is also represented with hash value derived from its content.
However, this value is robust against perceptual changes applied to video content. Therefore, video fingerprinting is also known as perceptual hashing method. Furthermore, in comparison with watermarking, no data embedding is required since the content itself would be used to generate the video discriminative features .
Several video fingerprinting algorithms that work at the pixel level have been proposed. Working directly with pixels is, nowadays, computationally feasible and accurate. However, those solutions do not address the magnitude of the resources needed for such a system and little analysis is provided in order to use a video fingerprinting solution in practical cases. The compressed domain processing techniques use the information extracted directly from compressed bitstream, and therefore are more advantageous than the uncompressed domain in computational means. To achieve this, the information already inherent in the video stream, which was included during the compression stage, is utilized .
A partial decompression must still be done to extract the information necessary for processing; however this overhead is small compared to full decompression of the video stream. It is shown that in MPEG video decompression; approximately 40% of the CPU time is spent in Inverse Discrete Cosine Transform (IDCT) calculations, even when using fast DCT algorithms . This paper propose a video fingerprinting technique that work directly in the compressed domain at the stage of variable length decoding(Figure), so it is computationally more efficient than uncompressed domain techniques and even the methods that utilize
DCT coefficients that are partially decoded.
Figure : Full decompression versus partial decompression of the compressed video (the bold segmented rectangle is the zone where the partially decoded-based techniques work and the rounded rectangle is where the proposed technique work).
The paper is organized as follows: literature survey and related work is given in section (2). Section (3) describes in detail our proposed fingerprinting system. Experimental results, comparisons and discussion are shown in section (4). Finally, Section (5) concludes the paper and point out some directions for future work.
The following figure (Figure ) depicts the taxonomy of various approaches which the video fingerprinting techniques are based; these techniques are introduced briefly in this section.
Figure : Taxonomy Of existing video fingerprinting features.
2.1 Global-based techniques
Global- based fingerprints exploit features from all the pixels in the video frame.
Many of the techniques used in the design of the video fingerprinting systems depend heavily on the histograms. For example, the color, the luminance and the intensity histograms are used. In YCbCr technique a window P consisting of N frames divided by K key frames is extracted. For each Y, Cb and Cr axes a 5 bins are allocated resulting in a window of 125 dimension features for each window . Also, the
color coherent vector  is used to characterize key frames from the video clip and the space-time color feature was used where each shot is divided into a fixed number of equal size segments, each of which is represented by a blending image obtained by averaging of the pixel values of all the frames in a segment.
Each blending image is divided into blocks of equal size, and then two color signatures are extracted per block.
2.1.2 Differentiation-Based Techniques
This class of techniques tries to capture the relationship among the pixels in the spatial dimension or the temporal dimension or both.
The video frames are partitioned into blocks and the motion direction histogram are estimated and used as features .
The method is based on measuring the temporal activity of the pixels in the video frames, and using the phase of the spectral information as a feature .
Each frame is divided in a grid of R rows and C columns, resulting in RxC blocks. To capture relevant information from both temporal and spatial domain they apply temporal and spatial differentiation in the feature space. They take the differences of corresponding features extracted from subsequent blocks from one frame, and from subsequent frames in the temporal direction. This is equivalent to applying 2×2 spatiotemporal Haar filters on the randomized block means of the luminance component .
2.1.3 Ordinal measure
Linear correspondence measures like correlation and the sum of squared difference between intensity distributions are known to be fragile. Ordinal measures which are based on relative ordering of intensity values in windows—rank permutations—have demonstrable robustness. By using distance metrics between two rank permutations, ordinal measures are defined. These measures are independent of absolute intensity scale and invariant to monotone transformations of intensity values like gamma variation between images.
Ordinal measures are used for the spatial, the temporal dimension of the video. And recently, the extended temporal ordinal measure is proposed which has more robustness than its predecessors [14, 15, and 16].
2.1.4 Gradient and Gradient Orientation Methods
The gradient orientation is the direction in which the directional derivative has the largest value. Lowe  used the histogram of gradient orientations as local descriptors for object recognition, and comparative test in  showed that it performs best among various local descriptors. However its high dimensionality renders the histogram of gradient orientations unsuitable for video fingerprinting. Instead, Lee  proposed using the center of gradient orientation as a robust feature.
2.1.5 Slice Entropy Scattergraph (SES)
SES employs video spatio-temporal slices which can greatly decrease the storage and computational complexity. It is based on entropy and its deviation so as to preserve as much as the video information.
Besides, SES takes advantage of a scattergraph which is succinct and efficient to plot the distribution of video content. To effectively describe SES, the projection histograms, shape contexts and polynomial coefficients are used .
2.1.6 Transform Domain –Based Techniques
Compact Fourier Mellin Transform (CFMT)
First, the video is summarized into key frames. Then, the CFMT representation can be computed by applying Fourier Transform to each frame. After that Analytical Fourier Mellin Transform (AFMT) is applied. The last steps to obtain the CFMT descriptor are applying Principle Component Analysis and
Lloyd-Max non-uniform scalar quantization on the resultant magnitude [9, 21].
3D-DCT (Discrete Cosine Transform)
The input video is first normalized to a standard format. After that, a 3D- DCT is applied on this standardized video and specific transform coefficients are selected. Finally, a hash is generated from the magnitude relationship among the selected coefficients. The innovation in this method is that it makes use of both spatial and temporal video information simultaneously via three dimensional transform .
Malekesmaeili et al.  extract a hash of a video from the temporally informative representative images
(TIRIs). A TIRI for a video segment is derived by taking weighted temporal averages of the luminance pixel values of the frames in the segment. The authors apply the before mentioned DCT-based hashing algorithm with minor modification on each TIRI.
2D-DCT TMO (Temporal Maximum Occurrence)
The method proposed a video hash scheme that utilizes image hash and spatio-temporal information contained in video to generate video hash. A video clip is firstly segmented to shots, and video hash is derived in unit of shot.2D-DCT to each frame in the shot is taken, quantized, and record the temporal occurrence of the co-located coefficient. Then pair of the closet value as the DCT coefficient for every collocated entry is chosen, lastly inverse transform to the spatial domain, and derive image hashes from two feature frames by hash based on Radial projections .
3D-DWT (Discrete Wavelet Transform)
This technique presents a solution for the hashing of videos at the group-of-frames level using the 3dimensional discrete wavelet transform. A hash of a group-of-frames is computed from the spatiotemporal low-pass band of the transform. The local variances of the wavelet coefficients in the band are used to derive the hash .
2.1.7 Nonnegative Matrix Factorization (NMF)
NMF converts data matrix to the product of two matrices, say W and H. The system extracts features by applying NMF onto individual frame of the video to construct its characteristic matrices W and H then, the resultant matrices are used as features. The main steps are ranking, resizing and applying binary hashes to construct the video fingerprint [26, 27].
2.2 Local-based techniques
Local-based fingerprints exploit features from and around the points of interest. These points characterize important features in local regions of the frames and can be easily detected using Harris detector.
2.2.1 Joly method
The technique is based on an improved version of the Harris interest point detector and a differential description of the local region around each interest point. The features are not extracted in every frame of the video but only in key-frames corresponding to extrema of the global intensity of motion .
2.2.2 Video Copy tracking (ViCopT)
Harris points of interest are extracted on every frame. These points of interest are associated from frame to frame to build trajectories, and the trajectory is used to characterize the spatio-temporal content of videos. It allows the local description to be enriched by adding a spatial, dynamic and temporal behavior of this point.
For each trajectory, the signal description finally kept is the average of each component of the local description. By using the properties of the built trajectories, a label of behavior can be assigned to the corresponding local description as a temporal content .
2.2.3 Space time interest points
The space time interest points correspond to points where the frame values have significant local variation in both space and time. Space time interest points are described by the spatio-temporal third order local jet leading to a 34-dimensional vector .
2.2.4 Scale invariant feature transform (SIFT)
SIFT descriptors were introduced in  as features that are relatively robust to scale, perspective, and rotation changes, and their accuracy in object recognition has led to extensive use in image similarity.
2.3 Compressed Domain Techniques
Several video fingerprinting methods have been proposed for identifying video, in which fingerprints are extracted by analyzing video in both spatial and temporal dimension. However, these conventional methods have one resemblance, in which video decompression is still required for extracting the fingerprint from a compressed video. In practical, faster computational time can be achieved if fingerprint is extracted directly from the compressed domain.
So far, too fewer methods are known to propose video fingerprinting in compressed domain, one of them which DC coefficient is used to model the fingerprint . Given the extracted DC coefficient, the video fingerprint method constructs the video frame; the modeled video frame is evaluated to obtain the keyframes of the video, which would be further analyzed for generating the fingerprints.
Recently M. AlBaqir , proposed a video fingerprinting method, in which motion vectors are used to model the fingerprint. He considers utilizing motion vector to construct approximated motion field since the motion vectors are commonly generated during video compression for exploiting the temporal redundancy within a video.
The proposed technique (Figure ) use compressed domain M acro B lock T ype's information to extract the video fingerprint by representing them into 10 bin one dimensional H istogram ( MBTH ):
Where η t is the total number of MB in frame t
Partial Decoding of i-th frame
(Extract macroblocks types)
10-Bin Histogram Generation
Figure : Proposed Technique Block Diagram
In general, MPEG encoder normally classify the video frames into I (intra) frame, P (predicted) frame and B
(bi-directional) frame, each frame is divided to macroblocks(MB) which are 16x16 pixel motion compensation units within a frame. I frames can only have Intra blocks. P and B pictures can have different modes according to motion content. Macroblock type modes in P and B frames are given in Figure  and
Figure  respectively. The coding scheme is then performed in macroblock basis. If intra coding is selected, the corresponding MB is encoded individually by exploiting the two dimensional discrete cosine transform (2D-DCT) coefficients (eqn. (3)). This is performed to reduce video frame’s spatial redundancy.
On the other hand, if inter coding is selected, the MB is then encoded using temporal reference using motion estimation-motion compensation (ME/MC) algorithms [34, 35]. x ( n
1 M i
X i , j c i c j
, where c i
0 i 0
(3) and X i , j is M
M block of pixels
Not Coded (skipped)
Figure : Macroblock Type Modes in P Pictures
The purpose of using ME/MC is to reduce redundancy in temporal direction. To exploit temporal redundancy, MPEG algorithms compute an interframe difference called prediction error : prediction error(i, j)
[ ( I mn
, t )
( I m
i , n
(4) where M and N are macroblock sizes and are usually set to 16x16.For a given macroblock, the encoder first determines whether it is Motion Compensated (MC) or Non Motion Compensated (NO_MC). The input video frame is analyzed by a motion compensation estimator/predictor. For each macroblock, a scheme is used to determine whether the current block is intra/inter coded based on the prediction error. The scheme can be quite complex, but the general idea is to code the difference between target macroblock and reference macroblock when the prediction error is small, otherwise, intra-code the macroblock. In a special situation when the prediction error (for perfect match) is zero, the macroblock is not coded using prediction error and is skipped. In general, the condition for No_MC inter-coding is as follows :
( I c
( x )
( x ))
Where x=(x, y)
is the position of a point in the macroblock, σ is the threshold, I c is the current macroblock, and I r
is the reference macroblock which is either an I frame or a P frame. The No_MC macroblock can also
be categorized into two types, one is intra-coded and the other is inter-coded. In a special case, when the macroblock perfectly matches its reference, it is skipped and not coded at all.
No MC Intra
Figure : Macroblock Type Modes in B Pictures
For normal scenes, prediction is performed in P and B frames. When there is a scene change, this prediction drops significantly, which lead to some macroblocks to be encoded in intra mode. Strictly speaking, if a frame is inside of a shot, then the macroblocks should be predicted well from previous or next frames.
However, when the frames are on the shot boundary, the frames cannot be predicted from the related macroblocks, and a high prediction error occurs . This causes most of the macroblocks of the P frames to be intra coded instead of motion compensated.
The proposed technique makes use of the before mentioned facts about the macroblock types information
(Table ) to capture the intrinsic content of the video. Only VLC decoding and number of macroblocks
times addition operation is necessary to obtain the MB data. Therefore it is computationally very efficient
(see the rounded rectangle zone in figure ).
Table : The used 10-bin histogram fingerprint description
Intra Macroblocks in I frames
Intra Macroblocks in P frames
Intra Macroblocks in B frame
Non-motion Compensation Macroblocks in P frames
Non-motion Compensation Macroblocks in B frames
Forward motion Macroblocks in P frames
Forward motion Macroblocks in B frame
Backward motion Macroblocks
10 Bi-directionally Macroblocks
To further reduce the resultant fingerprint redundancy, the proposed technique quantizes the generated fingerprint into a binary sequence. This can be achieved by fixing a threshold level, and quantize every bin value within each histogram according to the threshold. Some experiments are conducted to choose a threshold operator. we tested three operators which are a median operator that is used to binarize the resultant fingerprint for each frame individually, Another one is working by compare each two consecutive fingerprints of two consecutive frames in greater than or lesser than manner to a perform the binarization.
The last one depends on using the minimum value of the two consecutive fingerprints as the resultant fingerprint for each histogram bin. The logic behind the last two operators is to exploit the temporal dimension of the video frames in the design of the fingerprints. After testing the three operators we decided to use the median operator
Furthermore, by having a fingerprint in a binary sequence; a more efficient matching process can be performed using Hamming distance between the compared fingerprints.
Instead of using the histogram intersection (eqn. (6)) or even the Jaccard coefficient (eqn. (7)) to compare the resultant fingerprints.
W ( Q , R )
J ( Q , R )
( i )
( i )
) where Q and R are two pair of fingerprints, each containing N bins.
To evaluate the performance of the proposed technique, a test set of 200 videos all taken from the ocean  was used. Then attacks were individually mounted on the videos, so a new test set equals to 3600 videos was generated. The mounted attacks included added watermark, mosaic effect, Embossment effect, flipping, blindness effect, cropping, contrast adjustment, brightness modification, and bit rate change as Table  illustrates. Also, the hardware spec of the PC used for the experiments was an Intel dual-core running at 2
GHz with 3GB memory. However, all tests were ran using a single core.
Three Categories of experiments are conducted in this study which are:
1Investigating the binarization issue (as mentioned in section (3)).
2Studying the behavior of the proposed technique against content-preserving attacks (Table ) to study the robustness and uniqueness of the proposed fingerprint.
3Comparing the proposed work against existing technique working in the compressed domain
(Motion Field ).
Table : Distortions used in the study
Adding Horizontal Lines
Adding Vertical Lines
Cropping ( Big Window )
Cropping ( Small Window )
10 Contrast Adjustment( negative image )
11 Contrast Adjustment( maximum contrast )
Brightness Adjustment (
Brightness Adjustment (
Brightness Adjustment (
Brightness Adjustment ( +25% )
Different Bit Rate(512 Kbps)
Different Bit Rate(800 Kbps)
The average retrieval rate across all the 17-attacks for all the techniques is shown in Table.as the table show that proposed work outperform the motion field technique , also the proposed technique followed by the median operator as a binarization block gives the best results. Figure  show the same results as Table
 but in detailed manner, where the distortions (from 1 to 17 as mentioned in Table ) are shown.
Table : Average retrieval rate across all distortions
Average Retrieval Rate
Proposed Technique + Median
Different Bit Rate(800 Kbps)
Different Bit Rate(512 Kbps)
Brightness Adjustment (+25%)
Brightness Adjustment (+50%)
Brightness Adjustment (-25%)
Brightness Adjustment (-50%)
Contrast Adjustment(maximum contrast)
Contrast Adjustment(negative image)
Cropping (Small Window)
Cropping (Big Window)
Adding Vertical Lines
Adding Horizontal Lines
Proposed Technique + Median
Motion Field Technique
0 10 20 30 40 50 60 70 80 90 100
Correctly Retrieved %
Figure : The performance of the proposed technique against the baseline technique across all used distortions.
An acceptable reasoning to the results in Table  and Figure  is as follow, since the motion field technique try to build the motion trajectories of the video content depending on using the available information in the compressed domain, but not all the macroblocks in the compressed domain carry motion information rather a lot of them are labeled as NO_MC or skipped macroblocks. On the other hand, The proposed technique alleviate this drawback by using the information associated with each macroblock in the compressed stream, although of its simplicity it is overcome the gaps problem in the motion field construction .
If we look closely to Figure , we can see that the proposed technique gives better results against the motion field approach especially the robustness against flipping and blindness effects(distortion 4,5,6, and
7), These effects change the geometry and the content of the video frames severely. But the motion field outperform the proposed work in the mild cropping (distortion 8) as Table  show.
Table : Motion Field technique against the proposed technique
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Motion Field 90 90 90 60 20 10 40 50 90 90 90 90 90 90 90 90 90
Proposed 90 70 90 100 100 60 60 30 70 90 80 90 100 100 100 100 100
This paper proposes a video fingerprinting method in the compressed domain that exploits the macroblock states in the video stream. The proposed work gives promising results despite of its low computational overhead in performing the partial decompression of the video clips. Thus, the proposed work can be adapted to work in real time environments
One direction for future work is combining this technique with compressed domain watermarking methods to design a robust content management methodology and apply it in the broadcast monitoring area.
M. AlBaqir, “Literature Study Report on Content Based Identification”, Information and Communication
Theory (ICT) Group, Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS),
Delft University of Technology, April 2008.
Lee, S and Yoo, C.D.,“Robust video fingerprinting for content-Based video identification”, IEEE
Transactions on Circuits and Systems for Video Technology, pp. 983-988, 2008.
 Nianhua Xie; Li Li; Xianglin Zeng; Maybank, S.,” A Survey on Visual Content-Based Video Indexing and Retrieval”, IEEE SMC,pp. 797 – 819,2011.
A. Joly, C. Fr´elicot, and O. Buisson, “Robust content based video copy identification in a large reference database,” in Int. Conf. on Image and Video Retrieval, pp. 414–424, 2003.
 Sooyeun Jung, Dongeun Lee, Seongwon Lee, and Joonki Paik, “Robust Watermarking for Compressed
Video Using Fingerprints and Its Applications”, International Journal of Control, Automation, and Systems, vol. 6, no. 6, pp. 794-799, 2008.
Liu Yang, Yao Li, “Research of Robust Video Fingerprinting”, In Proceedings International Conference on Computer Application and System Modeling, pp. 43-46, 2010.
Maria Chatzigiorgaki and Athanassios N. Skodras, “Real-time keyframe extraction towards video content identification”, 16th International Conference on Digital Signal Processing, pp.1 – 6, 2009.
Young-min Kim , Sung Woo Choi , Seong-whan Lee, "Fast Scene Change Detection Using Direct
Feature Extraction From MPEG Compressed Videos", International Conference on Pattern Recognition
(ICPR'00),vol. 3 , 2000.
A. Sarkar, P. Ghosh, E. Moxley, and B. S. Manjunath, “Video fingerprinting: Features for duplicate and similar video detection and query-based video retrieval”. In SPIE – Multimedia Content Access: Algorithms and Systems II, Jan. 2008.
X. Yang, Q. Tian, and E. C. Chang, “A color fingerprint of video shot for content identification,” in
Proceedings of the 12th annual ACM international conference on Multimedia Systems, pp. 276–279, 2004.
A. Hampapur, K. Hyun, and R. M. Bolle, “Comparison of sequence matching techniques for video copy detection”, SPIE, vol.4676, pp. 194-201, 2002.
J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford,
“Video copy detection: a comparative study”, In CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrieval, pp. 371-378, 2007.
Oostveen, J and.Kalker, T and Haitsma, ”Feature extraction and a database strategy for video
Fingerprinting”. In Proceedings of Visual Information and Information Systems, pp. 117 – 128 , 2002.
Dinkar N. Bhat, Shree K. Nayar,”Ordinal Measures for Image Correspondence”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, VOL. 20, NO. 4, 1998.
Xian-sheng Hua , Xian Chen , Hong-jiang Zhang,”Robust Video Signature Based on Ordinal Measure”,
Heung Kyu Lee, and June Kim, “Extended Temporal Ordinal Measurement Using Spatially Normalized
Mean for Video Copy Detection”, ETRI Journal, vol.32, no.3, pp.490-492, 2010.
David G. Lowe, "Object recognition from local scale-invariant features," International Conference on
Computer Vision, pp. 1150-1157, 1999.
K. Mikolajczyk, C. Schmid, “A performance evaluation of local descriptors”, In Proc. CVPR, vol. 2, pp.
Lee, S and Yoo, C.D., “Video fingerprinting based on centroids of gradient orientations”. In
Proceedings ICASSP, pp. 401-404, 2006.
Peng Cui, Zhipeng Wu, Shuqiang Jiang, Qingming Huang , "Fast copy detection based on Slice
Entropy Scattergraph", IEEE International Conference on Multimedia (ICME),pp 1236 – 1241, 2010.
S. Derrode and F. Ghorbel, “Robust and efficient Fourier-Mellin transform approximations for graylevel image reconstruction and complete invariant description,” Computer Vision and Image Understanding:
CVIU , pp. 57–78, 2001.
B. Coskun, B. Sankur, and N. Memon, “Spatiotemporal transform based video hashing,” IEEE
Transactions on Multimedia, vol. 8, no. 6, pp. 1190–1208, Dec. 2006.
Malekesmaeili, M.,Fatourechi,”Video Copy Detection Using Temporally Informative Representative
Images”, International Conference on Machine Learning and Applications,pp. 69 – 74,2009.
Qian Chen, Jun Tian, Dapeng Wu, "A Robust Video Hash Scheme Based on 2D-DCT Temporal
Maximum Occurrence," accepted by Security and Communication Networks, Wiley, 2010.
 Saikia, N.; Bora, P.K., “Robust video hashing using the 3D-DWT”,National Conference on
Communications (NCC),pp 1-5, 2011.
 Cooper,M and Foote,J, ”Summarizing Video Using Non-Negative Similarity Matrix Factorization”. In processing of IEEE International Conference of Multimedia Signal Processing, pp. 25-28,2002.
 Cirakman,O et al, “Key-Frame Based Video Fingerprinting By NMF”, In Proceedings of IEEE 17th
International Conference on Image Processing, pp. 25-28, 2010.
A. Joly, O. Buisson, and C. Frelicot, “Content-based copy detection using distortion-based probabilistic similarity search”, IEEE Transactions on Multimedia, 2007.
J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa, “Robust voting algorithm based on labels of behavior for video copy detection”. In ACM Multimedia, MM’06, 2006.
Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors”, In
IEEE. Conf. on Computer Vision and Pattern Recognition, pp.506–513, 2004.
D. Lowe, “Distinctive image features from scale-invariant keypoints,”, in International Journal of
Computer Vision, pp. 91–110, 2003.
A. Mikhalev et. al., “Video fingerprint structure, database construction and search algorithms”, Direct
Video & Audio Content Search Engine (DIVAS) project, Deliverable number D 4.2, February 2008.
M. AlBaqir, “Video fingerprinting in compressed domain", MSc. Thesis, Delft university of technology,
Richardson, I. E. G.,”H.264 and MPEG4 Video Compression- Video Coding for Next Generation
Multimedia”, (Wiley & Sons, England), 2003.
 T. Heath, T. Howlett, and J. Keller, “Automatic Video Segmentation in the Compressed Domain”,
IEEE Aerospace Conference, 2002.
Pei, S. C, and Chou, Y. Z. “Efficient MPEG Compressed Video Analysis Using Macroblock Type
Information”, IEEE Transactions on Multimedia, Volume 1, Issue 4, pp. 321 – 333, 1999.
 B.-L. Yeo and B. Liu., “Rapid Scene Analysis on Compressed Video”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544, 1995.
 ReefVid: Free Reef Video Clip Database [Online]. Available: http://www.reefvid.org/