COMPRESSED DOMAIN VIDEO FINGERPRINTING BASED ON

advertisement

COMPRESSED DOMAIN VIDEO FINGERPRINTING BASED

ON MACROBLOCKS INFORMATION

Abbass S. Abbass Aliaa A. A. Youssif Atef Z. Ghalwash

Faculty of computers and information, Helwan University, Cairo, Egypt

Abstract-

With the development of media technologies, more and more videos are digitally produced, stored and distributed. File sharing of video content on the internet, using hosting services like YouTube which stated that the upload rate of their server is up to 20 hours of video materials per minute. Therefore, a video fingerprinting system to automatically identify the uploaded video is really is needed. Illegal distribution of copyrighted videos online is a huge problem especially for commercial businesses. With digital processing tools; videos can be transformed into different versions and distributed on internet. The need for identification and management of video content grows proportionally with the widespread availability of digital videos. Fingerprints are compact content-based signature that summarizes a video signal or another media signal. Several video fingerprinting methods have been proposed for identifying video, in which fingerprints are extracted by analyzing video in both spatial and temporal dimension.

However, these conventional methods have one resemblance, in which video decompression is still required for extracting the fingerprint from a compressed video. In practical, faster computational time can be achieved if fingerprint is extracted directly from the compressed domain. So far, too fewer methods are known to propose video fingerprinting in compressed domain. This paper presents a simple but effective video fingerprinting technique that works directly in the compressed. Experimental results show that the proposed fingerprints are highly robust against most signal processing transformations.

Keywords: video fingerprinting- compressed domain- perceptual hash.

1. Introduction

Due to the advances in compression technology, the expansion of low cost storage media, and the growth of the internet, digitally available video quantity has grown enormously. Some of the potential applications where digital video is used are multimedia information systems, digital libraries, movie industry, and Videoon-Demand services. In all these services digital video needs to be properly processed and inserted into a video server for future access. These processes include compressing, segmenting and indexing a video sequence. Then the users can query the videos by an example image or several chosen features, and the system compares this query with various segments of the video [1, 2].

Content based video search and retrieval has attracted high research interests. According to different user requirements, we may have different kinds of video search intentions including the detection of duplicate/near-duplicate entry of a given clip, the identification of recurrent instances of a given clip, and similar video content search in a sense of coarse or clear semantic meanings. Application comprises video copyright management, video content identification and content-based video retrieval [3].

Content-Based Copy Detection (CBCD) schemes are an alternative to the watermarking approach for persistent identification of images and video clips [4, 5]. As opposed to watermarking, the CBCD approach only uses a content-based comparison between the original and the candidate objects. For storage and computational considerations, it generally consists in extracting as few features as possible from the candidate objects and matching them with a database (DB). Since the features must be discriminated enough

1

to identify an image or a video, the features are often called fingerprints or signatures. CBCD presents two major advantages. First, a video clip which has already been delivered can be recognized. Secondly, contentbased features are intrinsically more robust than inserted watermarks because they contain information that cannot be suppressed without considerably corrupting the perceptual content itself.

Among various solutions to these problems, fingerprinting is receiving increased attention.

Similar to cryptographic hashing, video is also represented with hash value derived from its content.

However, this value is robust against perceptual changes applied to video content. Therefore, video fingerprinting is also known as perceptual hashing method. Furthermore, in comparison with watermarking, no data embedding is required since the content itself would be used to generate the video discriminative features [6].

Several video fingerprinting algorithms that work at the pixel level have been proposed. Working directly with pixels is, nowadays, computationally feasible and accurate. However, those solutions do not address the magnitude of the resources needed for such a system and little analysis is provided in order to use a video fingerprinting solution in practical cases. The compressed domain processing techniques use the information extracted directly from compressed bitstream, and therefore are more advantageous than the uncompressed domain in computational means. To achieve this, the information already inherent in the video stream, which was included during the compression stage, is utilized [7].

A partial decompression must still be done to extract the information necessary for processing; however this overhead is small compared to full decompression of the video stream. It is shown that in MPEG video decompression; approximately 40% of the CPU time is spent in Inverse Discrete Cosine Transform (IDCT) calculations, even when using fast DCT algorithms [8]. This paper propose a video fingerprinting technique that work directly in the compressed domain at the stage of variable length decoding(Figure[1]), so it is computationally more efficient than uncompressed domain techniques and even the methods that utilize

DCT coefficients that are partially decoded.

MPEG

Parsing

Variable Length

Decoding

Inverse

Quantization

IDCT

Post

Processing

Motion

Compensation

Reference

Frames Raw

Figure [1]: Full decompression versus partial decompression of the compressed video (the bold segmented rectangle is the zone where the partially decoded-based techniques work and the rounded rectangle is where the proposed technique work).

The paper is organized as follows: literature survey and related work is given in section (2). Section (3) describes in detail our proposed fingerprinting system. Experimental results, comparisons and discussion are shown in section (4). Finally, Section (5) concludes the paper and point out some directions for future work.

2

2. Literature Survey

The following figure (Figure [2]) depicts the taxonomy of various approaches which the video fingerprinting techniques are based; these techniques are introduced briefly in this section.

Figure [2]: Taxonomy Of existing video fingerprinting features.

2.1 Global-based techniques

Global- based fingerprints exploit features from all the pixels in the video frame.

2.1.1 Histogram

Many of the techniques used in the design of the video fingerprinting systems depend heavily on the histograms. For example, the color, the luminance and the intensity histograms are used. In YCbCr technique a window P consisting of N frames divided by K key frames is extracted. For each Y, Cb and Cr axes a 5 bins are allocated resulting in a window of 125 dimension features for each window [9]. Also, the

3

color coherent vector [10] is used to characterize key frames from the video clip and the space-time color feature was used where each shot is divided into a fixed number of equal size segments, each of which is represented by a blending image obtained by averaging of the pixel values of all the frames in a segment.

Each blending image is divided into blocks of equal size, and then two color signatures are extracted per block.

2.1.2 Differentiation-Based Techniques

This class of techniques tries to capture the relationship among the pixels in the spatial dimension or the temporal dimension or both.

Motion direction

The video frames are partitioned into blocks and the motion direction histogram are estimated and used as features [11].

Temporal Activity

The method is based on measuring the temporal activity of the pixels in the video frames, and using the phase of the spectral information as a feature [12].

Oostveen method

Each frame is divided in a grid of R rows and C columns, resulting in RxC blocks. To capture relevant information from both temporal and spatial domain they apply temporal and spatial differentiation in the feature space. They take the differences of corresponding features extracted from subsequent blocks from one frame, and from subsequent frames in the temporal direction. This is equivalent to applying 2×2 spatiotemporal Haar filters on the randomized block means of the luminance component [13].

2.1.3 Ordinal measure

Linear correspondence measures like correlation and the sum of squared difference between intensity distributions are known to be fragile. Ordinal measures which are based on relative ordering of intensity values in windows—rank permutations—have demonstrable robustness. By using distance metrics between two rank permutations, ordinal measures are defined. These measures are independent of absolute intensity scale and invariant to monotone transformations of intensity values like gamma variation between images.

Ordinal measures are used for the spatial, the temporal dimension of the video. And recently, the extended temporal ordinal measure is proposed which has more robustness than its predecessors [14, 15, and 16].

2.1.4 Gradient and Gradient Orientation Methods

The gradient orientation is the direction in which the directional derivative has the largest value. Lowe [17] used the histogram of gradient orientations as local descriptors for object recognition, and comparative test in [18] showed that it performs best among various local descriptors. However its high dimensionality renders the histogram of gradient orientations unsuitable for video fingerprinting. Instead, Lee [19] proposed using the center of gradient orientation as a robust feature.

2.1.5 Slice Entropy Scattergraph (SES)

SES employs video spatio-temporal slices which can greatly decrease the storage and computational complexity. It is based on entropy and its deviation so as to preserve as much as the video information.

Besides, SES takes advantage of a scattergraph which is succinct and efficient to plot the distribution of video content. To effectively describe SES, the projection histograms, shape contexts and polynomial coefficients are used [20].

4

2.1.6 Transform Domain –Based Techniques

Compact Fourier Mellin Transform (CFMT)

First, the video is summarized into key frames. Then, the CFMT representation can be computed by applying Fourier Transform to each frame. After that Analytical Fourier Mellin Transform (AFMT) is applied. The last steps to obtain the CFMT descriptor are applying Principle Component Analysis and

Lloyd-Max non-uniform scalar quantization on the resultant magnitude [9, 21].

3D-DCT (Discrete Cosine Transform)

The input video is first normalized to a standard format. After that, a 3D- DCT is applied on this standardized video and specific transform coefficients are selected. Finally, a hash is generated from the magnitude relationship among the selected coefficients. The innovation in this method is that it makes use of both spatial and temporal video information simultaneously via three dimensional transform [22].

Malekesmaeili et al. [23] extract a hash of a video from the temporally informative representative images

(TIRIs). A TIRI for a video segment is derived by taking weighted temporal averages of the luminance pixel values of the frames in the segment. The authors apply the before mentioned DCT-based hashing algorithm with minor modification on each TIRI.

2D-DCT TMO (Temporal Maximum Occurrence)

The method proposed a video hash scheme that utilizes image hash and spatio-temporal information contained in video to generate video hash. A video clip is firstly segmented to shots, and video hash is derived in unit of shot.2D-DCT to each frame in the shot is taken, quantized, and record the temporal occurrence of the co-located coefficient. Then pair of the closet value as the DCT coefficient for every collocated entry is chosen, lastly inverse transform to the spatial domain, and derive image hashes from two feature frames by hash based on Radial projections [24].

3D-DWT (Discrete Wavelet Transform)

This technique presents a solution for the hashing of videos at the group-of-frames level using the 3dimensional discrete wavelet transform. A hash of a group-of-frames is computed from the spatiotemporal low-pass band of the transform. The local variances of the wavelet coefficients in the band are used to derive the hash [25].

2.1.7 Nonnegative Matrix Factorization (NMF)

NMF converts data matrix to the product of two matrices, say W and H. The system extracts features by applying NMF onto individual frame of the video to construct its characteristic matrices W and H then, the resultant matrices are used as features. The main steps are ranking, resizing and applying binary hashes to construct the video fingerprint [26, 27].

2.2 Local-based techniques

Local-based fingerprints exploit features from and around the points of interest. These points characterize important features in local regions of the frames and can be easily detected using Harris detector.

2.2.1 Joly method

The technique is based on an improved version of the Harris interest point detector and a differential description of the local region around each interest point. The features are not extracted in every frame of the video but only in key-frames corresponding to extrema of the global intensity of motion [28].

2.2.2 Video Copy tracking (ViCopT)

Harris points of interest are extracted on every frame. These points of interest are associated from frame to frame to build trajectories, and the trajectory is used to characterize the spatio-temporal content of videos. It allows the local description to be enriched by adding a spatial, dynamic and temporal behavior of this point.

5

For each trajectory, the signal description finally kept is the average of each component of the local description. By using the properties of the built trajectories, a label of behavior can be assigned to the corresponding local description as a temporal content [29].

2.2.3 Space time interest points

The space time interest points correspond to points where the frame values have significant local variation in both space and time. Space time interest points are described by the spatio-temporal third order local jet leading to a 34-dimensional vector [30].

2.2.4 Scale invariant feature transform (SIFT)

SIFT descriptors were introduced in [31] as features that are relatively robust to scale, perspective, and rotation changes, and their accuracy in object recognition has led to extensive use in image similarity.

2.3 Compressed Domain Techniques

Several video fingerprinting methods have been proposed for identifying video, in which fingerprints are extracted by analyzing video in both spatial and temporal dimension. However, these conventional methods have one resemblance, in which video decompression is still required for extracting the fingerprint from a compressed video. In practical, faster computational time can be achieved if fingerprint is extracted directly from the compressed domain.

So far, too fewer methods are known to propose video fingerprinting in compressed domain, one of them which DC coefficient is used to model the fingerprint [32]. Given the extracted DC coefficient, the video fingerprint method constructs the video frame; the modeled video frame is evaluated to obtain the keyframes of the video, which would be further analyzed for generating the fingerprints.

Recently M. AlBaqir [33], proposed a video fingerprinting method, in which motion vectors are used to model the fingerprint. He considers utilizing motion vector to construct approximated motion field since the motion vectors are commonly generated during video compression for exploiting the temporal redundancy within a video.

3. Proposed Technique

The proposed technique (Figure [3]) use compressed domain M acro B lock T ype's information to extract the video fingerprint by representing them into 10 bin one dimensional H istogram ( MBTH ):

MBTH t 

X

1

, X

2

, X

3

,...

X

10

 t

 i

10

1

X i

  t

(1)

(2)

Where η t is the total number of MB in frame t

6

Delay

Z

1

MPEG Video

Variable Length

Decoding

Partial Decoding of i-th frame

(Extract macroblocks types)

10-Bin Histogram Generation

Biniarization

Fingerprint

Figure [3]: Proposed Technique Block Diagram

In general, MPEG encoder normally classify the video frames into I (intra) frame, P (predicted) frame and B

(bi-directional) frame, each frame is divided to macroblocks(MB) which are 16x16 pixel motion compensation units within a frame. I frames can only have Intra blocks. P and B pictures can have different modes according to motion content. Macroblock type modes in P and B frames are given in Figure [4] and

Figure [5] respectively. The coding scheme is then performed in macroblock basis. If intra coding is selected, the corresponding MB is encoded individually by exploiting the two dimensional discrete cosine transform (2D-DCT) coefficients (eqn. (3)). This is performed to reduce video frame’s spatial redundancy.

On the other hand, if inter coding is selected, the MB is then encoded using temporal reference using motion estimation-motion compensation (ME/MC) algorithms [34, 35]. x ( n

1

, n

2

)

M

1 M  i

0 j

1

0

X i , j c i c j

 cos

 

2 n

1

1

 i

2 M

 cos

 

2 n

2

1

 j

2 M

, where c i

1

M

2

M i

0 i  0

(3) and X i , j is M

M block of pixels

7

Input Block

MC Inter

Coded

Coded+macroblock_quant

Not Coded

Coded

No MC

Inter

Coded+macroblock_quant

Not Coded (skipped)

Coded

Intra

Coded+macroblock_quant

Figure [4]: Macroblock Type Modes in P Pictures

The purpose of using ME/MC is to reduce redundancy in temporal direction. To exploit temporal redundancy, MPEG algorithms compute an interframe difference called prediction error [36]: prediction error(i, j)

1

MN m

 

M

2

, n

N

2

[ ( I mn

, t )

( I m

 i , n

 j

, t

1 )]

2

(4) where M and N are macroblock sizes and are usually set to 16x16.For a given macroblock, the encoder first determines whether it is Motion Compensated (MC) or Non Motion Compensated (NO_MC). The input video frame is analyzed by a motion compensation estimator/predictor. For each macroblock, a scheme is used to determine whether the current block is intra/inter coded based on the prediction error. The scheme can be quite complex, but the general idea is to code the difference between target macroblock and reference macroblock when the prediction error is small, otherwise, intra-code the macroblock. In a special situation when the prediction error (for perfect match) is zero, the macroblock is not coded using prediction error and is skipped. In general, the condition for No_MC inter-coding is as follows [36]:

 x

( I c

( x )

I r

( x ))

2

 

(5)

Where x=(x, y)

t

is the position of a point in the macroblock, σ is the threshold, I c is the current macroblock, and I r

is the reference macroblock which is either an I frame or a P frame. The No_MC macroblock can also

8

be categorized into two types, one is intra-coded and the other is inter-coded. In a special case, when the macroblock perfectly matches its reference, it is skipped and not coded at all.

Coded

Forward Coded+macroblock_quant

Not Coded

MC Backward

Coded

Coded+macroblock_quant

Coded

Not Coded

Input Block

Interpolated Coded+macroblock_quant

Not Coded

No MC Intra

Coded

Coded+macroblock_quant

Figure [5]: Macroblock Type Modes in B Pictures

For normal scenes, prediction is performed in P and B frames. When there is a scene change, this prediction drops significantly, which lead to some macroblocks to be encoded in intra mode. Strictly speaking, if a frame is inside of a shot, then the macroblocks should be predicted well from previous or next frames.

However, when the frames are on the shot boundary, the frames cannot be predicted from the related macroblocks, and a high prediction error occurs [37]. This causes most of the macroblocks of the P frames to be intra coded instead of motion compensated.

The proposed technique makes use of the before mentioned facts about the macroblock types information

(Table [1]) to capture the intrinsic content of the video. Only VLC decoding and number of macroblocks

9

times addition operation is necessary to obtain the MB data. Therefore it is computationally very efficient

(see the rounded rectangle zone in figure [1]).

Table [1]: The used 10-bin histogram fingerprint description

Bin Index

1

2

3

4

5

6

7

8

9

Interpretation

Skipped Macroblocks

Intra Macroblocks in I frames

Intra Macroblocks in P frames

Intra Macroblocks in B frame

Non-motion Compensation Macroblocks in P frames

Non-motion Compensation Macroblocks in B frames

Forward motion Macroblocks in P frames

Forward motion Macroblocks in B frame

Backward motion Macroblocks

10 Bi-directionally Macroblocks

To further reduce the resultant fingerprint redundancy, the proposed technique quantizes the generated fingerprint into a binary sequence. This can be achieved by fixing a threshold level, and quantize every bin value within each histogram according to the threshold. Some experiments are conducted to choose a threshold operator. we tested three operators which are a median operator that is used to binarize the resultant fingerprint for each frame individually, Another one is working by compare each two consecutive fingerprints of two consecutive frames in greater than or lesser than manner to a perform the binarization.

The last one depends on using the minimum value of the two consecutive fingerprints as the resultant fingerprint for each histogram bin. The logic behind the last two operators is to exploit the temporal dimension of the video frames in the design of the fingerprints. After testing the three operators we decided to use the median operator

Furthermore, by having a fingerprint in a binary sequence; a more efficient matching process can be performed using Hamming distance between the compared fingerprints.

Instead of using the histogram intersection (eqn. (6)) or even the Jaccard coefficient (eqn. (7)) to compare the resultant fingerprints.

W ( Q , R )

J ( Q , R )

 i

N 

1 min(

Q

R

(6)

(7)

Q

R

Q

( i )

, R

( i )

) where Q and R are two pair of fingerprints, each containing N bins.

4. Experimental Results

To evaluate the performance of the proposed technique, a test set of 200 videos all taken from the ocean [38] was used. Then attacks were individually mounted on the videos, so a new test set equals to 3600 videos was generated. The mounted attacks included added watermark, mosaic effect, Embossment effect, flipping, blindness effect, cropping, contrast adjustment, brightness modification, and bit rate change as Table [2] illustrates. Also, the hardware spec of the PC used for the experiments was an Intel dual-core running at 2

GHz with 3GB memory. However, all tests were ran using a single core.

10

Three Categories of experiments are conducted in this study which are:

1Investigating the binarization issue (as mentioned in section (3)).

2Studying the behavior of the proposed technique against content-preserving attacks (Table [2]) to study the robustness and uniqueness of the proposed fingerprint.

3Comparing the proposed work against existing technique working in the compressed domain

(Motion Field [33]).

Index

1

Table [2]: Distortions used in the study

Distortion

Watermark

2

3

4

5

Mosaic

Embossment

Horizontal Flipping

Vertical Flipping

6

7

8

9

Adding Horizontal Lines

Adding Vertical Lines

Cropping ( Big Window )

Cropping ( Small Window )

10 Contrast Adjustment( negative image )

11 Contrast Adjustment( maximum contrast )

12

13

14

Brightness Adjustment (

Brightness Adjustment (

Brightness Adjustment (

-50%

-25%

+50%

)

)

)

15

16

17

Brightness Adjustment ( +25% )

Different Bit Rate(512 Kbps)

Different Bit Rate(800 Kbps)

The average retrieval rate across all the 17-attacks for all the techniques is shown in Table[3].as the table show that proposed work outperform the motion field technique , also the proposed technique followed by the median operator as a binarization block gives the best results. Figure [6] show the same results as Table

[3] but in detailed manner, where the distortions (from 1 to 17 as mentioned in Table [2]) are shown.

Table [3]: Average retrieval rate across all distortions

Technique

Motion Field[33]

Average Retrieval Rate

74.12%

Proposed Technique

Proposed Technique + Median

82.94%

11

Different Bit Rate(800 Kbps)

Different Bit Rate(512 Kbps)

Brightness Adjustment (+25%)

Brightness Adjustment (+50%)

Brightness Adjustment (-25%)

Brightness Adjustment (-50%)

Contrast Adjustment(maximum contrast)

Contrast Adjustment(negative image)

Cropping (Small Window)

Cropping (Big Window)

Adding Vertical Lines

Adding Horizontal Lines

Vertical Flipping

Horizontal Flipping

Embossment

Mosaic

Watermark

Proposed Technique + Median

Proposed Technique

Motion Field Technique

0 10 20 30 40 50 60 70 80 90 100

Correctly Retrieved %

Figure [6]: The performance of the proposed technique against the baseline technique across all used distortions.

An acceptable reasoning to the results in Table [3] and Figure [6] is as follow, since the motion field technique try to build the motion trajectories of the video content depending on using the available information in the compressed domain, but not all the macroblocks in the compressed domain carry motion information rather a lot of them are labeled as NO_MC or skipped macroblocks. On the other hand, The proposed technique alleviate this drawback by using the information associated with each macroblock in the compressed stream, although of its simplicity it is overcome the gaps problem in the motion field construction .

12

If we look closely to Figure [6], we can see that the proposed technique gives better results against the motion field approach especially the robustness against flipping and blindness effects(distortion 4,5,6, and

7), These effects change the geometry and the content of the video frames severely. But the motion field outperform the proposed work in the mild cropping (distortion 8) as Table [4] show.

Technique

Table [4]: Motion Field technique against the proposed technique

DISTORTION

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Motion Field 90 90 90 60 20 10 40 50 90 90 90 90 90 90 90 90 90

Proposed 90 70 90 100 100 60 60 30 70 90 80 90 100 100 100 100 100

5. CONCLUSION

This paper proposes a video fingerprinting method in the compressed domain that exploits the macroblock states in the video stream. The proposed work gives promising results despite of its low computational overhead in performing the partial decompression of the video clips. Thus, the proposed work can be adapted to work in real time environments

.

One direction for future work is combining this technique with compressed domain watermarking methods to design a robust content management methodology and apply it in the broadcast monitoring area.

6. REFERENCES

[1]M. AlBaqir, “Literature Study Report on Content Based Identification”, Information and Communication

Theory (ICT) Group, Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS),

Delft University of Technology, April 2008.

[2]Lee, S and Yoo, C.D.,“Robust video fingerprinting for content-Based video identification”, IEEE

Transactions on Circuits and Systems for Video Technology, pp. 983-988, 2008.

[3] Nianhua Xie; Li Li; Xianglin Zeng; Maybank, S.,” A Survey on Visual Content-Based Video Indexing and Retrieval”, IEEE SMC,pp. 797 – 819,2011.

[4]A. Joly, C. Fr´elicot, and O. Buisson, “Robust content based video copy identification in a large reference database,” in Int. Conf. on Image and Video Retrieval, pp. 414–424, 2003.

[5] Sooyeun Jung, Dongeun Lee, Seongwon Lee, and Joonki Paik, “Robust Watermarking for Compressed

Video Using Fingerprints and Its Applications”, International Journal of Control, Automation, and Systems, vol. 6, no. 6, pp. 794-799, 2008.

[6]Liu Yang, Yao Li, “Research of Robust Video Fingerprinting”, In Proceedings International Conference on Computer Application and System Modeling, pp. 43-46, 2010.

[7]Maria Chatzigiorgaki and Athanassios N. Skodras, “Real-time keyframe extraction towards video content identification”, 16th International Conference on Digital Signal Processing, pp.1 – 6, 2009.

[8]Young-min Kim , Sung Woo Choi , Seong-whan Lee, "Fast Scene Change Detection Using Direct

Feature Extraction From MPEG Compressed Videos", International Conference on Pattern Recognition

(ICPR'00),vol. 3 , 2000.

13

[9]A. Sarkar, P. Ghosh, E. Moxley, and B. S. Manjunath, “Video fingerprinting: Features for duplicate and similar video detection and query-based video retrieval”. In SPIE – Multimedia Content Access: Algorithms and Systems II, Jan. 2008.

[10]X. Yang, Q. Tian, and E. C. Chang, “A color fingerprint of video shot for content identification,” in

Proceedings of the 12th annual ACM international conference on Multimedia Systems, pp. 276–279, 2004.

[11]A. Hampapur, K. Hyun, and R. M. Bolle, “Comparison of sequence matching techniques for video copy detection”, SPIE, vol.4676, pp. 194-201, 2002.

[12]J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford,

“Video copy detection: a comparative study”, In CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrieval, pp. 371-378, 2007.

[13]Oostveen, J and.Kalker, T and Haitsma, ”Feature extraction and a database strategy for video

Fingerprinting”. In Proceedings of Visual Information and Information Systems, pp. 117 – 128 , 2002.

[14]Dinkar N. Bhat, Shree K. Nayar,”Ordinal Measures for Image Correspondence”, IEEE Transactions on

Pattern Analysis and Machine Intelligence, VOL. 20, NO. 4, 1998.

[15]Xian-sheng Hua , Xian Chen , Hong-jiang Zhang,”Robust Video Signature Based on Ordinal Measure”,

ICIP, 2004.

[16]Heung Kyu Lee, and June Kim, “Extended Temporal Ordinal Measurement Using Spatially Normalized

Mean for Video Copy Detection”, ETRI Journal, vol.32, no.3, pp.490-492, 2010.

[17]David G. Lowe, "Object recognition from local scale-invariant features," International Conference on

Computer Vision, pp. 1150-1157, 1999.

[18]K. Mikolajczyk, C. Schmid, “A performance evaluation of local descriptors”, In Proc. CVPR, vol. 2, pp.

257-263,2003.

[19]Lee, S and Yoo, C.D., “Video fingerprinting based on centroids of gradient orientations”. In

Proceedings ICASSP, pp. 401-404, 2006.

[20]Peng Cui, Zhipeng Wu, Shuqiang Jiang, Qingming Huang , "Fast copy detection based on Slice

Entropy Scattergraph", IEEE International Conference on Multimedia (ICME),pp 1236 – 1241, 2010.

[21]S. Derrode and F. Ghorbel, “Robust and efficient Fourier-Mellin transform approximations for graylevel image reconstruction and complete invariant description,” Computer Vision and Image Understanding:

CVIU , pp. 57–78, 2001.

[22]B. Coskun, B. Sankur, and N. Memon, “Spatiotemporal transform based video hashing,” IEEE

Transactions on Multimedia, vol. 8, no. 6, pp. 1190–1208, Dec. 2006.

[23]Malekesmaeili, M.,Fatourechi,”Video Copy Detection Using Temporally Informative Representative

Images”, International Conference on Machine Learning and Applications,pp. 69 – 74,2009.

[24]Qian Chen, Jun Tian, Dapeng Wu, "A Robust Video Hash Scheme Based on 2D-DCT Temporal

Maximum Occurrence," accepted by Security and Communication Networks, Wiley, 2010.

14

[25] Saikia, N.; Bora, P.K., “Robust video hashing using the 3D-DWT”,National Conference on

Communications (NCC),pp 1-5, 2011.

[26] Cooper,M and Foote,J, ”Summarizing Video Using Non-Negative Similarity Matrix Factorization”. In processing of IEEE International Conference of Multimedia Signal Processing, pp. 25-28,2002.

[27] Cirakman,O et al, “Key-Frame Based Video Fingerprinting By NMF”, In Proceedings of IEEE 17th

International Conference on Image Processing, pp. 25-28, 2010.

[28]A. Joly, O. Buisson, and C. Frelicot, “Content-based copy detection using distortion-based probabilistic similarity search”, IEEE Transactions on Multimedia, 2007.

[29]J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa, “Robust voting algorithm based on labels of behavior for video copy detection”. In ACM Multimedia, MM’06, 2006.

[30]Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors”, In

IEEE. Conf. on Computer Vision and Pattern Recognition, pp.506–513, 2004.

[31]D. Lowe, “Distinctive image features from scale-invariant keypoints,”, in International Journal of

Computer Vision, pp. 91–110, 2003.

[32]A. Mikhalev et. al., “Video fingerprint structure, database construction and search algorithms”, Direct

Video & Audio Content Search Engine (DIVAS) project, Deliverable number D 4.2, February 2008.

[33]M. AlBaqir, “Video fingerprinting in compressed domain", MSc. Thesis, Delft university of technology,

2009.

[34]Richardson, I. E. G.,”H.264 and MPEG4 Video Compression- Video Coding for Next Generation

Multimedia”, (Wiley & Sons, England), 2003.

[35] T. Heath, T. Howlett, and J. Keller, “Automatic Video Segmentation in the Compressed Domain”,

IEEE Aerospace Conference, 2002.

[36]Pei, S. C, and Chou, Y. Z. “Efficient MPEG Compressed Video Analysis Using Macroblock Type

Information”, IEEE Transactions on Multimedia, Volume 1, Issue 4, pp. 321 – 333, 1999.

[37] B.-L. Yeo and B. Liu., “Rapid Scene Analysis on Compressed Video”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544, 1995.

[38] ReefVid: Free Reef Video Clip Database [Online]. Available: http://www.reefvid.org/

15

Download