Discovering Underlying Similarities in Video Arman M. Garakani Reify Corp. (http://www.reifycorp.com) One Broadway, Cambridge, MA 02142 arman.garakani@reifycorp.com Abstract The concept of interrogating similarities within a data set has a long history in fields ranging from medicinal chemistry to image analysis. We define a descriptor as an entropic measure of similarity for an image and the neighborhood of images surrounding it. Differential changes in descriptor values imply differential changes in the structure underlying the images. For example, at the location of a zero crossing in the descriptor values, the corresponding image is a watershed image clearly sitting between two dissimilar groups of images. This paper describes a fast algorithm for image sequence clustering based on the above concept. Developed initially for an adaptive system for capture, analysis and storage of lengthy dynamic visual processes, the algorithm uncovers underlying spatio-temporal structures without a priori information or segmentation. 9 The algorithm estimates the average amount of information each image, in an ordered set, conveys about the structure underlying the ordered set. Such signatures enable capture of relevant and salient time periods directly leading to reductions in cost of followup analysis and storage. As a part of the video capture system, the above characterization may provide predictive feedback to an adaptive capture sub-system controlling temporal sampling, frame-rate and exposure. Details of the algorithm, examples of its application to quantification of biological motion, and video identification and recognition are presented. Prior to the workshop, an efficient implementation will be posted as a web service to generate characterization of unknown videos online. Introduction Traditional approaches to extracting information from video captures of dynamic processes assume that the full information content of the image sequence is too complex, too incomplete or too large to be analyzed without intervention. These factors lead existing technologies to depend heavily on user input or assumptions for precise segmentation, tracking and quantification. approaches to motion segmentation include: Current • frame-based techniques such as frame-to-frame differencing10,4 • localization of coherent regions in optical flow fields (excellent review by Barron, Fleet and Beauchemin3) • spatio-temporal approaches such as coherent bundles of pixel evolution over time12 • level set approachs to identify trajectories of moving objects3 • use of perceptual organization of spatio-temporal volumes15. While spatio-temporal approaches are less prone to noise and illumination changes, all approaches fail if the object moves too rapid, or out of focus or deforms4. Paradoxically, many motion characterization algorithms depend on robust segmentation13 or extract a specific pattern such as cyclic or periodic motion5,6. Another common approach has been to treat an image sequence as a spatio-temporal volume and to locate patches, small enough to contain uniform motion, in a larger volume16. However, similar to many other approaches it breaks down in in the presence of deformations. Information extracted by the algorithm is a single value for each image in the sequence. We call this time series, EMS. EMS can be analyzed to identify and assess expected temporal relationships, and potential discovery of unanticipated relationships. EMS can be computed from an entire sequence or from a moving temporal window. Applications and future work The algorithm outlined here has been used entirely for extracting phenotypic signatures in biological screening. Some of the areas to which we have applied the algorithm are: • Detecting periodic motions (without looking for it) • Registering videos by using the output of our algorithm as mutual information • • • • Detecting epicenters of periodic motions Detecting direction in transparent motion (blood flow) Detecting diffusions (or rate of degeneration) Generating audio from video motion Our future plans will focus on detecting / locating image sequences within others (video ID and recognition) Overview and definitions We define an image, as a region (typically regtangular) of a stored memory buffer that covers the entire buffer or a section of it. Similarly image sequences can be “window” sequences as well. Color and audio information were ignored for all examples. We define an ordered set, in time as a set of images captured and time stamped sequentially, i.e. a time stamp of any image is greater than all previous images in the set. An unordered data set can be sorted using entropic similarity. It (x,y) Rt (p,q) For small transformations and small time intervals, changes in I are correlated with changes in R. In cases where transformations, deformations and so on are too large, the normalized correlation value speeds to zero. Put another way, when images represent smooth transformations, normalized correlation values vary smoothly. If there are abrupt transformations, normalized correlation value drops abruptly. Invariant to linear changes in brightness This property falls out of the definition of normalized correlation, However note that if motion in an image sequence is radiometric (i.e. not a geometric movement, but a change in luminance), successive normalized correlation values will indicate a change. This is a desirable quality as we found in a long biological observation of random movements of cells. A clear linear component to our otherwise random characterization was discovered to be the microscope lamp dying slowly. Basic Concepts The basic concepts of this algorithm are summarized in the following sections. We will describe our choice for the similarity and grouping function. We subsequently exemplify its application. Image to Image (similarity or dissimilarity) Normalized correlation was chosen as the pair-wise image similarity function for the following reasons: Linearity and stability Image in the containing image set (order) It should be clear by now that our approach is centered around the concept that sharing similar images or dissimilar images should motivate grouping. Shannon’s entropy2,14 measures the average amount of information a correlation event conveys when it appears with some probability. Using similar notations as above, we have a self-similarity matrix M, M R2x2, where Mij = s (Ii , Ij ) is the piecewise A well known characteristic of Normalized Correlation is its predictable and largely linear degradation as one image is transformed or deformed a short distance over another(1,2). Specifically, within a small range, changes in normalized correlation value vary linearly with transformation parameters. Normalized correlation is often used to provide an estimate in target tracking17. Fig 1 shows trends of variation in normalized correlation value as scale and position change1. The shape of the trend depends on the spatial frequency of the patterns used. Collaborative property With illumination and observation angles fixed, changes in shape manifests as changes in surface orientations. Let's assume that we have access to function R representing mapping from surface orientations to image intensities and therefore we could relate, for a single image, gray value at position (x,y) and its surface orientation and surface reflection (p,q) using R, that is Figure 1: Changes in correlation ( y axis ) as scale and viewing position are changed (photocopy of the original data)1 similarity between Ii and Ij and i < j, for the ordered set in time. For any row of M, say k, KM denotes the collection of similarity values forming a probability distribution given by p (K M) = KM / k KM Figure 2: Entropic similarity measure plotted for a 2d point set Figure 4: Entropic similarity measure for annulus ring data 0.964 0.962 and the entropy of this distribution is 0.96 0.958 H (K M) = p(KM) log p(KM) 0.956 0.954 We call KM , EMS an entropic measure of similarity. EMS is an estimate of disorder in the similarities with image Ik , an estimate of the disorder in the system which is viewed. In our future work, we will appraise similarity functions by how close an estimate the entropic similarity measure is of the underlying system. 0.95212 11.5 11 10.5 10 9.5 Y 9 8.5 8.5 8 8 9 9.5 10 10.5 11 11.5 12 X Figure 3: A subset of heart muscle contraction sequence A simple example (point set) To illustrate the power of this approach, let us consider a two dimensional point set of 1000 points sampled randomly from A circular annulus with center pc, inner radius r1 and outer radius r2, euclidean distance function D, the set of points p with mean of zero and variance of 1, so that Figure 4: Entropic similarity measure for contraction sequence Figure 6: Entropic similarity measure for heart muscle contraction 1 r12 <= D(p, pc)2 <= r22 0.8 Single moving object sequence Fig 3 shows a subset of image captures of a single contracting heart muscle cell. Fig 4 shows the entropic measure of similarity time series. Contraction in heart muscle cells has a rapid acceleration phase followed by relaxation. We have used this method of direct measurement to quantify type and magnitude of motion in different biological systems. Multiple objects / multiple movement sequence A sample video, Finale.mov, was taken from the tutorial clips in the Adobe Permiere ©. The results shown in Figure 5 illustrate the ability to characterize pans and periodic movements. Movie trailer content different compression settings and clip localization Movie trailer http://www.apple.com/trailers/weinstein/thenannydiaries/ 0.6 EMS distance is the similarity function used (Octave/Matlab code available). Data shown in Fig 2 is the entropic measure of similarity is ploted as z for the point set p. The entropic similarity surface captures the underlying structure, ring, in the 2d data set. 0.4 0.2 0 0 1 2 3 4 5 6 Seconds ( 30 Frames Per Second ) The entropic similarity measure is nearly identical for the trailers encoded at different resolutons, Fig 6. To demonstrate clip localization, a segment was selected at frame 1490 lasting 140 frames. Fig 7 shows section in the content signature at frame 1490 in contrast with the signature generated from the 140 frames in the clip. Figure 5 Analysis of Finale.mov from Adobe© Premiere Training 1 The pan sequence center 0.8 A rider is riding away Cut 0.6 Cut 0.4 More riders 0.2 Pan + movement 0 0 2 4 6 8 10 12 14 16 Figure 7: Entropy similarity measure for clip and clip of the signature Figure 6: Entropy similarity measure for the movie trailer 1 Low resolution Medium resolution Content Signature Clip Clip Signature 0.25 0.9 0.8 0.2 0.7 0.15 0.6 0.5 0.1 0.4 0.3 0.05 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 Acknowledgements This work including the unique video analysis software implementation, would not have been possible without the many contributions of a small group of remarkable biologists, doctors, engineers, and colleagues I have worked with throughout my career and without the consistent support of my wife Jenna from my very first incoherent rambling about how images could talk. 0 20 40 60 80 100 120 140 References 1. Horn, B.K.P. & Bachman, B.L. Using Synthetic Images to Register Real Images with Surface Models. in Artificial Intelligence: An MIT Perspective, Vol. 2 (eds. Winston, P.H. & Brown, R.H.) 129-160 (MIT Press, 1979). 2. S. Theodoridis and K. Koutroumbas, Pattern Recognition, 3rd edition, 2006 pp. 106. Academic Press. 3. Barron, J.L., Fleet, D.J., and Beauchemin, S. (1994) Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43-77 4. Weiming Hu; Tieniu Tan; Liang Wang; S. Maybank, A survey on visual surveillance of object motion and behaviors, Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on Volume 34, Issue 3, Aug. 2004 Page(s):334 - 352 5. Trammell, J.; Ktonas, P., A simple nonlinear deterministic process may generate the timing of rapid eye movements during human REM sleep, Neural Engineering, 2003. Conference Proceedings. First International IEEE EMBS Conference onVolume , Issue , 20-22 p. 324 - 327 6. Aaron M. Plotnik and Stephen M. Rock, Quantification of Cyclic Motion of Marine Animals from Computer Vision, In: Proceedings of the MTS/ IEEE Oceans 2002, vol. 3. Biloxi, MS, USA. pp. 1575-1581. 7. R. Cutler and L. Davis. Robust real-time periodic motion detection, analysis, and applications. IEEE Transactions on Pattern Analysis and Machine Intel ligence, Aug 2000 22(8):781–796. 8. Yingen Xiong a; Francis Quek, Hand Motion Oscillatory Gestures and Multimodal Discourse Analysis, International Journal of Human-Computer Interaction,December 2006 , Volume 21, Issue 3 pages 285 - 312 9. Garakani A, Hack A, Roberts P, Walter S: Method and Apparatus for Acquisition, Compression, and Characterization of Spatiotemporal Signals. 2003, US20030185450A1 10. Davide Migliore, Matteo Matteucci, Matteo Naccari and Andrea Bonarini, A revaluation of frame difference in fast and robust motion detection, Proceedings of the 4th ACM international workshop on Video surveillance and sensor networks, 2006, pages 215 - 218 11. Mansouri, A., Konrad, J., Multiple motion segmentation with level sets. 2003. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 201– 219. 12. Motion Segmentation by Spatiotemporal Smoothness Using 5D Tensor Voting, Changki Min; Medioni, G. Computer Vision and Pattern Recognition Workshop, June 2006 Volume 17 , Issue 22 . 13. Guoying Zhao, Li Cui, Hua Li, and Matti Pietikäinen Gait Recognition Using Fractal Scale and Wavelet Moments, 18th International Conference on Pattern Recognition (ICPR'06) Volume 4 Pages:453 - 456 14. Shannon, C. E. & Weaver, W. (1949) The Mathematical Theory of Communication (Univ. Illinois Press, Chicago). 15. Kishore Korimilli and Sundeep Sarkar, Motion Segmentation Based on Perceptual Organization of Spatio-Temporal Volumes, 15th International conference on Pattern Recognition, (ICPR'00) volume 3 page 3852 16. Gérard G. Medioni, Isaac Cohen, François Brémond, Somboon Hongeng, Ramakant Nevatia, Event Detection and Analysis from Video Streams IEEE Transactions on Pattern Analysis and Machine Intel ligence,, Aug 2000, 22(8):781–796 17. R. M. B. Jackson, Target tracking using area correlation, International Conference on Advanced Infrared Defector and Systems, London, UK, 29-30 October 1981. IEE Conference Publication No. 204, pages 124-130. 18. David A Drubin , Arman M Garakani and Pamela A Silver,Motion as a phenotype: the use of live-cell imaging and machine visual screening to characterize transcription-dependent chromosome dynamics BMC Cell Biology 2006, 7:19 19. Andrew A. Hack, Jeremy Ahouse, Peter G. Roberts and Arman M. Garakani, New Methods for Automated Phenotyping of Complex Cellular Behaviors, 26th IEEE EMBS Annual International Conference 2003, San Francisco