Discovering Underlying Similarities in Video Arman M. Garakani

advertisement
Discovering Underlying Similarities in Video
Arman M. Garakani
Reify Corp. (http://www.reifycorp.com)
One Broadway, Cambridge, MA 02142
arman.garakani@reifycorp.com
Abstract
The concept of interrogating similarities within a data set
has a long history in fields ranging from medicinal
chemistry to image analysis. We define a descriptor as an
entropic measure of similarity for an image and the
neighborhood of images surrounding it. Differential
changes in descriptor values imply differential changes in
the structure underlying the images. For example, at the
location of a zero crossing in the descriptor values, the
corresponding image is a watershed image clearly sitting
between two dissimilar groups of images.
This paper describes a fast algorithm for image sequence
clustering based on the above concept. Developed initially
for an adaptive system for capture, analysis and storage of
lengthy dynamic visual processes, the algorithm uncovers
underlying spatio-temporal structures without a priori
information or segmentation. 9
The algorithm estimates the average amount of information
each image, in an ordered set, conveys about the structure
underlying the ordered set. Such signatures enable capture
of relevant and salient time periods directly leading to
reductions in cost of followup analysis and storage. As a
part of the video capture system, the above characterization
may provide predictive feedback to an adaptive capture
sub-system controlling temporal sampling, frame-rate and
exposure. Details of the algorithm, examples of its
application to quantification of biological motion, and
video identification and recognition are presented.
Prior to the workshop, an efficient implementation will be
posted as a web service to generate characterization of
unknown videos online.
Introduction
Traditional approaches to extracting information from
video captures of dynamic processes assume that the full
information content of the image sequence is too complex,
too incomplete or too large to be analyzed without
intervention. These factors lead existing technologies to
depend heavily on user input or assumptions for precise
segmentation, tracking and quantification.
approaches to motion segmentation include:
Current
• frame-based techniques such as frame-to-frame
differencing10,4
• localization of coherent regions in optical flow fields
(excellent review by Barron, Fleet and Beauchemin3)
• spatio-temporal approaches such as coherent bundles of
pixel evolution over time12
• level set approachs to identify trajectories of moving
objects3
• use of perceptual organization of spatio-temporal
volumes15.
While spatio-temporal approaches are less prone to noise
and illumination changes, all approaches fail if the object
moves too rapid, or out of focus or deforms4.
Paradoxically, many motion characterization algorithms
depend on robust segmentation13 or extract a specific
pattern such as cyclic or periodic motion5,6. Another
common approach has been to treat an image sequence as a
spatio-temporal volume and to locate patches, small
enough to contain uniform motion, in a larger volume16.
However, similar to many other approaches it breaks down
in in the presence of deformations.
Information extracted by the algorithm is a single value for
each image in the sequence. We call this time series, EMS.
EMS can be analyzed to identify and assess expected
temporal relationships, and potential discovery of
unanticipated relationships. EMS can be computed from an
entire sequence or from a moving temporal window.
Applications and future work
The algorithm outlined here has been used entirely for
extracting phenotypic signatures in biological screening.
Some of the areas to which we have applied the algorithm
are:
• Detecting periodic motions (without looking for it)
• Registering videos by using the output of our algorithm
as mutual information
•
•
•
•
Detecting epicenters of periodic motions
Detecting direction in transparent motion (blood flow)
Detecting diffusions (or rate of degeneration)
Generating audio from video motion
Our future plans will focus on detecting / locating image
sequences within others (video ID and recognition)
Overview and definitions
We define an image, as a region (typically regtangular) of a
stored memory buffer that covers the entire buffer or a
section of it. Similarly image sequences can be “window”
sequences as well. Color and audio information were
ignored for all examples.
We define an ordered set, in time as a set of images
captured and time stamped sequentially, i.e. a time stamp
of any image is greater than all previous images in the set.
An unordered data set can be sorted using entropic
similarity.
It (x,y) Rt (p,q)
For small transformations and small time intervals,
changes in I are correlated with changes in R. In cases
where transformations, deformations and so on are too
large, the normalized correlation value speeds to zero. Put
another way, when images represent smooth
transformations, normalized correlation values vary
smoothly. If there are abrupt transformations, normalized
correlation value drops abruptly.
Invariant to linear changes in brightness
This property falls out of the definition of normalized
correlation, However note that if motion in an image
sequence is radiometric (i.e. not a geometric movement,
but a change in luminance), successive normalized
correlation values will indicate a change. This is a
desirable quality as we found in a long biological
observation of random movements of cells. A clear linear
component to our otherwise random characterization was
discovered to be the microscope lamp dying slowly.
Basic Concepts
The basic concepts of this algorithm are summarized in the
following sections. We will describe our choice for the
similarity and grouping function. We subsequently
exemplify its application.
Image to Image (similarity or dissimilarity)
Normalized correlation was chosen as the pair-wise image
similarity function for the following reasons:
Linearity and stability
Image in the containing image set (order)
It should be clear by now that our approach is centered
around the concept that sharing similar images or
dissimilar images should motivate grouping. Shannon’s
entropy2,14 measures the average amount of information a
correlation event conveys when it appears with some
probability.
Using similar notations as above, we have a self-similarity
matrix M, M R2x2, where Mij = s (Ii , Ij ) is the piecewise
A well known characteristic of Normalized Correlation is
its predictable and largely linear degradation as one image
is transformed or deformed a short distance over
another(1,2). Specifically, within a small range, changes in
normalized correlation value vary linearly with
transformation parameters. Normalized correlation is often
used to provide an estimate in target tracking17. Fig 1
shows trends of variation in normalized correlation value
as scale and position change1. The shape of the trend
depends on the spatial frequency of the patterns used.
Collaborative property
With illumination and observation angles fixed, changes in
shape manifests as changes in surface orientations. Let's
assume that we have access to function R representing
mapping from surface orientations to image intensities and
therefore we could relate, for a single image, gray value at
position (x,y) and its surface orientation and surface
reflection (p,q) using R, that is
Figure 1: Changes in correlation ( y axis ) as scale and viewing position are
changed (photocopy of the original data)1
similarity between Ii and Ij and i < j, for the ordered set in
time. For any row of M, say k, KM denotes the collection of
similarity values forming a probability distribution given
by
p (K
M)
= KM / k KM
Figure 2: Entropic similarity measure plotted for a 2d point set
Figure 4: Entropic similarity measure for annulus ring data
0.964
0.962
and the entropy of this distribution is
0.96
0.958
H (K
M)
= p(KM) log p(KM)
0.956
0.954
We call KM , EMS an entropic measure of similarity. EMS
is an estimate of disorder in the similarities with image Ik ,
an estimate of the disorder in the system which is viewed.
In our future work, we will appraise similarity functions by
how close an estimate the entropic similarity measure is of
the underlying system.
0.95212
11.5
11
10.5
10
9.5
Y
9
8.5
8.5
8 8
9
9.5
10
10.5
11
11.5
12
X
Figure 3: A subset of heart muscle contraction sequence
A simple example (point set)
To illustrate the power of this approach, let us consider a
two dimensional point set of 1000 points sampled
randomly from A circular annulus with center pc, inner
radius r1 and outer radius r2, euclidean distance function D,
the set of points p with mean of zero and variance of 1, so
that
Figure 4: Entropic similarity measure for contraction sequence
Figure 6: Entropic similarity measure for heart muscle contraction
1
r12 <= D(p, pc)2 <= r22
0.8
Single moving object sequence
Fig 3 shows a subset of image captures of a single
contracting heart muscle cell. Fig 4 shows the entropic
measure of similarity time series. Contraction in heart
muscle cells has a rapid acceleration phase followed by
relaxation. We have used this method of direct
measurement to quantify type and magnitude of motion in
different biological systems.
Multiple objects / multiple movement sequence
A sample video, Finale.mov, was taken from the tutorial
clips in the Adobe Permiere ©. The results shown in Figure
5 illustrate the ability to characterize pans and periodic
movements.
Movie trailer content different compression
settings and clip localization
Movie trailer http://www.apple.com/trailers/weinstein/thenannydiaries/
0.6
EMS
distance is the similarity function used (Octave/Matlab
code available). Data shown in Fig 2 is the entropic
measure of similarity is ploted as z for the point set p. The
entropic similarity surface captures the underlying
structure, ring, in the 2d data set.
0.4
0.2
0
0
1
2
3
4
5
6
Seconds ( 30 Frames Per Second )
The entropic similarity measure is nearly identical for the
trailers encoded at different resolutons, Fig 6. To
demonstrate clip localization, a segment was selected at
frame 1490 lasting 140 frames. Fig 7 shows section in the
content signature at frame 1490 in contrast with the
signature generated from the 140 frames in the clip.
Figure 5 Analysis of Finale.mov from Adobe© Premiere Training
1
The pan
sequence center
0.8
A rider is riding
away
Cut
0.6
Cut
0.4
More riders
0.2
Pan + movement
0
0
2
4
6
8
10
12
14
16
Figure 7: Entropy similarity measure for clip and clip of the signature
Figure 6: Entropy similarity measure for the movie trailer
1
Low resolution
Medium resolution
Content Signature Clip
Clip Signature
0.25
0.9
0.8
0.2
0.7
0.15
0.6
0.5
0.1
0.4
0.3
0.05
0.2
0.1
0
0
500
1000
1500
2000
2500
3000
Acknowledgements
This work including the unique video analysis software
implementation, would not have been possible without the
many contributions of a small group of remarkable
biologists, doctors, engineers, and colleagues I have
worked with throughout my career and without the
consistent support of my wife Jenna from my very first
incoherent rambling about how images could talk.
0
20
40
60
80
100
120
140
References
1. Horn, B.K.P. & Bachman, B.L. Using Synthetic
Images to Register Real Images with Surface Models.
in Artificial Intelligence: An MIT Perspective, Vol. 2
(eds. Winston, P.H. & Brown, R.H.) 129-160 (MIT
Press, 1979).
2. S. Theodoridis and K. Koutroumbas, Pattern
Recognition, 3rd edition, 2006 pp. 106. Academic
Press.
3. Barron, J.L., Fleet, D.J., and Beauchemin, S. (1994)
Performance of optical flow techniques. International
Journal of Computer Vision, 12(1):43-77
4. Weiming Hu; Tieniu Tan; Liang Wang; S. Maybank, A
survey on visual surveillance of object motion and
behaviors, Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on
Volume 34, Issue 3, Aug. 2004 Page(s):334 - 352
5. Trammell, J.; Ktonas, P., A simple nonlinear
deterministic process may generate the timing of rapid
eye movements during human REM sleep, Neural
Engineering, 2003. Conference Proceedings. First
International IEEE EMBS Conference onVolume ,
Issue , 20-22 p. 324 - 327
6. Aaron M. Plotnik and Stephen M. Rock,
Quantification of Cyclic Motion of Marine Animals
from Computer Vision, In: Proceedings of the MTS/
IEEE Oceans 2002, vol. 3. Biloxi, MS, USA. pp.
1575-1581.
7. R. Cutler and L. Davis. Robust real-time periodic
motion detection, analysis, and applications. IEEE
Transactions on Pattern Analysis and Machine Intel
ligence, Aug 2000 22(8):781–796.
8. Yingen Xiong a; Francis Quek, Hand Motion
Oscillatory Gestures and Multimodal Discourse
Analysis, International Journal of Human-Computer
Interaction,December 2006 , Volume 21, Issue 3 pages
285 - 312
9. Garakani A, Hack A, Roberts P, Walter S: Method and
Apparatus for Acquisition, Compression, and
Characterization of Spatiotemporal Signals. 2003,
US20030185450A1
10. Davide Migliore, Matteo Matteucci, Matteo Naccari
and Andrea Bonarini, A revaluation of frame
difference in fast and robust motion detection,
Proceedings of the 4th ACM international workshop
on Video surveillance and sensor networks, 2006,
pages 215 - 218
11. Mansouri, A., Konrad, J., Multiple motion
segmentation with level sets. 2003. IEEE Transactions
on Pattern Analysis and Machine Intelligence 12, 201–
219.
12. Motion Segmentation by Spatiotemporal Smoothness
Using 5D Tensor Voting, Changki Min; Medioni, G.
Computer Vision and Pattern Recognition Workshop,
June 2006 Volume 17 , Issue 22 .
13. Guoying Zhao, Li Cui, Hua Li, and Matti Pietikäinen
Gait Recognition Using Fractal Scale and Wavelet
Moments, 18th International Conference on
Pattern Recognition (ICPR'06) Volume 4
Pages:453 - 456
14. Shannon, C. E. & Weaver, W. (1949) The
Mathematical Theory of Communication (Univ.
Illinois Press, Chicago).
15. Kishore Korimilli and Sundeep Sarkar, Motion
Segmentation Based on Perceptual Organization of
Spatio-Temporal
Volumes,
15th
International
conference on Pattern Recognition, (ICPR'00) volume
3 page 3852
16. Gérard G. Medioni, Isaac Cohen, François Brémond,
Somboon Hongeng, Ramakant Nevatia, Event
Detection and Analysis from Video Streams IEEE
Transactions on Pattern Analysis and Machine Intel
ligence,, Aug 2000, 22(8):781–796
17. R. M. B. Jackson, Target tracking using area
correlation, International Conference on Advanced
Infrared Defector and Systems, London, UK, 29-30
October 1981. IEE Conference Publication No. 204,
pages 124-130.
18. David A Drubin , Arman M Garakani and Pamela A
Silver,Motion as a phenotype: the use of live-cell
imaging and machine visual screening to characterize
transcription-dependent chromosome dynamics BMC
Cell Biology 2006, 7:19
19. Andrew A. Hack, Jeremy Ahouse, Peter G. Roberts
and Arman M. Garakani, New Methods for Automated
Phenotyping of Complex Cellular Behaviors, 26th
IEEE EMBS Annual International Conference 2003,
San Francisco
Download