Video Object Recognition

advertisement
Video Object Recognition
Chenyi Chen
Motion is important
• How important?
• Let’s first look at “Visual Parsing After
Recovery From Blindness”
• This is a real “vision” paper
Background
• Study how do three Indian patients (subjects)
develop object recognition ability after long
term blindness
• Give treatment to the subjects
• During recovery, test on the subjects to see
how they perform on recognition tasks
Background
• The subjects are:
• S.K.: age 29, male, born blindness, M.A. in
political science
• J.A.: age 13, male, born blindness, never
received education
• P.B.: age 7, male, born blindness
• Control group: 4 normal sighted adults, similar
social background
Subjects’ parsing of static images
S.K. versus simple region partition
algorithm
Dynamic information in object
segregation
Motility rating and object recognition
results
Follow-up testing after several months
What do we learn about developing
visual parsing skill
• Early stages: integrative impairments,
overfragmentation of images, compromise
recognition performance
• However, motion effectively mitigates these
integrative difficulties
• Motion appears to be instrumental both in
segregating objects and in binding their
constituents into representations for
recognition
• So we have some insight of how people
developing visual recognition ability
• Can we reproduce visual learning process on a
robot?
• Let’s look at “Learning about Humans During
the First 6 Minutes of Life”
A baby robot
Hypothesis in social development
• The infant brain is particularly sensitive to the
presence of contingencies
• The contingency drives the definition and
recognition of caregivers
• Human faces become attractive because they
tend to occur in high contingency situations
Goal
• Whether acoustic contingency information
(sound) would be sufficient for the robot to
develop preferences for human faces
• If so, get a sense for the time scale of the
learning problem
A baby robot
Settings
• The baby robot interacted with the lab
members while recording image it saw
• Contingency detection engine analyzes sound
signal for presence of contingencies
• Whether people were present is not specified
• Whether people were of any particular
relevance is not specified
• The only training label is the acoustic
contingency signal
Visual learning engine
• Probabilistic model
• Only needs the images to be weakly labeled as
containing with high or low probability the
object of interest, do not need to indicate
where the objects are located on the image
plane
• Implementable in a neural network
• Run in real time at video frame rate
Hardware
• Plush baby doll
• IEEE1394a webcam (capture images, only
grayscale images used for training)
• Microphone (receive auditory signal)
• Loudspeaker (baby makes excited noise)
Collecting data
• Record the auditory and visual signals for 88
minutes
• 2877 positive examples
• 824 negative examples
• Baby robot was placed in chair, stroller, and a
crib, with bright or dim lighting conditions
• 9 persons interacted with the baby robot
Collecting data
• Select 34 positive examples and 200 negative
examples for training (approx. 5 min 34 sec).
The rest are used for testing
• The label is noisy
Results
• Evaluation: 2-Alternative Forced Choice Task (2AFC)
• 86.17% on the face detection task ( i.e.,
deciding which of two images contained a face)
• 89.7% correct on the contingency task (i.e.,
deciding which of two images was more likely to
be associated with an auditory contingency)
• 92.3 % correct on the person detection task (i.e.,
deciding which image contained a person).
Results
• Examples images and their pixel-wise
probability images
Results
• Infants showed a significant order of tracking
preference in favor the face stimulus, followed
by the scrambled stimulus, followed by the
empty stimulus
• The robot reproduce the
preference order
• Video usually contains more data for object
detector training
• There is a domain difference between video
and still image
• So “Analysing domain shift factors between
videos and images for object detection” is
necessary
Goal
• For a given target test domain (image or
video), the performance of the detector
depends on the domain it was trained on.
• Examine the reasons behind this performance
gap.
• Train an object detector with samples either
from still images or from video frames and
then test the detector on both domains.
Dataset
• Still images (VOC)
• PASCAL VOC 2007
• 10 class of moving objects chosen
Dataset
•
•
•
•
Video frames (VID)
YouTube-Objects dataset
10 classes of moving objects
Further annotated a few images to make the
dataset have comparable labels with VOC
Equalizing the number of samples per
class
•
•
•
•
•
Equalize the training samples of VOC and VID
3097 in total over the 10 classes (Table. 1)
Only the equalized training sets are used
trainVOC
trainVID
Domain shift factors
• Spatial location accuracy: accuracy of
bounding box
• Appearance diversity: consecutive frames in
video are similar, thus less diverse
• Image quality: compression, motion blur etc.
in video images
• Object detector: DPM
Spatial location accuracy
•
•
•
•
Method of getting bounding box on video:
PRE: worst
FVS: better
Manual label: best
Spatial location accuracy
• Reduce almost 4% of the gap (test on VOC)
Spatial location accuracy
• Equalization: using the ground truth (human
labeled) bounding box on trainVID
Appearance diversity
• Near identical samples of an object in video
Appearance diversity
• Measure diversity:
• Clustering (agglomerative clustering, L2
distance of HOG features): each cluster
contains visually very similar samples
• Measure appearance diversity by counting the
number of clusters
• Equalization: resample training sets so the
number of images and clusters (of trainVOC
and trainVID) are equal
Appearance diversity
Appearance diversity
• Bridge the gap by 3.5% (test on VOC)
Image quality
• Gradient energy: sum of gradient magnitudes in
HOG cells
• Equalization: blur trainVOC by applying a
Gaussian filter
Image quality
Image quality
• Closes the gap by 1% (test on VOC)
Training-test set correlation
• The final 7% performance gap
• Domain-specific correlation/bias
• Find nearest neighbor of testing images in
both training sets
Training-test set correlation
•
•
•
•
According to nearest neighbor criterion
testVOC is most similar to trainVOC
testVID is most similar to trainVID
Such correlation leads to the final
performance gap
• Now we understand the gap between video
domain and still image domain
• We still want to try transferring the knowledge
learnt in video domain to image domain
• OK, then let’s look at “Learning Object Class
Detectors from Weakly Annotated Video”
Benefits of video
• Easier to automatically segment the object
from the background based on motion
information
• Show significant appearances variations of an
object
• Provide a large number of training images
Pipeline
• Each video contains an object as indicated by
the video tag
• Automatically localize object in video clips,
output one bounding box for each video
• Learn a detector from the video images and
corresponding bounding boxes
• Domain adaptation
• Test the detector on PASCAL 07 dataset
Localizing objects in real-world videos
• Extract shots of coherent motion from each
video
• Robustly fit spatio-temporal bounding boxes
(tube) to each shot (3~15 tubes per shot)
• Jointly select one tube per video by
minimizing an energy function of similarity
• The selected tubes are the output of the
algorithm (used to train a detector)
Localizing objects in real-world videos
Localizing objects in real-world videos
• Temporal partitioning into shot
• Abrupt changes of the visual content of the
video
• Thresholding color histogram differences in
consecutive frames
Localizing objects in real-world videos
• Forming candidate tubes
• Large-displacement optical flow (LDOF)
• Clustering the dense point tracks based on the
similarity in their motion and proximity in
location
• Fit spatio-temporal bounding box to each
motion segment
Localizing objects in real-world videos
• Example of tubes
Localizing objects in real-world videos
• Joint selection of tubes
• Energy function
• Each frame s has multiple candidate tubes
• Select one tube from the candidates for each
frame s
• Selected tubes over all the frames
• Coefficient α
Localizing objects in real-world videos
• The pairwise potential
• Measure appearance dissimilarity
• Encourage selecting tubes look similar over
time
• Tube ls, lq in two different frame s, q
• Dissimilarity functions Δ (two types of features)
compare the appearance of the two tubes
Localizing objects in real-world videos
• The unary potential
• Measure the cost of selecting tube ls in shot s
• Δ: prefer tubes visually homogeneous
• Γ: percentage of bounding-box perimeter
touching the border of image
• Ω: objectness probability of the bounding box
Localizing objects in real-world videos
• Minimization
• Find the configuration L* of tubes over all
frames that minimizes energy E
• L* is the final output and will be used to train
an object detector
Results
• Automatic tube selection
• Compare with ground truth bounding box
(IoU>=50%)
• The automatic tube selection technique
selects best available tube most of time
Learning a detector from the selected
tubes
• Sampling bounding box
• Reduce the number to manageable quantity
• Select samples more likely to contain relevant
objects (using Γ and Ω exactly)
• Train the object detector
• DPM
• SPM
Results
• Models:
• VOC: model trained on PASCAL dataset
• VMA: model trained on manually annotated
frames from video
• VID: model trained on video with the
proposed automatic pipeline
• Test on PASCAL dataset
Results
• Object detector without domain adaptation
• The bounding boxes generated by the
proposed pipeline is closed to manually
labeled ones
• Performance gap across domain is large
Domain adaptation
• Domain difference:
• Higher HOG gradient energy in images
• SVM based on GIST feature to distinguish
video from images, accuracy 83%
Domain adaptation
• Large quantity of video (source domain)
training data, small number of PASCAL image
(target domain) training data
• Adaptation methods:
• All: directly train a single classifier using the
union of all available training data
• Pred: use the output of the source classifier as
an additional feature for training the target
classifier
Domain adaptation
• Adaptation methods (cont.):
• Prior: the parameters of the source classifier
are used as a prior when learning the target
classifier
• LinInt: first train two separate classifiers fs(x),
ft(x) from the source and the target training
data, and then linearly interpolate their
predictions on new target data at test time
Results
• Object detector with domain adaptation
• Improvement w.r.t. VOC model
• Most method (combine VOC training data at early
stage) degrades performance
• The LinInt is immune to negative transfer
• Knowledge can be transferred from video to image
domain
• We can not only automatically output
bounding boxes on video, but also
automatically segment the video into
background and foreground object
• Fast object segmentation in unconstrained
video
Goal
• Propose an automatic technique for
separating foreground objects from the
background in a video
• Two main stages:
• 1) Efficient initial foreground estimation
• 2) Foreground-background labeling
refinement
Efficient initial foreground estimation
• Optical flow: supports large displacement and
efficient GPU implementation
• Motion boundaries:
• magnitude of the gradient of the optical flow
field
• difference in direction between the motion of
pixel p and its neighbors N (if n is moving in a
different direction than all its neighbors, it is
likely to be a motion boundary)
Efficient initial foreground estimation
Efficient initial foreground estimation
• Problems with the motion boundaries:
• Do not completely cover the whole object
boundary
• Subject to false positive
Efficient initial foreground estimation
• Inside-outside map (e.g. pixel level, 0: outside,
1: inside)
• Estimates whether a pixel is inside the object
based on the point-in-polygon problem
• Any ray originating inside a
closed curve intersects it an
odd number of time. Any ray
originating outside intersects it
an even number of times.
Efficient initial foreground estimation
•
•
•
•
•
Inside-outside map (cont.)
Incomplete motion boundary
Shooting 8 rays spaced by 45 degrees
Majority vote for final decision
Optimized data structure for linear time
implementation when computing the map
Foreground-background labelling
refinement
• Pixel labelling problem with two labels
(foreground and background)
• Oversegment each frame into superpixels
• Assign labels to superpixels
• Superpixel i at frame t takes a label
• All superpixels’ labes in all frames
Foreground-background labelling
refinement
• Energy function
• Output segmentation minimizes
• Minimize with graph-cut
Foreground-background labelling
refinement
• In the energy function:
• A: appearance model, one for foreground, one
for background. Estimated based on the insideoutside map
• L: location model, propagate the per-frame
inside-outside maps over time to build a more
complete location prior
• V: spatial smoothness potential defined over
edge
• W: temporal smoothness potential defined over
edge
Experiment evaluation
• SegTrack dataset: 6 videos
• Evaluation: number of wrongly labeled pixels
averaged over fall frames
Experiment evaluation
• Considerably outperforms [6, 4, 18]
• On par with [14], which is remarkable, given
that the proposed approach is simpler
• [27] achieves lower, but is much slower
• The SegTrack dataset is saturated
Experiment evaluation
• YouTube-Objects dataset:
• 10 diverse object classes
• Ground truth bounding box provided for some
frames
• Fit bounding box to largest connected
component of the segmentation output
• Evaluation: PASCAL criterion (IoU>=0.5)
Experiment evaluation
Experiment evaluation
• Runtime:
• Intel Core i7 2.0GHz machine
• Given optical flow and superpixels, it takes 0.5
sec/frame
• Considerably faster than the other strong
baselines (typically >100 sec/frame)
• With videos, we are able to extract objects,
but we can even do something more crazy,
which is revealing subtle movement of objects
• At last, let’s look at “Eulerian Video
Magnification for Revealing Subtle Changes in
the World”
Goal
• Reveal temporal variations in videos that are
difficult or impossible to see with the naked
eye and display them in an indicative manner
• Input standard video sequence
• Output amplified signal to reveal hidden
information
Results
• https://www.youtube.com/watch?v=e9ASH8I
BJ2U
It’s amazing, right?
How it works?
• First-order motion example:
•
: image intensity at position x and time t
•
: displacement function
• Motion magnification: synthesize the signal
• α: amplification factor
How it works?
•
: applying a broadband temporal bandpass
filter, picking out everything except f(x)
• Define
• Then
• And
How it works?
How it works?
• General case: δ(t) is not entirely within the
passband of the temporal filter B(x,t)
• δk(t): different temporal spectral components
of δ(t), each will be attenuated by the filter by
a factor ϒk
• Then, let
How it works?
• We need
• For high spatial frequencies and large
amplification factor α, the first order Taylor
expansion may not hold
• So, we have
Higher spatial frequencies, smaller α
How it works?
• Artifacts
How it works?
• Multiscale analysis
• Scale-varying process
• Assign different spatial frequencies with
different magnification factor α
Pipeline
Pipeline
• Spatial decomposition: video pyramid
constructed by a separable binomial filter of
size five
• Example temporal
filter B(x,t), task
specific
Thank you!
Download