Video Object Recognition Chenyi Chen Motion is important • How important? • Let’s first look at “Visual Parsing After Recovery From Blindness” • This is a real “vision” paper Background • Study how do three Indian patients (subjects) develop object recognition ability after long term blindness • Give treatment to the subjects • During recovery, test on the subjects to see how they perform on recognition tasks Background • The subjects are: • S.K.: age 29, male, born blindness, M.A. in political science • J.A.: age 13, male, born blindness, never received education • P.B.: age 7, male, born blindness • Control group: 4 normal sighted adults, similar social background Subjects’ parsing of static images S.K. versus simple region partition algorithm Dynamic information in object segregation Motility rating and object recognition results Follow-up testing after several months What do we learn about developing visual parsing skill • Early stages: integrative impairments, overfragmentation of images, compromise recognition performance • However, motion effectively mitigates these integrative difficulties • Motion appears to be instrumental both in segregating objects and in binding their constituents into representations for recognition • So we have some insight of how people developing visual recognition ability • Can we reproduce visual learning process on a robot? • Let’s look at “Learning about Humans During the First 6 Minutes of Life” A baby robot Hypothesis in social development • The infant brain is particularly sensitive to the presence of contingencies • The contingency drives the definition and recognition of caregivers • Human faces become attractive because they tend to occur in high contingency situations Goal • Whether acoustic contingency information (sound) would be sufficient for the robot to develop preferences for human faces • If so, get a sense for the time scale of the learning problem A baby robot Settings • The baby robot interacted with the lab members while recording image it saw • Contingency detection engine analyzes sound signal for presence of contingencies • Whether people were present is not specified • Whether people were of any particular relevance is not specified • The only training label is the acoustic contingency signal Visual learning engine • Probabilistic model • Only needs the images to be weakly labeled as containing with high or low probability the object of interest, do not need to indicate where the objects are located on the image plane • Implementable in a neural network • Run in real time at video frame rate Hardware • Plush baby doll • IEEE1394a webcam (capture images, only grayscale images used for training) • Microphone (receive auditory signal) • Loudspeaker (baby makes excited noise) Collecting data • Record the auditory and visual signals for 88 minutes • 2877 positive examples • 824 negative examples • Baby robot was placed in chair, stroller, and a crib, with bright or dim lighting conditions • 9 persons interacted with the baby robot Collecting data • Select 34 positive examples and 200 negative examples for training (approx. 5 min 34 sec). The rest are used for testing • The label is noisy Results • Evaluation: 2-Alternative Forced Choice Task (2AFC) • 86.17% on the face detection task ( i.e., deciding which of two images contained a face) • 89.7% correct on the contingency task (i.e., deciding which of two images was more likely to be associated with an auditory contingency) • 92.3 % correct on the person detection task (i.e., deciding which image contained a person). Results • Examples images and their pixel-wise probability images Results • Infants showed a significant order of tracking preference in favor the face stimulus, followed by the scrambled stimulus, followed by the empty stimulus • The robot reproduce the preference order • Video usually contains more data for object detector training • There is a domain difference between video and still image • So “Analysing domain shift factors between videos and images for object detection” is necessary Goal • For a given target test domain (image or video), the performance of the detector depends on the domain it was trained on. • Examine the reasons behind this performance gap. • Train an object detector with samples either from still images or from video frames and then test the detector on both domains. Dataset • Still images (VOC) • PASCAL VOC 2007 • 10 class of moving objects chosen Dataset • • • • Video frames (VID) YouTube-Objects dataset 10 classes of moving objects Further annotated a few images to make the dataset have comparable labels with VOC Equalizing the number of samples per class • • • • • Equalize the training samples of VOC and VID 3097 in total over the 10 classes (Table. 1) Only the equalized training sets are used trainVOC trainVID Domain shift factors • Spatial location accuracy: accuracy of bounding box • Appearance diversity: consecutive frames in video are similar, thus less diverse • Image quality: compression, motion blur etc. in video images • Object detector: DPM Spatial location accuracy • • • • Method of getting bounding box on video: PRE: worst FVS: better Manual label: best Spatial location accuracy • Reduce almost 4% of the gap (test on VOC) Spatial location accuracy • Equalization: using the ground truth (human labeled) bounding box on trainVID Appearance diversity • Near identical samples of an object in video Appearance diversity • Measure diversity: • Clustering (agglomerative clustering, L2 distance of HOG features): each cluster contains visually very similar samples • Measure appearance diversity by counting the number of clusters • Equalization: resample training sets so the number of images and clusters (of trainVOC and trainVID) are equal Appearance diversity Appearance diversity • Bridge the gap by 3.5% (test on VOC) Image quality • Gradient energy: sum of gradient magnitudes in HOG cells • Equalization: blur trainVOC by applying a Gaussian filter Image quality Image quality • Closes the gap by 1% (test on VOC) Training-test set correlation • The final 7% performance gap • Domain-specific correlation/bias • Find nearest neighbor of testing images in both training sets Training-test set correlation • • • • According to nearest neighbor criterion testVOC is most similar to trainVOC testVID is most similar to trainVID Such correlation leads to the final performance gap • Now we understand the gap between video domain and still image domain • We still want to try transferring the knowledge learnt in video domain to image domain • OK, then let’s look at “Learning Object Class Detectors from Weakly Annotated Video” Benefits of video • Easier to automatically segment the object from the background based on motion information • Show significant appearances variations of an object • Provide a large number of training images Pipeline • Each video contains an object as indicated by the video tag • Automatically localize object in video clips, output one bounding box for each video • Learn a detector from the video images and corresponding bounding boxes • Domain adaptation • Test the detector on PASCAL 07 dataset Localizing objects in real-world videos • Extract shots of coherent motion from each video • Robustly fit spatio-temporal bounding boxes (tube) to each shot (3~15 tubes per shot) • Jointly select one tube per video by minimizing an energy function of similarity • The selected tubes are the output of the algorithm (used to train a detector) Localizing objects in real-world videos Localizing objects in real-world videos • Temporal partitioning into shot • Abrupt changes of the visual content of the video • Thresholding color histogram differences in consecutive frames Localizing objects in real-world videos • Forming candidate tubes • Large-displacement optical flow (LDOF) • Clustering the dense point tracks based on the similarity in their motion and proximity in location • Fit spatio-temporal bounding box to each motion segment Localizing objects in real-world videos • Example of tubes Localizing objects in real-world videos • Joint selection of tubes • Energy function • Each frame s has multiple candidate tubes • Select one tube from the candidates for each frame s • Selected tubes over all the frames • Coefficient α Localizing objects in real-world videos • The pairwise potential • Measure appearance dissimilarity • Encourage selecting tubes look similar over time • Tube ls, lq in two different frame s, q • Dissimilarity functions Δ (two types of features) compare the appearance of the two tubes Localizing objects in real-world videos • The unary potential • Measure the cost of selecting tube ls in shot s • Δ: prefer tubes visually homogeneous • Γ: percentage of bounding-box perimeter touching the border of image • Ω: objectness probability of the bounding box Localizing objects in real-world videos • Minimization • Find the configuration L* of tubes over all frames that minimizes energy E • L* is the final output and will be used to train an object detector Results • Automatic tube selection • Compare with ground truth bounding box (IoU>=50%) • The automatic tube selection technique selects best available tube most of time Learning a detector from the selected tubes • Sampling bounding box • Reduce the number to manageable quantity • Select samples more likely to contain relevant objects (using Γ and Ω exactly) • Train the object detector • DPM • SPM Results • Models: • VOC: model trained on PASCAL dataset • VMA: model trained on manually annotated frames from video • VID: model trained on video with the proposed automatic pipeline • Test on PASCAL dataset Results • Object detector without domain adaptation • The bounding boxes generated by the proposed pipeline is closed to manually labeled ones • Performance gap across domain is large Domain adaptation • Domain difference: • Higher HOG gradient energy in images • SVM based on GIST feature to distinguish video from images, accuracy 83% Domain adaptation • Large quantity of video (source domain) training data, small number of PASCAL image (target domain) training data • Adaptation methods: • All: directly train a single classifier using the union of all available training data • Pred: use the output of the source classifier as an additional feature for training the target classifier Domain adaptation • Adaptation methods (cont.): • Prior: the parameters of the source classifier are used as a prior when learning the target classifier • LinInt: first train two separate classifiers fs(x), ft(x) from the source and the target training data, and then linearly interpolate their predictions on new target data at test time Results • Object detector with domain adaptation • Improvement w.r.t. VOC model • Most method (combine VOC training data at early stage) degrades performance • The LinInt is immune to negative transfer • Knowledge can be transferred from video to image domain • We can not only automatically output bounding boxes on video, but also automatically segment the video into background and foreground object • Fast object segmentation in unconstrained video Goal • Propose an automatic technique for separating foreground objects from the background in a video • Two main stages: • 1) Efficient initial foreground estimation • 2) Foreground-background labeling refinement Efficient initial foreground estimation • Optical flow: supports large displacement and efficient GPU implementation • Motion boundaries: • magnitude of the gradient of the optical flow field • difference in direction between the motion of pixel p and its neighbors N (if n is moving in a different direction than all its neighbors, it is likely to be a motion boundary) Efficient initial foreground estimation Efficient initial foreground estimation • Problems with the motion boundaries: • Do not completely cover the whole object boundary • Subject to false positive Efficient initial foreground estimation • Inside-outside map (e.g. pixel level, 0: outside, 1: inside) • Estimates whether a pixel is inside the object based on the point-in-polygon problem • Any ray originating inside a closed curve intersects it an odd number of time. Any ray originating outside intersects it an even number of times. Efficient initial foreground estimation • • • • • Inside-outside map (cont.) Incomplete motion boundary Shooting 8 rays spaced by 45 degrees Majority vote for final decision Optimized data structure for linear time implementation when computing the map Foreground-background labelling refinement • Pixel labelling problem with two labels (foreground and background) • Oversegment each frame into superpixels • Assign labels to superpixels • Superpixel i at frame t takes a label • All superpixels’ labes in all frames Foreground-background labelling refinement • Energy function • Output segmentation minimizes • Minimize with graph-cut Foreground-background labelling refinement • In the energy function: • A: appearance model, one for foreground, one for background. Estimated based on the insideoutside map • L: location model, propagate the per-frame inside-outside maps over time to build a more complete location prior • V: spatial smoothness potential defined over edge • W: temporal smoothness potential defined over edge Experiment evaluation • SegTrack dataset: 6 videos • Evaluation: number of wrongly labeled pixels averaged over fall frames Experiment evaluation • Considerably outperforms [6, 4, 18] • On par with [14], which is remarkable, given that the proposed approach is simpler • [27] achieves lower, but is much slower • The SegTrack dataset is saturated Experiment evaluation • YouTube-Objects dataset: • 10 diverse object classes • Ground truth bounding box provided for some frames • Fit bounding box to largest connected component of the segmentation output • Evaluation: PASCAL criterion (IoU>=0.5) Experiment evaluation Experiment evaluation • Runtime: • Intel Core i7 2.0GHz machine • Given optical flow and superpixels, it takes 0.5 sec/frame • Considerably faster than the other strong baselines (typically >100 sec/frame) • With videos, we are able to extract objects, but we can even do something more crazy, which is revealing subtle movement of objects • At last, let’s look at “Eulerian Video Magnification for Revealing Subtle Changes in the World” Goal • Reveal temporal variations in videos that are difficult or impossible to see with the naked eye and display them in an indicative manner • Input standard video sequence • Output amplified signal to reveal hidden information Results • https://www.youtube.com/watch?v=e9ASH8I BJ2U It’s amazing, right? How it works? • First-order motion example: • : image intensity at position x and time t • : displacement function • Motion magnification: synthesize the signal • α: amplification factor How it works? • : applying a broadband temporal bandpass filter, picking out everything except f(x) • Define • Then • And How it works? How it works? • General case: δ(t) is not entirely within the passband of the temporal filter B(x,t) • δk(t): different temporal spectral components of δ(t), each will be attenuated by the filter by a factor ϒk • Then, let How it works? • We need • For high spatial frequencies and large amplification factor α, the first order Taylor expansion may not hold • So, we have Higher spatial frequencies, smaller α How it works? • Artifacts How it works? • Multiscale analysis • Scale-varying process • Assign different spatial frequencies with different magnification factor α Pipeline Pipeline • Spatial decomposition: video pyramid constructed by a separable binomial filter of size five • Example temporal filter B(x,t), task specific Thank you!