Semantic Sparse Learning in Images and Videos Ph.D. Dissertation Defense Qiang Zhang Outline • • • • Introduction Review of Proposal Defense Progress After Proposal Defense Future Work Sparse Learning in Image and Video • Sparse learning has seen a lot applications in computer vision: – Image/video encoding/compression; – Image/video denoising; – Image/video super-resolution; – Image/video classification; – Face recognition; – Video tracking… Sparse Learning • What is sparse learning: introducing sparsity constraint into traditional learning algorithm, e.g., least square regression->LASSO; • Sparsity constraint can be induced by β1 norm, β0 “norm”: – π₯ – π₯ = 0 = 1 π π₯π π:π₯π ≠0 1 • πΏ1 norm is convex and πΏ0 “norm” is nonconvex. Sparse Learning in Matrix • Sparsity learning in matrix has also drawn a lot of attentions; • Instead of measuring β1 or β0 of matrix, we are more interested in the rank of matrix; • Rank of matrix is not convex, thus a convex relaxation: nuclear norm/trace norm is more widely used; – π ∗ = π ππ • Applications: matrix completion, robust PCA. Sparse Learning with Statistics Learning • Sparse learning is related to the beta process in statistics learning: – Sparsity and values of the sparse vector π₯ is controlled separately; • It can be combined with other graphic models. πΌ, π½ Beta proc. π Bernoulli z Dot Prod. π, Σ Gauss Dist y x Introduce Semantics to Sparse Learning • Image and video contain rich semantic information: – Similar patches should have similar sparse codes; – Face images of same subject under different illumination conditions form a low-dimension subspace. – Human vision is more sensitive to patterns which are different from their neighbors; – Skills will get improved during a training process; • Many existing sparse learning methods ignore those semantic information; – If properly used, semantic could boost their performances. Semantic Sparse Learning • My study: how to model semantic information of visual data and explicitly incorporate it in sparse learning; • Identify four vision problems of great importance and broad interest to community; – Different types of context information, – Different data modalities. • We show how semantic information can enhance sparse learning in those problems. Outline • • • • Introduction Review of Proposal Defense Progress After Proposal Defense Future Work Overview of Proposal Defense • In proposal defense, we present, in four important problems, how semantic information can enhance sparse learning; – Discriminative Dictionary Learning; – Subspace Learning; – Spatiotemporal Visual Saliency; – Relative Hidden Markov Model; Discriminative Dictionary Learning • Sparse representation has shown great success in face recognition, e.g., SRC; – Face image of a subject can be constructed from images of this subject; • SRC: images as columns of dictionary, – A large dictionary required for good performances; – Large dictionary: high computational cost. • A smaller dictionary is possible by learning it from the training data. K-SVD: Learning Dictionary for Sparse Representation • K-SVD learns dictionary to sparsely represent the input signal; – Objective: Reconstruction+Sparsity; – Successful in signal reconstruction task; • We can improve the discriminative power of dictionary by adding a extra term to K-SVD: – Objective: Reconstruction+Classification+Sparsity; • It can still be solved under K-SVD framework efficiently; – Dictionary is both reconstructive and discriminative. Experiment • Compare the proposed method (D-KSVD) with SRC, and K-SVD; • Use two datasets, Extended YaleB dataset and AR dataset; • Compute accuracy and histogram of correlation of dictionary atoms. 95.56 99.31 80.76 93.17 93.17 99.58 99.72 93.85 99.30 98.89 95.0 90.5 68.14 88.17 93.0 Time(S) 62 131 76 Subspace Learning • Multiple copies of signals from the same source are very common; • Those signals should have some common components and unique components; • We may want to decompose the signal into multiple components: – Obtaining better compression rate; – Extract more relevant features. Proposed Model • We decompose the images π = πΏππ π,π π,π=1 as: πΏππ = πͺπ + π¨π + π¬ππ ∀πΏππ ∈ π – πΏππ the ππ‘β image of ππ‘β subject; • πͺπ is a matrix representing the common information of images for Subject π; • π¨π , low rank, captures the global information of the image, e.g., illumination conditions; • π¬ππ , sparse, stores image-specific details such as expression conditions or noise with sparse support. Decomposing Extended YaleB Dataset Right: the Left: common components; low-rank components. Face Recognition: Robustness over Poses Variations • We build subspaces from the decomposed components and test image; • Subspaces are compared via principal angle; • Test image is assigned to the closest subspace; • Experiment result on Multi-PIE. #train per subject 20 Proposed 99.98±0.03% 99.98±0.03% SRC Volterrafaces 99.60±0.22% 99.93±0.05% SUN 15 99.92±0.06% 99.45±0.03% 98.37±0.47% 99.38±0.14% 10 99.24±0.06% 96.79±0.28% 97.63±0.28% 97.89±0.30% 5 90.95±0.70% 86.98±0.16% 89.72±1.45% 88.29±0.02% Spatiotemporal Saliency • Visual saliency predicts regions in the field of view that draw the most visual attention; • Visual saliency has attracted a lot of interest and has a lot of applications; • Spectrum analysis based visual saliency has been popular due to its simplicity; • Human vision system is interacting with dynamics scenes—temporal information should be modeled for visual saliency. Proposed Method • Assumption: the foreground object is small compared to whole spatiotemporal volume and background is sparse in frequency domain; • FFT: Phase map captures the location of events; • We proposed to detect saliency region with the phase map of the video as: 2 π π = β± −1 , π = β±(π) π • Only Fourier transform is involved—simple but effective. Simulation Experiment • We verify the proposed method on designed psychology pattern: Blind Flicker Direction Velocity Experiment on Video Datasets • We also test on two challenging video datasets for saliency evaluation, CRCNSORIG and DIEM. • To measure the performance, we compute the area under curve (AUC) Method AWS HouNIPS Bian IO SR Torralba Judd Marat Rarity-G CIOFM Proposed AUC 0.6 0.5967 0.595 0.595 0.5867 0.5833 0.5833 0.5833 0.5767 0.5767 0.6639 Method AWS Bian Marat Judd AIM HouNIPS Torralba GBVS SR CIO Proposed AUC 0.577 0.573 0.573 0.57 0.568 0.563 0.584 0.562 0.561 0.556 0.6896 Relative Hidden Markov Model • Understanding human motion is an important task in many ο¬elds, e.g., surgery; • One key problem in such applications is the analysis of skills associated with body motion; • Many computational methods have been developed for this purpose, especially HMM based; • One practical difο¬culty: those methods typically require the skill labels for the training data; • Obtaining labels of training data, currently done by senior surgeons, is very difficult and costly. Relative Label Instead of Absolute Label • It is hard to say whether (b) is smiling or not. • But it is easy to find (b) is less smiling than (a) but more than (c). • We use similar idea in our motion analysis: given two videos, we only need to know which one is better. Proposed Method • Proposed method learns a model from small number of pairwise rankings of training data; • The skill is linked to the likelihood of the inputs under the learned model. π π π π π . π‘. πΉ π π π > πΉ ππ π ∀ π, π ∈ πΌ π: max π ππ • Skill is measured by πΉ π π π : – Data likelihood π π₯ π π ; – Likelihood ratio π π₯ π π1 /π π₯ π π2 Preferred Experiments: Skill Curve Outline • Introduction • Review of Proposal Defense • Progress After Proposal Defense – Spatiotemporal Visual Saliency – Relative Hidden Markov Model • Future Work Spatiotemporal Saliency • Previous: spatiotemporal saliency detection for video is proposed; • New progress: – More theoretic analysis for the algorithm; – Application on abnormality detection; – Application on spatiotemporal interest point detection. An Explanation from Human Vision System Explanation and Analysis • Saliency map is related to the primary visual cortex of human vision system: – Orientation selective. The cells are tuned to the specific frequency and orientation; – Lateral surround inhibitive. Similarly tuned cells will be Spectral suppressed depending on total response. Magnitude • It can be modeled by phase only transform: – Spectral magnitude=total response of cells tuned to specific frequency and orientation. – Discarding spectral magnitude=suppressing by the total response. Abnormality Detection • Abnormality detection identifies the volume which is different from the others; – Such region would be also salient; • We can use the spatiotemporal saliency for abnormality detection in video: – Compute saliency map for the input video; – Compute average saliency score for each frame; – Pick the frame/region above certain value; • Fast, effective and unsupervised. Example: Abnormality Frame Example: Abnormality Region Abnormality Detection: Results UMN dataset UCSD dataset Method AUC Method EER Optical Flow 0.84 Social Force 37% Social Force 0.96 MPPCA 35% Chaotic Invariant 0.99 MDT 25% NN 0.93 Adam 40% Sparse Recon. 0.978 Reddy 21.25% Proposed 23% Interaction Force 0.9961 Proposed 0.9378 Spatiotemporal Saliency Point Detector • Spatiotemporal interest point (STIP) has been widely used in action recognition; – Follow bag of words from image classification; • STIPs is usually selected from region where the (3D) gradient is dominant; – Simple but no psychology justification. • The regions which attracts human’s attention most would also contribute most to people’s perception of the scene. Spatiotemporal Saliency Point Detector • An efficient yet effective STIP detector: – Compute saliency map for input video; – Nonlocal maximal suppression; – Detect the nonlocal maximums in saliency map; • Feed the detected STIPs to existing algorithms: – STIP can be described by HoG, HoF or even its local saliency value; – Bag of words can be directly used. Example: STIP Experiment Results Method Harris3D Gabor Weizmann KTH 85.60% 91.80% UCF sports 78.10% N.A. 88.70% 77.70% Hessian3D N.A. 88.70% 79.30% Dense 86.10% 81.60% Proposed 84.50% 88.00% 86.70% Proposed* 95.60% 92.60% 85.60% N.A. For “proposed*”, we extract the descriptor on the saliency map instead of on the video. Outline • Introduction • Review of Proposal Defense • Progress After Proposal Defense – Spatiotemporal Visual Saliency – Relative Hidden Markov Model • Future Work Relative Hidden Markov Model • Previous: relative HMM for modeling motion skills in videos is proposed; • New progress: – An improved algorithm; – Theoretic analysis via comparison to latent SVM; – Application on emotion recognition from speech signal. Proposed Method • To make the algorithm robust over outliers, we introduce π: − log π π π π + πΎ π: min π∈Ω π₯ π ∈π πππ π,π ∈πΌ πΉ π π π − πΉ ππ π + πππ ≥ π, πππ ≥ 0 • The objective function requires: – The model fits data well; – The pairwise comparison is satisfied. Proposed Method Cont’d • π π₯ π can be approximated π π₯, π§ ∗ π ; – Merhav and Ephraim (1991); – Better approximation for longer sequence. • For multinomial distribution: log π π₯, π§ ∗ π = log π π β(π₯, π§ ∗ ); – π§ ∗ state path via Viterbi algorithm. • The constraint π ∈ Ω can be written as: πΆπ = 1, π ≤ 0 Update the Model • To update the model, we are solving: π, π: min π π π + πΎ π π π,π π΄π + π ≤ π, πΆπ π = 1, π ≤ 0, π ≥ 0 • This is a nonconvex nonlinear problem; – Previous work uses primal-dual interior point method. • We introduce π = π π and apply augmented Lagrange multipliers: min π π π + πΎ π π +< π, π − log π > + π,π,π π π − log π 2 π΄π + π ≤ π, πΆπ = 1, π ≤ 0, π ≥ 0, π ≥ 0 2 2 Update the Model Cont’d • We can use block coordinate descent to solve this problem: • Sub-problem 1 (QP): π min π π + πΎ π +< π, π − log π > + π − log π π,π 2 π΄π + π ≤ π, π ≤ 0, π ≥ 0 π π • Sub-problem 2 (nonlinear prog.): π min < π, π − log π > + π − log π π 2 πΆπ = 1, π ≥ 0 2 2 2 2 Sub-problem 2 π min < π, π − log π > + π − log π π 2 • Sub-problem 2 is not convex; πΆπ = 1, π ≥ 0 • C is well-structured—it can be divided into several much smaller and easier problems. π π : min ππ < π π − > + π − log π π 2 1π π π = 1,0 ≤ π π ≤ 1 ππ , π π log π π 2 2 • Solved via primal dual interior point method: – Easy computable gradient, hessian (diagnoal); – Starting point can be computed in a closed form. 2 2 Algorithms • Initialize π by ordinary HMM; • While not converged: – Compute optimal state path z for each x; – Solve Sub-problem 2; – Solve Sub-problem 1; – Update π = π + π π − log π , π = ππ – Check convergence; • End Comparing Cost of Optimization Methods Relationship to Latent SVM • Latent SVM Method: aims to learn a predictor: Proposed ππ€ π₯ = max π€ π Ψ π₯, π§ ππ π§ π₯π, π₯π Latent πVariable ππ π: min −βbe written+as: πΎ • The objective function can π π ∈Ω π π,π ∈πΌ 1 2 π₯ ∈π min π π€ π2 + πΆ π₯π π π −β + β π₯ π0,1 , π₯ π− π¦π+ππ€πππ ≥ π, πππ ≥ 0 π€ π₯2 , π₯ π • Fit into the proposed problem: 1 min π€ 22 + πΆ π€ 2 ππ π π β π₯πΏ − π§πΏ − β π₯π − π§π π¦π max π€ π π π π πΏ π π§π ,π§π State Path Pair as Latent Var. + ππ ≥ 1 , ππ ≥ 0 Relationship to Latent SVM • Latent SVM can’t guarantee π€ is a valid HMM: – Max-margin is the only requirement; • Two state paths π§ππΏ , π§ππ is optimized jointly— no known efficient algorithm; – Viterbi algorithm used for the proposed method. • π€ learned by latent SVM can only works on a pair of data—only usable for comparisons. – Proposed method can assign score to single data. Experiment: Emotion Recognition • Recognizing the emotional state of the speakers is very important; – Human computer interaction • Existing methods try to classify the audio to predefined labels or levels: – Labeled training data is required; • We can leverage the power of pairwise comparison via the proposed method; Training Data Extract MFCC 991 audios, 6 emotions at 7 levels, half for training and 1000 randomly selected pair for input. Bag of Words Pairwise Rank Emotion Recognition with RHMM RHMM Models Experiment Results Dimension Improved Baseline HMM Pleasantness 77.30% 57.96% 75.05% Arousal 86.95% 55.74% 69.55% Dominance 87.95% 63.04% 77.32% Credibility 76.68% 55.11% 71.74% Interest 81.90% 62.56% 78.07% Positivity 74.99% 67.84% 70.36% Average 81.28% 53.14% 73.72% Outline • Introduction • Review of Proposal Defense • Progress After Proposal Defense – Spatiotemporal Visual Saliency – Relative Hidden Markov Model • Conclusion and Future Work Future Work • Subspace learning via decomposition: – Allow each image being weighted combination of several imaging conditions; – Automatically figure out the weight and condition. • Relative hidden Markov models: – Theoretic analysis of the learned model; – Allowing more types of observation models; – Modeling multiple relative attributes jointly via multi-task learning framework. Selected Publications • • • • • • • • • Lin Chen, Qiang Zhang and Baoxin Li, Predicting Multiple Attributes via Relative Multi-task Learning, IEEE Computer Vision and Pattern Recognition (CVPR) 2014, Columbus, OH Qiang Zhang and Baoxin Li, Relative Hidden Markov Models for Evaluating Motion Skills, IEEE Computer Vision and Pattern Recognition (CVPR) 2013, Portland, OR Lin Chen, Qiongjie Tian, Qiang Zhang and Baoxin Li. Learning Skill-Defining Latent Space in Video-Based Analysis of Surgical Expertise: A Multi-Stream Fusion Approach. NextMed/MMVR20. San Diego, CA, 2013. Qiongjie Tian, Lin Chen, Qiang Zhang and Baoxin Li. Enhancing Fundamentals of Laparoscopic Surgery Trainer Box via Designing A Multi-Sensor Feedback System. NextMed/MMVR20. San Diego, CA, 2013. Qiang Zhang, Lin Chen, Qiongjie Tian and Baoxin Li. Video-based analysis of motion skills in simulationbased surgical training. SPIE Multimedia Content Access: Algorithms and Systems VII. San Francisco, CA, 2013. Qiang Zhang and Baoxin Li. Video-based motion expertise analysis in simulationbased surgical training using hierarchical dirichlet process hidden markov model. In Proceedings of the 2011 international ACM workshop on Medical multimedia analysis and retrieval (MMAR ’11). ACM, New York, NY, USA, 19-24. Zhang, Qiang and Li, Baoxin, Towards Computational Understanding of Skill Levels in Simulation-Based Surgical Training via Automatic Video Analysis, International Symposium on Visual Computing (ISVC) 2010, Las egas, NV Qiang Zhang and Baoxin Li. Mining Discriminative Components With LowRank And Sparsity Constraints for Face Recognition. The 18th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2012). Qiang Zhang and Baoxin Li, Joint Sparsity Model with Matrix Completion for an Ensemble of Images, IEEE International Conference on Image Processing (ICIP) 2010, Hong Kong, China Selected Publications • • • • • • • • Qiang Zhang and Baoxin Li, Discriminative K-SVD for Dictionary Learning in Face Recognition, IEEE Computer Vision and Pattern Recognition (CVPR) 2010, San Francisco, CA Qiang Zhang, Baoxin Li, “Max Margin Multi-Attribute Learning with Low Rank Constraint,” Image Processing, IEEE Transactions on, Yilin Wang, Qiang Zhang and Baoxin Li, Semantic Saliency Weighted SSIM for Video Quality Assessment, VPQM 2014, Chandler, AZ Qiang Zhang, Chang Yuan, Xinyu Xu, Peter Van Beek, Hae jong Seo, and Baoxin Li. Efficient defect detection with sign information of Walsh Hadamard transform. IS&T/SPIE Image Processing: Machine Vision Applications VI. San Francisco, CA, 2013 Jin Zhou, Qiang Zhang, Baoxin Li and Ananya Das, Synthesis of Stereoscopic Views from Monocular Endoscopic Videos, IEEE Computer Vision and Pattern Recognition (CVPR) 2010 workshop on Mathematical Methods in Biomedical Image Analysis, San Francisco, CA Qiang Zhang and Pengfei Xu and Wen Li and Zhongke Wu and Mingquan Zhou, Efficient Edge Matching Using Improved Hierarchical Chamfer Matching, Aug, IEEE International Symposium on Circuits and Systems (ISCAS) 2009, Taipei, Taiwan Qiang Zhang and Hua Li and Yan Zhao and Xinlu Liu, Exploration of EventEvoked Oscillaotry Activities during a Cognitive Task, The 4th International Conference on Natural Computation and The 5th International Conference on Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) 2008, Jinan , China Qiang Zhang, Baoxin Li, “Relative Hidden Markov Models for Video-based Evaluation of Motion Skills in Surgical Training,” Pattern Analysis and Machine Intelligence, IEEE Transactions on [under review] Acknowledgements • Thesis Committee – Baoxin Li, Pavan Turaga, Yalin Wang and Jieping Ye • Visual Presentation & Processing Group – Xinyu Xu, Jin Zhou, Zheshen Wang, Xiaolong Zhang, Pradeep Nagesh, Naveen Kulkarni, Devi Archana Paladugu, Nan Li, Peng Zhang, Lin Chen, Qiongjie Tian, Yilin Wang, Xu Zhou, Parag Chandakkar, Ragav Venkatesan, Hima Bindu Maguluri, Collin Walker Thank You! Questions?