Sides - the website of Qiang Zhang, an experienced researcher and

advertisement
Semantic Sparse Learning in
Images and Videos
Ph.D. Dissertation Defense
Qiang Zhang
Outline
•
•
•
•
Introduction
Review of Proposal Defense
Progress After Proposal Defense
Future Work
Sparse Learning in Image
and Video
• Sparse learning has seen a lot
applications in computer vision:
– Image/video encoding/compression;
– Image/video denoising;
– Image/video super-resolution;
– Image/video classification;
– Face recognition;
– Video tracking…
Sparse Learning
• What is sparse learning: introducing sparsity
constraint into traditional learning algorithm,
e.g., least square regression->LASSO;
• Sparsity constraint can be induced by β„“1
norm, β„“0 “norm”:
– π‘₯
– π‘₯
=
0 =
1
𝑖
π‘₯𝑖
𝑖:π‘₯𝑖 ≠0 1
• 𝐿1 norm is convex and
𝐿0 “norm” is nonconvex.
Sparse Learning in Matrix
• Sparsity learning in matrix has also drawn a
lot of attentions;
• Instead of measuring β„“1 or β„“0 of matrix, we
are more interested in the rank of matrix;
• Rank of matrix is not convex, thus a convex
relaxation: nuclear norm/trace norm is more
widely used;
– 𝑋
∗
=
𝑖 πœŽπ‘–
• Applications: matrix completion, robust PCA.
Sparse Learning with
Statistics Learning
• Sparse learning is related to the beta process
in statistics learning:
– Sparsity and values of the sparse vector π‘₯ is
controlled separately;
• It can be combined with other graphic models.
𝛼, 𝛽
Beta proc.
πœ‹
Bernoulli
z
Dot Prod.
πœ‡, Σ
Gauss Dist
y
x
Introduce Semantics to
Sparse Learning
• Image and video contain rich semantic information:
– Similar patches should have similar sparse codes;
– Face images of same subject under different illumination
conditions form a low-dimension subspace.
– Human vision is more sensitive to patterns which are
different from their neighbors;
– Skills will get improved during a training process;
• Many existing sparse learning methods ignore those
semantic information;
– If properly used, semantic could boost their performances.
Semantic Sparse Learning
• My study: how to model semantic information
of visual data and explicitly incorporate it in
sparse learning;
• Identify four vision problems of great
importance and broad interest to community;
– Different types of context information,
– Different data modalities.
• We show how semantic information can
enhance sparse learning in those problems.
Outline
•
•
•
•
Introduction
Review of Proposal Defense
Progress After Proposal Defense
Future Work
Overview of Proposal
Defense
• In proposal defense, we present, in four
important problems, how semantic
information can enhance sparse
learning;
– Discriminative Dictionary Learning;
– Subspace Learning;
– Spatiotemporal Visual Saliency;
– Relative Hidden Markov Model;
Discriminative Dictionary
Learning
• Sparse representation has shown great
success in face recognition, e.g., SRC;
– Face image of a subject can be constructed from
images of this subject;
• SRC: images as columns of dictionary,
– A large dictionary required for good performances;
– Large dictionary: high computational cost.
• A smaller dictionary is possible by learning it
from the training data.
K-SVD: Learning Dictionary
for Sparse Representation
• K-SVD learns dictionary to sparsely represent the
input signal;
– Objective: Reconstruction+Sparsity;
– Successful in signal reconstruction task;
• We can improve the discriminative power of
dictionary by adding a extra term to K-SVD:
– Objective: Reconstruction+Classification+Sparsity;
• It can still be solved under K-SVD framework
efficiently;
– Dictionary is both reconstructive and discriminative.
Experiment
• Compare the proposed
method (D-KSVD) with
SRC, and K-SVD;
• Use two datasets,
Extended YaleB dataset
and AR dataset;
• Compute accuracy and
histogram of correlation
of dictionary atoms.
95.56
99.31
80.76
93.17
93.17
99.58
99.72
93.85
99.30
98.89
95.0
90.5
68.14
88.17
93.0
Time(S)
62
131
76
Subspace Learning
• Multiple copies of signals from the same
source are very common;
• Those signals should have some common
components and unique components;
• We may want to decompose the signal into
multiple components:
– Obtaining better compression rate;
– Extract more relevant features.
Proposed Model
• We decompose the images 𝕏 = 𝑿𝑖𝑗
𝑁,𝑀
𝑖,𝑗=1
as:
𝑿𝑖𝑗 = π‘ͺ𝑖 + 𝑨𝑗 + 𝑬𝑖𝑗 ∀𝑿𝑖𝑗 ∈ 𝕏
– 𝑿𝑖𝑗 the π‘—π‘‘β„Ž image of π‘–π‘‘β„Ž subject;
• π‘ͺ𝑖 is a matrix representing the common information of
images for Subject 𝑖;
• 𝑨𝑗 , low rank, captures the global information of the
image, e.g., illumination conditions;
• 𝑬𝑖𝑗 , sparse, stores image-specific details such as
expression conditions or noise with sparse support.
Decomposing Extended YaleB
Dataset Right: the
Left: common components;
low-rank components.
Face Recognition: Robustness over
Poses Variations
• We build subspaces from the decomposed components
and test image;
• Subspaces are compared via principal angle;
• Test image is assigned to the closest subspace;
• Experiment result on Multi-PIE.
#train per
subject
20
Proposed 99.98±0.03%
99.98±0.03%
SRC
Volterrafaces 99.60±0.22%
99.93±0.05%
SUN
15
99.92±0.06%
99.45±0.03%
98.37±0.47%
99.38±0.14%
10
99.24±0.06%
96.79±0.28%
97.63±0.28%
97.89±0.30%
5
90.95±0.70%
86.98±0.16%
89.72±1.45%
88.29±0.02%
Spatiotemporal Saliency
• Visual saliency predicts regions in the field of
view that draw the most visual attention;
• Visual saliency has attracted a lot of interest
and has a lot of applications;
• Spectrum analysis based visual saliency has
been popular due to its simplicity;
• Human vision system is interacting with
dynamics scenes—temporal information
should be modeled for visual saliency.
Proposed Method
• Assumption: the foreground object is small compared
to whole spatiotemporal volume and background is
sparse in frequency domain;
• FFT: Phase map captures the location of events;
• We proposed to detect saliency region with the phase
map of the video as:
2
π‘Œ
𝑍 = β„± −1
, π‘Œ = β„±(𝑋)
π‘Œ
• Only Fourier transform is involved—simple but
effective.
Simulation Experiment
• We verify the proposed method on designed
psychology pattern:
Blind
Flicker
Direction Velocity
Experiment on Video
Datasets
• We also test on two
challenging video
datasets for saliency
evaluation, CRCNSORIG and DIEM.
• To measure the
performance, we
compute the area under
curve (AUC)
Method
AWS
HouNIPS
Bian
IO
SR
Torralba
Judd
Marat
Rarity-G
CIOFM
Proposed
AUC
0.6
0.5967
0.595
0.595
0.5867
0.5833
0.5833
0.5833
0.5767
0.5767
0.6639
Method
AWS
Bian
Marat
Judd
AIM
HouNIPS
Torralba
GBVS
SR
CIO
Proposed
AUC
0.577
0.573
0.573
0.57
0.568
0.563
0.584
0.562
0.561
0.556
0.6896
Relative Hidden Markov
Model
• Understanding human motion is an important task in
many fields, e.g., surgery;
• One key problem in such applications is the analysis
of skills associated with body motion;
• Many computational methods have been developed
for this purpose, especially HMM based;
• One practical difficulty: those methods typically
require the skill labels for the training data;
• Obtaining labels of training data, currently done by
senior surgeons, is very difficult and costly.
Relative Label Instead of
Absolute Label
• It is hard to say whether
(b) is smiling or not.
• But it is easy to find (b)
is less smiling than (a)
but more than (c).
• We use similar idea in
our motion analysis:
given two videos, we
only need to know
which one is better.
Proposed Method
• Proposed method learns a model from small
number of pairwise rankings of training data;
• The skill is linked to the likelihood of the
inputs under the learned model.
𝑝 𝑋 𝑖 πœƒ 𝑠. 𝑑. 𝐹 𝑋 𝑖 πœƒ > 𝐹 𝑋𝑗 πœƒ ∀ 𝑖, 𝑗 ∈ 𝔼
πœƒ: max
πœƒ
𝑋𝑖
• Skill is measured by 𝐹 𝑋 𝑖 πœƒ :
– Data likelihood 𝑝 π‘₯ 𝑖 πœƒ ;
– Likelihood ratio 𝑃 π‘₯ 𝑖 πœƒ1 /𝑝 π‘₯ 𝑖 πœƒ2
Preferred
Experiments: Skill Curve
Outline
• Introduction
• Review of Proposal Defense
• Progress After Proposal Defense
– Spatiotemporal Visual Saliency
– Relative Hidden Markov Model
• Future Work
Spatiotemporal Saliency
• Previous: spatiotemporal saliency
detection for video is proposed;
• New progress:
– More theoretic analysis for the algorithm;
– Application on abnormality detection;
– Application on spatiotemporal interest point
detection.
An Explanation from Human
Vision System
Explanation and Analysis
• Saliency map is related to the primary visual cortex of
human vision system:
– Orientation selective. The cells are tuned to the specific
frequency and orientation;
– Lateral surround inhibitive. Similarly tuned cells will be
Spectral
suppressed depending on total response.
Magnitude
• It can be modeled by phase only transform:
– Spectral magnitude=total response of cells tuned to specific
frequency and orientation.
– Discarding spectral magnitude=suppressing by the total
response.
Abnormality Detection
• Abnormality detection identifies the volume
which is different from the others;
– Such region would be also salient;
• We can use the spatiotemporal saliency for
abnormality detection in video:
– Compute saliency map for the input video;
– Compute average saliency score for each frame;
– Pick the frame/region above certain value;
• Fast, effective and unsupervised.
Example: Abnormality Frame
Example: Abnormality Region
Abnormality Detection:
Results
UMN dataset
UCSD dataset
Method
AUC
Method
EER
Optical Flow
0.84
Social Force
37%
Social Force
0.96
MPPCA
35%
Chaotic Invariant
0.99
MDT
25%
NN
0.93
Adam
40%
Sparse Recon.
0.978
Reddy
21.25%
Proposed
23%
Interaction Force 0.9961
Proposed
0.9378
Spatiotemporal Saliency
Point Detector
• Spatiotemporal interest point (STIP) has been
widely used in action recognition;
– Follow bag of words from image classification;
• STIPs is usually selected from region where
the (3D) gradient is dominant;
– Simple but no psychology justification.
• The regions which attracts human’s attention
most would also contribute most to people’s
perception of the scene.
Spatiotemporal Saliency
Point Detector
• An efficient yet effective STIP detector:
– Compute saliency map for input video;
– Nonlocal maximal suppression;
– Detect the nonlocal maximums in saliency map;
• Feed the detected STIPs to existing
algorithms:
– STIP can be described by HoG, HoF or even its
local saliency value;
– Bag of words can be directly used.
Example: STIP
Experiment Results
Method
Harris3D
Gabor
Weizmann KTH
85.60% 91.80%
UCF sports
78.10%
N.A.
88.70%
77.70%
Hessian3D N.A.
88.70%
79.30%
Dense
86.10%
81.60%
Proposed
84.50% 88.00%
86.70%
Proposed*
95.60% 92.60%
85.60%
N.A.
For “proposed*”, we extract the descriptor on the saliency map instead of on the video.
Outline
• Introduction
• Review of Proposal Defense
• Progress After Proposal Defense
– Spatiotemporal Visual Saliency
– Relative Hidden Markov Model
• Future Work
Relative Hidden Markov
Model
• Previous: relative HMM for modeling
motion skills in videos is proposed;
• New progress:
– An improved algorithm;
– Theoretic analysis via comparison to latent
SVM;
– Application on emotion recognition from
speech signal.
Proposed Method
• To make the algorithm robust over outliers,
we introduce πœ–:
− log 𝑝 𝑋 𝑖 πœƒ + 𝛾
πœƒ: min
πœƒ∈Ω
π‘₯ 𝑖 ∈𝕏
πœ–π‘–π‘—
𝑖,𝑗 ∈𝔼
𝐹 𝑋 𝑖 πœƒ − 𝐹 𝑋𝑗 πœƒ + πœ–π‘–π‘— ≥ 𝜌, πœ–π‘–π‘— ≥ 0
• The objective function requires:
– The model fits data well;
– The pairwise comparison is satisfied.
Proposed Method Cont’d
• 𝑝 π‘₯ πœƒ can be approximated 𝑝 π‘₯, 𝑧 ∗ πœƒ ;
– Merhav and Ephraim (1991);
– Better approximation for longer sequence.
• For multinomial distribution:
log 𝑝 π‘₯, 𝑧 ∗ πœƒ = log πœƒ 𝑇 β„Ž(π‘₯, 𝑧 ∗ );
– 𝑧 ∗ state path via Viterbi algorithm.
• The constraint πœƒ ∈ Ω can be written as:
πΆπœƒ = 1, πœƒ ≤ 0
Update the Model
• To update the model, we are solving:
πœ“, πœ–: min 𝑓 𝑇 πœ“ + 𝛾 𝑇 πœ–
πœ“,πœ–
π΄πœ“ + πœ– ≤ 𝜌, 𝐢𝑒 πœ“ = 1, πœ“ ≤ 0, πœ– ≥ 0
• This is a nonconvex nonlinear problem;
– Previous work uses primal-dual interior point method.
• We introduce πœƒ = 𝑒 πœ“ and apply augmented Lagrange
multipliers:
min 𝑓 𝑇 πœ“ + 𝛾 𝑇 πœ– +< πœ†, πœ“ − log πœƒ > +
πœƒ,πœ“,πœ–
πœ‡
πœ“ − log πœƒ
2
π΄πœ“ + πœ– ≤ 𝜌, πΆπœƒ = 1, πœ“ ≤ 0, πœƒ ≥ 0, πœ– ≥ 0
2
2
Update the Model Cont’d
• We can use block coordinate descent to
solve this problem:
• Sub-problem 1 (QP):
πœ‡
min 𝑓 πœ“ + 𝛾 πœ– +< πœ†, πœ“ − log πœƒ > + πœ“ − log πœƒ
πœ“,πœ–
2
π΄πœ“ + πœ– ≤ 𝜌, πœ“ ≤ 0, πœ– ≥ 0
𝑇
𝑇
• Sub-problem 2 (nonlinear prog.):
πœ‡
min < πœ†, πœ“ − log πœƒ > + πœ“ − log πœƒ
πœƒ
2
πΆπœƒ = 1, πœƒ ≥ 0
2
2
2
2
Sub-problem 2
πœ‡
min < πœ†, πœ“ − log πœƒ > + πœ“ − log πœƒ
πœƒ
2
• Sub-problem 2 is not convex; πΆπœƒ = 1, πœƒ ≥ 0
• C is well-structured—it can be divided into
several much smaller and easier problems.
πœƒ π‘˜ : min
πœƒπ‘˜
<
πœ‡ π‘˜
−
> + πœ“ − log πœƒ π‘˜
2
1𝑇 πœƒ π‘˜ = 1,0 ≤ πœƒ π‘˜ ≤ 1
πœ†π‘˜ , πœ“ π‘˜
log πœƒ π‘˜
2
2
• Solved via primal dual interior point method:
– Easy computable gradient, hessian (diagnoal);
– Starting point can be computed in a closed form.
2
2
Algorithms
• Initialize πœƒ by ordinary HMM;
• While not converged:
– Compute optimal state path z for each x;
– Solve Sub-problem 2;
– Solve Sub-problem 1;
– Update πœ† = πœ† + πœ‡ πœ“ − log πœƒ , πœ‡ = πœ‡πœŽ
– Check convergence;
• End
Comparing Cost of
Optimization Methods
Relationship to Latent SVM
• Latent
SVM Method:
aims to learn a predictor:
Proposed
𝑓𝑀 π‘₯ = max 𝑀 𝑇 Ψ π‘₯, 𝑧
πœ“π‘‡
𝑧
π‘₯𝑖, π‘₯𝑖
Latent
πœ–Variable
𝑖𝑗
πœ“: min
−β„Žbe written+as:
𝛾
• The objective
function
can
πœ“
𝑒 ∈Ω
𝑖
𝑖,𝑗 ∈𝔼
1
2 π‘₯ ∈𝕏
min 𝑖 𝑀 𝑖2 + 𝐢
π‘₯𝑖
πœ“ 𝑇 −β„Ž
+ β„Ž π‘₯ 𝑗0,1
, π‘₯ 𝑗− 𝑦𝑖+π‘“π‘€πœ–π‘–π‘—
≥ 𝜌, πœ–π‘–π‘— ≥ 0
𝑀 π‘₯2 , π‘₯
𝑖
• Fit into the proposed problem:
1
min 𝑀 22 + 𝐢
𝑀 2
πœ–π‘–
𝑖
𝑇 β„Ž π‘₯𝐿 − 𝑧𝐿 − β„Ž π‘₯𝑅 − 𝑧𝑅
𝑦𝑖 max
𝑀
𝑖
𝑖
𝑖
𝑖
𝐿 𝑅
𝑧𝑖 ,𝑧𝑖
State Path Pair
as Latent Var.
+ πœ–π‘– ≥ 1 , πœ–π‘– ≥ 0
Relationship to Latent SVM
• Latent SVM can’t guarantee 𝑀 is a valid
HMM:
– Max-margin is the only requirement;
• Two state paths 𝑧𝑖𝐿 , 𝑧𝑖𝑅 is optimized jointly—
no known efficient algorithm;
– Viterbi algorithm used for the proposed method.
• 𝑀 learned by latent SVM can only works on a
pair of data—only usable for comparisons.
– Proposed method can assign score to single data.
Experiment: Emotion
Recognition
• Recognizing the emotional state of
the speakers is very important;
– Human computer interaction
• Existing methods try to classify the audio to
predefined labels or levels:
– Labeled training data is required;
• We can leverage the power of pairwise
comparison via the proposed method;
Training
Data
Extract MFCC
991 audios, 6 emotions at 7 levels,
half for training and 1000 randomly
selected pair for input.
Bag of Words
Pairwise
Rank
Emotion Recognition
with RHMM
RHMM
Models
Experiment Results
Dimension
Improved Baseline HMM
Pleasantness
77.30% 57.96% 75.05%
Arousal
86.95% 55.74% 69.55%
Dominance
87.95% 63.04% 77.32%
Credibility
76.68% 55.11% 71.74%
Interest
81.90% 62.56% 78.07%
Positivity
74.99% 67.84% 70.36%
Average
81.28% 53.14% 73.72%
Outline
• Introduction
• Review of Proposal Defense
• Progress After Proposal Defense
– Spatiotemporal Visual Saliency
– Relative Hidden Markov Model
• Conclusion and Future Work
Future Work
• Subspace learning via decomposition:
– Allow each image being weighted combination of
several imaging conditions;
– Automatically figure out the weight and condition.
• Relative hidden Markov models:
– Theoretic analysis of the learned model;
– Allowing more types of observation models;
– Modeling multiple relative attributes jointly via
multi-task learning framework.
Selected Publications
•
•
•
•
•
•
•
•
•
Lin Chen, Qiang Zhang and Baoxin Li, Predicting Multiple Attributes via Relative Multi-task Learning, IEEE
Computer Vision and Pattern Recognition (CVPR) 2014, Columbus, OH
Qiang Zhang and Baoxin Li, Relative Hidden Markov Models for Evaluating Motion Skills, IEEE Computer
Vision and Pattern Recognition (CVPR) 2013, Portland, OR
Lin Chen, Qiongjie Tian, Qiang Zhang and Baoxin Li. Learning Skill-Defining Latent Space in Video-Based
Analysis of Surgical Expertise: A Multi-Stream Fusion Approach. NextMed/MMVR20. San Diego, CA, 2013.
Qiongjie Tian, Lin Chen, Qiang Zhang and Baoxin Li. Enhancing Fundamentals of Laparoscopic Surgery
Trainer Box via Designing A Multi-Sensor Feedback System. NextMed/MMVR20. San Diego, CA, 2013.
Qiang Zhang, Lin Chen, Qiongjie Tian and Baoxin Li. Video-based analysis of motion skills in simulationbased surgical training. SPIE Multimedia Content Access: Algorithms and Systems VII. San Francisco, CA,
2013.
Qiang Zhang and Baoxin Li. Video-based motion expertise analysis in simulationbased surgical training
using hierarchical dirichlet process hidden markov model. In Proceedings of the 2011 international ACM
workshop on Medical multimedia analysis and retrieval (MMAR ’11). ACM, New York, NY, USA, 19-24.
Zhang, Qiang and Li, Baoxin, Towards Computational Understanding of Skill Levels in Simulation-Based
Surgical Training via Automatic Video Analysis, International Symposium on Visual Computing (ISVC) 2010,
Las egas, NV
Qiang Zhang and Baoxin Li. Mining Discriminative Components With LowRank And Sparsity Constraints for
Face Recognition. The 18th ACM SIGKDD International Conference On Knowledge Discovery and Data
Mining (SIGKDD 2012).
Qiang Zhang and Baoxin Li, Joint Sparsity Model with Matrix Completion for an Ensemble of Images, IEEE
International Conference on Image Processing (ICIP) 2010, Hong Kong, China
Selected Publications
•
•
•
•
•
•
•
•
Qiang Zhang and Baoxin Li, Discriminative K-SVD for Dictionary Learning in Face Recognition, IEEE
Computer Vision and Pattern Recognition (CVPR) 2010, San Francisco, CA
Qiang Zhang, Baoxin Li, “Max Margin Multi-Attribute Learning with Low Rank Constraint,” Image Processing,
IEEE Transactions on,
Yilin Wang, Qiang Zhang and Baoxin Li, Semantic Saliency Weighted SSIM for Video Quality Assessment,
VPQM 2014, Chandler, AZ
Qiang Zhang, Chang Yuan, Xinyu Xu, Peter Van Beek, Hae jong Seo, and Baoxin Li. Efficient defect detection
with sign information of Walsh Hadamard transform. IS&T/SPIE Image Processing: Machine Vision
Applications VI. San Francisco, CA, 2013
Jin Zhou, Qiang Zhang, Baoxin Li and Ananya Das, Synthesis of Stereoscopic Views from Monocular
Endoscopic Videos, IEEE Computer Vision and Pattern Recognition (CVPR) 2010 workshop on Mathematical
Methods in Biomedical Image Analysis, San Francisco, CA
Qiang Zhang and Pengfei Xu and Wen Li and Zhongke Wu and Mingquan Zhou, Efficient Edge Matching
Using Improved Hierarchical Chamfer Matching, Aug, IEEE International Symposium on Circuits and
Systems (ISCAS) 2009, Taipei, Taiwan
Qiang Zhang and Hua Li and Yan Zhao and Xinlu Liu, Exploration of EventEvoked Oscillaotry Activities
during a Cognitive Task, The 4th International Conference on Natural Computation and The 5th International
Conference on Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) 2008, Jinan , China
Qiang Zhang, Baoxin Li, “Relative Hidden Markov Models for Video-based Evaluation of Motion Skills in
Surgical Training,” Pattern Analysis and Machine Intelligence, IEEE Transactions on [under review]
Acknowledgements
• Thesis Committee
– Baoxin Li, Pavan Turaga, Yalin Wang and
Jieping Ye
• Visual Presentation & Processing Group
– Xinyu Xu, Jin Zhou, Zheshen Wang, Xiaolong Zhang,
Pradeep Nagesh, Naveen Kulkarni, Devi Archana Paladugu,
Nan Li, Peng Zhang, Lin Chen, Qiongjie Tian, Yilin Wang,
Xu Zhou, Parag Chandakkar, Ragav Venkatesan, Hima
Bindu Maguluri, Collin Walker
Thank You!
Questions?
Download