Shape-Based Human Detection and Segmentation via Hierarchical

Shape-Based Human Detection and
Segmentation via Hierarchical PartTemplate Matching
Zhe Lin, Member, IEEE
Larry S. Davis, Fellow, IEEE
APRIL 2010
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
• Robust Human tracking and identification are
highly dependent on reliable human
detection and human segmentation.
• Remains challenging due to several conditions
like body postures, illumination, occlusion,
and viewpoint changes.
• Goal: Develop a robust and efficient approach
to detect and segmentation.
• Method: Shape-based, part-template
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
Previous Work
• Shape Feature extraction schemes
– Model human shapes globally [1],[2],[3]
– Model shapes using sparse local features
• Learning Perspective
– Generative approach – tree-based data structure
– Discriminative approach – using SVMs as the test
classifiers [3]
• Surveillance scenarios
– Motion blob information [35],[36]
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
Proposed Approach
• Hierarchical part-template matching approach
combining with discriminative learning.
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
Hierarchical Part-Template Matching
• Generating the part-template tree model
– Synthesizing global shape models
– Generating parts by decomposition
– Constructing an initial tree model using parts
• Learning the part-template tree
• Hierarchical part-template matching
Synthesizing Global Shape Models
• Analyzing articulation of human body to six
– Head, torso, pair of upper legs, pair of lower legs
– Parameter above are quantized into {3,2,3,3,3,3}
Generating Parts by Decomposition
• Binarize (a) and to obtain (b), then extract
boundaries of the silhouettes to get (c).
• Silhouettes are decomposed into three
parts(head-torso, upper legs, and lower legs)
• The parameters of silhouettes are denoted by
θj, consist of index and location
Constructing an Initial Tree Model
Using Parts
• A part-template tree is conducted by placing
the decomposed part region or fragment into
a tree.
• Four layer L0~L3, denote root, head-torso,
upper and lower legs separately.
• Tree consists of 186 part-template. (6 ht
models, 18 ul models, and 162 ll models)
• Much larger set only slightly improves in
• Applying fast hierarchical shape matching
Constructing an Initial Tree Model
Using Parts
Learning the Part-Template Tree
• The tree doesn’t contain any prior statistics
from real human silhouettes.
• The learning is performed by matching the
tree to a set of real human silhouette images.
• The goal is to explicitly estimate branching
probability distributions (conditional
probability distributions).
Learning the Part-Template Tree
• Learning method:
– The training silhouette is passed through the tree
from root to estimate the matching score and find
the optimal path.
– Based on the set of paths, a branching probability
distribution is estimated for each node.
– Each node contains a binary image of the parttemplate, its sample point coordinates, and a
branching probability.
Hierarchical Part-Template Matching
• Similarly to the model used for tree learning.
• The overall matching score for a detection
window is simply modeled as a summation of
scores of all nodes along the path.
• Score of node is the product of the parttemplate matching score and the probability
of the node.
• Matching method is similar to Chamfer
matching [6].
– The matching score of a sample point on the
contour is measured by edge-orientation
matching to find the optimal human pose.
[6] D.M. Gavrila and V. Philomin, “Real-Time Object Detection for SMART Vehicles,”
Proc. IEEE
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And Background
• Experiment Result
• Conclusion
Pose-Adaptive Descriptors
• Introduce a pose-adaptive feature
computation method for detecting
human from images using SVM.
• By similar method of HOG descriptor[3]
getting object detection window.
• After given the candidate detection
window, hierarchical part-template
matching is performed to estimate the
[3] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,”
optimal pose.
Proc. IEEE Conf.
• After the pose is estimated, block
Pose-Adaptive Descriptors
Low-Level Features
• Similar to [3]
• Given an image, calculate gradient
magnitudes |G| and edge orientation O
• Quantize the image into 8x8 nonoverlapping
cells, each represent a histogram of edge
Pose Inference on The Low-Level
• An optimal tree path is estimated based on
the matching score.
• Among matching score, the part-template
score is measured by an average of gradient
• Matching score
where B(t) = [O(t)/(π/9)], h is the
orientation histogram
• The average score of the part-template is
Representation Using Pose-Adaptive
• The global shape models are represented as a
set of boundary points with corresponding
edge orientations.
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And
Background Subtraction
• Experiment Result
• Conclusion
Scene-to-Camera Calibration
• To obtain a mapping between head points and
foot points in the image, estimate the
homography between the head plane and the
foot plane in the image.
• Get head point ph = f(pf), where pf is an
arbitrary point of foot.
Combining With Background
• Find foot regions Rfoot = {x|ϒx≥ξ}
• Through part-template matching finding
regions that may be legs.
• Given the estimated human vertical axis vx
and an adaptive rectangular window
W(x,(w0,h0)), get human detection.
• Get human segmentation.
Combining With Calibration and
Background Substraction
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And
Background Subtraction
• Experiment Result
• Conclusion
Experiment Result
• Present result of human detector using their
method on two public pedestrian data sets
• Present result of multiple occluded human
detector on three crowded image and video
data set.
• Compare with other approaches using DET
Experiment of Detection Result
Experiment of Detection Result
• Better performance than HOG-SVM.
• Not only detecting but also segmenting
human poses.
• Can be further improved because of capability
of being extended to cover more pose or
• Successfully detected difficult poses while the
HOG-based detector missed.
Experiment of Detection Result
Experiment of Detection Result
Experiment of Segmentation Result
• Using pose model and probabilistic
hierarchical part-template matching algorithm
give very accurate segmentation in the MITCBCL and INRIA data set.
Experiment Without Subtraction
Experiment Without Subtraction
Experiment With Subtraction
• Data set
– Caviar Benchmark data set
– Munich Airport data set collected by Siemens
Corporate Research
• Can get good result even with poor and
inaccurate background subtraction.
Experiment With Subtraction
Experiment With Subtraction
• Introduction
• Previous Work
• Proposed Approach
– Hierarchical Part-Template Matching
– Pose-Adaptive Descriptors
– Combining With Calibration And
Background Subtraction
• Experiment Result
• Conclusion
• A hierarchical part-template matching
approach is employed to match
human shapes with images detect and
segment simultaneously.
• Many of misdetections are due to the
pose estimation failures.
• Future work
– Investigating the addition of color and
texture statistics to the local contextual