An Overview of Vision-Based 3D Human Body Representations

advertisement
An Overview of Vision-Based 3D Human Body
Representations
Angel D. Sappa1, Niki Aifanti2, Nikos Grammalidis2,
and Sotiris Malassiotis2
1
2
Computer Vision Center
Edifici O Campus UAB
08193 - Bellaterra, Barcelona, Spain
Tel: +34 93 581 3036
Fax: +34 93 581 1670
e-mail: angel.sappa@cvc.uab.es
Informatics & Telematics Institute
1st Km Thermi-Panorama Road
Thermi-Thessaloniki, Greece
Tel: +30 2310 46 41 60
Fax: +30 2310 46 41 64
e-mail: {naif, ngramm, malasiot@iti.gr}
Keywords: Human body modeling; Vision-based body representations; Coding
standards;
1
An overview of Vision-Based 3D Human Body
Representations
INTRODUCTION
The problem of human body modeling has been initially tackled to solve applications related to
the film industry or computer games within the computer graphics community (CG). Since then,
several different tools were developed for editing and animating 3D digital body models.
Although at the beginning most of those tools were devised within the computer graphics
community, nowadays a lot of work proceeds from the computer vision community (CV). In
spite of this overlapped interest there is a considerable difference between CG and CV human
body model (HBM) applications. The first one pursues for realistic models of both, human body
geometry and its associated motion. On the contrary, CV seeks more for an efficient than an
accurate model for applications such as: intelligent video surveillance, motion analysis,
telepresence, 3D video sequence processing and coding.
The current work is focused on vision-based human body modeling systems. This overview
will present some of the techniques proposed in the bibliography, together with their advantages
or disadvantages. The outline of this work is as follows. Firstly, geometrical primitives and
mathematical formalism, used for 3D model representation, are addressed. Next, a brief
description of standards used for coding HBMs is given. Finally, a section with future trends and
conclusion is introduced.
3D Human Body Representations
Modeling a human body implies firstly the definition of an articulated 3D structure, in order to
represent the human body biomechanical features. Secondly, it involves the choice of an
2
3 DOF
3 DOF
1 DOF
root
hip
3 DOF
3 DOF
1 DOF
knee
Link
global
A
Joint
Figure 1. Stick representation of an articulated model defined by 22 DOF.
appropriate mathematical model to govern the movements of that articulated structure.
Several 3D articulated representations and mathematical formalisms have been proposed in the
literature to model both the structure and movements of a human body (Green & Guan, 2004).
Generally, a HBM is represented as a chain of rigid bodies, called links, interconnected to one
another by joints. Links can be represented by means of sticks (Yoo, Nixon & Harris, 2002;
Taylor, 2000), polyhedron (Saito & Hoshino, 2001), generalized cylinders (Sidenbladh, Black &
Sigal, 2002) or superquadrics (Marzani, Calais & Legrand, 2001). A joint interconnects two links
by means of rotational motions about the axes. The number of independent rotation parameters
will define the degrees of freedom (DOF) associated with a given joint. Fig. 1 presents an
illustration of an articulated model defined by 12 links (sticks) and 10 joints. Other HBM
representations, which do not follow the aforementioned links-and-joints philosophy, have been
also proposed in the literature to tackle specific applications. For example, Douros, Dekker and
Buxton (1999) present a technique to represent HBMs as single entities by means of smooth
3
surfaces or polygonal meshes. This kind of representation is only useful as a rigid description of
the human body. On the contrary, Plänkers and Fua (2003) and Aubel, Boulic and Thalmann
(2000) present a framework that retains an articulated structure represented by sticks, but replace
the simple geometric primitives by soft objects. The result of this soft surface representation is a
realistic model where body parts such as chest, abdomen or biceps muscles are well modeled.
The simplest 3D articulated structure is a stick representation with no associated volume or
surface (Liebowitz & Carlsson, 2001). Planar 2D representations, such as cardboard models,
have been also widely used (Huang & Huang, 2002). However volumetric representations are
preferred when more realistic models need to be generated. In other words, there is a trade-off
between accuracy of representation and complexity. The utilized models should be quite realistic
but they should have a low number of parameters in order to be processed in real-time. Table 1
presents a summary of some of the approaches followed in the literature:
Authors
DOF
Geometrical Model Representation
Delamarre and
Faugeras (2001)
22
Truncated cones (arms and legs), spheres (neck, joints and head)
and right parallelepipeds (hands, feet and torso)
Gavrila (1999)
22
Superquadrics
Barron and
Kakadiaris (2000)
60
Sticks
Cohen, Medioni
and Gu (2001)
32
Generalized cylinders
Ning et al. (2004)
12
Truncated cones (torso, arms and legs) and a sphere (head)
Table 1- Human body structure representations
Each one of the aforementioned geometrical structures is complemented by means of a motion
model that governs its movements (Rohr, 1997); the objective is that the full body performs
4
realistic movements. There is a wide variety of ways to mathematically model articulated
systems from a kinematics and dynamics point of view. A mathematical model will include the
parameters that describe the links as well as information about the constraints associated with
each joint. A model that only includes this information is called a kinematics model and
describes the possible static states of a system. The state vector of a kinematics model consists of
the model state and the model parameters. A system in motion is modeled when the dynamics of
the system are modeled as well. A dynamics model describes the state evolution of the system
over time. In a dynamics model the state vector includes linear and angular velocities as well as
position.
After selecting an appropriate model for a particular application, it is necessary to develop a
concise mathematical formulation for a general solution to the kinematics and dynamics
problems, which are non-linear problems. Different formalisms have been proposed in order to
assign local reference frames to the links. The simplest approach is to introduce joint hierarchies
formed by independent articulation of one DOF, described in terms of Euler angles. Hence, the
body posture is synthesized by concatenating the transformation matrices associated with the
joints, starting from the root. In order to illustrate this notation, let us express the coordinates of
point A in the global reference frame associated with the root of the model, see Fig. 1:
Aglobal = Transroot-global x Tranship-root x Transknee-hip x Aknee
where: Aknee represents the coordinates of points A relative to the local reference frame placed
in the knee-joint; Transi-j are the corresponding transformation matrices to express reference
frame i in reference frame j, these matrices are defined as:
5
Transi − j
Cφ Cθ Cψ − Sφ Sψ

 R | T   S φ Cθ Cψ + Cφ S ψ
=
=
− Sθ Cψ
0 | 1 

0

− C φ Cθ S ψ − S φ Cψ
− Sφ Cθ Sψ + Cφ Cψ
Sθ Sψ
0
Cφ S θ
Sφ Sθ
Cθ
0
tx 
t y 
tz 

1
C and S represent the cosine and sine respectively, and (φ, θ, ω) are the Euler angles. This kind
of matrix concatenation can be used to express every body part in the body global reference
frame.
3D HUMAN BODY CODING STANDARDS
In order to animate or interchange HBMs, a standard representation is required. Related
standards, such as Web3D H-anim standards, the MPEG-4 face and body animation as well as
MPEG-4 AFX extensions for humanoid animation allow compatibility between different HBM
processing tools (e.g. HBMs created using an editing tool could be animated using another
completely different tool).
The Web3D H-anim working group (H-anim) was formed so that developers could agree on a
standard naming convention for human body parts and joints. This group has produced the
Humanoid Animation Specification (H-anim) standards, describing a standard way of
representing humanoids in VRML. These standards allow humanoids created using authoring
tools from one vendor to be animated using tools from another. H-Anim humanoids can be
animated using keyframing, inverse kinematics, performance animation systems and other
techniques. The three main design goals of H-anim standards are:
•
Compatibility: Humanoids should be able to display/animate in any VRML compliant
browser.
•
Flexibility: No assumptions are made about the types of applications that will use
6
humanoids.
•
Simplicity: The specification should contain only what is absolutely necessary.
For this reason, a H-anim file defines a hierarchy of Joint nodes, each defining the rotation center
of a Joint, which are arranged to form a hierarchy. The most common implementation for a Joint
is a VRML Transform node, which is used to define the relationship of each body segment to its
immediate parent. Each Joint node can contain other Joint nodes, and may also contain a
Segment node, which contains information about the 3D geometry, colour and texture of the
body part associated with that joint. Each Segment can also have a number of Site nodes, which
define specific locations relative to the segment. Joint nodes may also contain additional hints for
inverse-kinematics systems that wish to control the H-Anim figure.
The hierarchy of H-anim Joint and Segment hierarchy is shown in Fig. 2.
7
Figure 2: The H-anim 1.1 Joint and Segment hierarchy (from H-anim website). Three
sets of joints are identified, classified according to their significance, so that H-Anim
models of different complexity can be produced. Segments are shown with dark grey
colour and Sites with light grey colour. Each object beginning with l_ (left) has a
corresponding object beginning with r_ (right). Chart was produced by J. Eric Mason
and Veronica Polo, VR Telecom Inc.
Furthermore, the MPEG-4 SNHC (Synthetic and Natural Hybrid Coding) group has
standardized two types of streams in order to animate avatars:
•
The Face/Body Definition Parameters (FDP/BDP) are avatar-specific and based on the Hanim specifications.
8
•
The Face/Body Animation Parameters (FAP/BAP) are used to animate face/body models.
More specifically, 168 Body Animation Parameters (BAPs) are defined by MPEG-4 SNHC
to describe almost any possible body posture. Thus, a single set of FAPs/BAPs can be used
to describe the face/body posture of different avatars. MPEG-4 has also standardized the
compressed form the resulting animation stream using two techniques: DCT-based or
prediction-based. Typical bitrates for these compressed bitstreams are 2 kbps for the case
of facial animation or 10 to 30 kbps for the case of body animation.
In addition complex 3D deformations that can result from the movement of specific body parts
(e.g. muscle contraction, clothing folds, etc.) can be modeled by using Face/Body Animation
Tables (FAT/BATs), which specify sets of vertices that undergo non-rigid motion and a function
to describe this motion with respect to the values of specific BAPs/FAPs. However, a significant
problem with using such tables is that they are body model-dependent and require a complex
modeling stage. In order to solve such problems, MPEG-4 addresses new animation
functionalities in the framework of AFX group by including also a generic seamless virtual
model definition and bone-based animation. Particularly, the AFX specification describes state
of the art components for rendering geometry, textures, volumes and animation. A hierarchy of
geometry, modeling, physics and biomechanical models are described along with advanced tools
for animating these models (Figure 3).
9
.
Cognitive
Behavior
Biomechanical
Physics
AFX
Animation
Models
Modeling
Geometry
Figure 3. Hierarchy of AFX Animation Models
Specifically, the new Humanoid Animation Framework, defined by MPEG-4 SNHC (Preda,
2002; Preda & Prêteux, 2001), is defined as a biomechanical model in AFX and is based on a
rigid skeleton made of bones. The skeleton consists of bones, which are rigid objects that can be
transformed (rotated around specific joints), but not deformed. Attached to the skeleton, a skin
model is defined, which smoothly follows any skeleton movement.
FUTURE TRENDS AND CONCLUSIONS
Vision-based applications have been growing considerably fast during the last two decades. As
a result of that growing, the current technology can tackle—at the moment only under well
defined constraints—tasks so complex such as human body modeling. In addition, the
knowledge collected during this time from different research areas (e.g. video processing,
rigid/articulated object modeling, human body/motion models, etc), also helps to face up to
vision-based human body modeling. However, in spite of all this large amount of work, many
issues are still open. Problems such as: development of models including prior knowledge;
modeling of multiple person environments; real-time performance; still need to be efficiently
10
solved.
In addition to the aforementioned issues, the reduction of the processing time is one of the
milestones in the non-rigid object modeling field. It is highly dependent on two factors; on one
hand the computational complexity, and on the other hand the current technology. Taking into
account the last years’ evolution, we can say that computational complexity will not be
significantly reduced during next years. On the contrary, improvements in the current technology
have become a commonplace (e.g. reduction in acquisition and processing times, increase in the
memory size). Therefore, algorithms, that nowadays are computationally prohibitive, are
expected to have a good performance with next technologies. The latter gives rise to a promising
future for HBM applications and, as an extension, to non-rigid object modeling in general.
REFERENCES
Aubel, A., Boulic, R., & Thalmann D. (2000). Real-time display of virtual humans: Levels of
details and impostors. IEEE Trans. on Circuits and Systems for Video Technology, Special
Issue on 3D Video Technology, 10(2), 207-217.
Barron, C., & Kakadiaris, I. (2000). Estimating anthropometry and pose from a single camera.
IEEE Int. Conf. on Computer Vision and Pattern Recognition. Hilton Head Island, USA.
Cohen, I., Medioni, G., & Gu, H. (2001). Inference of 3D human body posture from multiple
cameras for vision-based user interface. World Multiconference on Systemics, Cybernetics
and Informatics. USA.
Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multi-view tracking with
physical forces. Special Issue on Modelling People, Computer Vision and Image
Understanding, 81, 328-357.
Douros, I., Dekker, L., & Buxton, B. (1999). An improved algorithm for reconstruction of the
surface of the human body from 3D scanner data using local B-spline patches. IEEE Int.
Workshop on Modeling People. Corfu, Greece.
Gavrila, D. M. (1999). The visual analysis of human movement: A Survey. Computer Vision and
Image Understanding, 73(1), 82-98.
Green, R., & Guan, L. (2004). Quantifying and recognizing human movement patterns from
monocular video images—Part I: a new framework for modelling human motion. IEEE
Trans. on Circuits and Systems for Video Technology, 14(2), 179-190.
Huang, Y., & Huang, T. (2002). Model-based human body tracking. 16th Int. Conf. on Pattern
Recognition. Quebec City, Canada.
11
Liebowitz, D., & Carlsson, S. (2001). Uncalibrated motion capture exploiting articulated
structure constraints. IEEE Int. Conf. on Computer Vision. Vancouver, Canada.
Marzani, F., Calais, E., & Legrand, L. (2001). A 3-D marker-free system for the analysis of
movement disabilities-an application to the Legs. IEEE Trans. on Information Technology in
Biomedicine, 5(1), 18-26.
Ning, H., Tan, T., Wang, L., & Hu, W. (2004). Kinematics-based tracking of human walking in
monocular video sequences. Image and Vision Computing, 22, 429-441.
Plänkers, R., & Fua, P. (2003). Articulated soft objects for multiview shape and motion capture.
IEEE Trans. onPattern Analysis and Machine Intelligence, 25(9), 1182-1188.
Preda, M. (Ed.). (2002). MPEG-4 Animation Framework eXtension (AFX) VM 9.0, Marius
Preda (INT), editor , ISO/IEC JTC1/SC29/WG11 N5245.
Preda, M., & Prêteux, F. (2001). Advanced virtual humanoid animation framework based on the
MPEG-4 SNHC Standard. Euroimage ICAV 3D 2001 Conference. Mykonos, Greece.
Rohr, K. (1997). Human movement analysis based on explicit motion models. Chapter 8 in
Motion-Based Recognition, M. Shah and R. Jain (Eds.), Kluwer Academic Publishers,
Dordrecht Boston 1997, pp. 171-198.
Saito, H., & Hoshino, J. (2001). A Match Moving Technique for Merging CG and Human Video
Sequences. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing. Salt Lake City,
USA.
Sidenbladh, H., Black, M.J., & Sigal, L. (2002). Implicit probabilistic models of human motion
for synthesis and tracking. European Conf. on Computer Vision. Copenhagen, Denmark.
Taylor, C. (2000). Reconstruction of articulated objects from point correspondences in a single
uncalibrated image. IEEE Int. Conf. on Computer Vision and Pattern Recognition. Hilton
Head Island, USA.
Yoo, J., Nixon, M., & Harris, C. (2002). Extracting human gait signatures by body segment
properties. 5th IEEE Southwest Symp. on Image Analysis and Interpretation, Santa Fe, USA.
12
Terms and Definitions
Human Body Modeling: digital model generally describing the shape and motion of a human
body.
Articulated Object: structure composed of two or more rigid bodies interconnected by means of
joints. The degrees of freedom associated with each joint define the different structure
configurations.
Virtual Reality: 3D digital world, simulating the real one, allowing a user to interact with
objects as if inside it.
VRML: Virtual Reality Modeling Language, a platform-independent language for virtual reality
scene description.
H-anim: VRML Consortium Charter for Humanoid Animation Working Group. This group has
recently produced the International Standard “Information technology — Computer graphics
and image processing — Humanoid animation (H-Anim)”, i.e. an abstract representation for
modeling three-dimensional human figures.
Rotation matrix: A linear operator rotating a vector in a given space. A rotation matrix has only
three degrees of freedom in 3D and one in 2D. It can be parameterized in various ways,
usually through Euler angles, yaw-pitch-roll angles, rotation angles around the coordinate
axes, etc.
MPEG: Moving Picture Experts Group. A group developing standards for coding digital audio
and video, as used in e.g. video CD, DVD and digital television. This term is often used to
refer to media that is stored in the MPEG-1 format.
MPEG-2: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), a subset of
ISO Recommendation 13818, meant for transmission of studio-quality audio and video. It
covers 4 levels of video resolution.
MPEG-4: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), originally
concerned with similar applications as H.263 (very low bit rate channels, up to 64kbps).
Subsequently extended to encompass a large set of multimedia applications, including over
the Internet.
MPEG-4 AFX: MPEG-4 extension with the aim to define high-level components and a
framework to describe realistic animations and 2D/3D objects.
MPEG-7: A standard formulated by the ISO Motion Pictures Expert Group (MPEG). Unlike
MPEG-2 and MPEG-4, which deal with compressing multimedia contents within specific
applications, it specifies the structure and features of the compressed multimedia content
produced by the different standards, for instance to be used in search engines.
13
Download