An Overview of Vision-Based 3D Human Body Representations Angel D. Sappa1, Niki Aifanti2, Nikos Grammalidis2, and Sotiris Malassiotis2 1 2 Computer Vision Center Edifici O Campus UAB 08193 - Bellaterra, Barcelona, Spain Tel: +34 93 581 3036 Fax: +34 93 581 1670 e-mail: angel.sappa@cvc.uab.es Informatics & Telematics Institute 1st Km Thermi-Panorama Road Thermi-Thessaloniki, Greece Tel: +30 2310 46 41 60 Fax: +30 2310 46 41 64 e-mail: {naif, ngramm, malasiot@iti.gr} Keywords: Human body modeling; Vision-based body representations; Coding standards; 1 An overview of Vision-Based 3D Human Body Representations INTRODUCTION The problem of human body modeling has been initially tackled to solve applications related to the film industry or computer games within the computer graphics community (CG). Since then, several different tools were developed for editing and animating 3D digital body models. Although at the beginning most of those tools were devised within the computer graphics community, nowadays a lot of work proceeds from the computer vision community (CV). In spite of this overlapped interest there is a considerable difference between CG and CV human body model (HBM) applications. The first one pursues for realistic models of both, human body geometry and its associated motion. On the contrary, CV seeks more for an efficient than an accurate model for applications such as: intelligent video surveillance, motion analysis, telepresence, 3D video sequence processing and coding. The current work is focused on vision-based human body modeling systems. This overview will present some of the techniques proposed in the bibliography, together with their advantages or disadvantages. The outline of this work is as follows. Firstly, geometrical primitives and mathematical formalism, used for 3D model representation, are addressed. Next, a brief description of standards used for coding HBMs is given. Finally, a section with future trends and conclusion is introduced. 3D Human Body Representations Modeling a human body implies firstly the definition of an articulated 3D structure, in order to represent the human body biomechanical features. Secondly, it involves the choice of an 2 3 DOF 3 DOF 1 DOF root hip 3 DOF 3 DOF 1 DOF knee Link global A Joint Figure 1. Stick representation of an articulated model defined by 22 DOF. appropriate mathematical model to govern the movements of that articulated structure. Several 3D articulated representations and mathematical formalisms have been proposed in the literature to model both the structure and movements of a human body (Green & Guan, 2004). Generally, a HBM is represented as a chain of rigid bodies, called links, interconnected to one another by joints. Links can be represented by means of sticks (Yoo, Nixon & Harris, 2002; Taylor, 2000), polyhedron (Saito & Hoshino, 2001), generalized cylinders (Sidenbladh, Black & Sigal, 2002) or superquadrics (Marzani, Calais & Legrand, 2001). A joint interconnects two links by means of rotational motions about the axes. The number of independent rotation parameters will define the degrees of freedom (DOF) associated with a given joint. Fig. 1 presents an illustration of an articulated model defined by 12 links (sticks) and 10 joints. Other HBM representations, which do not follow the aforementioned links-and-joints philosophy, have been also proposed in the literature to tackle specific applications. For example, Douros, Dekker and Buxton (1999) present a technique to represent HBMs as single entities by means of smooth 3 surfaces or polygonal meshes. This kind of representation is only useful as a rigid description of the human body. On the contrary, Plänkers and Fua (2003) and Aubel, Boulic and Thalmann (2000) present a framework that retains an articulated structure represented by sticks, but replace the simple geometric primitives by soft objects. The result of this soft surface representation is a realistic model where body parts such as chest, abdomen or biceps muscles are well modeled. The simplest 3D articulated structure is a stick representation with no associated volume or surface (Liebowitz & Carlsson, 2001). Planar 2D representations, such as cardboard models, have been also widely used (Huang & Huang, 2002). However volumetric representations are preferred when more realistic models need to be generated. In other words, there is a trade-off between accuracy of representation and complexity. The utilized models should be quite realistic but they should have a low number of parameters in order to be processed in real-time. Table 1 presents a summary of some of the approaches followed in the literature: Authors DOF Geometrical Model Representation Delamarre and Faugeras (2001) 22 Truncated cones (arms and legs), spheres (neck, joints and head) and right parallelepipeds (hands, feet and torso) Gavrila (1999) 22 Superquadrics Barron and Kakadiaris (2000) 60 Sticks Cohen, Medioni and Gu (2001) 32 Generalized cylinders Ning et al. (2004) 12 Truncated cones (torso, arms and legs) and a sphere (head) Table 1- Human body structure representations Each one of the aforementioned geometrical structures is complemented by means of a motion model that governs its movements (Rohr, 1997); the objective is that the full body performs 4 realistic movements. There is a wide variety of ways to mathematically model articulated systems from a kinematics and dynamics point of view. A mathematical model will include the parameters that describe the links as well as information about the constraints associated with each joint. A model that only includes this information is called a kinematics model and describes the possible static states of a system. The state vector of a kinematics model consists of the model state and the model parameters. A system in motion is modeled when the dynamics of the system are modeled as well. A dynamics model describes the state evolution of the system over time. In a dynamics model the state vector includes linear and angular velocities as well as position. After selecting an appropriate model for a particular application, it is necessary to develop a concise mathematical formulation for a general solution to the kinematics and dynamics problems, which are non-linear problems. Different formalisms have been proposed in order to assign local reference frames to the links. The simplest approach is to introduce joint hierarchies formed by independent articulation of one DOF, described in terms of Euler angles. Hence, the body posture is synthesized by concatenating the transformation matrices associated with the joints, starting from the root. In order to illustrate this notation, let us express the coordinates of point A in the global reference frame associated with the root of the model, see Fig. 1: Aglobal = Transroot-global x Tranship-root x Transknee-hip x Aknee where: Aknee represents the coordinates of points A relative to the local reference frame placed in the knee-joint; Transi-j are the corresponding transformation matrices to express reference frame i in reference frame j, these matrices are defined as: 5 Transi − j Cφ Cθ Cψ − Sφ Sψ R | T S φ Cθ Cψ + Cφ S ψ = = − Sθ Cψ 0 | 1 0 − C φ Cθ S ψ − S φ Cψ − Sφ Cθ Sψ + Cφ Cψ Sθ Sψ 0 Cφ S θ Sφ Sθ Cθ 0 tx t y tz 1 C and S represent the cosine and sine respectively, and (φ, θ, ω) are the Euler angles. This kind of matrix concatenation can be used to express every body part in the body global reference frame. 3D HUMAN BODY CODING STANDARDS In order to animate or interchange HBMs, a standard representation is required. Related standards, such as Web3D H-anim standards, the MPEG-4 face and body animation as well as MPEG-4 AFX extensions for humanoid animation allow compatibility between different HBM processing tools (e.g. HBMs created using an editing tool could be animated using another completely different tool). The Web3D H-anim working group (H-anim) was formed so that developers could agree on a standard naming convention for human body parts and joints. This group has produced the Humanoid Animation Specification (H-anim) standards, describing a standard way of representing humanoids in VRML. These standards allow humanoids created using authoring tools from one vendor to be animated using tools from another. H-Anim humanoids can be animated using keyframing, inverse kinematics, performance animation systems and other techniques. The three main design goals of H-anim standards are: • Compatibility: Humanoids should be able to display/animate in any VRML compliant browser. • Flexibility: No assumptions are made about the types of applications that will use 6 humanoids. • Simplicity: The specification should contain only what is absolutely necessary. For this reason, a H-anim file defines a hierarchy of Joint nodes, each defining the rotation center of a Joint, which are arranged to form a hierarchy. The most common implementation for a Joint is a VRML Transform node, which is used to define the relationship of each body segment to its immediate parent. Each Joint node can contain other Joint nodes, and may also contain a Segment node, which contains information about the 3D geometry, colour and texture of the body part associated with that joint. Each Segment can also have a number of Site nodes, which define specific locations relative to the segment. Joint nodes may also contain additional hints for inverse-kinematics systems that wish to control the H-Anim figure. The hierarchy of H-anim Joint and Segment hierarchy is shown in Fig. 2. 7 Figure 2: The H-anim 1.1 Joint and Segment hierarchy (from H-anim website). Three sets of joints are identified, classified according to their significance, so that H-Anim models of different complexity can be produced. Segments are shown with dark grey colour and Sites with light grey colour. Each object beginning with l_ (left) has a corresponding object beginning with r_ (right). Chart was produced by J. Eric Mason and Veronica Polo, VR Telecom Inc. Furthermore, the MPEG-4 SNHC (Synthetic and Natural Hybrid Coding) group has standardized two types of streams in order to animate avatars: • The Face/Body Definition Parameters (FDP/BDP) are avatar-specific and based on the Hanim specifications. 8 • The Face/Body Animation Parameters (FAP/BAP) are used to animate face/body models. More specifically, 168 Body Animation Parameters (BAPs) are defined by MPEG-4 SNHC to describe almost any possible body posture. Thus, a single set of FAPs/BAPs can be used to describe the face/body posture of different avatars. MPEG-4 has also standardized the compressed form the resulting animation stream using two techniques: DCT-based or prediction-based. Typical bitrates for these compressed bitstreams are 2 kbps for the case of facial animation or 10 to 30 kbps for the case of body animation. In addition complex 3D deformations that can result from the movement of specific body parts (e.g. muscle contraction, clothing folds, etc.) can be modeled by using Face/Body Animation Tables (FAT/BATs), which specify sets of vertices that undergo non-rigid motion and a function to describe this motion with respect to the values of specific BAPs/FAPs. However, a significant problem with using such tables is that they are body model-dependent and require a complex modeling stage. In order to solve such problems, MPEG-4 addresses new animation functionalities in the framework of AFX group by including also a generic seamless virtual model definition and bone-based animation. Particularly, the AFX specification describes state of the art components for rendering geometry, textures, volumes and animation. A hierarchy of geometry, modeling, physics and biomechanical models are described along with advanced tools for animating these models (Figure 3). 9 . Cognitive Behavior Biomechanical Physics AFX Animation Models Modeling Geometry Figure 3. Hierarchy of AFX Animation Models Specifically, the new Humanoid Animation Framework, defined by MPEG-4 SNHC (Preda, 2002; Preda & Prêteux, 2001), is defined as a biomechanical model in AFX and is based on a rigid skeleton made of bones. The skeleton consists of bones, which are rigid objects that can be transformed (rotated around specific joints), but not deformed. Attached to the skeleton, a skin model is defined, which smoothly follows any skeleton movement. FUTURE TRENDS AND CONCLUSIONS Vision-based applications have been growing considerably fast during the last two decades. As a result of that growing, the current technology can tackle—at the moment only under well defined constraints—tasks so complex such as human body modeling. In addition, the knowledge collected during this time from different research areas (e.g. video processing, rigid/articulated object modeling, human body/motion models, etc), also helps to face up to vision-based human body modeling. However, in spite of all this large amount of work, many issues are still open. Problems such as: development of models including prior knowledge; modeling of multiple person environments; real-time performance; still need to be efficiently 10 solved. In addition to the aforementioned issues, the reduction of the processing time is one of the milestones in the non-rigid object modeling field. It is highly dependent on two factors; on one hand the computational complexity, and on the other hand the current technology. Taking into account the last years’ evolution, we can say that computational complexity will not be significantly reduced during next years. On the contrary, improvements in the current technology have become a commonplace (e.g. reduction in acquisition and processing times, increase in the memory size). Therefore, algorithms, that nowadays are computationally prohibitive, are expected to have a good performance with next technologies. The latter gives rise to a promising future for HBM applications and, as an extension, to non-rigid object modeling in general. REFERENCES Aubel, A., Boulic, R., & Thalmann D. (2000). Real-time display of virtual humans: Levels of details and impostors. IEEE Trans. on Circuits and Systems for Video Technology, Special Issue on 3D Video Technology, 10(2), 207-217. Barron, C., & Kakadiaris, I. (2000). Estimating anthropometry and pose from a single camera. IEEE Int. Conf. on Computer Vision and Pattern Recognition. Hilton Head Island, USA. Cohen, I., Medioni, G., & Gu, H. (2001). Inference of 3D human body posture from multiple cameras for vision-based user interface. World Multiconference on Systemics, Cybernetics and Informatics. USA. Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multi-view tracking with physical forces. Special Issue on Modelling People, Computer Vision and Image Understanding, 81, 328-357. Douros, I., Dekker, L., & Buxton, B. (1999). An improved algorithm for reconstruction of the surface of the human body from 3D scanner data using local B-spline patches. IEEE Int. Workshop on Modeling People. Corfu, Greece. Gavrila, D. M. (1999). The visual analysis of human movement: A Survey. Computer Vision and Image Understanding, 73(1), 82-98. Green, R., & Guan, L. (2004). Quantifying and recognizing human movement patterns from monocular video images—Part I: a new framework for modelling human motion. IEEE Trans. on Circuits and Systems for Video Technology, 14(2), 179-190. Huang, Y., & Huang, T. (2002). Model-based human body tracking. 16th Int. Conf. on Pattern Recognition. Quebec City, Canada. 11 Liebowitz, D., & Carlsson, S. (2001). Uncalibrated motion capture exploiting articulated structure constraints. IEEE Int. Conf. on Computer Vision. Vancouver, Canada. Marzani, F., Calais, E., & Legrand, L. (2001). A 3-D marker-free system for the analysis of movement disabilities-an application to the Legs. IEEE Trans. on Information Technology in Biomedicine, 5(1), 18-26. Ning, H., Tan, T., Wang, L., & Hu, W. (2004). Kinematics-based tracking of human walking in monocular video sequences. Image and Vision Computing, 22, 429-441. Plänkers, R., & Fua, P. (2003). Articulated soft objects for multiview shape and motion capture. IEEE Trans. onPattern Analysis and Machine Intelligence, 25(9), 1182-1188. Preda, M. (Ed.). (2002). MPEG-4 Animation Framework eXtension (AFX) VM 9.0, Marius Preda (INT), editor , ISO/IEC JTC1/SC29/WG11 N5245. Preda, M., & Prêteux, F. (2001). Advanced virtual humanoid animation framework based on the MPEG-4 SNHC Standard. Euroimage ICAV 3D 2001 Conference. Mykonos, Greece. Rohr, K. (1997). Human movement analysis based on explicit motion models. Chapter 8 in Motion-Based Recognition, M. Shah and R. Jain (Eds.), Kluwer Academic Publishers, Dordrecht Boston 1997, pp. 171-198. Saito, H., & Hoshino, J. (2001). A Match Moving Technique for Merging CG and Human Video Sequences. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing. Salt Lake City, USA. Sidenbladh, H., Black, M.J., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. European Conf. on Computer Vision. Copenhagen, Denmark. Taylor, C. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. IEEE Int. Conf. on Computer Vision and Pattern Recognition. Hilton Head Island, USA. Yoo, J., Nixon, M., & Harris, C. (2002). Extracting human gait signatures by body segment properties. 5th IEEE Southwest Symp. on Image Analysis and Interpretation, Santa Fe, USA. 12 Terms and Definitions Human Body Modeling: digital model generally describing the shape and motion of a human body. Articulated Object: structure composed of two or more rigid bodies interconnected by means of joints. The degrees of freedom associated with each joint define the different structure configurations. Virtual Reality: 3D digital world, simulating the real one, allowing a user to interact with objects as if inside it. VRML: Virtual Reality Modeling Language, a platform-independent language for virtual reality scene description. H-anim: VRML Consortium Charter for Humanoid Animation Working Group. This group has recently produced the International Standard “Information technology — Computer graphics and image processing — Humanoid animation (H-Anim)”, i.e. an abstract representation for modeling three-dimensional human figures. Rotation matrix: A linear operator rotating a vector in a given space. A rotation matrix has only three degrees of freedom in 3D and one in 2D. It can be parameterized in various ways, usually through Euler angles, yaw-pitch-roll angles, rotation angles around the coordinate axes, etc. MPEG: Moving Picture Experts Group. A group developing standards for coding digital audio and video, as used in e.g. video CD, DVD and digital television. This term is often used to refer to media that is stored in the MPEG-1 format. MPEG-2: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), a subset of ISO Recommendation 13818, meant for transmission of studio-quality audio and video. It covers 4 levels of video resolution. MPEG-4: A standard formulated by the ISO Motion Pictures Expert Group (MPEG), originally concerned with similar applications as H.263 (very low bit rate channels, up to 64kbps). Subsequently extended to encompass a large set of multimedia applications, including over the Internet. MPEG-4 AFX: MPEG-4 extension with the aim to define high-level components and a framework to describe realistic animations and 2D/3D objects. MPEG-7: A standard formulated by the ISO Motion Pictures Expert Group (MPEG). Unlike MPEG-2 and MPEG-4, which deal with compressing multimedia contents within specific applications, it specifies the structure and features of the compressed multimedia content produced by the different standards, for instance to be used in search engines. 13