Contextual Ontology based Multimedia metadata application system describing framework 1. Introduction As Mark Weiser pointed out[1], computing is becoming pervasive. In the pervasive computing domain, multimedia data is widely used and plays a more and more important role in many systems especially in the surveillance domain. However, because of the huge semantic gap between the multimedia data and the understanding, when intelligent surveillance system is facing the multimedia data which is captured by various sensors, the system cannot get the correct context and understand user’s intention so as to can not provide automatic and more correct services. As a result, system often needs the help of human. It is widely accepted that the gap between data and semantic meaning limits the development of surveillance system. We think that the metadata describing schema play a key role to bridge the gap in surveillance system. However, isolating from the system’s information and architecture, current multimedia metadata describing schema limits its use. In addition, it’s difficult to make automatic analysis to the media content, or there is no effect expression of the result of feature extracting according to current metadata schema. The lack of precise models and formats for object and system representation and the high complexity of multimedia processing algorithms make the development of fully automatic semantic multimedia analysis and management systems a challenging task.[2] Many research groups are active in finding and proposing interesting solutions or standards about the management and exchange of multimedia data. Within the MUSCLE NoE research is focusing on standards, technologies and techniques for integrating, exchanging and enhancing the use of multimedia within a variety of research areas. At CNR ISTI, Patrizia Asirelli et al. are developing an infrastructure for MultiMedia Metadata Management (4M) to support the integration of media from different sources. This infrastructure enables the collection, analysis and integration of media for semantic annotation, search and retrieval.[3] Jeffrey E. Boyd et al. embed the low-level functions performed by a video surveillance system into cameras. They build video information servers that are conceptually similar to MPEG-7 cameras, but differ in that they interact with client applications and can be configured dynamically to do more than describe video content.[4] Trivedi et al. presents an overall system architecture to support design and development of intelligent environments. Their system supports capturing data by omni-canera and PTZ Camera. Various sensors data are encoded into XML format and stored into Knowledge Base through the Server Core of the system. There are various modules for multicamera-based multiperson tracking, event detection and event based servoing for selective attention, voxelization, streaming face recognition in their system.[5] In this paper we mainly discuss a multimedia metadata application system describing framework based on ontology. In the framework, we mainly resolve the problem in representation and communication of data, metadata and other information. Through defining ontology, we can not only implement the exchange and interaction among modules of a system even among different systems, but also support the context-aware service through defining the dynamic context model. 2. Related works about visual data describing research Because XML has well readability, commonality and extensibility, it is widely accepted that using the language based on XML to describe multimedia data is a better approach, e.g. MPEG-7, CVML, and VERL & VEML and so on. Thor List and Robert B. Fisher proposed a XML-based Computer Vision Markup Language (CVML) for use in Cognitive Vision, to enable separate research groups to collaborate with each other as well as making their research results more available to other areas of science and industry. [6] CVML mainly emphasizes the low-level feature, but lack of supporting to high-level semantic. In the “Challenge Project on Video Event Taxonomy” sponsored by the Advanced Research and Development Activity (ARDA) of the U.S. , more than 30 researchers in computer vision and knowledge representation and representatives of the user community proposed the VERL formal language which is used to describe event taxonomy. Meantime, they also proposed the VEML (Video Event Markup Language) which is used to annotate the event instance of VERL. VERL & VEML has a strong concept of object-orient, and a high level abstract, but lack of the low-level feature expression.(Fig 1.) [7] Fig1 Diagram of the relationship between VERL and VEML MPEG-7 is Multimedia Content Description Interface[8], which was proposed by MPEG organization in 1996, and became an international standard in 2001. It can describe both low-level feature and high-level semantics. However, it has to be noted that though its function is strong, the semantics of its elements have no formal grounding. So the resulting interoperability problems prevent an effective use of MPEG-7 as a language for describing multimedia.[9] Fig2 Use scope of MPEG-7 3. Requirements of designing multimedia metadata describing schema An original purpose of designing multimedia metadata schema is to annotate multimedia data so as to search or retrieve certain content of such multimedia data. As we are entering the pervasive computing era, now the multimedia metadata should play a more fundamental and important role. Recognizing that metadata will be the important hinge among all modules in the system, we think the metadata describing schema should base on the information system. System’s different modules interact information each other by metadata. For example, the storage module indexes and stores the output and mediate results of other modules as metadata type. The context information will be got based on metadata too. So we can get some requirements according to multimedia metadata as follows: Extensibility. In multimedia application, metadata expression will not be static. For example, the application system might add a new module or modify original modules, and this requires the metadata have a better extensibility to adapt the system’s changing. In addition, the objects described by metadata may change or add. Currently the media type mainly include video, audio etc, but as the technology evolve, new media may occur. The multimedia metadata schema should not redesign according to every new media. Interoperability (readability). When designing the metadata describing schema, we should considerate the information fusion issue including among different modules and different systems. Multimedia application system consists of various modules. Many tasks often need the cooperation of different modules, So this requires a common understanding about certain subject. Here interoperability means that a processing unit can understand the result of other units and adapt itself. In addition, different multimedia application system often has different format of information, and this prevent fusion of different systems. Metadata having better readability can ease the data maintain and the transform between different format. At the meantime, the interoperability can be benefit for developer to maintain system. Easy to parse and easy to transport by Network. Multimedia application systems ongoing researches are inclining to multi sensors, heterogeneous and distributed architecture. Different computing units have to share information and communicate with each other via network. The distributed system demands that the metadata describing various information be easy to transport and easy to be parsed so as to interact information, improve the speed and efficiency of information processing. Ability to describe and reason context information. Because the same action may convey different meaning in different context, multimedia application system needs context information to detect and understand event. Enabling system intelligent should be a start point and target of designing multimedia metadata schema, or else the metadata will limit its usage. The design of metadata should support context information expression, ranging from the low-level feature to high-level semantic description. In addition, the metadata should support more complicated information expression which combines different information. Easy to check error and trace. Because capturing the low-level feature has some uncertainty, sometimes has errors which may be found later, this require error check mechanism. If system has no such mechanism, it will cause error decision effecting the later processing and lead to error results. With such mechanism, system should record the state of error check, but should not slowdown the speed of the updated metadata processing. 4. The design of metadata describing framework based on ontology When designing multimedia metadata describing schema, we should not be limited to construct only media metadata. We propose a framework taking account of system processing platform. In our framework, we not only define media data, but also other information including context and system settings etc. As a whole, we call the describing schema MeSysONT (Multimedia System Ontology). Because our framework is based on system’s architecture, this paper will briefly introduce the Software Platform for Distributed Visual Information Processing proposed by us in 2007. [10] 4.1 Brief introduction on the Software Platform for Distributed Visual Information Processing. Distributed Visual Information Processing system (DiVI) is a kind of distributed intelligent system which employs distributed camera arrays. In the platform multi-level system organization and multi-server platform architecture is adopted. This division of the common services and application simplifies the application's development and deployment as well as the whole system's flexibility.(Fig3.) Fig3. An overview of system architecture With such architecture, the division of the common services and application simplifies the application's development and deployment as well as the whole system's flexibility. A new processing unit can be added into the platform conveniently without more interference to system. For the transparence of communication between platform and applications, a set of XML read-write classes are also designed. These classes provide message format and encapsulate the serialization and un-serialization of the messages. This usage of XML also gives the platform agility for future demands. [10] 4.2 Why using ontology? New computing technology will play more and more important role in assisting people’s life. However, the obstacle of developing application system is the huge gap between data and understanding. Furthermore, system can’t get the meaning of human only by capturing his actions. System has to know scenario and the dynamic context just like person have common sense knowledge. Taking account of this we introduce ontology concept into our describing framework. The term “ontology” was borrowed from philosophy and was introduced into the knowledge engineering field as a means of abstracting and representing knowledge. Ontologies are used to build consensual terminologies for the domain knowledge in a formal way so that they can be more easily shared and reused. More recently, ontologies have been applied in many fields of computer science, such as the semantic Web, e-commerce, and information systems[11]. Moreover, ontologies have been used in context-aware systems in pervasive computing domain, e.g. CoBrA, Gaia, SOCAM and so on. Clarify organization using Ontology in domain makes correction easy and convenient. Ontology can supply the information that detect general context or user wants more exactly. Especially, intelligent agents need Ontology which is well defined to understand situation information and run reasoning about (context understanding) vague situation. Moreover, there is immediately benefit that can get through designing context base. That is, explain about context that need, and do process itself through process which compose context depending on it and set in organization modeling. Furthermore the Ontology can help computer vision algorithm get a better result and make it possible to apply it to assist people’s life. For example, there is a big problem that the environment often change when applying CV algorithm. If the system has ontology about physical environment and can reason about dynamic context, it will help algorithm adapt its parameters so as to solve its robustness and suitability problem. We can also define sensors information ontology to enable system select suitable sensor or combine them. We think this can solve the problem of occlusion. 4.3 ontology based describing schema According to the multimedia application in pervasive computing, we propose a framework to describe information including multimedia metadata based on ontology. Its basic idea is that the multimedia metadata is based on the ontology which also describes the system’s information including architecture, components and so on. Strengthening the ability of semantic expression is the other main idea. The ontology describes information ranging from raw data to system’s overall structure. Additionally we introduce the description of context information and reasoning mechanism so as to support context aware service. Fig4 shows the overview of the ontology framework. In the framework, the ontology includes various information. We divide the ontology into three types based on their functions, which is Media data related, information system related and context information ontology. Additionally we define task ontology of system so as to describe system’s function and provide services conveniently. In our opinion, defining the application system’s overall structure and the understanding to multimedia data as ontology will get some benefits as followings: Making it easy to understand multimedia data. It will be convenient to analyze and reason on the data It will be useful to fuse different modules and systems It will help developers to understand and evolve the system. Media data related ontology includes two levels: low and high level. The low-level mainly is used to express the result of feature extracting, having no semantic meaning. When designing the low-level media metadata ontology, we should considerate how to express the change of its fundamental element; this will be useful to help the algorithm of feature extracting. The high-level description has simple semantic meaning, and can provide basic material for context reasoning. Fig4. Overview of contextual ontology framework Information related ontology mainly describes the application system’s structure and its functional modules. Such designing will enable different modules to cooperate each other more conveniently, which can help system to implement distributed computing. Context information related ontology includes static and dynamic ontology. The static context mainly describe stable or less changeable information, e.g. physical environmental information, network parameters, sensors parameters information and so on. The dynamic context mainly describes the result when entity’s (including human user) state information is changing in certain static context. With the task ontology, system will know what service should be provided according to certain context. It has to be noted that the ontology is not layered structured in the real situation. Components of the ontology have various relationships between each other. Fig5 shows an example of a partial contextual ontology. We can see that many components relate to each other through the Location ontology. Fig5. An example of home scenario ontology 5. Overview of the information processing architecture in framework A target of the framework is to enable system to provide context-aware service. With such service, the system not only get current context to reason user’s attention, but also can guide the low-level feature extracting algorithm. Under our framework, different processing units ‘know’ each other better, so they will cooperate efficiently and conveniently. Fig6. An overview of Information processing architecture Fig6 shows the brief overview of information processing architecture. Data capturing modules capture raw data from various sensors and extract them to the data processing modules. The data processing modules generate metadata according to the framework. Analyze & reasoning module can get the current context using Dynamic Context model [12] . Combining with the task defined by ontology, system can understand user’s attention and provide active services. Additionally, current context can guide the low-level feature extracting algorithm. 6. An example of implementation in the domain of intelligence surveillance application Based on the framework, we have implemented a prototype system named MPEG-7 Based Video Surveillance Information System. [13] The system is used in a hall scenario. The system can archive multimedia raw data, metadata and retrieve data according to content. One of its functions is that it can warn the person who is entering into a special space which can be set by system manager. Fig7 architecture of the Software platform The system has a distributed structure. From Fig7 we can see that the whole system consists of two layers. The Host Server is responsible for the Network processing. The Application Modules do not need to know how the other Application Modules are deployed in the network. It is only required to connect to the Host Server, and this can enhance the flexibility of system. Encoding and decoding of the multimedia data is accomplished in the Host Server. As one computer only has one Host Server, system only needs to compress or decompress the same multimedia data one time, and this can decrease communication resource and computing time. Fig8 is a diagram of describing schema in a video surveillance scenario. Context describes Manager Information, System settings, Camera parameters and so on. LLevel describe the Blob information captured by motion detection module. HLevel describe the high level semantic information, including the relationships Fig8 Diagram of describing schema among motion blobs, type of motion object. The relation between LLevel and HLevel is expressed by L2HGraph and H2LGraph. Following is an example of system settings in context information. <Host id=”Host_1” ip="166.111.139.121"> --Host locatoin <Port> <VideoListen>5001</VideoListen> -- video data monitoring <MessageSendListen>6000</MessageSendListen> -- message sending <MessageRevListen>6006</MessageRevListen> -- message receiving <RemoteListen>7000</RemoteListen> -- monitor connection of other host </Port> <Modules> — — modules of the host <Module id="Module_1"> <Name>MotionTracking_1</Name> --Module name MotionTracking_1 <Group>TrackingGroup</Group> --belongs to TrackingGroup <Property>Vision Process</Property> --module’s property </Module> </Modules> <Connections> -- designated connection used for transmitting message <Sock>166.111.250.105</Sock> --connect to host 166.111.250.105 </Connections> </Host> Following is an example which shows the relation between low-level feature and high-level semantic meaning. <Relation xsi:type="SegmentSemanticBaseRelationType" name="hasMediaPerceptionOf" source="#Body_1" target="#Blob_1"/> <Relation xsi:type="SegmentSemanticBaseRelationType" name="hasMediaPerceptionOf" source="#Body_1" target="#Blob_3"/> <Relation xsi:type="SegmentSemanticBaseRelationType" name="hasMediaPerceptionOf" source="#Body_1" target="#Blob_4"/> The above example denotes the high-level semantic entity Body_1 consists of three blobs Blob_1, Blob_2 and Blob_3 in video picture. This approach has a better ability to check error. For example, when we find that Body_2 and Body_5 are the same entity after some processing, we can add a description as following shows. <Relation xsi:type="SemanticBaseSemanticBaseRelation" name="equivalentTo" source="#Body_2" target="#Body_5"/> 7. Conclusions A framework is proposed to support not only multimedia metadata but also the context-aware service. Multimedia metadata is integrated into the system’s description so as to understand the data more deeply. Ontology using enables the modules fusion and context reason. The framework based on ontology supports the high-level abstraction of metadata and contextual information with the power of a formal describing schema which allows context inference to provide more precise context information adapted to changing, heterogeneous smart space environments. Further research will investigate probability description logic approaches with more inference power to make the system more robust and extensible. References: [1] Mark Weiser, The Computer for the 21st Century, Mobile Computing and Communications Review, Volum3, Number 3 [2] Dasiopoulou, S., Papastathis, V.K., Mezaris, V., Kompatsiaris, I., Strintzis, M.G.: An ontology framework for knowledge-assisted semantic video analysis and annotation. In: Proceedings of the 4th International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2004) at the 3rd International Semantic Web Conference (ISWC 2004) (2004) [3] Asirelli, P., Little, S., Martinelli, M., Salvetti, O.: MultiMedia Metadata Management: a Proposal for an Infrastructure. In: SWAP 2006, Semantic Web Technologies and Applications, December 18-20, Pisa, Italy (2006) [4] Boyd, Jeffrey E; Sayles, Maxwell; Olsen, Luke; Tarjan, Paul Source; Content description servers for networked video surveillance. International Conference on Information Technology: Coding Computing, ITCC 2004, 2004, p 798-803 [5] Trivedi, M.M.; Huang, K.S.; Mikic, I.; Dynamic context capture and distributed video arrays for intelligent spaces. Systems, Man and Cybernetics, Part A, IEEE Transactions on Volume 35, Issue 1, Jan. 2005 Page(s):145 – 163 [6] Thor List and Robert B. Fisher, “CVML – An XML-based Computer Vision Markup Language”, Proc. Int. Conference. on Pattern Recognition., Cambridge. 1. 789-792. 2004 [7] Ram Nevatia, Jerry Hobbs and Bob Bolles, “An Ontology for Video Event Representation”, 2004 International Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'04) Volume 7 [8] J.M.Martinez. MPEG-7 Overview (version JTC1/SC29/WG11 N6828. Palma de Mallorca, October 2004 10). ISO/IEC [9] Rapha¨el Troncy,Werner Bailer, Michael Hausenblas, Philip Hofmair, and Rudolf Schlatte. Enabling Multimedia Metadata Interoperability by Defining Formal Semantics of MPEG-7 Profiles. In Y. Avrithis et al. (Eds.): SAMT 2006, LNCS 4306, pp. 41–55, 2006. [10] Yao Wang, Linmi Tao, Qiang Liu, Yanjun Zhao, Guangyou Xu. A flexible multi-server platform for distributed video information processing. In The 5th International Conference on Computer Vision Systems, 2007 [11] JUAN YE, LORCAN COYLE, SIMON DOBSON and PADDY NIXON . Ontology-based models in pervasive computing systems. In The Knowledge Engineering Review, Vol. 22:4, 315–347, 2007. [12] Peng Dai, Guangyou Xu. Event Driven Dynamic Context Model for Group Interaction Analysis. In Proc. International Conference on Soft Computing and Human Sciences, Kitakyushu, Japan, 2-5 Aug.,2007 (SCHS'07) [13] 赵彦钧, 基于 MPEG-7 的视频监控信息系统. 硕士论文,清华大学, 北京,2007