A Scalable and Extensible Segment-Event-Object based Sports Video Retrieval System DIAN TJONDRONEGORO Queensland University of Technology, Australia YI-PING PHOEBE CHEN Deakin University, Australia and ADRIEN JOLY Queensland University of Technology, Australia ________________________________________________________________________ Sport video data is growing rapidly as a result of the maturing digital technologies that support digital video capture, faster data processing, and large storage. However, (1) semi-automatic content extraction and annotation, (2) scalable indexing model, and (3) effective retrieval and browsing, still pose the most challenging problems for maximizing the usage of large video databases. This paper will present the findings from a comprehensive work that proposes a scalable and extensible sports video retrieval system with two major contributions in the area of sports video indexing and retrieval. The first contribution is a new sports video indexing model that utilizes semi-schema-based indexing scheme on top of an Object-Relationship approach. This indexing model is scalable and extensible as it enables gradual index construction which is supported by ongoing development of future content extraction algorithms. The second contribution is a set of novel queries which are based on XQuery to generate dynamic and user-oriented summaries and event structures. The proposed sports video retrieval system has been fully implemented and populated with soccer, tennis, swimming, and diving video. The system has been evaluated against 20 users to demonstrate and confirm its feasibility and benefits. The experimental sports genres were specifically selected to represent the four main categories of sports domain: period-, set-point-, time (race)-, and performance- based sports. Thus, the proposed system should be generic and robust for all types of sports. Categories and Subject Descriptors: H3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – Abstracting methods, Indexing method; Information Storage and Retrieval – Query formulation; H5.1 [Information Interfaces and Applications]: Multimedia Information Systems – Video General Terms: Design, Experimentation, Languages Additional Key Words and Phrases: Video database system, sports video retrieval, automatic content extraction, indexing, XML, XQuery, MPEG-7, mobile video interaction ________________________________________________________________________ 1. INTRODUCTION In the midst of a digital media era, we have witnessed rapid development of technologies that support digital video broadcast and capture, faster data processing, and large storage. As a result of the maturing technologies, sports video is growing vastly as it can attract a wide range of audiences and is generally broadcasted for long hours. To maximize the potential usage of sports video, we have identified three major requirements. Authors' addresses: Dian Tjondronegoro, School of Information Systems, Queensland University of Technology, Australia; Yi-Ping Phoebe Chen, School of Information Technology, Deakin University, Australia; Adrien Joly is a student of Faculty of Information Technology, Queensland University of Technology, Australia. Correspondence should be addressed to dian@qut.edu.au. Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2006 ACM 1073-0516/01/0300-0034 $5.00 First is a system that combines powerful techniques to automatically extract key audio-visual and semantic content. The content extraction system also summarizes the lengthy content into a compact, meaningful and more enjoyable presentation. Second is a robust, expressive and flexible video data indexing model that supports gradual index construction. Indexing is required to support effective browsing and retrieval. In fact, how users can access the video data depends significantly upon how the videos are indexed. Third is a powerful language that can dynamically generate user/applicationoriented video summaries for smart browsing and support for various search strategies. Queries should be designed based on potential usage, not just simply because they are possible or easy to be achieved. For example, users can hardly benefit from queries which are based on low-level features, such as audio volume and visual shape. For most viewers (including sports fans), a compact and summarized version often appears to be more attractive than the full-length video. Especially now that most handheld gadgets such as PDA, and 3G mobile phones can play video, users expect to access sports video anywhere and at anytime. Mobile video interaction increases the necessity for more effective and content-based retrieval due to the following reasons. Mobile devices have limited capabilities in supporting users to watch the full contents of a sports video due to their small screen size and restricted battery life. Since local storage in mobile devices is relatively small, costs of downloading streaming content is also a major issue. Thus, users need to selectively watch particular segments to reduce the time and costs of downloading full-video content. Current solutions for desktop or mobile streaming of video content have not fully exploited the power of content-based indexing to enhance users’ experience. For example, UEFA.com via RealPlayer 10 only allows users to watch a fixed set of soccer video segments which are compiled as 15 minute highlights. To improve the flexibility of content access, the system must be more adaptive to user’s requirements. For instance, users should be able to select which key events they want to watch as a highlight. Mobility also creates limited capabilities in typing complex queries. Thus, content-based and personalized summarisation that creates browse-able dynamic event structures will make users feel in total control. This paper will present the findings from a comprehensive work that proposes a scalable and extensible sports video retrieval system with two major contributions in the area of sports video indexing and retrieval. The first contribution is an extensible sports video indexing model that utilizes semi-schema-based indexing scheme on top of an Object-Relationship approach to support gradual content-extraction. Schema-based matching ensures that the video indexes are valid during data operations such as insertion and retrieval; and thereby minimizes the need for manual checking. However, the model is also schema-less as it allows additional declared elements in the instantiated objects by comparison with its schema definition. Moreover, not all elements in an object need to be instantiated at one time because video content extraction often requires several passes due to the complexity and lengthiness of processing. Thus, scalability and extensibility are supported in terms of gradual index construction (number of concepts / relationships) as a result of future developments of new algorithms to extract semantic automatically from video; and not in terms of the number of objects to be indexed. For this purpose, ORA-SS notation will be used for schema representation and to describe the specific features from XML schema that can be used to support the indexing model. Within this paper, it will be demonstrated that the model can be easily leveraged to be MPEG-7 compliant. The second contribution is a set of novel queries which are based on XQuery 1.0 (an XML query language) to generate dynamic and user-oriented summaries and event structures. The support for complex queries will demonstrate the benefits of the proposed video model. The contribution primarily revolves around how we have utilized the power of XQuery to construct dynamic summaries, as opposed to storing them as static XML components. This approach will reduce the number of nesting components in the video model. For example, MPEG-7 stores this as navigation and summary, whereas we show that summaries do not need to be stored. Moreover, in any proposal of a new video model, one needs to show how the model can support queries. XQuery is still a new language and yet to be finalized and therefore, this paper will advertise its capabilities to support the proposed video model. Thus, it can serve as a scenario-based use-case and tutorial for the emerging technologies of XML schema and XQuery. The proposed sports video system has been fully implemented and evaluated against real users for mobile video access. The evaluation is used to measure the performance of the system and to confirm the effectiveness of the presented content for users’ purposes. The remainder of this paper will be structured as follows. Section 2 presents the architecture of the proposed sports video retrieval system. Section 3 thoroughly discusses the previous work to describe the achievements and gaps in current technologies (that are needed for our system architecture). Section 4 describes the indexing scheme in details, while Section 5 focuses on the utilization of XQuery to generate dynamic summaries. Section 6 presents results from system evaluation (including users’ feedback) before finally some conclusions and future works will be presented in Section 7. 2. ARCHITECTURE OF PROPOSED SPORTS VIDEO RETRIEVAL SYSTEM The architecture of the sports video retrieval system comprises the typical components in a video database system (as depicted in Figure 1). User/application requirements determine the retrieval and browsing. The success of retrieval depends on the completeness and effectiveness of the indexes. Indexing techniques are determined by the extractable information through automatic or semi-automatic content extraction. Since video contains rich and multidimensional information, it needs to be modeled and summarized to get the most compact and effective representation of video data. Prior to video content analysis, the structure of a video sequence needs to be analyzed and separated into different layers such as shot and scene. Since the architecture relies on W3C standards to ensure its extensibility, we have adopted a web-based architecture that consists of “light” clients using a web browser to interact with an applicative server. The architecture of our implementation is described in Figure 2. This model supports different kinds of client devices since this applicative server can adapt to the user interface while generating and delivering web pages to the viewer. Because the metadata library is heavily solicited by this system, access to this metadata has to be handled by an XML database server (meta-content library server) that ensures reliable and efficient retrieval using XQuery. For that purpose, we have chosen to use EXist, a popular open source solution. For delivering the content, content delivery server is used to ensure quality of service. Since the application mainly accesses videos, a streaming-type server is needed. 3. PREVIOUS WORK This section will position the study presented in this paper towards the area of: video content extraction, indexing, and retrieval. The main purpose of the discussion is to review the current state-of-play (including our prior work) and emphasize our major contributions. Moreover, the process of content extraction will confirm the feasibility of our indexed contents since manual interventions have been minimized. Raw Video Content Extraction Structure analysis Abstraction (Modelling) Indexing Retrieval Browsing Summarization User / Application Figure 1. High-level Architecture of the Video Retrieval System INDEX GENERATION Index Generation Figure 2. The Architecture of the Implemented System 3.1. Previous Work on Automatic Content Extraction The major benefit of the domain-specific video database is that users better understand what they can retrieve, such as specific events and objects. Moreover, feature extraction tools can be designed for a specific purpose, such as excitement rather than loud audio, playing-field instead of dominant color with typical lines, or player and ball instead of foreground/moving objects. The followings sub-sections will describe the process of sports video content extraction. 3.1.1. Segmentation. Instead of using traditional color-based shot segmentation [Hanjalic 2002], as shown in Figure 3, we have identified that broadcasted soccer videos usually use transitions of typical shot types to emphasize story boundaries of the match, including global-view (GV), zoom-in view (ZV), and close-up view (CV) shots [Xie et al. 2002]. We can use grass (or dominant color)-ratio, which measures the amount of grass pixels in a frame, to classify the main shots in soccer video [Ekin et al. 2003, Xu et al. 1998]. Global shots contain the highest grass-ratio, while zoom-in contains less and close-up contains the lowest (or none). Thus, close-up shots generalize other shots, such as crowd, substitutes, or coach close-ups which contain no grass. Analysis of cameraviews transition in a sports video has been used successfully for play-break segmentation [Ekin et al. 2003]. The start of a play scene can be simply marked as the first frame of a long global shot (e.g. > 5 sec) which can be interleaved by very short zoom-up or closeup shots (e.g. < 2 sec) [Ekin et al. 2003]. A play scene (P) typically describes an attacking action which could end because of a goal or other reasons that result in a break such as foul. Likewise, a break scene is started by either a long zoom-in shot or a zoom shot of medium length, which can be interleaved by short global shots. During break scene (B), such as after a goal is scored, zoom-in and close-up shots will be dominantly used to capture players’ and supporters’ celebration during the break. Subsequently, some slow-motion replay shots and artificial texts are usually inserted to add additional content to the goal highlight. 3.1.2. Low-level to Mid-level or Domain- Level Features. As a video document is composed of image-sequences (frames) and audio tracks, video content analysis can utilize the techniques from audio and image content analysis. Mid-level features are slightly more complex than low-level features, and they can also be used to detect sports features. For example, as shown in Figure 4, skin-color locator (using specific color histogram and shape characteristics) can be used to detect player’s faces, allowing the system to zoom into a player’s face which can describe the actor(s) of an event. To use low-level features for extracting sports features, we need to apply some additional domain-knowledge interpretation. For example, to detect excitement, we observed the following changes in sports audio track when an exciting event occurs: (1) Crowd’s cheer and commentator’s speech become louder, (2) Commentator’s voice has a slight (or significant) raise of pitch and (3) Commentator’s talk becomes more rapid and has fewer pauses. Based on this concept, the essence of excitement detection algorithm is the use of three main features: lower pause-rate, higher pitch-rate, and louder volume, using dynamic thresholds [Tjondronegoro et al. 2003]. Figure 3. Example of Main Camera-based Views in Soccer, AFL, and Basketball (Global, Zoom-In, Close-up) Some audio and visual features can be directly used to annotate some semantic. An example of the audio features is speech recognition [Meng et al. 2001, Ponceleon et al. 1998] which detects specific features from a spectrogram. An example of visual features is optical character recognition (OCR) which uses specific features from the typical patterns of strong edges [Chairsorn et al. 2002, Sato et al. 1998]. However, current solutions for these techniques are still complex, time consuming, need comprehensive training, and are yet fully reliable. original image labelled face objects (a) original image labelled face objects (b) original image labelled face objects (c) Figure 4. Example of Face Detection Results 3.1.3. Generic and Domain-specific Semantic. Highlights are generically constructed by gathering the interesting events that may capture user attentions. Most sports broadcasters distinguish them by inserting some editing effects such as slow-motion replay scene(s) and artificial text display. For most sports, highlights can be detected based on generalized knowledge of the common features shared by most sports videos. For example, interesting events in sport videos are generally detectable using generic domain-level features such as whistle, excited crowd/commentator, replay scene, and text display [Tjondronegoro et al. 2004]. Other features such as slow-motion replay [Assfalg et al. 2002, 2002, Babaguchi et al. 2000], play-position (midfield, goal-area) [Assfalg et al. 2003, Babaguchi et al. 2003] and tempo (fast and slow play) [Adams et al. 2002] have also been widely used to detect exciting events. While generic highlights are good for casual video skimming, domain-specific (or classified) highlights will support more useful browsing and query applications. For example, users may prefer to watch only the soccer goals. To achieve this, it has become a well-known theory that the high-level semantics in sport videos can be detected based on the occurrences of specific audio and visual features which can be extracted automatically. The pattern of occurrence can be captured by manual heuristics [Babaguchi et al. 2002, Nepal et al. 2001] or automatic machine learning [Wu et al. 2002]. Another alternative is object-motion based which offers a higher level of analysis but requires expensive computations. For example, the definition of a goal in soccer is when the ball passes the goal line inside of the goal-mouth. While object-based features such as ball- and players-tracking are capable of detecting these semantics, specific features like slow-motion replay, excitement, and text display should be able to detect the goal event more efficiently or at least help in narrowing down the scope of the analysis. 3.1.4. Further-tactical Semantic, Customized Semantic, and Annotations. Furthertactical semantic layer can be detected by focusing on further-specific features and characteristics of a particular sport. Thus, tactical semantic needs to be separated from specific semantic due to its requirement to use more complex and less-generic audiovisual features. For example, corner kick and free-kick in soccer needs to be detected using specific analysis rules of soccer-field and player-ball motion interpretation [Han et al. 2003]. Playing-field analysis and motion interpretation are less generic than excitement and break shot detection. For instance, excitement detection algorithm can be shared for different sports by adjusting the value of some parameters or thresholds. However, the algorithm that detects corner in soccer will not be applicable for tennis due to a large difference between soccer and tennis fields. Customized semantic layer can be formed by selecting particular semantics using user or application usage-log and preferences [Babaguchi et al. 2003]. For example, a sport summary can be structured using integrated highlights and play-breaks using some textalternative annotations [Tjondronegoro et al. 2004]. In section 5, we will cover more customized semantic as a means to reducing the necessity of users to enter complex queries which would also require the user to understand the query syntax and data schema. After all features and semantic are extracted from the video using automatic algorithms, a sports video database system will usually still require some amount of manual annotations and metadata, such as sports-venue, details and appearance of teams and players. Fully-automatic content extraction and annotation is yet to be possible. Moreover, annotation techniques using textual analysis from ‘super-imposed’ texts [Sato et al. 1998] and closed-caption [Babaguchi et al. 2002] cannot always be 100% accurate. In fact, closed-caption does not always accompany every sports video, and superimposed texts can be occluded by noises and low-resolution while its appearance could be random (i.e. not necessarily during or nearby the event itself). For the proposed video database system, we developed user-friendly interfaces to enable efficient data annotation that complement the automatic content extraction process. 3.2. Previous Work on Video Indexing An appropriate model for video metadata is important for the provision of adequate retrieval support. Object-Oriented (OO) modeling is recognized for its ability to support complex data definitions [Blaha et al. 1998]. We have identified two main alternatives when using OO for modelling, namely; schema-based [Adali et al. 1996] and schema-less [Oomoto et al. 1997]. The main benefit of using a schema-based model is its capability to support easy insertions and deletions on a video database. This is due to the strict components that have to be followed exactly for each entity. However, schema-based models have limitations and present difficulties during data retrieval because users must know the class or attribute structures before they can retrieve the desired objects. Another major limitation is the difficulty to include new description during instantiation of video models due to the static schema; therefore, the model is not flexible. In contrast, schemaless modelling is designed based on the fact that each video interval (i.e. frame sequence) can be regarded as a video object, in which the attributes can be (semantic) objects, events, or other video objects. The content of a video object is more flexible because it allows dynamic calculation of inheritance, overlap, merge and projection of intervals to satisfy user queries. However, two major problems are created by schema-less modelling. The first significant problem is query difficulties because users/developers must inspect the attribute definition of each object in order to develop a query because each object has its own attribute structure. The second problem refers to the total dependency on users or applications for supervising the instantiation of video objects. Such a problem is caused by the fact that a schema is not present. 3.2.1. Segment-based Indexing. During the process of indexing texts, a document is divided into smaller components such as sections, paragraphs, sentences, phrases, words, letters and numerals. Based on such divisions, indices can be built upon these components [Zhang 1999]. Using a similar approach, video can also be decomposed into a segment-hierarchies that are similar to the storyboards used in filmmaking [Ponceleon et al. 1998, Zhang 1999]. Researchers have commonly indexed video shots which are the video segments that group sequential frames (usually short) with similar characteristics [Heng et al. 2002, Oh et al. 2000]. 3.2.3. Object-based Indexing. Object-based indexing is achieved by attaching video segments to the semantic objects (i.e. the actors). Thus, it needs to distinguish particular objects throughout a video sequence in an attempt to capture content changes. In particular, a video scene is defined as a complex collection of objects, the location and physical attributes of each object and the relationship between them. The objects extraction process for video uses the fact that objects region usually moves as a whole within a sequence of video frames [Dimitrova et al. 2001, Lu 1999]. 3.2.3. Event-based Indexing. Event-based indexing can be used as the potentially most suitable indexing technique for sport videos since sport highlights on TV, magazine or internet are commonly described using a set of events, particularly the important or exciting events [Zeinik-Manor et al. 2001]. The main benefits of events-based indexing are: (1) a sport match can be naturally decomposed into specific events; (2) Sport viewers remember and recall a sport match based on the events; (3) Events can serve as an effective bridge between low-level features and high-level features in sport videos. Since events occurrence can be predicted automatically using specific domain knowledge. Compared to these existing approaches, our indexing approach has put more emphasis on object-relational and semi-schema modeling techniques which will be described in section 4. The proposed indexing is segment- and event- based, which can be linked to semantic objects (such as players, team, and stadium). 3.3. Previous Work on Video Retrieval For retrieval purposes, play-break and highlights have been widely accepted as the semantically-meaningful segments for sport videos [Ekin et al. 2003, Rui et al. 2000, Xu et al. 1998, Yu 2003]. A play is when the game is flowing such as when the ball is being played in soccer and basketball. A break is when the game is stopped or paused due to specific reasons. Play-based summaries are most effective for browsing purposes as most highlights are contained within plays. However, break sequences should still be retained. They are just as important as play, especially if they contain highlights which can be useful for certain users and applications. For example, a player preparing for a direct free kick or penalty kick before in soccer videos shows the strategy and positioning of the offensive and defensive teams. A break can also contain slow-motion replay and fullscreen texts which are usually inserted by the broadcaster when the game becomes less intense or at the end of a playing period. Since play-break segments are not necessarily short enough for users to continue watching until they can find interesting events. For example, a match may only contain a few breaks due to the rarity of goal, foul, or ball out of play. In this case, play segments can become too long to be a summary. Play-breaks also cannot support users who need a precise highlight summary. In particular, sport fans often need to browse or search a particular highlight in which their favorite team and/or players appear. Based on these reasons, we have demonstrated the importance of integrating highlights in their corresponding plays or breaks to construct a more complete sport video summary [Tjondronegoro et al. 2004]. 4. INDEXING SCHEME We designed a sport indexing model using two main abstraction classes: segment and event which can be linked with the re-usable semantic objects. Segment can be instantiated as video-, audio- or visual- segment which are extracted from a raw video track when mid-level features can be detected. An event can be instantiated into generic, domain-specific, or further-tactical semantics. Events and segments are chosen because they can provide effective description for most sport games. For example, most users will agree that watching soccer goals is the most celebrated and exciting event during a soccer game. Segments can be used as text-alternative annotations to describe a goal and is shown in Figure 5. The final near-goal segment in a play-break sequence which contains goal will describe how the goal was scored. Face and text displays are used to inform the viewer of who scored the goal (i.e. the actor of the event) and the updated score. The replay scene shows the goal from different angles (to further emphasize how the goal was scored). In most cases, when the replay scene is associated with excitement, the content is considered to be more important. Excitement during the last play shot in a goal is usually associated with descriptive narration. In fact, we (as humans) can often hear a goal without actually seeing it. (a ) Play 45:27 (b ) Break Play 45:49 (c ) P-B Sequence Containing Goal Highlight 46:32 46:55 (d) 45:4 4 (a) Last Near-goal Segment 45:4 9 45:5 45:5 (b) Face2and Inserted Texts 8 46:0 46:2 46:2 9 1 5 (c) Slow-motion Replay Scene (highlighted-portion is associated with excitement 46:3 8 46:3 2 45:3 45:4 45:4 45:5 7 4 9 5 (d) Excitement during play (highlighted-portion is associated with the last near-goal before goal) Figure 5.Goal Event with Segment-based Annotations. 4.1. Using ORA-SS to Design the XML Video Model We have utilized some of the benefits from XML to store and index the extracted information from sport videos: XML is scalable; it allows for additional information without affecting others [Ngai et al. 2002]. This is important for supporting gradual developments of feature extraction techniques that can allow an incremental addition to the extractable segments. XML is internally descriptive and can be displayed in various ways. This is important for users who are browsing the XML data directly and also, search results can be returned as XML. XML fully supports semi-structured aspects which match video database characteristics: An object can be described using attributes (properties) and other objects such as: nested objects and heterogeneous elements. Object instantiation from the same class may not have the same number of attributes because not all attributes are compulsory XML supports two types of relationships which are namely; nesting and referencing. However, to reduce redundancy, we have used referencing instead of nested object class [Dobbie et al. 2000]. We have used XML Schema to construct a video schema as it has replaced DTD as the most descriptive language. Due to its expressive power, XML schema has also been used as the basis for MPEG-7 Data Definition Language (DDL) and the XQuery data model. We should therefore be able to easily leverage our proposed model to support MPEG-7 standard multimedia descriptions and XQuery implementation. For a more compact representation of XML schema, we will demonstrate the use of ORA-SS (ObjectRelationship-Attribute notation for Semi-Structured data) to design the video model as shown in Fig.13-15. ORA-SS notation is chosen for its ability to represent most of XML schema’s features. Appendix 1 provides a summary of ORA-SS notation. This diagram extends the original ORA-SS notation [Dobbie et al. 2000] by demonstrating a more complex sample which integrates the inheritance diagram with the schema diagram. Moreover, we have introduced two additional notations: 1) italic for an abstract object, 2) for a repeated object in the diagram (to avoid crossing lines). We have employed a bottom-up approach (i.e. from Figure 8 to Figure 6) to describe the proposed video model. Segment is the basic semantic unit in a sport video; it describes the abstraction levels such as mid-level features and generic semantic, media location and media description. Each segment is instantiated with a unique key for segment Id into video, visual, or audio segment. A sport video (SV) is a type of video segment which consists of SV components, overall summary, and hierarchical summary. SV components are composed of: a) segment collection which stores a flat-list of visual and audio segments that can be extracted from the sport video, b) syntactic relation collection which stores all the syntactic relations such as ‘composed of’ and ‘starts after’, between one source segment and one or more destination segments, and c) semantic relation collection which records all the semantic relations such as ‘is actor of’ and ‘appears in’, between one source segment or semantic object and one or more destination segments or semantic objects. Overall summary describes the sport video game as a whole; it includes where (stadium), when (date, time), teams (that compete), final result, and match statistics. Match statistics can be stored as XML tags or a visual frame (e.g. text displays that depict the number of goals, shots, fouls, red/yellow cards, and counter attacks in a soccer game). The Hierarchical Summary is composed of a Comprehensive Summary and Highlights Summary. The Comprehensive Summary describes a sport video in terms of play-break sequences which exist as the main story decomposition unit in most sport videos. For example, an attacking attempt during a play is stopped when there is a goal or foul. Each play-break can contain zero or one (key) event and can be decomposed into one or more play and break shots. More complete descriptions on playbreak-event based summarization has been described and the first query (Q1) is described in section 6. On the other hand, the Highlights Summary organizes highlight events into common summary themes such as soccer goals and basketball free-throws. Using the proposed video model, we have demonstrated a sport video indexing scheme that supports: Scalable video indexes that allow gradual extraction of segments and events (such as the introduction of new techniques) without affecting others that pre-exist. For example, we can incrementally introduce additional segments and events without affecting the hierarchical summary. Similarly, more semantic objects, such as stadium and referee, can be introduced at a later stage when there are many sport videos that share the same stadium and referee. Object-Relationship modeling scheme. In particular, we have demonstrated that inheritance and referencing are important features in video database modeling. Semi schema based modeling scheme. As shown in Figure 7, we allow users (or applications) to add ANY additional elements (or attributes) into a segment description as long as the element has been declared somewhere else in the proposed schema, or within other schema that are within a particular scope. In fact, we may attach ANY additional elements into other elements in our data model to allow more flexibility. This stems from the fact that users often know best what they want to describe and they should therefore be able to add components without the need to check the pre-defined schema). However, we aim to gradually modify the schema with new components, especially when the extra information provided by users can be used to enrich the current video model. 4.2. Sports Categorization The proposed indexing scheme is robust for all types of sports since all segments and events are instantiated in the same hierarchical level, while semantic objects are recorded as a separate entity – independent from any game. Most audiovisual segments such as whistle, text display, and excitement can be instantiated in various sports. For a sports genre, the temporal and logical hierarchy structures can be constructed according to its sports category, which defines the play-break structure and typical events. Based on the temporal and event structures, we identified four sports categories, namely, period-, setpoint-, time-, and performance-based. Table I presents a list of sports that belong to different categories to demonstrate that almost all sports genre can be classified into the proposed categories. Sport Video Library (Database) 2, 1:1, * Performancebased Video Set-Pointbased Video Time-based Video 2, 1:1, 1:1 Semantic Object Collection Period-based Video 2, 1:1, * 2, 1:1,* Sports Name Sport Video (SV) Period-based Event collection Player Team 2, 1:1,* 2, 1:1, 1:1 2, 1:1, 1:1 2, 1:1,? Overall (Match) Summary SV Component Hierarchical Summary 2, 1:1, 1:1 2, 1:1,? Segment sCollection Syntactic Relation Col 2, 1:1, ? 2, 1:1, 1:2 Semantic Relation Col 2, 1:1,+ 2, 1:1,? Comprehensive Summary Team What Where Action 2, 1:1,? Video Segment 2, 1:1,? Highlights Summary Period-based Event 2, 1:1, 1: n, < 2, 1:1,* 2, 1:1,? 2, 1:1,? 2, 1:1, + Audio Segment Visual Segment Highlight Collection P-B Sequence se, 2, 1:1, ? <Goal> Team Benefited Video Segment 2, 1:1, + 2, 1:1, ? Replay Play … Excitement … Goal-area Face Text … Event Event 2, 1:1, ? sp, 2, 1:1, 1: n, < sb, 2, 1:1, 1: n, < Play 2, 1:1, * Key Audio 2, 1:1, * Key Frame Goal Scorer Assist Giver Break 2, 1:1, * Key Frame Replay Figure 6. ORA-SS Schema for Sport Video Database (Overall) – see Appendix 2 for the full-sized diagram. Segment Media Location Segment Id Media Description Abstraction Level Video Segment ANY Visual Segment va 2, 1: n, ? File Path File Name File Frame Ext Start File End Author Create Date Last Update Vid Id video seg extraction algorithm Audio Segment sa 2, 1: n, ? Vis Id visual seg extraction algorithm va aa 2, 1: n, ? audio seg extraction algorithm sa parameters used Au Id aa parameters used parameters used Figure 7. ORA-SS Schema for Sport Video Database (Segments). Syntactic Relation Col Semantic Relation Col 2, 1:1, + Syntactic Relation SemanticR elation 2, 1:1, 1:1 2, 1:1,1:1 syntax rel category Source Segment 2, 1:1, + 2, 1:1, + Destination Segment semantic rel category Source Segment Source Sem Obj Source Segment Destination Sem Obj Figure 8. ORA-SS Schema for Sport Video Database (Syntactic-Semantic elations). Period based sports are typically structured into playing periods such as a ‘set’ in tennis or a ‘round’ in boxing. We can predict the number of playing periods for each class of sports. For example, soccer usually has two 45-minute (normal) playing periods. For this sports category, winners are usually decided based on the final score at the end of the playing periods; thus the playing periods will be stopped regardless of the score. The typical events during period-based sports are: goal, foul, and good offensive/defensive. Set-point based sports such as tennis and volleyball are composed of sets which are ended each time a player reaches a certain score. For example, a set in tennis is ended as soon as a player reaches 6 points or more and the winner of each set is usually decided when a player has at least two sets advantage. Examples of the typical events are: set- and match- point, long rally, and service ace. Time-based sports usually involve racing and is structured around laps. Unlike periodbased sport which is usually broadcasted as an individual match; this sport class is mostly broadcasted as a portion of a championship or competition. For example, in day eight of the Australian Swimming National Championship live broadcast, viewers are presented with multiple race events, such as “men’s semi-final freestyle 50m”. Each race can be decomposed into one or more laps which are equivalent to a play. The number of laps during each race can be predicted. For example, a 200m swimming race is usually composed of 4 laps. Predicting the number of laps in a race can help during highlight detection as most exciting events, such as overtaking the lead and breaking a record, happen during the last lap. For this type of sports, the winner is decided based on the player who has the minimum time to complete the race. Typical events in this category are: record breaking, start- and end- of race. PeriodSoccer (football), basketball, rugby, Australian/American football, hockey, boxing, water polo, handball Set-pointTennis, table tennis, badminton, volleyball, beach volleyball, baseball, cricket, bowling, fencing, softball, tae-kwon-do, judo, wrestling TimeSwimming, athletics (e.g. running, marathon), car/motor/horse/bike racing, track cycling, sailing, triathlon, canoeand kayak- flatwater/slalom racing, rowing PerformanceRhythmic gymnastics, track and field (e.g. long/high jump, javelin throw), weight lifting, diving, shooting, synchronized swimming, equestrian Table I. Sports Categorization which is based on Temporal Structure Similarity Performance based sports has a temporal structure which is similar to time-based sports. For example, in day 21 of the Olympics’ gymnastics, viewers will see different competitions such as men’s and women’s acrobatic artistic or rhythmic semi-finals. Each competition will have one or more performances by each competitor. One performance is equal to a play. We can consider each performance as a key event because there are many breaks between each performance, such as players waiting for the results. The winner of this type of sport is determined by the highest average points awarded by judges. Some typical events in this category are: good-/bad- performance and winning moves. 4.3. Utilizing XML Schema to Construct the Video Model This section discusses how XML schema used to construct the proposed video model. 4.3.1. Scalable Modeling Scheme. Schema can be easily extended by introducing more elements. Similarly, more types can be defined to refine the schema rather than using generic data types. In other words, the schema does not need to be complete at the beginning. <complexType name="OverallSummary"> … <element name="matchStatistics" type="string”/> … </complexType> <complexType name="MatchStatisticsType"> <sequence> <element name="numOfGoals" type="integer" minOccurs="0" maxOccurs="1"/> <element name="numOfFouls" type="integer" minOccurs="0" maxOccurs="1"/> <element name="numOfShots" type="integer" minOccurs="0" maxOccurs="1"/> </sequence> </complexType> <complexType name="OverallSummary"> … <element name="matchStatistics" type="vid:MatchStatisticsType”/> … </complexType> 4.3.2. Object-Oriented Modeling Scheme. Object/class can be defined as element or complexType which can be inherited using extension or restriction. A class can be defined as abstract and substituted with a concrete class. Substitution group is used to maximize extensibility. For example, a soccer video contains a PeriodBasedDomainEventCollection, which includes zero or more PeriodBasedDomainEvent. PeriodBasedDomainEvent is an abstract class which has to be instantiated by the actual event such as Goal. <complexType name="PeriodBasedDomainEventCollectionType"> <sequence> <element ref="vid:PeriodBasedDomainEvent" minOccurs="0" maxOccurs="unbounded" /> </sequence> </complexType> <element name="PeriodBasedDomainEvent" abstract="true" type="vid:PeriodBasedDomainEventType" /> <complexType name="PeriodBasedDomainEventType" abstract="true"> <complexContent> <extension base="vid:VideoSegmentType"> <sequence> <element name="teamBenefitedId" type="vid:SemObjIdType" minOccurs="0" /> <element name="playerBenefitedId" type="vid:SemObjIdType" minOccurs="0" /> </sequence> </extension> </complexContent> </complexType> <element name="Goal" substitutionGroup="vid:PeriodBasedDomainEvent" type="vid:GoalType" /> 4.3.3. Relational Modeling Scheme. Referential integrity can be achieved by introducing primary key (PK) which is defined using key and foreign key (FK) and is enforced using keyref. <key name="sequencePK"> <selector xpath="vid:sportVideos/vid:sportVideo/vid:sportVideoComponent/vid:segmentCollection/vid:pbSequence"/> <field xpath="vid:segmentId"/> </key> <keyref name="sequenceInCompSummaryRef" refer="vid:sequencePK"> <selector xpath="vid:sportVideos/sportVideo/vid:hierarchicalSummary/vid:comprehensiveSummary/vid:pbSequence"/> <field xpath="vid:pbId"/> </keyref> 4.3.4. Schema-Less Element. The instances of Media description can include more elements than the schema (while maintaining the well-formed structure) to avoid static and fixed elements. The main benefit is that developers may know a lot, but not everything that the user knows. Therefore, a class can simultaneously be both schema and schema-less. XQuery can easily produce an “element name” by matching the element names between the schema and the instance to find the elements which have not been defined by the schema. <complexType name="MediaDescription"> <sequence> <element name="author" type="string" /> <element name="creationDate" type="date" /> <element name="lastUpdate" type="date" minOccurs=”0” /> <any minOccurs="0"/> </sequence> </complexType> 4.3.5. Semi-structured Modeling Scheme. Not all elements need to be instantiated at one time (video indexing requires gradual extraction due to complexity and processing time), without creating much problems for the tree structure. <complexType name="SportVideoComponent"> <sequence> <element ref="vid:segmentCollection" minOccurs="0" maxOccurs="1"/> <element ref="vid:semanticObjectCollection" minOccurs="0" maxOccurs="1"/> <element ref="vid:syntacticRelationCollection" minOccurs="0" maxOccurs="1"/> <element ref="vid:semanticRelationCollection" minOccurs="0" maxOccurs="1"/> </sequence> </complexType> 4.4. Using XML to Store Video Indexes and Capture Users’ Preferences In this section, we will demonstrate the use of XML to construct a video database and user preference. 4.4.1.Sport Video Database <sportVideoLibrary> <semanticObjectCollection> <team> <teamId>Tm1</teamId> <teamType>clubTeam</teamType> <teamShortName>Madrid</teamShortName> <teamFullName>Real Madrid</teamFullName> </team> … (other club and national teams appearing in the library) <player> <playerId>Pl1</playerId> <playerShortName>Ronaldo</playerShortName> <playerFullName>Luis Nazario de Lima</playerFullName> <clubTeam>Tm1</clubTeam> <position>forward</position> </player> …(other players appearing in the library) </semanticObjectCollection> <PeriodBasedVideo eventName="soccer"> <segmentId>M1</segmentId> <mediaLocation> <filePath>D:\mpeg7fromdian\Bra_Ger2-2.mpg</filePath> <fileName>Bra_Ger2-2</fileName> <fileExtension>mpg</fileExtension> <frameStart>00:00.00</frameStart> <frameEnd>21:44.00</frameEnd> </mediaLocation> ... (Media description can be added here) <name>Soccer:worldCup_Bra_Ger</name> <sportVideoComponent> <segmentCollection> <pbSequence> <segmentId>S1</segmentId> <mediaLocation> <frameStart>00:02.00</frameStart> <frameEnd>00:17.00</frameEnd> </mediaLocation> … (other segments, such as play, break, face frame, text frame, replay scene ) </segmentCollection> <syntacticRelationCollection> <syntacticRelation> <relationCategory>composedOf</relationCategory> <sourceSegmentId>M1</sourceSegmentId> <destinationSegmentId>S1</destinationSegmentId> <destinationSegmentId>S2</destinationSegmentId> <destinationSegmentId>S3</destinationSegmentId> <destinationSegmentId>S4</destinationSegmentId> <destinationSegmentId>S5</destinationSegmentId> <destinationSegmentId>S6</destinationSegmentId> … </syntacticRelation> … (other syntactic relations) </syntacticRelationCollection> <semanticRelationCollection> <semanticRelation> <relationCategory>appearsIn</relationCategory> <sourceSemObjId>Pl19</sourceSemObjId> <destinationSegmentId>E3</destinationSegmentId> </semanticRelation> … (other semantic relations) </semanticRelationCollection> </sportVideoComponent> <overallSummary> <whatAction>soccer-international-worldcup</whatAction> <team>TM2</team> <team>TM1</team> <where>Unknown</where> <when>Unknown</when> <matchStatistics> <numOfGoals>0</numOfGoals> <numOfFouls>2</numOfFouls> … (other match statistics, such as num of shots) </matchStatistics> </overallSummary> <hierarchicalSummary> <comprehensiveSummary> <pbSequence pbId="S1"> <highlightEvent eventId="E1"/> <break breakId="B1"/> <break breakId="B2"/> <break breakId="B3"/> </pbSequence> … (other play-break sequences) </comprehensiveSummary> <highlightSummary> <summaryThemeList> <summaryTheme themeId="T0"> <themeContent>soccer</themeContent> </summaryTheme> <summaryTheme themeId="T01" parentThemeId="T0"> <themeContent>Foul</themeContent> </summaryTheme> … (other summary themes) <highlightCollectionList> <highlightCollection themeId="T01"> <highlightEventId>E1</highlightEventId> </highlightCollection> … (other highlight events within the same summary theme) </highlightCollection> … (other highlight collections, such as soccer goals) </highlightCollectionList> </highlightSummary> </hierarchicalSummary> … (other sports videos) </sportVideoLibrary> It should be noted that “hierarchical summary” is generated dynamically using XQueries which will be described in Section 5. 4.4.2. User Preference <userPreference> <user userName="dian"> <sportGenres> <sportGenre priorityLevel="1" Name="soccer"> <events> <event priorityLevel="1">Goal</event> <event priorityLevel="2">Shot</event> <event priorityLevel="3">Goodplay</event> <event priorityLevel="4">Foul</event> … (Other events preference) </events> <players> <player priorityLevel="1">Pl11</player> <player priorityLevel="2">Pl8</player> <player priorityLevel="3">Pl10</player> … (Other players preference) </players> <segmentPref> <segment priorityLevel="1">excitement</segment> <segment priorityLevel="2">playScene</segment> <segment priorityLevel="3">text</segment> <segment priorityLevel="4">breakScene</segment> … (Other segments preference) </segmentPref> </sportGenre> … (Other sports genre preference) <outputPref> <maxOutputDuration>1000</maxOutputDuration> <maxWaitDuration>P1H30M</maxWaitDuration> </outputPref> </user> … (other users) </userPreference> 4.5. Accommodating MPEG-7 Standard Descriptions MPEG-7 standard [Chang et al. 2001, Manjunath et al. 2002, Pereira 2001] creates the possibility of having a standardized method of describing multimedia content. To support our video indexing framework, MPEG-7 provides tools that describe the structure and semantic of a single multimedia document. Content structure tools describe the segments, the hierarchical decompositions and structural relationships among the segments. They include: segment entity, segment attribute, segment decomposition and structural relation tools. Video media (audiovisual segment) can be decomposed into audio- and video (i.e. image sequence) - segment and moving (object) region. Content semantics tools describe semantic entities including object, event, concept, semantic-state, semantic-place, and semantic-time, as well as the semantic attributes and relations. For rapid navigation and browsing capabilities and flexible access, MPEG-7 provides hierarchical- and sequential- summary description schemes, as well as user views and variations. To describe user interaction, MPEG-7 provides user preference and usage history description schemes. We have extended the current MPEG-7 description framework by: (1) emphasizing the use of referenced- instead of nested- elements; (2) allowing users to add any elements on top of the existing schema (i.e. semi-schema based); (3) introducing a comprehensive summary as an example of a customized hierarchical summary. Compared to nested elements, referencing can promote more scalable and flexible instantiations of the description as the current automatic content extraction techniques are yet to be fully completed, thus, we should allow multiple passes without the complexity of nesting elements. Allowing ‘any elements’ gives users more independence for customizing their own descriptions, without the necessity of having to ask administrators to update the schema. As a user would know about the updated schema, they can write their own queries to suit the customized elements. The benefits derived from the comprehensive summary have been discussed in section 3.3. In addition to these three major extensions, we have also introduced multi-semantic layers to classify events into generic, specialized, and customized events to signify the fact that these semantic layers should be detected, indexed, and queried in different ways. For example, an exciting event (generic) in a soccer game cannot be directly associated with a goal scorer as it has to be classified into a soccer goal to suit the context. In MPEG-7, a VideoSegment can contain: SpatialDecomposition to MovingRegion; TemporalDecomposition to VideoSegment and StillRegion; SpatioTemporalDecomposition to Moving/Still Region; MediaSourceDecomposition to VideoSegment. Each of these segments can be identified and referenced by an ID, XPath or HREF (any URI). Within any segment, we can specify a Relation (which identifies the source and target) with another segment or semantic. The MPEG-7 semantic entities, AgentObjectType and EventType, can be used to describe semantic objects such as soccer player and events such as soccer goal. These entities can be entered and reused while SemanticRelations can be graphically described by connecting the respective semantic entities. Unlike the MPEG-7 approach, we separate the segments (which is an example of object declaration) from the relationships with other segments and semantic. This is to keep every element simple, atomic, self-contained, and independent; thereby promoting the extensibility of the model. Moreover, in our model, we have distinguished a segment’s collection from collections of semantic-object, syntactic-relations and semantic-relations so that users can easily browse over the database. The current MPEG-7 schema model allows referencing in HierarchicalSummary as it may contain SourceID to identify the original content, SourceLocator to locate the original content, and SourceInformation to reference elements of the description of the original content. We have expanded this referencing scheme by referencing the elements of each SummarySegment to the original segment. The main benefit of this approach is to retain the existing attributes and relationships between the original-referenced segment and the (linked) segments and semantic objects. This is so that simpler queries can be written if we want to produce customized summaries which are based on the hierarchical summary, while showing all of the related segments and semantic objects. In MPEG-7, HierarchicalSummary can be used to implement sports video summarization such as highlights, synopses, and previews. Users can also generate their summaries by bookmarking segments while virtual programs can be constructed by combining segments from different programs. Unlike MPEG-7, we do not wish to store summaries; thus, we will demonstrate in section 5 and 6, the use of XQuery to generate dynamic summaries. Previously in section 4.2, we also introduced a customized user preference with some suggested extensions to MPEG-7’s UserPreference. For dynamic user-oriented summaries, we emphasize the need to capture users’ preferences on the summaries and duration-of-viewing while our usage log captures users’ viewing behaviours such as browsing and viewing. For example, we can generate automated summary preferences based on the most ‘viewed’ and ‘replayed’ segments. MPEG-21 defines an open multimedia framework which is composed of two essential concepts, namely: (1) the definition of a fundamental unit of distribution and transaction, being the Digital Item (DI) such as a video collection; and (2) the concept of users interacting with DI. MPEG-21 can complement MPEG-7 to personalize the selection and delivery of multimedia content for the individual users because MPEG-21 Digital Items Adaptation (DIA) provides tools to support resource and descriptor adaptation and quality-of-service management. [Tseng et al. 2004]. As the proposed indexing scheme is based on XML schema, it can easily benefit from some of the existing standard frameworks. For this purpose, the MPEG-21 DID (DI declaration) can be used as a means for integrating different descriptive schema into one descriptive framework [Kosch 2004]. As an example, since XML schema was not designed specifically for AV data, certain MPEG-7 extensions have been added to describe: array and matrix data types, and built-in primitive time data types. Array and matrix are suitable to describe sound and image content, while time is the unique dimension of video content. In general, we can always use standard MPEG-7 descriptors to replace the XML-based descriptors and make the indices more MPEG-7compliant for when the standard becomes widely used. For our current database, we choose not to apply any MPEG-7 descriptions as they are strongly-typed and schema-based. Users need to trace the pre-defined schemas and descriptors to become familiar with the descriptions; whereas our indexing framework emphasises the flexibility of schema-less models. However, as shown in the example below, we will demonstrate how the proposed indexing framework can utilise MPEG-21 to combine MPEG-7 descriptors as a part of the “Any” elements. <?xml version=”1.0” encoding=”UTF-8”?> <DIDL xmlns=”urnLmpeg:mpeg-21:2002:01-DIDL-NS” xmlns:mpeg7=”urn:mpeg:mpeg7:schema:2001”> <Container> <Descriptor> <Statement mimeType = “text/plain”>My Sports Video Collections</Statement> </Descriptor> <Item id = “SportVideo1> <SoccerVideo> … <OverallSummary> … <mpeg7:Mpeg7> <mpeg7:DescriptionMetadata Any xsi:type=”mpeg7:DescriptionMetadataType”> <mpeg7:Comment xsi:type=”mpeg7:TextAnnotationType”> <mpeg7:KeywordAnnotation>Australian Soccer </mpeg7:KeywordAnnotation> … </mpeg7:DescriptionMetadata> 5. DYNAMIC SUMMARIES AND USER-ORIENTED QUERIES Users would need to understand the query’s syntax language and the data indexing schema before they can write queries for their retrieval requirements. A better alternative is to compress the video’s long sequence into a more compact representation through a summarization process. Using the summaries, users can browse and navigate through the segments that they want to watch as well as specifying some parameters to refine the search. We can assist users in selecting what they want to watch by constructing ‘queriesgenerated’ dynamic-summaries, event-structures, and customized summaries. Summaries should be constructed dynamically and based on extracted segments and features instead of storing them as a static index. This is to allow gradual content extraction as not all segments and semantic objects can be extracted or defined at one time. In this section, we will demonstrate the use of XQuery to construct dynamic summaries and user-oriented queries. In total, there are 6 sample queries which have been fully tested. All queries have been designed to demonstrate the benefits of the proposed video data model (segmentevent based) while focusing on potential user/application requirements. XQuery 1.0 [Boag et al. 2004, Brundage 2004] is actively developed by the W3C (World Wide Web Consortium) XML Query and XSL Working Groups since 1998. XQuery supports functionalities that are both decades old and brand new. XQuery provides some powerful features for XML-based video retrieval and will be demonstrated alongside examples in this section: Composition: involves constructing temporary XML results in the middle of a query and to be able to navigate into it. This is particularly useful for constructing dynamic summaries and views for browsing Procedural approach: involves constructing user-defined functions (or modules) including recursive ones, as well as using built-in functions. This approach allows effective design of queries using a similar approach to procedural programming. Writing functions will also improve the query by improving the readability (i.e. repetitive query portions are written as a function), share-ability (i.e. user-defined functions can be reused by other users/applications), extensibility (i.e. partial or whole query can be encapsulated into a function, and then used for building more complex queries. Tree and Sequence: involves constructing queries that can retain the original tree (hierarchy) structure, as well as queries that take into consideration the order (sequence) or appearance of elements. This is particularly important to support video hierarchy views and the temporal ordering of video segments. This section mainly aims to demonstrate the benefits of our model in terms of generating user-oriented dynamic summaries. Most of the queries are dynamicallycalculated summaries (or views) that users do not have to write themselves. For example Q5 and Q6 will require very little user interaction. 5.1. Summary 1 (Q1): Comprehensive Summary. Build the Comprehensive Summary of a particular sport match with all the details. The graphical interface of the browser is depicted in Figure 9. We utilize XQuery to automatically generate (and store) the structure of a video’s comprehensive summary. The comprehensive summary encapsulates the relationships between different segments. A Play-Break (PB) sequence may contain several play- and break- segments. The query automatically collects all segments within each PB sequence and presents them in a hierarchical order. By modifying this query, we can also generate other hierarchical structures such as highlight summary. Since the generated comprehensive summary comprises of objects which are referenced from the segment collections, we should be able to obtain all the segments’ details and construct a summary which can support useful browsing. Using the browsing structure, users can peruse a sport video by PB sequences (like CD audio tracks). For each track, a user can check whether it contains a highlight event. Users can watch either the entire track, or its play- and break- shots, or watch the event as a shorter alternative. Figure 9. (From left to right) Screenshots of Q1 and Q2 5.1.1. Algorithm 1: obtain all details of the comprehensive summary Declare variable $filterMatchId in order to filter out unwanted matches ( 1.1) Find out the specified match in database, return all matches if user didn’t specify matchId ( 1.2) Loop in all PB in the Comprehensive Summary (CS) of the match Based on the CS, get the SegmentID of all event and segments contained within the PB ( 1.3) Display the elements of the sequence ( 1.4) Loop in all events in the sequence – by matching Segment ID ( 1.5) Display the file name of the match Display elements of the event ordered by frame start time stamp Loop in all plays in the sequence, by matching Segment ID ( 1.6) Display the file name of the match Display the elements of the play ordered by frame start time stamp … Apply modified 1.5 for key frames and other segments within play … Apply modified 1.5 for break and other segments within sequence such as face. 1.1 let $filterMatchId := request:request-parameter("filterMatchId", "") 1.2 let $vidLib := /sportVideoLibrary, $matches := $vidLib//*[ ends-with(lower-case(name()), "video")], $matches := if ($filterMatchId != "") then $matches[segmentId = $filterMatchId] else $matches 1.3 for $PB in $vid//comprehensiveSummary/pbSequence let $EVENT := $PB/highlightEvent, $PLAY := $PB/play, $BREAK := $PB/break 1.4 for $sequence in $vid//segmentCollection/pbSequence[segmentId = $PB/@pbId] return $sequence/* 1.5 for $event in $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[segmentId= $EVENT/@eventId] order by $event/mediaLocation/frameStart return <eventInPb>{$filename}{$event/*}</eventInPb> 1.6 for $play in $vid//segmentCollection/playScene[segmentId = $PLAY/@playId] order by $play/mediaLocation/frameStart return <playInPb>{$filename}{$play/*}</playInPb> XQuery support composition to combine the query results into a structure. As an example, the following shows the composition structure for Q1. <results> 1.1 1.2 return <match id=" {$vid/segmentId} "> { 1.3 return <pb> { 1.4} { 1.5} { 1.6} }</pb> </match> } </results> 5.3. Summary 2 (Q2): Event Summary For each event in a particular sport match, produce Event Summary (its details and play-, break- ratio). The graphical interface of the browser is depicted in Figure 9. As shown in Figure 10, segments-based statistics, such as excitement ratio, can be effectively used to describe a highlight as well to compare two or more events which may help viewers to decide which events are more interesting. For example, by comparing the statistics of Goal 4 and Goal 10, users can predict that goal 10 is potentially more interesting since it contains more excitement (audio) and it has a longer slow motion replay. However, Goal 4 may have more exciting plays in the goal-area due to the longer duration of goal-area that can be found during the play. Figure 10. Statistics-based Annotation on Events. 5.2.1. Algorithm 2 Declare variable $filterMatchId in order to filter out unwanted matches ( 1.1) Loop in all events in the eventCollection within the video Calculate event duration ( 2.1) Display event details and the player name related to the event ( 2.2) Calculate play ratio ( 2.3) … Apply modified 2.3 to calculate break ratio and other statistics such as excitement ratio The queries are for the specific portions of the algorithms which have not been described: 2.1, 2.2 (refer to Q1 for 1.1). 2.1 for $event in $vid/*[ ends-with(lower-case(name()), "eventcollection")]/* let $eStart := local:timetoint($event//frameStart), $eEnd := local:timetoint($event//frameEnd), $eventDuration := $eEnd - $eStart + 1 order by $event//frameStart 2.2 for $player in $event/*[ends-with(lower-case(node-name(.)),'playerid')] return element {replace(node-name($player), "Id", "ShortName")} { local:idToPlayerShortName(data($player)) } 2.3 for $play in $vid//playScene let $pStart := local:timetoint($play//frameStart), $pEnd := local:timetoint($play//frameEnd), $pDur := $pEnd - $pStart where ($pStart >= $eStart) and ($pEnd <= $eEnd) return if (count($play) >= 1) then (<playRatio>{100*($pDur cast as xs:float) div ($eventDuration cast as xs:float)}</playRatio>) else() 5.3. Summary 3 (Q3): Customized (User-preference based) Summary. Construct a customized summary (according to the user preference), processing from the first preferences, second preferences, and so on, until the maximum output duration is reached. The GUI of the browser is depicted in Figure 11. As shown in Section 4.2, users, via a stored user preference, can choose their favorite type of events, players and segment. This summarization is particularly useful when users want to specify the total duration of the summary while having total control upon the segments, events and players that they will be able to watch. This query can be supported by XQuery and the resulting query can be used as an example of how XQuery can develop a complex query. Due to the complexity, the followings provide the design rational. Obtain the user preference (i.e. priority level) for the types of summary, event, player, sport and segment. In the example discussed in Section 4.2, a user’s first preference is soccer game, and user would prefer to watch the events in the order of soccer goal, soccer shot, soccer good play and soccer foul. Besides that, users prefer to watch a certain players’ performance and specific segments such as excitement and text. While maximum output duration is not reached, the system should display each item. The second summary preferred is basketball, similar structure will apply. The segments are displayed until maximum output duration is reached. 5.3.1. Algorithm 3 Get the user preference for the specified user name * the example showed that the query will return result according to ‘dian’ preference Determine start and end of summaries, initial time duration, find maxOutputDuration ( 3.1) Loop from the start to the end of the preferred sportGenre (e.g. soccer, basketball) Determine the start and end of each type of user preferences: segment, player, event, and maximum output duration Loop in all videos in the video database which matches preferred sportGenre ( 3.2) Loop from the start to the end of the preferred segment ( 3.3) Loop in all segments in the video database Display the events details according to preference order ( 3.4) ….. Apply modified 3.3 and 3.4 for event summary and player summary Figure 11. Screenshots of Q3: a) Form to specify sports preference, b) Form to specify events preference in a particular sport, c) The returned summary 3.1 let $vidLib := /sportVideoLibrary let $pref := local:getUserPref($userName), $summStart := 1, $summEnd := count($pref//sportGenre), $total := 0, $outputPref := $pref/outputPref//maxOutputDuration cast as xs:integer 3.2 for $summLevel in $summStart to $summEnd let $summaryPref := $pref//sportGenre[@priorityLevel=$summLevel], $outputPref := $pref/outputPref//maxOutputDuration cast as xs:integer, $sportsGenre := $summaryPref/@Name, $playerStart := 1, $playerEnd := count($summaryPref//player), $segmentStart := 1, $segmentEnd := count($summaryPref//segment), $eventStart := 1, $eventEnd := count($summaryPref//event), $videos := $vidLib//*[ends-with(lower-case(name()), “video")][@eventName=$sportsGenre] 3.3 for $segmentLevel in $segmentStart to $segmentEnd let $segmentPref := $summaryPref//segment[@priorityLevel=$segmentLevel] 3.4 for $video in $vidLib//*[ ends-with(lower-case(name()),"video")][@eventName=$sportsGenre] let $filename := $video/mediaLocation/filename, $segments := $video//*[name() = $segmentPref ] where count($segments) > 0 return <match id="{$video/segmentId}" filename="{$filename}"> { for $segment in $segments let $duration := local:timetoint($segment//frameEnd) – local:timetoint($segment//frameStart) +1, $total := local:calcTotalDur($total cast as xs:integer, $duration cast as xs:integer) return if ($total <= $outputPref) then ($segment) else() } </match> 5.4. Summary 4 (Q4): Favorite Players and Matches Summary Construct a customized summary by showing a user’s favorite players and matches (by users’ parameters). The GUI of the browser is depicted in Figure 12. In this query, users can specify the type of their favorite match and/or players. For example, some users only like to watch a ‘goal feast’ match which is a match with > N number of goals. The following demonstrates examples of the summary requirement that can be specified by a user preference or user inputs. It is to be noted that the scope of best/ most is within the specified sportVideolibrary(s). <Favorite player> “actors of” <bestStriker> [>N goals] [player position = striker/forward] <mostUnluckyStriker> [>N shot in 1 match] [player position = striker/forward] <mostDangerousTackler> [>N foul in 1 match] <mostDiciplinePlayer> [<N foul in 5 matches] [player position = defender/centre back] <Favorite match> “contains” <goalFeast> [>N goals] <flowingGame> [<N fouls] or [<N whistle] <crowdPleaser> [>N excitement] and [not exist in not(FlowingGame) ] <nonQuietGame> [<N goal] and [<N shot] and [>N non] 5.4.1. Algorithm 4 Declare and initialize variables in order to receive parameters from user input ( 4.1) * the example showed that user defined matches that have more than 2 goals are goalFeast matches, whistle and fouls less than 10 are flowing matches A. Process goalFeast matches ( 4.2) Loop in sportVideolibrary Count all goal events in the video If the number of goals is greater than the parameter input by user Display details of the video as a goalFeast Match B. Process flowing matches ( 4.3) Loop in sportVideolibrary Count all foul events and whistle segments in the video If the number of goals less than a specified number and the number of events is less than the parameter input by user Display details of the video as a flowingMatch … Using a similar approach as A and B, process other favourite matches and favourite players 4.1 let $thresholdGoals := request:request-parameter("thresholdGoals", 3), $thresholdFouls := request:request-parameter("thresholdFouls", 10), $thresholdWhistle := request:request-parameter("thresholdWhistle", 10) 4.2 let $vidLib := /sportVideoLibrary for $vid in $vidLib//*[ ends-with(lower-case(name()), "video")] return let $gT := $thresholdGoals, $goal := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[name() = "Goal"], $numGoals := count($goal) return if ($numGoals >= $gT ) then (<match id="{$vid/segmentId}"filename="{$vid/mediaLocation/fileName}"> {<numOfGoals>{$numGoals}</numOfGoals>}{$vid/mediaLocation}</match>) else () 4.3 let $vidLib := /sportVideoLibrary for $vid in $vidLib//*[ ends-with(lower-case(name()), "video")] return let $fT := $thresholdFouls $wT := $thresholdWhistle $foul := $vid//*[ ends-with(lower-case(name()), "eventcollection")]//*[name() = "Foul"] $whistle := $vid//segmentCollection//*[name() = "whistle"] $numFouls := count($foul) $numWhistle := count($whistle) return if ($numFouls <= $fT and $numWhistle < $wT) then (<match id="{$vid/segmentId}" filename="{$vid/mediaLocation/fileName}"> {<numOfFouls>{$numFouls}</numOfFouls>} {<numOfWhistle>{$numWhistle}</numOfWhistle>} {$vid/mediaLocation}</match>) else () Figure 12. Screenshots of Q4: a) Form to enter user’s parameters; b) The generated Summary Q5) Summary 5: Player’s Event Summary. This query is able to list a certain player(e.g. Ronaldo)‘s details such as full name, short name, position, and any event segments that are related to him regardless of in which match. It determines all specified event segments in all matches in which the selected player appeared and displays a count on that event (e.g. goals) too. This query facilitates a user to track a certain player’s performance such as how many goals he scores, how many fouls he made of all matches in the video database. The GUI of the browser is depicted in Figure 13. 5.5. Algorithm 5 Declare and initialize variables in order to receive parameters from user input ( 5.1) * the example showed that user would like to watch all ‘goals’ scored by ‘Ronaldinho’ (or Pl 10) Locate player details based on parameter provided by user ( 5.2) Display the player’s details including fullName, shortName, and position ( 5.3) Loop in all goal events which are scored by the player Calculate and display the total number of goals ( 5.4) Display the details of the goal events Figure 13. Screenshots of Q5: a) Form to enter the preferred player and event, b) The Summary showing the player details (in future, a close-up image for each player will be provided) 5.1 let $player := request:request-parameter("player", "Pl10"), $event := request:request-parameter("event", "Goal") 5.2 let $vidLib := /sportVideoLibrary, $playerElement := $vidLib//player[playerId=$player] 5.3 <Details> {$ playerElement //playerFullName} {$ playerElement //playerShortName} {$ playerElement //position} </Details> 5.4 for $vid in $vidLib/*[ ends-with(lower-case(name()), "video")] let $events := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[name() = $event], $playerEvents := $events[*[ ends-with(lower-case(name()), "playerid")] = $playerElement//playerId] where count($playerEvents) > 0 return <match id="{$vid/segmentId}" filename="{$vid/mediaLocation/fileName}"> { for $playerEvent in $playerEvents return $playerEvent } </match> Q6) Summary 6: Players by Team Summary. Find all video segments in which each player in all club- or country- teams appears. The appearance results are grouped by the video, segment, and then event. For each group, sort results by the location (from the frameStart to frameEnd). The GUI of the browser is depicted in Figure 14. This query is particularly designed to show that Q5 can be easily extended into a more complex query. That is, XQuery is scalable, thanks to the procedural approach. Figure 14. Screenshot of Q6 5.6.1. Algorithm 6 Loop in all players who belong to all the club teams or country teams within the video Store all elements of the players and clubs or countries into variables Display the player’s details: fullname, shortName, position ( 6.1) Display the team name which the player belongs to Loop in all segments and events in which the player appear Calculate and display the total number of events Display the details of the events ordered by frameStart time stamp ( 6.2) 6.1 is modified from 5.1 and 5.2 – PlayersByClubTeamSummary for $clubPlayer in $vidLib//player[(countryTeam = $club/teamId) or (clubTeam = $club/teamId)] let $playerId := $clubPlayer//playerId return <player> {$clubPlayer/*} {for $team in $clubPlayer/countryTeam return <countryTeamName>{local:idToTeamName(data($clubPlayer/countryTeam))} </countryTeamName>} {for $team in $clubPlayer/clubTeam return <clubTeamName>{local:idToTeamName(data($clubPlayer/clubTeam))} </clubTeamName>} </player> 6.2 is modified from 5.3 – AppearancesListedByMatch for $vid in $vidLib/*[ends-with(lower-case(name()), "video")] let $eventCollection := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")], $appS :=$vid//semanticRelation[sourceSemObjId = $clubPlayer//playerId]/destinationSegmentId, $appE := $eventCollection//*[ends-with(lower-case(local-name(.)),"playerid") ][data(.) = $playerId] where count($appS) + count($appE) > 0 return <Match id="{$vid/segmentId}" filename="{$vid//fileName}"> {if (count($appS) > 0) then (<Segments> {for $segAppId in $appS let $sequence := $vid//segmentCollection/*[segmentId = $segAppId] order by local:timetoint($sequence//frameStart) return $sequence} </Segments>) else ()} {if (count($appE) > 0) then (<Events> {for $event in $eventCollection/* where $event//*[ends-with(lower-case(local-name(.)),"playerid") ] = $clubPlayer//playerId order by local:timetoint($event//frameStart) return $event} </Events>) else ()} </Match> 6. SYSTEM IMPLEMENTATION AND EVALUATION 6.1. Experimental Data The current implementation of the system has used soccer (3 x 20 minutes), tennis (20 minutes), swimming (2 x 5 minutes), and diving (10 minutes) sports data to ensure that the indexing and retrieval schemes are generic and robust. As described in section 4.2, these sports are chosen to represent the four main categories of sports: period-, set-point-, time (race)-, and performance- based sports. Thus, we have demonstrated that the proposed sports video retrieval system is generic for most, if not all, sport genres such as basketball, badminton, running, and gymnastics. 6.2. Semi-automatic Content Annotation for Indexing After all features and semantic are extracted using the automated algorithms, an XML file which contains segments, events and objects information is generated for each sport video according to our schema. Some information such as player names are entered manually by annotator or downloaded from related websites (such as goal.com). To generate the list of relationships between players and events (i.e. annotating each event with its actors), the index is parsed and visualized into a graphical user interface that will be used to assist annotators in determining the relationships. For each (or specific) event and segment, users can easily drag-and-drop the players’ names who are involved (or appear) in it. For future work, we will investigate the use of speech recognition and video optical character recognition to automatically detect the players’ names which are mentioned during a particular video event. However, until these techniques are 100% reliable, it is expected that we will still need some manual intervention. Thus, the main challenge is to reduce human’s workload in the indexing process as much as possible by obtaining external information and taking advantage of community contribution. 6.3. Content Delivery for Retrieval The proposed sports video retrieval system has been implemented as a Web-based application, allowing the user to browse the information by jumping from page to page using hyperlinks. This paradigm is already well-known and is easily portable to portable devices that have a small screen with only a few keys and no mouse. The interface (in the form of web pages) can be designed to allow browsing using PDA and mobile phones using the WAP technology that is the equivalent of HTML for mobile/wireless devices. With intention to use interoperable standards, the Applicative Server is designed to run on Tomcat 5.5 using the latest Java 1.5 SDK platform. When a user connects to the server, the main page lists generic summaries that apply to the whole library and a selection of (recently added) videos. For each video, some information and key frame(s) are displayed, and the queries are presented as links. The information for each Graphical User Interface (GUI) is retrieved by XQuery from the database and the query results are rendered by the Content Delivery Server using XSL transformations into an HTML document (to be explained below). When the user selects a query (or clicks on a link in general), the request is transmitted and handled by a servlet of the Content Delivery Server. Likewise, this will lead to executing an XQuery and render the results back to the user. Some of the queries are filtered by parameters, making it possible for users to personalize the results. Such queries use forms that invite the user to fill in the parameters before showing the actual results (for example, Q4 and Q5 in section 5). This form is populated by an XQuery in order to propose predefined (and existing) ‘drop-down’ options/lists to the user instead of letting him/her type the information. When the user submits the parameters' form, the parameters are passed to the actual XQuery that will generate the expected result. The result will then be transformed into an HTML page and returned to the user. When the user clicks on a video link, a specialized servlet of the Content Delivery Server will generate an ASX play-list file that contains the URL of the actual video segment. This play-list will be downloaded and opened by the associated video player (depending on users’ operating platform). Summary browsers (Q1 to Q6) form the most important parts of the GUI. A query usually returns information structured as a tree, and the nodes on that tree contain more detailed content. To allow the user to browse this content efficiently, each of the folder nodes can be opened to show their children nodes, and so on. Clicking on a node can show more detailed information and content about it in the bottom part of the screen. For example, if a query returns a set of players grouped by match, clicking on a match will show the location and date of that match, plus some keyframes, links to match-specific queries and to the whole match video. In order to avoid slow interaction with the server while browsing, the results of a query are self-contained in the summary browser page. That way, the page changes in real time, responding to the user's actions, as long as he/she does not click on a link that leads to a video or another query. This is made possible by developing the summary browser in Dynamic HTML, which is driven by JavaScript. Moreover, the content of all the nodes is rendered and embedded in the page, waiting for the user to view them when he/she wants to. The Content Delivery Server is a session-aware Tomcat web application (developed as Servlets in Java) which keeps the context for each connected user and adapts the navigation according to it. For example, some queries will be influenced by the user's preferences, and the representation of information and content will be adapted to the user's platform/device. Figure 15 describes the typical interaction scenario. For the requirements of this scenario, the followings describe the servlets that form the contentdelivery server. The Config class is included in all Servlets, providing necessary configuration information such as the paths to data and common functions. It also defines the list of queries. The PageRenderer class is also included in all Servlets and used as an abstraction layer for final device-adapted output. It provides functions that process the XSL transformations from XML data in order to render final HTML pages from query results. The SelectVideo servlet drives the rendering of the main page. The QueryForm servlet drives the rendering of parameters' page for parameterized queries. The ViewResult servlet drives the rendering of the summary browser after execution of a query. The ViewKeyFrame servlet returns image files for a given timestamp and a given video. It is used by the summary browser. The View servlet generates the ASX playlist file that gives access to a video segment. The UserPref servlet drives the rendering, interaction and data manipulation for the User Preferences GUI. It generates XUpdate documents and submits them to the database server to update the preferences. 6.3.1. Dynamic GUI generation. In order to allow simple extensibility and re-usability of the system and the library, the GUI has been designed modularly. Figure 16 describes the process of generating the dynamic GUI. Figure 17 describes the processing data flow that takes the query results to the final visually-rich HTML pages. The steps consist of XML transformations processed by XSL templates which can be described as follows: Step 1: High-level GUI rendering Summary Browser tree Rendering (BR) This query-specific transformation (XSL-T file) builds up the tree structure that will be shown on the final HTML page. For each node, a tree is denoted by a title, an icon and a link, but this tree is still coded in high-level XML, not in HTML. This transformation includes the “content location” transformation. Content Location (CL) This transformation defines a template that may be called by the query-specific XSL-T file when a node contains rich information to show besides the tree. This template embeds the content (defined by the result of the query) into a uniquely identified element for late rendering. The parent node has to link to that identifier. Step 2: Final GUI rendering (platform-specific) Platform-specific Page Rendering (PR) This transformation renders the DHTML (HTML + javascript) code of the summary browser containing the tree structure obtained from the BR transformation. Content Rendering (CR) This transformation is a second pass on the output of the BR transformation. First, it extracts the embedded content elements generated by the Content Locator in order to delegate the rendering process based on the specific templates (such as an XSL file for each sport genre, defining the representation of corresponding events and other types of objects). Second, it resolves the links between the rendered tree nodes and the rendered content within the final HTML page. The strengths of this dynamic GUI rendering design: We can abstract the final layout (client-platform specific) in the first step of rendering. These transformations are query-specific as they define the way to represent the results of a given query, they can be used for any platform. Moreover, it's easy to add support for new client platforms or change the GUI layout by implementing different versions of the second step of the rendering only. The rendering of content is delegated to sport-specific templates. It is easy to add support for new sports without affecting the main transformation engine. The separation of the tree and the linked content makes it possible to implement a “PULL” version of the Summary Browser that would download content on-demand from the serve. This strategy is faster and consumes less-bandwidth as compared to “PUSH” contents from the server which requires everything downloaded at once. Using XSL-T for this process makes the application more interoperable as it does not rely on a particular programming language or platform Figure 15. Typical Interaction Scenario 6.4. Users Evaluation The system has been evaluated against 20 users using observation and questionnaire to verify whether the retrieval scheme is effective, and powerful in meeting users’ requirements. The users’ profiles are: 5 sports fans that are keen to spend time watching the whole match, 3 sports fans that just like to watch highlights or a summarized version, 4 sports viewers that hardly watch any games, 2 TV viewers who dislike sport (to get as much useful feedback as to how to make sport/soccer interesting to watch). To quantize users’ answers and to enable statistics, we used closed questions that can be answered with 1-3 scale ratings (1: Disagree 2: Agree 3: Strongly agree). When users answer 1 or 2, they provided suggestions for improvements. Table II presents the result of our questionnaire which demonstrates that all users agree our browsing scheme is useful and meets their requirements. Questionnaire: Summary 1 Q1. Each Track contains sufficient information to understand what is going on – without having to refer to previous/next tracks (i.e. self-contained like a CD track) Q2. The tracks and play-break segmentation is useful and enhancing my viewing experience Q3. The key frames sufficiently describe the event Q4. If I can browse key frames which belong to the whole match, it will help me to decide which match I’d be keener to watch Q5. The track and play-breaks are also useful for the other sports Q6. By browsing tracks and play-breaks, I can still get the same details as if I watch the whole match, except now I have the choice of selecting/skipping scenes - like selecting tracks in a CD audio Q7. The tracks-based browsing scheme is effective to get the essence of a match without missing any details Summary 2 Q8. The event summaries give me enough highlights about the match. Q9. The statistics such as Ratio of break/excitement are descriptive and help to me to choose the more interesting events. Summary 3 Q10. The customized summary is better or more useful than the other summaries so far since it allows me to choose what I want to watch. Q11. The amount of customization (i.e. from the user preference form) is sufficient and suitable for minimum user intervention. Summary 4 Q12. This summary is useful since I can specify my favorite matches (if I want to watch them as a whole or use summary 1 or Summary 2 to browse) Q13. (Despite it’s yet to be supported by the current prototype) I will find this summary useful if I can choose: Summary 5 Q14. This summary is useful since I can watch my favorite players’ events Summary 6 Q15. This summary is useful since I can watch all the events of my favorite teams, grouped by all the players. For Profile 4: Q16. Even I am not a fan of sports, the browsing scheme (Q1-Q6 summaries) supported in this system would make me more interested to watch sports video Profile 1 1 1 1 1 2 2 3 4 4 Average Q1 3 2 2 2 2 2 1 2 3 2 2.1 Q2 3 1 3 2 2 2 2 3 3 1 2.2 Q3 3 2 3 1 2 3 1 3 3 2 2.3 Q4 3 3 3 1 2 3 2 3 3 2 2.5 Q5 3 2 3 1 3 1 2 3 3 2 2.3 Q6 3 2 3 2 2 3 2 2 3 2 2.4 Q7 3 2 3 2 2 1 2 3 3 2 2.3 Q8 3 2 3 3 3 3 3 2 3 3 2.8 Q9 3 1 3 3 2 3 1 2 3 1 2.2 Q10 3 3 3 3 2 3 3 3 3 3 2.9 Q11 3 3 3 2 1 3 2 2 3 3 2.5 Q12 3 2 3 3 3 2 2 3 3 2 2.6 Q13 3 2 3 3 3 2 3 2 3 2 2.6 Q14 3 3 3 3 3 3 3 3 3 3 3 Q15 3 2 3 3 2 2 3 2 3 3 2.6 Q16 2 3 2 Table II. Results from Users Evaluation Based on users’ feedback: summary 3 (player’s events), summary 5 (user-preference based summary), and summary 4 (favorite matches and players) and summary 2 (a match’s event summaries) are particularly useful.. It is worth noting that for users who do not like sports, they agree that the browsing scheme can make them more interested to watch sports video. Some users have suggested more summaries such as browse by years, big events (legendary events), past or recent games. They would also like to see closerview key frames, brief descriptions on each of the summaries available. Figure 16. XSL transformation and rendering for dynamic GUI generation Figure 17. Data flow from the XML query result to the final HTML page 7. CONCLUSION AND FUTURE WORK In this paper, we have proposed a novel sports video retrieval system which has been fully implemented and evaluated. The system essentially uses a segment-event-object based video indexing model which is designed using semi-schema-based and objectrelationship modeling schemes. The schema is implemented as an XML schema (represented by ORA-SS notation), populated by XML, and utilized by XQuery to construct dynamic summaries and user-oriented retrievals. The proposed schema is scalable since it enables incremental development of algorithms for automatic featuresemantic extraction. Moreover, the schema does not need to be complete at one time while allowing users to easily add additional elements without the need to know the parent objects. The indexing model has emphasized the usage of referencing relationship rather than nested object in order to avoid redundant data and to achieve more efficient data updates. Referencing also allows the system to add segments, events, and objects without worrying about the hierarchical, thereby achieving more straightforward and faster data insertions. We have also shown that the proposed indexing scheme can be easily leveraged to integrate more MPEG-7 standard multimedia descriptions. The implementation strategy that includes modular-based GUI design and multi-level architecture has demonstrated that the proposed system is scalable and extensible. Moreover, a web-based service will enable multi-platforms access, as long as they support web browsing and video player software. The user evaluation has confirmed the effectiveness of the current implemented browsing scheme which is generated dynamically from the indexes. The current implementation, which uses soccer, swimming, tennis and diving data to represent the four main categories of sports, have demonstrated that the proposed sports video retrieval framework is robust for all sports genres. The graphical interface of the system has been designed in a modular style to further enable future extensions. Moreover, the interface modules can be scaled for different screen sizes in desktop and mobile platforms. For future work, we aim to extract more features and to improve the performance results to support a more reliable automatic semantic interpretation. As a result, the database will be rapidly expanded to include more data from other sports. We will continually extend the retrieval module to include a wider variety of search strategies while adding more dynamic summaries that will benefit users and applications. The aesthetic of the user interface will be improved to enhance its effectiveness. Since XQuery is still a working draft, we will need to revisit the proposed queries when it becomes a final version. This is to ensure that all functionalities that we applied can be used and potential improvements on the query effectiveness can be made. Last, but most importantly, we aim to conduct a performance test on a large dataset (such as >500 videos) using the current XML database. REFERENCES ADALI, S., CANDAN, K.S., CHEN, S.-S., EROL, K. and SUBRAHMANIAN, V.S. 1996. The Advanced Video Information System: Data Structures and Query Processing. Multimedia Systems, 4 (4). 172-186. ADAMS, B., DORAI, C. and VENKATESH, S. 2002. Toward automatic extraction of expressive elements from motion pictures: tempo. Multimedia, IEEE Transactions on, 4 (4). 472-481. ASSFALG, J., BERTINI, M., COLOMBO, C., BIMBO, A.D. and NUNZIATI, W. 2003. Automatic extraction and annotation of soccer video highlights. in Image Processing, 2003. Proceedings. 2003 International Conference on, (2003), II-527-530 vol.523. ASSFALG, J., BERTINI, M., DEL BIMBO, A., NUNZIATI, W. and PALA, P. 2002. Detection and recognition of football highlights using HMM. in Electronics, Circuits and Systems, 2002. 9th International Conference on, (2002), 1059-1062 vol.1053. ASSFALG, J., BERTINI, M., DEL BIMBO, A., NUNZIATI, W. and PALA, P. 2002. Soccer highlights detection and recognition using HMMs. in Multimedia and Expo, 2002. ICME '02. Proceedings. 2002 IEEE International Conference on, (2002), 825-828 vol.821. BABAGUCHI, N., KAWAI, Y. and KITAHASHI, T. 2002. Event based indexing of broadcasted sports video by intermodal collaboration. Multimedia, IEEE Transactions on, 4 (1). 68-75. BABAGUCHI, N., KAWAI, Y., YASUGI, Y. and KITAHASHI, T. 2000. Linking live and replay scenes in broadcasted sports video. in ACM Workshop on Multimedia, (Los Angeles, California, United States, 2000), ACM Press, 205-208. BABAGUCHI, N. and NITTO, N. 2003. Intermodal collaboration: a strategy for semantic content analysis for broadcasted sports video. in Image Processing, 2003. Proceedings. 2003 International Conference on, (2003), 13-16. BABAGUCHI, N., OHARA, K. and OGURA, T. 2003. Effect of personalization on retrieval and summarization of sports video. in Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on, (2003), 940-944. BLAHA, M. and PREMERLANI, W. 1998. Object-oriented modeling and design for database applications. Prentice Hall, Upper Saddle River, N.J. BOAG, S., CHAMBERLIN, D., FERNANDEZ, M.F., FLORESCU, D., ROBIE, J. and SIMÉON, J. XQuery 1.0: An XML Query Language W3C Working Draft 29 October, W3C, 2004. BRUNDAGE, M. 2004. XQuery: the XML query language. Addison Wesley. CHAIRSORN, L. and CHUA, T.-S. 2002. The Segmentation and Classification of Story Boundaries in News Video. in 6th IFIP Working Conference on Visual Database Systems, (Brisbane, 2002), Kluwer, 94-109. CHANG, S.-F., SIKORA, T. and PURL, A. 2001. Overview of the MPEG-7 standard. Circuits and Systems for Video Technology, IEEE Transactions on, 11 (6). 688-695. DIMITROVA, N., RUI, Y. and SETHI, I. 2001. Media Content Management. In Design Management of Multimedia Information Systems: Opportunities Challenges. S.M. Rahman ed. Idea group publishing. DOBBIE, G., XIAOYING, W., LING, T.W. and LEE, M.L. ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data Technical Report Department of Computer Science,, National University of Singapore, 2000. EKIN, A. and TEKALP, A.M. 2003. Generic play-break event detection for summarization and hierarchical sports video analysis. in International Conference on Mulmedia and Expo 2003 (ICME03), (2003), IEEE, 6-9 July 2003. EKIN, A. and TEKALP, M. 2003. Automatic Soccer Video Analysis and Summarization. IEEE Transaction on Image Processing, 12 (7). 796-807. HAN, M., HUA, W., CHEN, T. and GONG, Y. 2003. Feature design in soccer video indexing. in Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on, (2003), 950-954. HANJALIC, A. 2002. Shot-boundary detection: unraveled and resolved? Circuits and Systems for Video Technology, IEEE Transactions on, 12 (2). 90-105. HENG, W.J. and NGAN, K.N. 2002. Shot boundary refinement for long transition in digital video sequence. Multimedia, IEEE Transactions on, 4 (1520-9210). 434-445. KOSCH, H. 2004. DIstributed Multimedia Database Technologies supported by MPEG-7 and MPEG-21. CRC Press, Florida, USA. LU, G.J. 1999. Multimedia database management systems. Artech House, Boston ; London :. MANJUNATH, B.S., SALEMBIER, P. and SIKORA, T. 2002. Introduction to MPEG-7. WIley, England. MENG, H.M., TANG, X., HUI, P.Y., GAO, X. and LI, Y.C. 2001. Speech retrieval with video parsing for television news programs. in Acoustics, Speech, and Signal Processing, 2001. Proceedings. 2001 IEEE International Conference on, (2001), 1401-1404 vol.1403. NEPAL, S., SRINIVASAN, U. and REYNOLDS, G. 2001. Automatic detection of 'Goal' segments in basketball videos. in ACM International Conference on Multimedia, (Ottawa; Canada, 2001), ACM, 261-269. NGAI, C.H., CHAN, P.W., YAU, E. and LYU, M.R. 2002. XVIP: an XML-based video information processing system. in Computer Software and Applications Conference, 2002. COMPSAC 2002. Proceedings. 26th Annual International, (2002), 173-178. OH, J. and HUA, K.A. 2000. Efficient and cost-effective techniques for browsing and indexing large video databases. in ACM SIGMOD Intl. Conf. on Management of Data, (Dallas, TX, 2000), ACM, 415-426. OOMOTO, E. and TANAKA, K. 1997. Video Database Systems - Recent Trends in Research and Development Activities. In The Handbook of Multimedia Information Management. R.J.a.R.M. William I. Grosky ed. Prentice Hall, Upper Saddle River, NJ, 405 - 448. PEREIRA, F. MPEG-7 requirements Document V.14. International, International Organisation For Standardisation, Coding of Moving Pictures and Audio, International Organisation For Standardisation, Coding of Moving Pictures and Audio ISO/IEC JTC 1/SC 29/WG 11/N4035, Singapore, 2001. PONCELEON, D., SRINIVASAN, S., AMIR, A., PETKOVIC, D. and DIKLIC, D. 1998. Key to effective video retieval: Effective cataloging and browsing. in IEEE Internation Workshop on Content-based image and video databases, (Bombay, India, 1998), IEEE Computer Society, 99-107. RUI, Y., GUPTA, A. and ACERO, A. 2000. Automatically extracting highlights for TV Baseball programs. in ACM International Conference on Multimedia, (Marina del Rey, California, United States, 2000), ACM, 105115. SATO, T., KANADE, T., HUGHES, E.K. and SMITH, M.A. 1998. Video OCR for digital news archive. in Content-Based Access of Image and Video Database, 1998. Proceedings., 1998 IEEE International Workshop on, (1998), 52-60. TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2004. Integrating Highlights to Play-break Sequences for More Complete Sport Video Summarization. IEEE Multimedia, Oct-Dec 2004. 22-37. TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2004. The Power of Play-Break for Automatic Detection and Browsing of Self-consumable Sport Video Highlights. in To appear in the 6th International ACM Multimedia Information Retrieval Workshop, (New York, USA., 2004), ACM. TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2003. Sports video summarization using highlights and play-breaks. in The fifth ACM SIGMM International Workshop on Workshop on Multimedia Information Retrieval (ACM MIR'03), (Berkeley, USA, 2003), ACM, 20-21-208. TSENG, B.L., LIN, C.-Y. and SMITH, J.R. 2004. Using MPEG-7 and MPEG-21 for personalizing video. Multimedia, IEEE, 11 (1). 42-52. WU, C., MA, Y.-F., ZHANG, H.-J. and ZHONG, Y.-Z. 2002. Events recognition by semantic inference for sports video. in Multimedia and Expo, 2002. Proceedings. 2002 IEEE International Conference on, (2002), 805808. XIE, L., CHANG, S.-F., DIVAKARAN, A. and SUN, H. 2002. Structure analysis of soccer video with hidden Markov models. in Acoustics, Speech, and Signal Processing, 2002 IEEE International Conference on, (Columbia University, 2002), 4096-4099. XU, P., XIE, L. and CHANG, S.-F. 1998. Algorithms and System for Segmentation and Structure Analysis in Soccer Video. in IEEE International Conference on Multimedia and Expo, (Tokyo, Japan,, 1998), IEEE. YU, X. 2003. Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast soccer video. in ACM MM 2003, (Berkeley, CA, USA, 2003), ACM, 11-20. ZEINIK-MANOR, L. and IRANI, M. 2001. Event-based analysis of video. in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, (The Weizmann Institute of Science, 2001), 123-130. ZHANG, H.J. (ed.), Content-based video browsing and retrieval. CRC press LLC, 1999. APPENDIX 1: ORA-SS NOTATION APPENDIX 2. FULL-SCALED DIAGRAM OF FIGURE 6 Replay Video Segment Play 2, 1:1, 1:1 2, 1:1,? Audio Segment Excitement 2, 1:1,+ Segment sCollection … Performancebased Video Set-Pointbased Video Sport Video (SV) 2, 1:1, 1:1 SV Component 2, 1:1, ? Face Visual Segment Semantic Relation Col Goal-area 2, 1:1,? 2, 1:1,? Syntactic Relation Col … Time-based Video 2, 1:1, 1:1 … Overall (Match) Summary What Where Action Text Period-based Video Sports Name 2, 1:1, 1:2 2, 1:1,* Key Frame Break Sport Video Library (Database) Period-based Event collection 2, 1:1,* 2, 1:1, 1:1 Assist Giver Video Segment Team Semantic Object Collection 2, 1:1, * Player Team Benefited 2, 1:1,? Period-based Event 2, 1:1,* <Goal> 2, 1:1, ? 2, 1:1, ? Goal Scorer Replay 2, 1:1, * Highlights Summary 2, 1:1, + Highlight Collection 2, 1:1, + Event sb, 2, 1:1, 1: n, < 2, 1:1,? Hierarchical Summary 2, 1:1, * 2, 1:1,? 2, 1:1,? Comprehensive Summary 2, 1:1, 1: n, < P-B Sequence Key Frame 2, 1:1, * Play sp, 2, 1:1, 1: n, < Event se, 2, 1:1, ? Team 2, 1:1, * Key Audio