A Scalable and Extensible Segment-Event

advertisement
A Scalable and Extensible Segment-Event-Object
based Sports Video Retrieval System
DIAN TJONDRONEGORO
Queensland University of Technology, Australia
YI-PING PHOEBE CHEN
Deakin University, Australia
and
ADRIEN JOLY
Queensland University of Technology, Australia
________________________________________________________________________
Sport video data is growing rapidly as a result of the maturing digital technologies that support digital video
capture, faster data processing, and large storage. However, (1) semi-automatic content extraction and
annotation, (2) scalable indexing model, and (3) effective retrieval and browsing, still pose the most challenging
problems for maximizing the usage of large video databases. This paper will present the findings from a
comprehensive work that proposes a scalable and extensible sports video retrieval system with two major
contributions in the area of sports video indexing and retrieval. The first contribution is a new sports video
indexing model that utilizes semi-schema-based indexing scheme on top of an Object-Relationship approach.
This indexing model is scalable and extensible as it enables gradual index construction which is supported by
ongoing development of future content extraction algorithms. The second contribution is a set of novel queries
which are based on XQuery to generate dynamic and user-oriented summaries and event structures. The
proposed sports video retrieval system has been fully implemented and populated with soccer, tennis,
swimming, and diving video. The system has been evaluated against 20 users to demonstrate and confirm its
feasibility and benefits. The experimental sports genres were specifically selected to represent the four main
categories of sports domain: period-, set-point-, time (race)-, and performance- based sports. Thus, the proposed
system should be generic and robust for all types of sports.
Categories and Subject Descriptors: H3.1 [Information Storage and Retrieval]: Content Analysis and
Indexing – Abstracting methods, Indexing method; Information Storage and Retrieval – Query formulation;
H5.1 [Information Interfaces and Applications]: Multimedia Information Systems – Video
General Terms: Design, Experimentation, Languages
Additional Key Words and Phrases: Video database system, sports video retrieval, automatic content extraction,
indexing, XML, XQuery, MPEG-7, mobile video interaction
________________________________________________________________________
1. INTRODUCTION
In the midst of a digital media era, we have witnessed rapid development of technologies
that support digital video broadcast and capture, faster data processing, and large storage.
As a result of the maturing technologies, sports video is growing vastly as it can attract a
wide range of audiences and is generally broadcasted for long hours. To maximize the
potential usage of sports video, we have identified three major requirements.
Authors' addresses: Dian Tjondronegoro, School of Information Systems, Queensland University of
Technology, Australia; Yi-Ping Phoebe Chen, School of Information Technology, Deakin University, Australia;
Adrien Joly is a student of Faculty of Information Technology, Queensland University of Technology,
Australia. Correspondence should be addressed to dian@qut.edu.au.
Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee
provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice,
the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM,
Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific
permission and/or a fee.
© 2006 ACM 1073-0516/01/0300-0034 $5.00
First is a system that combines powerful techniques to automatically extract key
audio-visual and semantic content. The content extraction system also summarizes the
lengthy content into a compact, meaningful and more enjoyable presentation. Second is a
robust, expressive and flexible video data indexing model that supports gradual index
construction. Indexing is required to support effective browsing and retrieval. In fact,
how users can access the video data depends significantly upon how the videos are
indexed. Third is a powerful language that can dynamically generate user/applicationoriented video summaries for smart browsing and support for various search strategies.
Queries should be designed based on potential usage, not just simply because they are
possible or easy to be achieved. For example, users can hardly benefit from queries which
are based on low-level features, such as audio volume and visual shape.
For most viewers (including sports fans), a compact and summarized version often
appears to be more attractive than the full-length video. Especially now that most handheld gadgets such as PDA, and 3G mobile phones can play video, users expect to access
sports video anywhere and at anytime. Mobile video interaction increases the necessity
for more effective and content-based retrieval due to the following reasons. Mobile
devices have limited capabilities in supporting users to watch the full contents of a sports
video due to their small screen size and restricted battery life. Since local storage in
mobile devices is relatively small, costs of downloading streaming content is also a major
issue. Thus, users need to selectively watch particular segments to reduce the time and
costs of downloading full-video content. Current solutions for desktop or mobile
streaming of video content have not fully exploited the power of content-based indexing
to enhance users’ experience. For example, UEFA.com via RealPlayer 10 only allows
users to watch a fixed set of soccer video segments which are compiled as 15 minute
highlights. To improve the flexibility of content access, the system must be more
adaptive to user’s requirements. For instance, users should be able to select which key
events they want to watch as a highlight. Mobility also creates limited capabilities in
typing complex queries. Thus, content-based and personalized summarisation that creates
browse-able dynamic event structures will make users feel in total control.
This paper will present the findings from a comprehensive work that proposes a
scalable and extensible sports video retrieval system with two major contributions in the
area of sports video indexing and retrieval. The first contribution is an extensible sports
video indexing model that utilizes semi-schema-based indexing scheme on top of an
Object-Relationship approach to support gradual content-extraction. Schema-based
matching ensures that the video indexes are valid during data operations such as insertion
and retrieval; and thereby minimizes the need for manual checking. However, the model
is also schema-less as it allows additional declared elements in the instantiated objects by
comparison with its schema definition. Moreover, not all elements in an object need to be
instantiated at one time because video content extraction often requires several passes due
to the complexity and lengthiness of processing. Thus, scalability and extensibility are
supported in terms of gradual index construction (number of concepts / relationships) as a
result of future developments of new algorithms to extract semantic automatically from
video; and not in terms of the number of objects to be indexed. For this purpose, ORA-SS
notation will be used for schema representation and to describe the specific features from
XML schema that can be used to support the indexing model. Within this paper, it will be
demonstrated that the model can be easily leveraged to be MPEG-7 compliant. The
second contribution is a set of novel queries which are based on XQuery 1.0 (an XML
query language) to generate dynamic and user-oriented summaries and event structures.
The support for complex queries will demonstrate the benefits of the proposed video
model. The contribution primarily revolves around how we have utilized the power of
XQuery to construct dynamic summaries, as opposed to storing them as static XML
components. This approach will reduce the number of nesting components in the video
model. For example, MPEG-7 stores this as navigation and summary, whereas we show
that summaries do not need to be stored. Moreover, in any proposal of a new video
model, one needs to show how the model can support queries. XQuery is still a new
language and yet to be finalized and therefore, this paper will advertise its capabilities to
support the proposed video model. Thus, it can serve as a scenario-based use-case and
tutorial for the emerging technologies of XML schema and XQuery.
The proposed sports video system has been fully implemented and evaluated against
real users for mobile video access. The evaluation is used to measure the performance of
the system and to confirm the effectiveness of the presented content for users’ purposes.
The remainder of this paper will be structured as follows. Section 2 presents the
architecture of the proposed sports video retrieval system. Section 3 thoroughly discusses
the previous work to describe the achievements and gaps in current technologies (that are
needed for our system architecture). Section 4 describes the indexing scheme in details,
while Section 5 focuses on the utilization of XQuery to generate dynamic summaries.
Section 6 presents results from system evaluation (including users’ feedback) before
finally some conclusions and future works will be presented in Section 7.
2. ARCHITECTURE OF PROPOSED SPORTS VIDEO RETRIEVAL SYSTEM
The architecture of the sports video retrieval system comprises the typical components in
a video database system (as depicted in Figure 1). User/application requirements
determine the retrieval and browsing. The success of retrieval depends on the
completeness and effectiveness of the indexes. Indexing techniques are determined by the
extractable information through automatic or semi-automatic content extraction. Since
video contains rich and multidimensional information, it needs to be modeled and
summarized to get the most compact and effective representation of video data. Prior to
video content analysis, the structure of a video sequence needs to be analyzed and
separated into different layers such as shot and scene.
Since the architecture relies on W3C standards to ensure its extensibility, we have
adopted a web-based architecture that consists of “light” clients using a web browser to
interact with an applicative server. The architecture of our implementation is described in
Figure 2. This model supports different kinds of client devices since this applicative
server can adapt to the user interface while generating and delivering web pages to the
viewer. Because the metadata library is heavily solicited by this system, access to this
metadata has to be handled by an XML database server (meta-content library server) that
ensures reliable and efficient retrieval using XQuery. For that purpose, we have chosen to
use EXist, a popular open source solution. For delivering the content, content delivery
server is used to ensure quality of service. Since the application mainly accesses videos, a
streaming-type server is needed.
3. PREVIOUS WORK
This section will position the study presented in this paper towards the area of: video
content extraction, indexing, and retrieval. The main purpose of the discussion is to
review the current state-of-play (including our prior work) and emphasize our major
contributions. Moreover, the process of content extraction will confirm the feasibility of
our indexed contents since manual interventions have been minimized.
Raw Video
Content
Extraction
Structure analysis
Abstraction
(Modelling)
Indexing
Retrieval
Browsing
Summarization
User / Application
Figure 1. High-level Architecture of the Video Retrieval System
INDEX GENERATION
Index Generation
Figure 2. The Architecture of the Implemented System
3.1. Previous Work on Automatic Content Extraction
The major benefit of the domain-specific video database is that users better understand
what they can retrieve, such as specific events and objects. Moreover, feature extraction
tools can be designed for a specific purpose, such as excitement rather than loud audio,
playing-field instead of dominant color with typical lines, or player and ball instead of
foreground/moving objects. The followings sub-sections will describe the process of
sports video content extraction.
3.1.1. Segmentation. Instead of using traditional color-based shot segmentation
[Hanjalic 2002], as shown in Figure 3, we have identified that broadcasted soccer videos
usually use transitions of typical shot types to emphasize story boundaries of the match,
including global-view (GV), zoom-in view (ZV), and close-up view (CV) shots [Xie et al.
2002]. We can use grass (or dominant color)-ratio, which measures the amount of grass
pixels in a frame, to classify the main shots in soccer video [Ekin et al. 2003, Xu et al.
1998]. Global shots contain the highest grass-ratio, while zoom-in contains less and
close-up contains the lowest (or none). Thus, close-up shots generalize other shots, such
as crowd, substitutes, or coach close-ups which contain no grass. Analysis of cameraviews transition in a sports video has been used successfully for play-break segmentation
[Ekin et al. 2003]. The start of a play scene can be simply marked as the first frame of a
long global shot (e.g. > 5 sec) which can be interleaved by very short zoom-up or closeup shots (e.g. < 2 sec) [Ekin et al. 2003]. A play scene (P) typically describes an
attacking action which could end because of a goal or other reasons that result in a break
such as foul. Likewise, a break scene is started by either a long zoom-in shot or a zoom
shot of medium length, which can be interleaved by short global shots. During break
scene (B), such as after a goal is scored, zoom-in and close-up shots will be dominantly
used to capture players’ and supporters’ celebration during the break. Subsequently, some
slow-motion replay shots and artificial texts are usually inserted to add additional content
to the goal highlight.
3.1.2. Low-level to Mid-level or Domain- Level Features. As a video document is
composed of image-sequences (frames) and audio tracks, video content analysis can
utilize the techniques from audio and image content analysis. Mid-level features are
slightly more complex than low-level features, and they can also be used to detect sports
features. For example, as shown in Figure 4, skin-color locator (using specific color
histogram and shape characteristics) can be used to detect player’s faces, allowing the
system to zoom into a player’s face which can describe the actor(s) of an event.
To use low-level features for extracting sports features, we need to apply some
additional domain-knowledge interpretation. For example, to detect excitement, we
observed the following changes in sports audio track when an exciting event occurs: (1)
Crowd’s cheer and commentator’s speech become louder, (2) Commentator’s voice has a
slight (or significant) raise of pitch and (3) Commentator’s talk becomes more rapid and
has fewer pauses. Based on this concept, the essence of excitement detection algorithm is
the use of three main features: lower pause-rate, higher pitch-rate, and louder volume,
using dynamic thresholds [Tjondronegoro et al. 2003].
Figure 3. Example of Main Camera-based Views in Soccer, AFL, and Basketball (Global, Zoom-In, Close-up)
Some audio and visual features can be directly used to annotate some semantic. An
example of the audio features is speech recognition [Meng et al. 2001, Ponceleon et al.
1998] which detects specific features from a spectrogram. An example of visual features
is optical character recognition (OCR) which uses specific features from the typical
patterns of strong edges [Chairsorn et al. 2002, Sato et al. 1998]. However, current
solutions for these techniques are still complex, time consuming, need comprehensive
training, and are yet fully reliable.
original image
labelled face objects
(a)
original image
labelled face objects
(b)
original image
labelled face objects
(c)
Figure 4. Example of Face Detection Results
3.1.3. Generic and Domain-specific Semantic. Highlights are generically constructed
by gathering the interesting events that may capture user attentions. Most sports
broadcasters distinguish them by inserting some editing effects such as slow-motion
replay scene(s) and artificial text display. For most sports, highlights can be detected
based on generalized knowledge of the common features shared by most sports videos.
For example, interesting events in sport videos are generally detectable using generic
domain-level features such as whistle, excited crowd/commentator, replay scene, and text
display [Tjondronegoro et al. 2004]. Other features such as slow-motion replay [Assfalg
et al. 2002, 2002, Babaguchi et al. 2000], play-position (midfield, goal-area) [Assfalg et
al. 2003, Babaguchi et al. 2003] and tempo (fast and slow play) [Adams et al. 2002] have
also been widely used to detect exciting events.
While generic highlights are good for casual video skimming, domain-specific (or
classified) highlights will support more useful browsing and query applications. For
example, users may prefer to watch only the soccer goals. To achieve this, it has become
a well-known theory that the high-level semantics in sport videos can be detected based
on the occurrences of specific audio and visual features which can be extracted
automatically. The pattern of occurrence can be captured by manual heuristics
[Babaguchi et al. 2002, Nepal et al. 2001] or automatic machine learning [Wu et al.
2002]. Another alternative is object-motion based which offers a higher level of analysis
but requires expensive computations. For example, the definition of a goal in soccer is
when the ball passes the goal line inside of the goal-mouth. While object-based features
such as ball- and players-tracking are capable of detecting these semantics, specific
features like slow-motion replay, excitement, and text display should be able to detect the
goal event more efficiently or at least help in narrowing down the scope of the analysis.
3.1.4. Further-tactical Semantic, Customized Semantic, and Annotations. Furthertactical semantic layer can be detected by focusing on further-specific features and
characteristics of a particular sport. Thus, tactical semantic needs to be separated from
specific semantic due to its requirement to use more complex and less-generic audiovisual features. For example, corner kick and free-kick in soccer needs to be detected
using specific analysis rules of soccer-field and player-ball motion interpretation [Han et
al. 2003]. Playing-field analysis and motion interpretation are less generic than
excitement and break shot detection. For instance, excitement detection algorithm can be
shared for different sports by adjusting the value of some parameters or thresholds.
However, the algorithm that detects corner in soccer will not be applicable for tennis due
to a large difference between soccer and tennis fields.
Customized semantic layer can be formed by selecting particular semantics using user
or application usage-log and preferences [Babaguchi et al. 2003]. For example, a sport
summary can be structured using integrated highlights and play-breaks using some textalternative annotations [Tjondronegoro et al. 2004]. In section 5, we will cover more
customized semantic as a means to reducing the necessity of users to enter complex
queries which would also require the user to understand the query syntax and data
schema.
After all features and semantic are extracted from the video using automatic
algorithms, a sports video database system will usually still require some amount of
manual annotations and metadata, such as sports-venue, details and appearance of teams
and players. Fully-automatic content extraction and annotation is yet to be possible.
Moreover, annotation techniques using textual analysis from ‘super-imposed’ texts [Sato
et al. 1998] and closed-caption [Babaguchi et al. 2002] cannot always be 100% accurate.
In fact, closed-caption does not always accompany every sports video, and superimposed texts can be occluded by noises and low-resolution while its appearance could
be random (i.e. not necessarily during or nearby the event itself). For the proposed video
database system, we developed user-friendly interfaces to enable efficient data annotation
that complement the automatic content extraction process.
3.2. Previous Work on Video Indexing
An appropriate model for video metadata is important for the provision of adequate
retrieval support. Object-Oriented (OO) modeling is recognized for its ability to support
complex data definitions [Blaha et al. 1998]. We have identified two main alternatives
when using OO for modelling, namely; schema-based [Adali et al. 1996] and schema-less
[Oomoto et al. 1997]. The main benefit of using a schema-based model is its capability to
support easy insertions and deletions on a video database. This is due to the strict
components that have to be followed exactly for each entity. However, schema-based
models have limitations and present difficulties during data retrieval because users must
know the class or attribute structures before they can retrieve the desired objects. Another
major limitation is the difficulty to include new description during instantiation of video
models due to the static schema; therefore, the model is not flexible. In contrast, schemaless modelling is designed based on the fact that each video interval (i.e. frame sequence)
can be regarded as a video object, in which the attributes can be (semantic) objects,
events, or other video objects. The content of a video object is more flexible because it
allows dynamic calculation of inheritance, overlap, merge and projection of intervals to
satisfy user queries. However, two major problems are created by schema-less modelling.
The first significant problem is query difficulties because users/developers must inspect
the attribute definition of each object in order to develop a query because each object has
its own attribute structure. The second problem refers to the total dependency on users or
applications for supervising the instantiation of video objects. Such a problem is caused
by the fact that a schema is not present.
3.2.1. Segment-based Indexing. During the process of indexing texts, a document is
divided into smaller components such as sections, paragraphs, sentences, phrases, words,
letters and numerals. Based on such divisions, indices can be built upon these
components [Zhang 1999]. Using a similar approach, video can also be decomposed into
a segment-hierarchies that are similar to the storyboards used in filmmaking [Ponceleon
et al. 1998, Zhang 1999]. Researchers have commonly indexed video shots which are the
video segments that group sequential frames (usually short) with similar characteristics
[Heng et al. 2002, Oh et al. 2000].
3.2.3. Object-based Indexing. Object-based indexing is achieved by attaching video
segments to the semantic objects (i.e. the actors). Thus, it needs to distinguish particular
objects throughout a video sequence in an attempt to capture content changes. In
particular, a video scene is defined as a complex collection of objects, the location and
physical attributes of each object and the relationship between them. The objects
extraction process for video uses the fact that objects region usually moves as a whole
within a sequence of video frames [Dimitrova et al. 2001, Lu 1999].
3.2.3. Event-based Indexing. Event-based indexing can be used as the potentially
most suitable indexing technique for sport videos since sport highlights on TV, magazine
or internet are commonly described using a set of events, particularly the important or
exciting events [Zeinik-Manor et al. 2001]. The main benefits of events-based indexing
are: (1) a sport match can be naturally decomposed into specific events; (2) Sport viewers
remember and recall a sport match based on the events; (3) Events can serve as an
effective bridge between low-level features and high-level features in sport videos. Since
events occurrence can be predicted automatically using specific domain knowledge.
Compared to these existing approaches, our indexing approach has put more emphasis
on object-relational and semi-schema modeling techniques which will be described in
section 4. The proposed indexing is segment- and event- based, which can be linked to
semantic objects (such as players, team, and stadium).
3.3. Previous Work on Video Retrieval
For retrieval purposes, play-break and highlights have been widely accepted as the
semantically-meaningful segments for sport videos [Ekin et al. 2003, Rui et al. 2000, Xu
et al. 1998, Yu 2003]. A play is when the game is flowing such as when the ball is being
played in soccer and basketball. A break is when the game is stopped or paused due to
specific reasons. Play-based summaries are most effective for browsing purposes as most
highlights are contained within plays. However, break sequences should still be retained.
They are just as important as play, especially if they contain highlights which can be
useful for certain users and applications. For example, a player preparing for a direct free
kick or penalty kick before in soccer videos shows the strategy and positioning of the
offensive and defensive teams. A break can also contain slow-motion replay and fullscreen texts which are usually inserted by the broadcaster when the game becomes less
intense or at the end of a playing period. Since play-break segments are not necessarily
short enough for users to continue watching until they can find interesting events. For
example, a match may only contain a few breaks due to the rarity of goal, foul, or ball out
of play. In this case, play segments can become too long to be a summary. Play-breaks
also cannot support users who need a precise highlight summary. In particular, sport fans
often need to browse or search a particular highlight in which their favorite team and/or
players appear. Based on these reasons, we have demonstrated the importance of
integrating highlights in their corresponding plays or breaks to construct a more complete
sport video summary [Tjondronegoro et al. 2004].
4. INDEXING SCHEME
We designed a sport indexing model using two main abstraction classes: segment and
event which can be linked with the re-usable semantic objects. Segment can be
instantiated as video-, audio- or visual- segment which are extracted from a raw video
track when mid-level features can be detected. An event can be instantiated into generic,
domain-specific, or further-tactical semantics. Events and segments are chosen because
they can provide effective description for most sport games. For example, most users
will agree that watching soccer goals is the most celebrated and exciting event during a
soccer game. Segments can be used as text-alternative annotations to describe a goal and
is shown in Figure 5. The final near-goal segment in a play-break sequence which
contains goal will describe how the goal was scored. Face and text displays are used to
inform the viewer of who scored the goal (i.e. the actor of the event) and the updated
score. The replay scene shows the goal from different angles (to further emphasize how
the goal was scored). In most cases, when the replay scene is associated with excitement,
the content is considered to be more important. Excitement during the last play shot in a
goal is usually associated with descriptive narration. In fact, we (as humans) can often
hear a goal without actually seeing it.
(a
)
Play
45:27
(b
)
Break
Play
45:49
(c
)
P-B Sequence Containing Goal Highlight
46:32
46:55
(d)
45:4
4
(a) Last Near-goal Segment
45:4
9
45:5
45:5
(b) Face2and Inserted Texts 8
46:0
46:2
46:2
9
1
5
(c) Slow-motion Replay Scene (highlighted-portion is associated with excitement
46:3
8
46:3
2
45:3
45:4
45:4
45:5
7
4
9
5
(d) Excitement during play (highlighted-portion is associated with the last near-goal before goal)
Figure 5.Goal Event with Segment-based Annotations.
4.1. Using ORA-SS to Design the XML Video Model
We have utilized some of the benefits from XML to store and index the extracted
information from sport videos:
 XML is scalable; it allows for additional information without affecting others [Ngai
et al. 2002]. This is important for supporting gradual developments of feature
extraction techniques that can allow an incremental addition to the extractable
segments.

XML is internally descriptive and can be displayed in various ways. This is
important for users who are browsing the XML data directly and also, search results
can be returned as XML.
 XML fully supports semi-structured aspects which match video database
characteristics:
 An object can be described using attributes (properties) and other objects such as:
nested objects and heterogeneous elements.
 Object instantiation from the same class may not have the same number of attributes
because not all attributes are compulsory
 XML supports two types of relationships which are namely; nesting and referencing.
However, to reduce redundancy, we have used referencing instead of nested object
class [Dobbie et al. 2000].
We have used XML Schema to construct a video schema as it has replaced DTD as
the most descriptive language. Due to its expressive power, XML schema has also been
used as the basis for MPEG-7 Data Definition Language (DDL) and the XQuery data
model. We should therefore be able to easily leverage our proposed model to support
MPEG-7 standard multimedia descriptions and XQuery implementation. For a more
compact representation of XML schema, we will demonstrate the use of ORA-SS (ObjectRelationship-Attribute notation for Semi-Structured data) to design the video model as
shown in Fig.13-15. ORA-SS notation is chosen for its ability to represent most of XML
schema’s features. Appendix 1 provides a summary of ORA-SS notation. This diagram
extends the original ORA-SS notation [Dobbie et al. 2000] by demonstrating a more
complex sample which integrates the inheritance diagram with the schema diagram.
Moreover, we have introduced two additional notations: 1) italic for an abstract object, 2)
for a repeated object in the diagram (to avoid crossing lines).
We have employed a bottom-up approach (i.e. from Figure 8 to Figure 6) to describe
the proposed video model. Segment is the basic semantic unit in a sport video; it
describes the abstraction levels such as mid-level features and generic semantic, media
location and media description. Each segment is instantiated with a unique key for
segment Id into video, visual, or audio segment. A sport video (SV) is a type of video
segment which consists of SV components, overall summary, and hierarchical summary.
SV components are composed of: a) segment collection which stores a flat-list of visual
and audio segments that can be extracted from the sport video, b) syntactic relation
collection which stores all the syntactic relations such as ‘composed of’ and ‘starts after’,
between one source segment and one or more destination segments, and c) semantic
relation collection which records all the semantic relations such as ‘is actor of’ and
‘appears in’, between one source segment or semantic object and one or more destination
segments or semantic objects. Overall summary describes the sport video game as a
whole; it includes where (stadium), when (date, time), teams (that compete), final result,
and match statistics. Match statistics can be stored as XML tags or a visual frame (e.g.
text displays that depict the number of goals, shots, fouls, red/yellow cards, and counter
attacks in a soccer game). The Hierarchical Summary is composed of a Comprehensive
Summary and Highlights Summary. The Comprehensive Summary describes a sport
video in terms of play-break sequences which exist as the main story decomposition unit
in most sport videos. For example, an attacking attempt during a play is stopped when
there is a goal or foul. Each play-break can contain zero or one (key) event and can be
decomposed into one or more play and break shots. More complete descriptions on playbreak-event based summarization has been described and the first query (Q1) is described
in section 6. On the other hand, the Highlights Summary organizes highlight events into
common summary themes such as soccer goals and basketball free-throws.
Using the proposed video model, we have demonstrated a sport video indexing
scheme that supports:

Scalable video indexes that allow gradual extraction of segments and events (such as
the introduction of new techniques) without affecting others that pre-exist. For
example, we can incrementally introduce additional segments and events without
affecting the hierarchical summary. Similarly, more semantic objects, such as
stadium and referee, can be introduced at a later stage when there are many sport
videos that share the same stadium and referee.
 Object-Relationship modeling scheme. In particular, we have demonstrated that
inheritance and referencing are important features in video database modeling.
 Semi schema based modeling scheme. As shown in Figure 7, we allow users (or
applications) to add ANY additional elements (or attributes) into a segment
description as long as the element has been declared somewhere else in the proposed
schema, or within other schema that are within a particular scope. In fact, we may
attach ANY additional elements into other elements in our data model to allow more
flexibility. This stems from the fact that users often know best what they want to
describe and they should therefore be able to add components without the need to
check the pre-defined schema). However, we aim to gradually modify the schema
with new components, especially when the extra information provided by users can
be used to enrich the current video model.
4.2. Sports Categorization
The proposed indexing scheme is robust for all types of sports since all segments and
events are instantiated in the same hierarchical level, while semantic objects are recorded
as a separate entity – independent from any game. Most audiovisual segments such as
whistle, text display, and excitement can be instantiated in various sports. For a sports
genre, the temporal and logical hierarchy structures can be constructed according to its
sports category, which defines the play-break structure and typical events. Based on the
temporal and event structures, we identified four sports categories, namely, period-, setpoint-, time-, and performance-based. Table I presents a list of sports that belong to
different categories to demonstrate that almost all sports genre can be classified into the
proposed categories.
Sport Video Library
(Database)
2, 1:1, *
Performancebased Video
Set-Pointbased Video
Time-based
Video
2, 1:1, 1:1
Semantic Object
Collection
Period-based
Video
2, 1:1, *
2, 1:1,*
Sports
Name
Sport Video (SV)
Period-based
Event collection
Player
Team
2, 1:1,*
2, 1:1, 1:1
2, 1:1, 1:1
2, 1:1,?
Overall (Match)
Summary
SV Component
Hierarchical
Summary
2, 1:1, 1:1
2, 1:1,?
Segment
sCollection
Syntactic
Relation Col
2, 1:1, ?
2, 1:1, 1:2
Semantic
Relation Col
2, 1:1,+
2, 1:1,?
Comprehensive
Summary
Team
What Where
Action
2, 1:1,?
Video
Segment
2, 1:1,?
Highlights
Summary
Period-based
Event
2, 1:1, 1: n, <
2, 1:1,*
2, 1:1,?
2, 1:1,?
2, 1:1, +
Audio
Segment
Visual
Segment
Highlight
Collection
P-B Sequence
se,
2, 1:1, ?
<Goal>
Team
Benefited
Video
Segment
2, 1:1, +
2, 1:1, ?
Replay
Play
…
Excitement
…
Goal-area
Face
Text
…
Event
Event
2, 1:1, ?
sp,
2, 1:1, 1: n, <
sb,
2, 1:1, 1: n, <
Play
2, 1:1, *
Key Audio
2, 1:1, *
Key Frame
Goal
Scorer
Assist
Giver
Break
2, 1:1, *
Key
Frame
Replay
Figure 6. ORA-SS Schema for Sport Video Database (Overall) – see Appendix 2 for the full-sized diagram.
Segment
Media
Location
Segment Id
Media
Description
Abstraction
Level
Video
Segment
ANY
Visual
Segment
va
2, 1: n, ?
File
Path
File
Name
File Frame
Ext Start
File
End
Author Create
Date
Last
Update
Vid Id
video seg
extraction
algorithm
Audio
Segment
sa
2, 1: n, ?
Vis Id
visual seg
extraction
algorithm
va
aa
2, 1: n, ?
audio seg
extraction
algorithm
sa
parameters
used
Au Id
aa
parameters
used
parameters
used
Figure 7. ORA-SS Schema for Sport Video Database (Segments).
Syntactic Relation
Col
Semantic
Relation Col
2, 1:1, +
Syntactic
Relation
SemanticR
elation
2, 1:1, 1:1
2, 1:1,1:1
syntax rel
category
Source
Segment
2, 1:1, +
2, 1:1, +
Destination
Segment
semantic rel
category
Source
Segment
Source
Sem Obj
Source
Segment
Destination
Sem Obj
Figure 8. ORA-SS Schema for Sport Video Database (Syntactic-Semantic elations).
Period based sports are typically structured into playing periods such as a ‘set’ in
tennis or a ‘round’ in boxing. We can predict the number of playing periods for each
class of sports. For example, soccer usually has two 45-minute (normal) playing periods.
For this sports category, winners are usually decided based on the final score at the end of
the playing periods; thus the playing periods will be stopped regardless of the score. The
typical events during period-based sports are: goal, foul, and good offensive/defensive.
Set-point based sports such as tennis and volleyball are composed of sets which are
ended each time a player reaches a certain score. For example, a set in tennis is ended as
soon as a player reaches 6 points or more and the winner of each set is usually decided
when a player has at least two sets advantage. Examples of the typical events are: set- and
match- point, long rally, and service ace.
Time-based sports usually involve racing and is structured around laps. Unlike periodbased sport which is usually broadcasted as an individual match; this sport class is mostly
broadcasted as a portion of a championship or competition. For example, in day eight of
the Australian Swimming National Championship live broadcast, viewers are presented
with multiple race events, such as “men’s semi-final freestyle 50m”. Each race can be
decomposed into one or more laps which are equivalent to a play. The number of laps
during each race can be predicted. For example, a 200m swimming race is usually
composed of 4 laps. Predicting the number of laps in a race can help during highlight
detection as most exciting events, such as overtaking the lead and breaking a record,
happen during the last lap. For this type of sports, the winner is decided based on the
player who has the minimum time to complete the race. Typical events in this category
are: record breaking, start- and end- of race.
PeriodSoccer (football),
basketball, rugby,
Australian/American
football, hockey,
boxing, water polo,
handball
Set-pointTennis, table tennis,
badminton, volleyball,
beach volleyball,
baseball, cricket,
bowling, fencing,
softball, tae-kwon-do,
judo, wrestling
TimeSwimming, athletics
(e.g. running,
marathon),
car/motor/horse/bike
racing, track cycling,
sailing, triathlon, canoeand kayak- flatwater/slalom racing,
rowing
PerformanceRhythmic gymnastics,
track and field (e.g.
long/high jump, javelin
throw), weight lifting,
diving, shooting,
synchronized
swimming, equestrian
Table I. Sports Categorization which is based on Temporal Structure Similarity
Performance based sports has a temporal structure which is similar to time-based
sports. For example, in day 21 of the Olympics’ gymnastics, viewers will see different
competitions such as men’s and women’s acrobatic artistic or rhythmic semi-finals. Each
competition will have one or more performances by each competitor. One performance is
equal to a play. We can consider each performance as a key event because there are many
breaks between each performance, such as players waiting for the results. The winner of
this type of sport is determined by the highest average points awarded by judges. Some
typical events in this category are: good-/bad- performance and winning moves.
4.3. Utilizing XML Schema to Construct the Video Model
This section discusses how XML schema used to construct the proposed video model.
4.3.1. Scalable Modeling Scheme. Schema can be easily extended by introducing
more elements. Similarly, more types can be defined to refine the schema rather than
using generic data types. In other words, the schema does not need to be complete at the
beginning.
<complexType name="OverallSummary">
…
<element name="matchStatistics"
type="string”/>
…
</complexType>
<complexType name="MatchStatisticsType">
<sequence>
<element name="numOfGoals" type="integer"
minOccurs="0" maxOccurs="1"/>
<element name="numOfFouls" type="integer"
minOccurs="0" maxOccurs="1"/>
<element name="numOfShots" type="integer"
minOccurs="0" maxOccurs="1"/>
</sequence>
</complexType>
<complexType name="OverallSummary">
…
<element name="matchStatistics" type="vid:MatchStatisticsType”/>
…
</complexType>
4.3.2. Object-Oriented Modeling Scheme. Object/class can be defined as element or
complexType which can be inherited using extension or restriction. A class can be
defined as abstract and substituted with a concrete class. Substitution group is used to
maximize
extensibility.
For
example,
a
soccer
video
contains
a
PeriodBasedDomainEventCollection,
which
includes
zero
or
more
PeriodBasedDomainEvent. PeriodBasedDomainEvent is an abstract class which has to be
instantiated by the actual event such as Goal.
<complexType name="PeriodBasedDomainEventCollectionType">
<sequence>
<element ref="vid:PeriodBasedDomainEvent" minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
<element name="PeriodBasedDomainEvent" abstract="true" type="vid:PeriodBasedDomainEventType" />
<complexType name="PeriodBasedDomainEventType" abstract="true">
<complexContent>
<extension base="vid:VideoSegmentType">
<sequence>
<element name="teamBenefitedId" type="vid:SemObjIdType" minOccurs="0" />
<element name="playerBenefitedId" type="vid:SemObjIdType" minOccurs="0" />
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Goal" substitutionGroup="vid:PeriodBasedDomainEvent" type="vid:GoalType" />
4.3.3. Relational Modeling Scheme. Referential integrity can be achieved by
introducing primary key (PK) which is defined using key and foreign key (FK) and is
enforced using keyref.
<key name="sequencePK">
<selector
xpath="vid:sportVideos/vid:sportVideo/vid:sportVideoComponent/vid:segmentCollection/vid:pbSequence"/>
<field xpath="vid:segmentId"/>
</key>
<keyref name="sequenceInCompSummaryRef" refer="vid:sequencePK">
<selector
xpath="vid:sportVideos/sportVideo/vid:hierarchicalSummary/vid:comprehensiveSummary/vid:pbSequence"/>
<field xpath="vid:pbId"/>
</keyref>
4.3.4. Schema-Less Element. The instances of Media description can include more
elements than the schema (while maintaining the well-formed structure) to avoid static
and fixed elements. The main benefit is that developers may know a lot, but not
everything that the user knows. Therefore, a class can simultaneously be both schema and
schema-less. XQuery can easily produce an “element name” by matching the element
names between the schema and the instance to find the elements which have not been
defined by the schema.
<complexType name="MediaDescription">
<sequence>
<element name="author" type="string" />
<element name="creationDate" type="date" />
<element name="lastUpdate" type="date" minOccurs=”0” />
<any minOccurs="0"/>
</sequence>
</complexType>
4.3.5. Semi-structured Modeling Scheme. Not all elements need to be instantiated at
one time (video indexing requires gradual extraction due to complexity and processing
time), without creating much problems for the tree structure.
<complexType name="SportVideoComponent">
<sequence>
<element ref="vid:segmentCollection" minOccurs="0" maxOccurs="1"/>
<element ref="vid:semanticObjectCollection" minOccurs="0" maxOccurs="1"/>
<element ref="vid:syntacticRelationCollection" minOccurs="0" maxOccurs="1"/>
<element ref="vid:semanticRelationCollection" minOccurs="0" maxOccurs="1"/>
</sequence>
</complexType>
4.4. Using XML to Store Video Indexes and Capture Users’ Preferences
In this section, we will demonstrate the use of XML to construct a video database and
user preference.
4.4.1.Sport Video Database
<sportVideoLibrary>
<semanticObjectCollection>
<team>
<teamId>Tm1</teamId>
<teamType>clubTeam</teamType>
<teamShortName>Madrid</teamShortName>
<teamFullName>Real Madrid</teamFullName>
</team>
… (other club and national teams appearing in the library)
<player>
<playerId>Pl1</playerId>
<playerShortName>Ronaldo</playerShortName>
<playerFullName>Luis Nazario de Lima</playerFullName>
<clubTeam>Tm1</clubTeam>
<position>forward</position>
</player>
…(other players appearing in the library)
</semanticObjectCollection>
<PeriodBasedVideo eventName="soccer">
<segmentId>M1</segmentId>
<mediaLocation>
<filePath>D:\mpeg7fromdian\Bra_Ger2-2.mpg</filePath>
<fileName>Bra_Ger2-2</fileName>
<fileExtension>mpg</fileExtension>
<frameStart>00:00.00</frameStart>
<frameEnd>21:44.00</frameEnd>
</mediaLocation>
... (Media description can be added here)
<name>Soccer:worldCup_Bra_Ger</name>
<sportVideoComponent>
<segmentCollection>
<pbSequence>
<segmentId>S1</segmentId>
<mediaLocation>
<frameStart>00:02.00</frameStart>
<frameEnd>00:17.00</frameEnd>
</mediaLocation>
… (other segments, such as play, break, face frame, text frame, replay scene )
</segmentCollection>
<syntacticRelationCollection>
<syntacticRelation>
<relationCategory>composedOf</relationCategory>
<sourceSegmentId>M1</sourceSegmentId>
<destinationSegmentId>S1</destinationSegmentId>
<destinationSegmentId>S2</destinationSegmentId>
<destinationSegmentId>S3</destinationSegmentId>
<destinationSegmentId>S4</destinationSegmentId>
<destinationSegmentId>S5</destinationSegmentId>
<destinationSegmentId>S6</destinationSegmentId> …
</syntacticRelation>
… (other syntactic relations)
</syntacticRelationCollection>
<semanticRelationCollection>
<semanticRelation>
<relationCategory>appearsIn</relationCategory>
<sourceSemObjId>Pl19</sourceSemObjId>
<destinationSegmentId>E3</destinationSegmentId>
</semanticRelation>
… (other semantic relations)
</semanticRelationCollection>
</sportVideoComponent>
<overallSummary>
<whatAction>soccer-international-worldcup</whatAction>
<team>TM2</team>
<team>TM1</team>
<where>Unknown</where>
<when>Unknown</when>
<matchStatistics>
<numOfGoals>0</numOfGoals>
<numOfFouls>2</numOfFouls>
… (other match statistics, such as num of shots)
</matchStatistics>
</overallSummary>
<hierarchicalSummary>
<comprehensiveSummary>
<pbSequence pbId="S1">
<highlightEvent eventId="E1"/>
<break breakId="B1"/>
<break breakId="B2"/>
<break breakId="B3"/>
</pbSequence>
… (other play-break sequences)
</comprehensiveSummary>
<highlightSummary>
<summaryThemeList>
<summaryTheme themeId="T0">
<themeContent>soccer</themeContent>
</summaryTheme>
<summaryTheme themeId="T01" parentThemeId="T0">
<themeContent>Foul</themeContent>
</summaryTheme>
… (other summary themes)
<highlightCollectionList>
<highlightCollection themeId="T01">
<highlightEventId>E1</highlightEventId>
</highlightCollection>
… (other highlight events within the same summary theme)
</highlightCollection>
… (other highlight collections, such as soccer goals)
</highlightCollectionList>
</highlightSummary>
</hierarchicalSummary>
… (other sports videos)
</sportVideoLibrary>
It should be noted that “hierarchical summary” is generated dynamically using
XQueries which will be described in Section 5.
4.4.2. User Preference
<userPreference>
<user userName="dian">
<sportGenres>
<sportGenre priorityLevel="1" Name="soccer">
<events>
<event priorityLevel="1">Goal</event>
<event priorityLevel="2">Shot</event>
<event priorityLevel="3">Goodplay</event>
<event priorityLevel="4">Foul</event>
… (Other events preference)
</events>
<players>
<player priorityLevel="1">Pl11</player>
<player priorityLevel="2">Pl8</player>
<player priorityLevel="3">Pl10</player>
… (Other players preference)
</players>
<segmentPref>
<segment priorityLevel="1">excitement</segment>
<segment priorityLevel="2">playScene</segment>
<segment priorityLevel="3">text</segment>
<segment priorityLevel="4">breakScene</segment>
… (Other segments preference)
</segmentPref>
</sportGenre>
… (Other sports genre preference)
<outputPref>
<maxOutputDuration>1000</maxOutputDuration>
<maxWaitDuration>P1H30M</maxWaitDuration>
</outputPref>
</user>
… (other users)
</userPreference>
4.5. Accommodating MPEG-7 Standard Descriptions
MPEG-7 standard [Chang et al. 2001, Manjunath et al. 2002, Pereira 2001] creates the
possibility of having a standardized method of describing multimedia content. To support
our video indexing framework, MPEG-7 provides tools that describe the structure and
semantic of a single multimedia document. Content structure tools describe the segments,
the hierarchical decompositions and structural relationships among the segments. They
include: segment entity, segment attribute, segment decomposition and structural relation
tools. Video media (audiovisual segment) can be decomposed into audio- and video (i.e.
image sequence) - segment and moving (object) region. Content semantics tools describe
semantic entities including object, event, concept, semantic-state, semantic-place, and
semantic-time, as well as the semantic attributes and relations. For rapid navigation and
browsing capabilities and flexible access, MPEG-7 provides hierarchical- and
sequential- summary description schemes, as well as user views and variations. To
describe user interaction, MPEG-7 provides user preference and usage history
description schemes.
We have extended the current MPEG-7 description framework by: (1) emphasizing
the use of referenced- instead of nested- elements; (2) allowing users to add any elements
on top of the existing schema (i.e. semi-schema based); (3) introducing a comprehensive
summary as an example of a customized hierarchical summary. Compared to nested
elements, referencing can promote more scalable and flexible instantiations of the
description as the current automatic content extraction techniques are yet to be fully
completed, thus, we should allow multiple passes without the complexity of nesting
elements. Allowing ‘any elements’ gives users more independence for customizing their
own descriptions, without the necessity of having to ask administrators to update the
schema. As a user would know about the updated schema, they can write their own
queries to suit the customized elements. The benefits derived from the comprehensive
summary have been discussed in section 3.3. In addition to these three major extensions,
we have also introduced multi-semantic layers to classify events into generic, specialized,
and customized events to signify the fact that these semantic layers should be detected,
indexed, and queried in different ways. For example, an exciting event (generic) in a
soccer game cannot be directly associated with a goal scorer as it has to be classified into
a soccer goal to suit the context.
In MPEG-7, a VideoSegment can contain: SpatialDecomposition to MovingRegion;
TemporalDecomposition
to
VideoSegment
and
StillRegion;
SpatioTemporalDecomposition to Moving/Still Region; MediaSourceDecomposition to
VideoSegment. Each of these segments can be identified and referenced by an ID, XPath
or HREF (any URI). Within any segment, we can specify a Relation (which identifies the
source and target) with another segment or semantic. The MPEG-7 semantic entities,
AgentObjectType and EventType, can be used to describe semantic objects such as soccer
player and events such as soccer goal. These entities can be entered and reused while
SemanticRelations can be graphically described by connecting the respective semantic
entities. Unlike the MPEG-7 approach, we separate the segments (which is an example of
object declaration) from the relationships with other segments and semantic. This is to
keep every element simple, atomic, self-contained, and independent; thereby promoting
the extensibility of the model. Moreover, in our model, we have distinguished a
segment’s collection from collections of semantic-object, syntactic-relations and
semantic-relations so that users can easily browse over the database.
The current MPEG-7 schema model allows referencing in HierarchicalSummary as it
may contain SourceID to identify the original content, SourceLocator to locate the
original content, and SourceInformation to reference elements of the description of the
original content. We have expanded this referencing scheme by referencing the elements
of each SummarySegment to the original segment. The main benefit of this approach is to
retain the existing attributes and relationships between the original-referenced segment
and the (linked) segments and semantic objects. This is so that simpler queries can be
written if we want to produce customized summaries which are based on the hierarchical
summary, while showing all of the related segments and semantic objects. In MPEG-7,
HierarchicalSummary can be used to implement sports video summarization such as
highlights, synopses, and previews. Users can also generate their summaries by
bookmarking segments while virtual programs can be constructed by combining
segments from different programs. Unlike MPEG-7, we do not wish to store summaries;
thus, we will demonstrate in section 5 and 6, the use of XQuery to generate dynamic
summaries.
Previously in section 4.2, we also introduced a customized user preference with some
suggested extensions to MPEG-7’s UserPreference. For dynamic user-oriented
summaries, we emphasize the need to capture users’ preferences on the summaries and
duration-of-viewing while our usage log captures users’ viewing behaviours such as
browsing and viewing. For example, we can generate automated summary preferences
based on the most ‘viewed’ and ‘replayed’ segments.
MPEG-21 defines an open multimedia framework which is composed of two essential
concepts, namely: (1) the definition of a fundamental unit of distribution and transaction,
being the Digital Item (DI) such as a video collection; and (2) the concept of users
interacting with DI. MPEG-21 can complement MPEG-7 to personalize the selection and
delivery of multimedia content for the individual users because MPEG-21 Digital Items
Adaptation (DIA) provides tools to support resource and descriptor adaptation and
quality-of-service management. [Tseng et al. 2004]. As the proposed indexing scheme is
based on XML schema, it can easily benefit from some of the existing standard
frameworks. For this purpose, the MPEG-21 DID (DI declaration) can be used as a
means for integrating different descriptive schema into one descriptive framework
[Kosch 2004]. As an example, since XML schema was not designed specifically for AV
data, certain MPEG-7 extensions have been added to describe: array and matrix data
types, and built-in primitive time data types. Array and matrix are suitable to describe
sound and image content, while time is the unique dimension of video content. In
general, we can always use standard MPEG-7 descriptors to replace the XML-based
descriptors and make the indices more MPEG-7compliant for when the standard becomes
widely used.
For our current database, we choose not to apply any MPEG-7 descriptions as they
are strongly-typed and schema-based. Users need to trace the pre-defined schemas and
descriptors to become familiar with the descriptions; whereas our indexing framework
emphasises the flexibility of schema-less models. However, as shown in the example
below, we will demonstrate how the proposed indexing framework can utilise MPEG-21
to combine MPEG-7 descriptors as a part of the “Any” elements.
<?xml version=”1.0” encoding=”UTF-8”?>
<DIDL xmlns=”urnLmpeg:mpeg-21:2002:01-DIDL-NS”
xmlns:mpeg7=”urn:mpeg:mpeg7:schema:2001”>
<Container>
<Descriptor>
<Statement mimeType = “text/plain”>My Sports Video Collections</Statement>
</Descriptor>
<Item id = “SportVideo1>
<SoccerVideo>
…
<OverallSummary>
…
<mpeg7:Mpeg7>
<mpeg7:DescriptionMetadata
Any
xsi:type=”mpeg7:DescriptionMetadataType”>
<mpeg7:Comment xsi:type=”mpeg7:TextAnnotationType”>
<mpeg7:KeywordAnnotation>Australian Soccer
</mpeg7:KeywordAnnotation>
…
</mpeg7:DescriptionMetadata>
5. DYNAMIC SUMMARIES AND USER-ORIENTED QUERIES
Users would need to understand the query’s syntax language and the data indexing
schema before they can write queries for their retrieval requirements. A better alternative
is to compress the video’s long sequence into a more compact representation through a
summarization process. Using the summaries, users can browse and navigate through the
segments that they want to watch as well as specifying some parameters to refine the
search. We can assist users in selecting what they want to watch by constructing ‘queriesgenerated’ dynamic-summaries, event-structures, and customized summaries. Summaries
should be constructed dynamically and based on extracted segments and features instead
of storing them as a static index. This is to allow gradual content extraction as not all
segments and semantic objects can be extracted or defined at one time. In this section, we
will demonstrate the use of XQuery to construct dynamic summaries and user-oriented
queries. In total, there are 6 sample queries which have been fully tested. All queries have
been designed to demonstrate the benefits of the proposed video data model (segmentevent based) while focusing on potential user/application requirements.
XQuery 1.0 [Boag et al. 2004, Brundage 2004] is actively developed by the W3C
(World Wide Web Consortium) XML Query and XSL Working Groups since 1998.
XQuery supports functionalities that are both decades old and brand new. XQuery
provides some powerful features for XML-based video retrieval and will be demonstrated
alongside examples in this section:
 Composition: involves constructing temporary XML results in the middle of a query
and to be able to navigate into it. This is particularly useful for constructing dynamic
summaries and views for browsing
 Procedural approach: involves constructing user-defined functions (or modules)
including recursive ones, as well as using built-in functions. This approach allows
effective design of queries using a similar approach to procedural programming.
Writing functions will also improve the query by improving the readability (i.e.
repetitive query portions are written as a function), share-ability (i.e. user-defined
functions can be reused by other users/applications), extensibility (i.e. partial or
whole query can be encapsulated into a function, and then used for building more
complex queries.
 Tree and Sequence: involves constructing queries that can retain the original tree
(hierarchy) structure, as well as queries that take into consideration the order
(sequence) or appearance of elements. This is particularly important to support video
hierarchy views and the temporal ordering of video segments.
 This section mainly aims to demonstrate the benefits of our model in terms of
generating user-oriented dynamic summaries. Most of the queries are dynamicallycalculated summaries (or views) that users do not have to write themselves. For
example Q5 and Q6 will require very little user interaction.
5.1. Summary 1 (Q1): Comprehensive Summary.
Build the Comprehensive Summary of a particular sport match with all the details. The
graphical interface of the browser is depicted in Figure 9. We utilize XQuery to
automatically generate (and store) the structure of a video’s comprehensive summary.
The comprehensive summary encapsulates the relationships between different segments.
A Play-Break (PB) sequence may contain several play- and break- segments. The query
automatically collects all segments within each PB sequence and presents them in a
hierarchical order. By modifying this query, we can also generate other hierarchical
structures such as highlight summary.
Since the generated comprehensive summary comprises of objects which are
referenced from the segment collections, we should be able to obtain all the segments’
details and construct a summary which can support useful browsing. Using the browsing
structure, users can peruse a sport video by PB sequences (like CD audio tracks). For
each track, a user can check whether it contains a highlight event. Users can watch either
the entire track, or its play- and break- shots, or watch the event as a shorter alternative.
Figure 9. (From left to right) Screenshots of Q1 and Q2
5.1.1. Algorithm 1: obtain all details of the comprehensive summary
Declare variable $filterMatchId in order to filter out unwanted matches (  1.1)
Find out the specified match in database, return all matches if user didn’t specify
matchId (  1.2)
Loop in all PB in the Comprehensive Summary (CS) of the match
Based on the CS, get the SegmentID of all event and segments contained
within the PB (  1.3)
Display the elements of the sequence (  1.4)
Loop in all events in the sequence – by matching Segment ID (  1.5)
Display the file name of the match
Display elements of the event ordered by frame start time stamp
Loop in all plays in the sequence, by matching Segment ID (  1.6)
Display the file name of the match
Display the elements of the play ordered by frame start time stamp
… Apply modified  1.5 for key frames and other segments within play
… Apply modified  1.5 for break and other segments within sequence such
as face.
 1.1
let $filterMatchId := request:request-parameter("filterMatchId", "")
 1.2
let $vidLib := /sportVideoLibrary,
$matches := $vidLib//*[ ends-with(lower-case(name()), "video")],
$matches := if ($filterMatchId != "") then $matches[segmentId = $filterMatchId] else
$matches
 1.3
for $PB in $vid//comprehensiveSummary/pbSequence
let $EVENT := $PB/highlightEvent,
$PLAY := $PB/play,
$BREAK := $PB/break
 1.4
for $sequence in $vid//segmentCollection/pbSequence[segmentId = $PB/@pbId]
return $sequence/*
 1.5
for $event in $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[segmentId=
$EVENT/@eventId]
order by $event/mediaLocation/frameStart
return <eventInPb>{$filename}{$event/*}</eventInPb>
 1.6
for $play in $vid//segmentCollection/playScene[segmentId = $PLAY/@playId]
order by $play/mediaLocation/frameStart
return <playInPb>{$filename}{$play/*}</playInPb>
XQuery support composition to combine the query results into a structure. As an
example, the following shows the composition structure for Q1.
<results>
 1.1
 1.2
return
<match id=" {$vid/segmentId} ">
{
 1.3
return
<pb>
{  1.4}
{  1.5}
{  1.6}
}</pb>
</match>
} </results>
5.3. Summary 2 (Q2): Event Summary
For each event in a particular sport match, produce Event Summary (its details and play-,
break- ratio). The graphical interface of the browser is depicted in Figure 9.
As shown in Figure 10, segments-based statistics, such as excitement ratio, can be
effectively used to describe a highlight as well to compare two or more events which may
help viewers to decide which events are more interesting. For example, by comparing the
statistics of Goal 4 and Goal 10, users can predict that goal 10 is potentially more
interesting since it contains more excitement (audio) and it has a longer slow motion
replay. However, Goal 4 may have more exciting plays in the goal-area due to the longer
duration of goal-area that can be found during the play.
Figure 10. Statistics-based Annotation on Events.
5.2.1. Algorithm 2
Declare variable $filterMatchId in order to filter out unwanted matches (  1.1)
Loop in all events in the eventCollection within the video
Calculate event duration
(  2.1)
Display event details and the player name related to the event (  2.2)
Calculate play ratio (  2.3)
… Apply modified  2.3 to calculate break ratio and other statistics such as
excitement ratio
The queries are for the specific portions of the algorithms which have not been
described:  2.1,  2.2 (refer to Q1 for  1.1).
 2.1
for $event in $vid/*[ ends-with(lower-case(name()), "eventcollection")]/*
let $eStart := local:timetoint($event//frameStart),
$eEnd := local:timetoint($event//frameEnd),
$eventDuration := $eEnd - $eStart + 1
order by $event//frameStart
 2.2
for $player in $event/*[ends-with(lower-case(node-name(.)),'playerid')]
return
element {replace(node-name($player), "Id", "ShortName")}
{ local:idToPlayerShortName(data($player)) }
 2.3
for $play in $vid//playScene
let $pStart := local:timetoint($play//frameStart), $pEnd := local:timetoint($play//frameEnd),
$pDur := $pEnd - $pStart
where ($pStart >= $eStart) and ($pEnd <= $eEnd)
return
if (count($play) >= 1)
then (<playRatio>{100*($pDur cast as xs:float) div ($eventDuration cast as
xs:float)}</playRatio>)
else()
5.3. Summary 3 (Q3): Customized (User-preference based) Summary.
Construct a customized summary (according to the user preference), processing from the
first preferences, second preferences, and so on, until the maximum output duration is
reached. The GUI of the browser is depicted in Figure 11.
As shown in Section 4.2, users, via a stored user preference, can choose their favorite
type of events, players and segment. This summarization is particularly useful when users
want to specify the total duration of the summary while having total control upon the
segments, events and players that they will be able to watch. This query can be supported
by XQuery and the resulting query can be used as an example of how XQuery can
develop a complex query. Due to the complexity, the followings provide the design
rational. Obtain the user preference (i.e. priority level) for the types of summary, event,
player, sport and segment. In the example discussed in Section 4.2, a user’s first
preference is soccer game, and user would prefer to watch the events in the order of
soccer goal, soccer shot, soccer good play and soccer foul. Besides that, users prefer to
watch a certain players’ performance and specific segments such as excitement and text.
While maximum output duration is not reached, the system should display each item. The
second summary preferred is basketball, similar structure will apply. The segments are
displayed until maximum output duration is reached.
5.3.1. Algorithm 3
Get the user preference for the specified user name
* the example showed that the query will return result according to ‘dian’ preference
Determine start and end of summaries, initial time duration, find maxOutputDuration (  3.1)
Loop from the start to the end of the preferred sportGenre (e.g. soccer, basketball)
Determine the start and end of each type of user preferences: segment, player, event, and
maximum output duration
Loop in all videos in the video database which matches preferred sportGenre (  3.2)
Loop from the start to the end of the preferred segment (  3.3)
Loop in all segments in the video database
Display the events details according to preference order (  3.4)
….. Apply modified  3.3 and  3.4 for event summary and player summary
Figure 11. Screenshots of Q3: a) Form to specify sports preference, b) Form to
specify events preference in a particular sport, c) The returned summary
 3.1
let $vidLib := /sportVideoLibrary
let $pref := local:getUserPref($userName),
$summStart := 1, $summEnd := count($pref//sportGenre),
$total := 0,
$outputPref := $pref/outputPref//maxOutputDuration cast as xs:integer
 3.2
for $summLevel in $summStart to $summEnd
let $summaryPref := $pref//sportGenre[@priorityLevel=$summLevel],
$outputPref := $pref/outputPref//maxOutputDuration cast as xs:integer,
$sportsGenre := $summaryPref/@Name,
$playerStart := 1, $playerEnd := count($summaryPref//player),
$segmentStart := 1, $segmentEnd := count($summaryPref//segment),
$eventStart := 1, $eventEnd := count($summaryPref//event),
$videos := $vidLib//*[ends-with(lower-case(name()),
“video")][@eventName=$sportsGenre]
 3.3
for $segmentLevel in $segmentStart to $segmentEnd
let $segmentPref := $summaryPref//segment[@priorityLevel=$segmentLevel]
 3.4
for $video in $vidLib//*[ ends-with(lower-case(name()),"video")][@eventName=$sportsGenre]
let $filename := $video/mediaLocation/filename,
$segments := $video//*[name() = $segmentPref ]
where count($segments) > 0
return
<match id="{$video/segmentId}" filename="{$filename}">
{
for $segment in $segments
let $duration := local:timetoint($segment//frameEnd) –
local:timetoint($segment//frameStart) +1,
$total := local:calcTotalDur($total cast as xs:integer, $duration cast as xs:integer)
return
if ($total <= $outputPref)
then ($segment)
else()
}
</match>
5.4. Summary 4 (Q4): Favorite Players and Matches Summary
Construct a customized summary by showing a user’s favorite players and matches (by
users’ parameters). The GUI of the browser is depicted in Figure 12.
In this query, users can specify the type of their favorite match and/or players. For
example, some users only like to watch a ‘goal feast’ match which is a match with > N
number of goals. The following demonstrates examples of the summary requirement that
can be specified by a user preference or user inputs. It is to be noted that the scope of
best/ most is within the specified sportVideolibrary(s).
<Favorite player> “actors of”
<bestStriker> [>N goals] [player position = striker/forward]
<mostUnluckyStriker> [>N shot in 1 match] [player position = striker/forward]
<mostDangerousTackler> [>N foul in 1 match]
<mostDiciplinePlayer> [<N foul in 5 matches] [player position = defender/centre back]
<Favorite match> “contains”
<goalFeast> [>N goals]
<flowingGame> [<N fouls] or [<N whistle]
<crowdPleaser> [>N excitement] and [not exist in not(FlowingGame) ]
<nonQuietGame> [<N goal] and [<N shot] and [>N non]
5.4.1. Algorithm 4
Declare and initialize variables in order to receive parameters from user input (  4.1)
* the example showed that user defined matches that have more than 2 goals are goalFeast
matches, whistle and fouls less than 10 are flowing matches
A. Process goalFeast matches (  4.2)
Loop in sportVideolibrary
Count all goal events in the video
If the number of goals is greater than the parameter input by user
Display details of the video as a goalFeast Match
B. Process flowing matches (  4.3)
Loop in sportVideolibrary
Count all foul events and whistle segments in the video
If the number of goals less than a specified number and the number of
events is less than the parameter input by user
Display details of the video as a flowingMatch
… Using a similar approach as A and B, process other favourite matches and favourite
players
 4.1
let $thresholdGoals := request:request-parameter("thresholdGoals", 3),
$thresholdFouls := request:request-parameter("thresholdFouls", 10),
$thresholdWhistle := request:request-parameter("thresholdWhistle", 10)
 4.2
let $vidLib := /sportVideoLibrary
for $vid in $vidLib//*[ ends-with(lower-case(name()), "video")]
return
let $gT := $thresholdGoals,
$goal := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[name() =
"Goal"],
$numGoals := count($goal)
return
if ($numGoals >= $gT )
then (<match id="{$vid/segmentId}"filename="{$vid/mediaLocation/fileName}">
{<numOfGoals>{$numGoals}</numOfGoals>}{$vid/mediaLocation}</match>)
else ()
 4.3
let $vidLib := /sportVideoLibrary
for $vid in $vidLib//*[ ends-with(lower-case(name()), "video")]
return
let $fT := $thresholdFouls
$wT := $thresholdWhistle
$foul := $vid//*[ ends-with(lower-case(name()), "eventcollection")]//*[name() = "Foul"]
$whistle := $vid//segmentCollection//*[name() = "whistle"]
$numFouls := count($foul)
$numWhistle := count($whistle)
return
if ($numFouls <= $fT and $numWhistle < $wT)
then (<match id="{$vid/segmentId}" filename="{$vid/mediaLocation/fileName}">
{<numOfFouls>{$numFouls}</numOfFouls>}
{<numOfWhistle>{$numWhistle}</numOfWhistle>}
{$vid/mediaLocation}</match>)
else ()
Figure 12. Screenshots of Q4: a) Form to enter user’s parameters; b) The generated Summary
Q5) Summary 5: Player’s Event Summary.
This query is able to list a certain player(e.g. Ronaldo)‘s details such as full name, short
name, position, and any event segments that are related to him regardless of in which
match. It determines all specified event segments in all matches in which the selected
player appeared and displays a count on that event (e.g. goals) too. This query facilitates
a user to track a certain player’s performance such as how many goals he scores, how
many fouls he made of all matches in the video database. The GUI of the browser is
depicted in Figure 13.
5.5. Algorithm 5
Declare and initialize variables in order to receive parameters from user input (  5.1)
* the example showed that user would like to watch all ‘goals’ scored by ‘Ronaldinho’ (or Pl 10)
Locate player details based on parameter provided by user (  5.2)
Display the player’s details including fullName, shortName, and position (  5.3)
Loop in all goal events which are scored by the player
Calculate and display the total number of goals
(  5.4)
Display the details of the goal events
Figure 13. Screenshots of Q5: a) Form to enter the preferred player and event, b) The Summary showing the
player details (in future, a close-up image for each player will be provided)
 5.1
let $player := request:request-parameter("player", "Pl10"),
$event := request:request-parameter("event", "Goal")
 5.2
let $vidLib := /sportVideoLibrary,
$playerElement := $vidLib//player[playerId=$player]
 5.3
<Details>
{$ playerElement //playerFullName}
{$ playerElement //playerShortName}
{$ playerElement //position}
</Details>
 5.4
for $vid in $vidLib/*[ ends-with(lower-case(name()), "video")]
let $events := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")]//*[name() =
$event],
$playerEvents := $events[*[ ends-with(lower-case(name()), "playerid")] =
$playerElement//playerId]
where count($playerEvents) > 0
return
<match id="{$vid/segmentId}" filename="{$vid/mediaLocation/fileName}">
{
for $playerEvent in $playerEvents
return
$playerEvent
}
</match>
Q6) Summary 6: Players by Team Summary.
Find all video segments in which each player in all club- or country- teams appears. The
appearance results are grouped by the video, segment, and then event. For each group,
sort results by the location (from the frameStart to frameEnd). The GUI of the browser is
depicted in Figure 14. This query is particularly designed to show that Q5 can be easily
extended into a more complex query. That is, XQuery is scalable, thanks to the
procedural approach.
Figure 14. Screenshot of Q6
5.6.1. Algorithm 6
Loop in all players who belong to all the club teams or country teams within the video
Store all elements of the players and clubs or countries into variables
Display the player’s details: fullname, shortName, position
(  6.1)
Display the team name which the player belongs to
Loop in all segments and events in which the player appear
Calculate and display the total number of events
Display the details of the events ordered by frameStart time stamp
(  6.2)
 6.1 is modified from  5.1 and  5.2 – PlayersByClubTeamSummary
for $clubPlayer in $vidLib//player[(countryTeam = $club/teamId) or (clubTeam =
$club/teamId)]
let $playerId := $clubPlayer//playerId
return
<player>
{$clubPlayer/*}
{for $team in $clubPlayer/countryTeam
return
<countryTeamName>{local:idToTeamName(data($clubPlayer/countryTeam))}
</countryTeamName>}
{for $team in $clubPlayer/clubTeam
return
<clubTeamName>{local:idToTeamName(data($clubPlayer/clubTeam))}
</clubTeamName>}
</player>
 6.2 is modified from  5.3 – AppearancesListedByMatch
for $vid in $vidLib/*[ends-with(lower-case(name()), "video")]
let $eventCollection := $vid//*[ ends-with(lower-case(name()), "domaineventcollection")],
$appS :=$vid//semanticRelation[sourceSemObjId =
$clubPlayer//playerId]/destinationSegmentId,
$appE := $eventCollection//*[ends-with(lower-case(local-name(.)),"playerid") ][data(.) =
$playerId]
where count($appS) + count($appE) > 0
return
<Match id="{$vid/segmentId}" filename="{$vid//fileName}">
{if (count($appS) > 0)
then
(<Segments>
{for $segAppId in $appS
let $sequence := $vid//segmentCollection/*[segmentId = $segAppId]
order by local:timetoint($sequence//frameStart)
return $sequence}
</Segments>)
else ()}
{if (count($appE) > 0)
then
(<Events>
{for $event in $eventCollection/*
where $event//*[ends-with(lower-case(local-name(.)),"playerid") ] =
$clubPlayer//playerId
order by local:timetoint($event//frameStart)
return $event}
</Events>)
else ()}
</Match>
6. SYSTEM IMPLEMENTATION AND EVALUATION
6.1. Experimental Data
The current implementation of the system has used soccer (3 x 20 minutes), tennis (20
minutes), swimming (2 x 5 minutes), and diving (10 minutes) sports data to ensure that
the indexing and retrieval schemes are generic and robust. As described in section 4.2,
these sports are chosen to represent the four main categories of sports: period-, set-point-,
time (race)-, and performance- based sports. Thus, we have demonstrated that the
proposed sports video retrieval system is generic for most, if not all, sport genres such as
basketball, badminton, running, and gymnastics.
6.2. Semi-automatic Content Annotation for Indexing
After all features and semantic are extracted using the automated algorithms, an XML file
which contains segments, events and objects information is generated for each sport
video according to our schema. Some information such as player names are entered
manually by annotator or downloaded from related websites (such as goal.com). To
generate the list of relationships between players and events (i.e. annotating each event
with its actors), the index is parsed and visualized into a graphical user interface that will
be used to assist annotators in determining the relationships. For each (or specific) event
and segment, users can easily drag-and-drop the players’ names who are involved (or
appear) in it. For future work, we will investigate the use of speech recognition and video
optical character recognition to automatically detect the players’ names which are
mentioned during a particular video event. However, until these techniques are 100%
reliable, it is expected that we will still need some manual intervention. Thus, the main
challenge is to reduce human’s workload in the indexing process as much as possible by
obtaining external information and taking advantage of community contribution.
6.3. Content Delivery for Retrieval
The proposed sports video retrieval system has been implemented as a Web-based
application, allowing the user to browse the information by jumping from page to page
using hyperlinks. This paradigm is already well-known and is easily portable to portable
devices that have a small screen with only a few keys and no mouse. The interface (in the
form of web pages) can be designed to allow browsing using PDA and mobile phones
using the WAP technology that is the equivalent of HTML for mobile/wireless devices.
With intention to use interoperable standards, the Applicative Server is designed to run
on Tomcat 5.5 using the latest Java 1.5 SDK platform.
When a user connects to the server, the main page lists generic summaries that apply
to the whole library and a selection of (recently added) videos. For each video, some
information and key frame(s) are displayed, and the queries are presented as links. The
information for each Graphical User Interface (GUI) is retrieved by XQuery from the
database and the query results are rendered by the Content Delivery Server using XSL
transformations into an HTML document (to be explained below). When the user selects
a query (or clicks on a link in general), the request is transmitted and handled by a servlet
of the Content Delivery Server. Likewise, this will lead to executing an XQuery and
render the results back to the user. Some of the queries are filtered by parameters, making
it possible for users to personalize the results. Such queries use forms that invite the user
to fill in the parameters before showing the actual results (for example, Q4 and Q5 in
section 5). This form is populated by an XQuery in order to propose predefined (and
existing) ‘drop-down’ options/lists to the user instead of letting him/her type the
information. When the user submits the parameters' form, the parameters are passed to
the actual XQuery that will generate the expected result. The result will then be
transformed into an HTML page and returned to the user. When the user clicks on a
video link, a specialized servlet of the Content Delivery Server will generate an ASX
play-list file that contains the URL of the actual video segment. This play-list will be
downloaded and opened by the associated video player (depending on users’ operating
platform).
Summary browsers (Q1 to Q6) form the most important parts of the GUI. A query
usually returns information structured as a tree, and the nodes on that tree contain more
detailed content. To allow the user to browse this content efficiently, each of the folder
nodes can be opened to show their children nodes, and so on. Clicking on a node can
show more detailed information and content about it in the bottom part of the screen. For
example, if a query returns a set of players grouped by match, clicking on a match will
show the location and date of that match, plus some keyframes, links to match-specific
queries and to the whole match video. In order to avoid slow interaction with the server
while browsing, the results of a query are self-contained in the summary browser page.
That way, the page changes in real time, responding to the user's actions, as long as
he/she does not click on a link that leads to a video or another query. This is made
possible by developing the summary browser in Dynamic HTML, which is driven by
JavaScript. Moreover, the content of all the nodes is rendered and embedded in the page,
waiting for the user to view them when he/she wants to.
The Content Delivery Server is a session-aware Tomcat web application (developed
as Servlets in Java) which keeps the context for each connected user and adapts the
navigation according to it. For example, some queries will be influenced by the user's
preferences, and the representation of information and content will be adapted to the
user's platform/device. Figure 15 describes the typical interaction scenario. For the
requirements of this scenario, the followings describe the servlets that form the contentdelivery server. The Config class is included in all Servlets, providing necessary
configuration information such as the paths to data and common functions. It also defines
the list of queries. The PageRenderer class is also included in all Servlets and used as an
abstraction layer for final device-adapted output. It provides functions that process the
XSL transformations from XML data in order to render final HTML pages from query
results. The SelectVideo servlet drives the rendering of the main page. The QueryForm
servlet drives the rendering of parameters' page for parameterized queries. The
ViewResult servlet drives the rendering of the summary browser after execution of a
query. The ViewKeyFrame servlet returns image files for a given timestamp and a given
video. It is used by the summary browser. The View servlet generates the ASX playlist
file that gives access to a video segment. The UserPref servlet drives the rendering,
interaction and data manipulation for the User Preferences GUI. It generates XUpdate
documents and submits them to the database server to update the preferences.
6.3.1. Dynamic GUI generation. In order to allow simple extensibility and re-usability
of the system and the library, the GUI has been designed modularly. Figure 16 describes
the process of generating the dynamic GUI. Figure 17 describes the processing data flow
that takes the query results to the final visually-rich HTML pages. The steps consist of
XML transformations processed by XSL templates which can be described as follows:
Step 1: High-level GUI rendering
 Summary Browser tree Rendering (BR) This query-specific transformation (XSL-T
file) builds up the tree structure that will be shown on the final HTML page. For each
node, a tree is denoted by a title, an icon and a link, but this tree is still coded in
high-level XML, not in HTML. This transformation includes the “content location”
transformation.
 Content Location (CL) This transformation defines a template that may be called by
the query-specific XSL-T file when a node contains rich information to show besides
the tree. This template embeds the content (defined by the result of the query) into a
uniquely identified element for late rendering. The parent node has to link to that
identifier.
Step 2: Final GUI rendering (platform-specific)
 Platform-specific Page Rendering (PR) This transformation renders the DHTML
(HTML + javascript) code of the summary browser containing the tree structure
obtained from the BR transformation.
 Content Rendering (CR) This transformation is a second pass on the output of the
BR transformation. First, it extracts the embedded content elements generated by the
Content Locator in order to delegate the rendering process based on the specific
templates (such as an XSL file for each sport genre, defining the representation of
corresponding events and other types of objects). Second, it resolves the links
between the rendered tree nodes and the rendered content within the final HTML
page.
The strengths of this dynamic GUI rendering design:
 We can abstract the final layout (client-platform specific) in the first step of
rendering. These transformations are query-specific as they define the way to
represent the results of a given query, they can be used for any platform. Moreover,
it's easy to add support for new client platforms or change the GUI layout by
implementing different versions of the second step of the rendering only.
 The rendering of content is delegated to sport-specific templates. It is easy to add
support for new sports without affecting the main transformation engine.


The separation of the tree and the linked content makes it possible to implement a
“PULL” version of the Summary Browser that would download content on-demand
from the serve. This strategy is faster and consumes less-bandwidth as compared to
“PUSH” contents from the server which requires everything downloaded at once.
Using XSL-T for this process makes the application more interoperable as it does not
rely on a particular programming language or platform
Figure 15. Typical Interaction Scenario
6.4. Users Evaluation
The system has been evaluated against 20 users using observation and questionnaire to
verify whether the retrieval scheme is effective, and powerful in meeting users’
requirements. The users’ profiles are: 5 sports fans that are keen to spend time watching
the whole match, 3 sports fans that just like to watch highlights or a summarized version,
4 sports viewers that hardly watch any games, 2 TV viewers who dislike sport (to get as
much useful feedback as to how to make sport/soccer interesting to watch). To quantize
users’ answers and to enable statistics, we used closed questions that can be answered
with 1-3 scale ratings (1: Disagree 2: Agree 3: Strongly agree). When users answer 1 or
2, they provided suggestions for improvements. Table II presents the result of our
questionnaire which demonstrates that all users agree our browsing scheme is useful and
meets their requirements.
Questionnaire:
Summary 1
Q1. Each Track contains sufficient information to understand what is going on – without
having to refer to previous/next tracks (i.e. self-contained like a CD track)
Q2. The tracks and play-break segmentation is useful and enhancing my viewing
experience
Q3. The key frames sufficiently describe the event
Q4. If I can browse key frames which belong to the whole match, it will help me to
decide which match I’d be keener to watch
Q5. The track and play-breaks are also useful for the other sports
Q6. By browsing tracks and play-breaks, I can still get the same details as if I watch the
whole match, except now I have the choice of selecting/skipping scenes - like selecting
tracks in a CD audio
Q7. The tracks-based browsing scheme is effective to get the essence of a match without
missing any details
Summary 2
Q8. The event summaries give me enough highlights about the match.
Q9. The statistics such as Ratio of break/excitement are descriptive and help to me to
choose the more interesting events.
Summary 3
Q10. The customized summary is better or more useful than the other summaries so far
since it allows me to choose what I want to watch.
Q11. The amount of customization (i.e. from the user preference form) is sufficient and
suitable for minimum user intervention.
Summary 4
Q12. This summary is useful since I can specify my favorite matches (if I want to watch
them as a whole or use summary 1 or Summary 2 to browse)
Q13. (Despite it’s yet to be supported by the current prototype) I will find this summary
useful if I can choose:
Summary 5
Q14. This summary is useful since I can watch my favorite players’ events
Summary 6
Q15. This summary is useful since I can watch all the events of my favorite teams,
grouped by all the players.
For Profile 4: Q16. Even I am not a fan of sports, the browsing scheme (Q1-Q6
summaries) supported in this system would make me more interested to watch sports
video
Profile
1
1
1
1
1
2
2
3
4
4
Average
Q1
3
2
2
2
2
2
1
2
3
2
2.1
Q2
3
1
3
2
2
2
2
3
3
1
2.2
Q3
3
2
3
1
2
3
1
3
3
2
2.3
Q4
3
3
3
1
2
3
2
3
3
2
2.5
Q5
3
2
3
1
3
1
2
3
3
2
2.3
Q6
3
2
3
2
2
3
2
2
3
2
2.4
Q7
3
2
3
2
2
1
2
3
3
2
2.3
Q8
3
2
3
3
3
3
3
2
3
3
2.8
Q9
3
1
3
3
2
3
1
2
3
1
2.2
Q10
3
3
3
3
2
3
3
3
3
3
2.9
Q11
3
3
3
2
1
3
2
2
3
3
2.5
Q12
3
2
3
3
3
2
2
3
3
2
2.6
Q13
3
2
3
3
3
2
3
2
3
2
2.6
Q14
3
3
3
3
3
3
3
3
3
3
3
Q15
3
2
3
3
2
2
3
2
3
3
2.6
Q16
2
3
2
Table II. Results from Users Evaluation
Based on users’ feedback: summary 3 (player’s events), summary 5 (user-preference
based summary), and summary 4 (favorite matches and players) and summary 2 (a
match’s event summaries) are particularly useful.. It is worth noting that for users who do
not like sports, they agree that the browsing scheme can make them more interested to
watch sports video. Some users have suggested more summaries such as browse by years,
big events (legendary events), past or recent games. They would also like to see closerview key frames, brief descriptions on each of the summaries available.
Figure 16. XSL transformation and rendering for dynamic GUI generation
Figure 17. Data flow from the XML query result to the final HTML page
7. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a novel sports video retrieval system which has been
fully implemented and evaluated. The system essentially uses a segment-event-object
based video indexing model which is designed using semi-schema-based and objectrelationship modeling schemes. The schema is implemented as an XML schema
(represented by ORA-SS notation), populated by XML, and utilized by XQuery to
construct dynamic summaries and user-oriented retrievals. The proposed schema is
scalable since it enables incremental development of algorithms for automatic featuresemantic extraction. Moreover, the schema does not need to be complete at one time
while allowing users to easily add additional elements without the need to know the
parent objects. The indexing model has emphasized the usage of referencing relationship
rather than nested object in order to avoid redundant data and to achieve more efficient
data updates. Referencing also allows the system to add segments, events, and objects
without worrying about the hierarchical, thereby achieving more straightforward and
faster data insertions. We have also shown that the proposed indexing scheme can be
easily leveraged to integrate more MPEG-7 standard multimedia descriptions. The
implementation strategy that includes modular-based GUI design and multi-level
architecture has demonstrated that the proposed system is scalable and extensible.
Moreover, a web-based service will enable multi-platforms access, as long as they
support web browsing and video player software. The user evaluation has confirmed the
effectiveness of the current implemented browsing scheme which is generated
dynamically from the indexes.
The current implementation, which uses soccer, swimming, tennis and diving data to
represent the four main categories of sports, have demonstrated that the proposed sports
video retrieval framework is robust for all sports genres. The graphical interface of the
system has been designed in a modular style to further enable future extensions.
Moreover, the interface modules can be scaled for different screen sizes in desktop and
mobile platforms.
For future work, we aim to extract more features and to improve the performance
results to support a more reliable automatic semantic interpretation. As a result, the
database will be rapidly expanded to include more data from other sports. We will
continually extend the retrieval module to include a wider variety of search strategies
while adding more dynamic summaries that will benefit users and applications. The
aesthetic of the user interface will be improved to enhance its effectiveness. Since
XQuery is still a working draft, we will need to revisit the proposed queries when it
becomes a final version. This is to ensure that all functionalities that we applied can be
used and potential improvements on the query effectiveness can be made. Last, but most
importantly, we aim to conduct a performance test on a large dataset (such as >500
videos) using the current XML database.
REFERENCES
ADALI, S., CANDAN, K.S., CHEN, S.-S., EROL, K. and SUBRAHMANIAN, V.S. 1996. The Advanced
Video Information System: Data Structures and Query Processing. Multimedia Systems, 4 (4). 172-186.
ADAMS, B., DORAI, C. and VENKATESH, S. 2002. Toward automatic extraction of expressive elements
from motion pictures: tempo. Multimedia, IEEE Transactions on, 4 (4). 472-481.
ASSFALG, J., BERTINI, M., COLOMBO, C., BIMBO, A.D. and NUNZIATI, W. 2003. Automatic extraction
and annotation of soccer video highlights. in Image Processing, 2003. Proceedings. 2003 International
Conference on, (2003), II-527-530 vol.523.
ASSFALG, J., BERTINI, M., DEL BIMBO, A., NUNZIATI, W. and PALA, P. 2002. Detection and recognition
of football highlights using HMM. in Electronics, Circuits and Systems, 2002. 9th International Conference on,
(2002), 1059-1062 vol.1053.
ASSFALG, J., BERTINI, M., DEL BIMBO, A., NUNZIATI, W. and PALA, P. 2002. Soccer highlights
detection and recognition using HMMs. in Multimedia and Expo, 2002. ICME '02. Proceedings. 2002 IEEE
International Conference on, (2002), 825-828 vol.821.
BABAGUCHI, N., KAWAI, Y. and KITAHASHI, T. 2002. Event based indexing of broadcasted sports video
by intermodal collaboration. Multimedia, IEEE Transactions on, 4 (1). 68-75.
BABAGUCHI, N., KAWAI, Y., YASUGI, Y. and KITAHASHI, T. 2000. Linking live and replay scenes in
broadcasted sports video. in ACM Workshop on Multimedia, (Los Angeles, California, United States, 2000),
ACM Press, 205-208.
BABAGUCHI, N. and NITTO, N. 2003. Intermodal collaboration: a strategy for semantic content analysis for
broadcasted sports video. in Image Processing, 2003. Proceedings. 2003 International Conference on, (2003),
13-16.
BABAGUCHI, N., OHARA, K. and OGURA, T. 2003. Effect of personalization on retrieval and summarization
of sports video. in Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim
Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference
on, (2003), 940-944.
BLAHA, M. and PREMERLANI, W. 1998. Object-oriented modeling and design for database applications.
Prentice Hall, Upper Saddle River, N.J.
BOAG, S., CHAMBERLIN, D., FERNANDEZ, M.F., FLORESCU, D., ROBIE, J. and SIMÉON, J. XQuery
1.0: An XML Query Language W3C Working Draft 29 October, W3C, 2004.
BRUNDAGE, M. 2004. XQuery: the XML query language. Addison Wesley.
CHAIRSORN, L. and CHUA, T.-S. 2002. The Segmentation and Classification of Story Boundaries in News
Video. in 6th IFIP Working Conference on Visual Database Systems, (Brisbane, 2002), Kluwer, 94-109.
CHANG, S.-F., SIKORA, T. and PURL, A. 2001. Overview of the MPEG-7 standard. Circuits and Systems for
Video Technology, IEEE Transactions on, 11 (6). 688-695.
DIMITROVA, N., RUI, Y. and SETHI, I. 2001. Media Content Management. In Design Management of
Multimedia Information Systems: Opportunities Challenges. S.M. Rahman ed. Idea group publishing.
DOBBIE, G., XIAOYING, W., LING, T.W. and LEE, M.L. ORA-SS: An Object-Relationship-Attribute Model
for Semistructured Data Technical Report Department of Computer Science,, National University of Singapore,
2000.
EKIN, A. and TEKALP, A.M. 2003. Generic play-break event detection for summarization and hierarchical
sports video analysis. in International Conference on Mulmedia and Expo 2003 (ICME03), (2003), IEEE, 6-9
July 2003.
EKIN, A. and TEKALP, M. 2003. Automatic Soccer Video Analysis and Summarization. IEEE Transaction on
Image Processing, 12 (7). 796-807.
HAN, M., HUA, W., CHEN, T. and GONG, Y. 2003. Feature design in soccer video indexing. in Information,
Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia.
Proceedings of the 2003 Joint Conference of the Fourth International Conference on, (2003), 950-954.
HANJALIC, A. 2002. Shot-boundary detection: unraveled and resolved? Circuits and Systems for Video
Technology, IEEE Transactions on, 12 (2). 90-105.
HENG, W.J. and NGAN, K.N. 2002. Shot boundary refinement for long transition in digital video sequence.
Multimedia, IEEE Transactions on, 4 (1520-9210). 434-445.
KOSCH, H. 2004. DIstributed Multimedia Database Technologies supported by MPEG-7 and MPEG-21. CRC
Press, Florida, USA.
LU, G.J. 1999. Multimedia database management systems. Artech House, Boston ; London :.
MANJUNATH, B.S., SALEMBIER, P. and SIKORA, T. 2002. Introduction to MPEG-7. WIley, England.
MENG, H.M., TANG, X., HUI, P.Y., GAO, X. and LI, Y.C. 2001. Speech retrieval with video parsing for
television news programs. in Acoustics, Speech, and Signal Processing, 2001. Proceedings. 2001 IEEE
International Conference on, (2001), 1401-1404 vol.1403.
NEPAL, S., SRINIVASAN, U. and REYNOLDS, G. 2001. Automatic detection of 'Goal' segments in basketball
videos. in ACM International Conference on Multimedia, (Ottawa; Canada, 2001), ACM, 261-269.
NGAI, C.H., CHAN, P.W., YAU, E. and LYU, M.R. 2002. XVIP: an XML-based video information processing
system. in Computer Software and Applications Conference, 2002. COMPSAC 2002. Proceedings. 26th Annual
International, (2002), 173-178.
OH, J. and HUA, K.A. 2000. Efficient and cost-effective techniques for browsing and indexing large video
databases. in ACM SIGMOD Intl. Conf. on Management of Data, (Dallas, TX, 2000), ACM, 415-426.
OOMOTO, E. and TANAKA, K. 1997. Video Database Systems - Recent Trends in Research and Development
Activities. In The Handbook of Multimedia Information Management. R.J.a.R.M. William I. Grosky ed.
Prentice Hall, Upper Saddle River, NJ, 405 - 448.
PEREIRA, F. MPEG-7 requirements Document V.14. International, International Organisation For
Standardisation, Coding of Moving Pictures and Audio, International Organisation For Standardisation, Coding
of Moving Pictures and Audio ISO/IEC JTC 1/SC 29/WG 11/N4035, Singapore, 2001.
PONCELEON, D., SRINIVASAN, S., AMIR, A., PETKOVIC, D. and DIKLIC, D. 1998. Key to effective
video retieval: Effective cataloging and browsing. in IEEE Internation Workshop on Content-based image and
video databases, (Bombay, India, 1998), IEEE Computer Society, 99-107.
RUI, Y., GUPTA, A. and ACERO, A. 2000. Automatically extracting highlights for TV Baseball programs. in
ACM International Conference on Multimedia, (Marina del Rey, California, United States, 2000), ACM, 105115.
SATO, T., KANADE, T., HUGHES, E.K. and SMITH, M.A. 1998. Video OCR for digital news archive. in
Content-Based Access of Image and Video Database, 1998. Proceedings., 1998 IEEE International Workshop
on, (1998), 52-60.
TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2004. Integrating Highlights to Play-break Sequences
for More Complete Sport Video Summarization. IEEE Multimedia, Oct-Dec 2004. 22-37.
TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2004. The Power of Play-Break for Automatic
Detection and Browsing of Self-consumable Sport Video Highlights. in To appear in the 6th International ACM
Multimedia Information Retrieval Workshop, (New York, USA., 2004), ACM.
TJONDRONEGORO, D., CHEN, Y.-P.P. and PHAM, B. 2003. Sports video summarization using highlights
and play-breaks. in The fifth ACM SIGMM International Workshop on Workshop on Multimedia Information
Retrieval (ACM MIR'03), (Berkeley, USA, 2003), ACM, 20-21-208.
TSENG, B.L., LIN, C.-Y. and SMITH, J.R. 2004. Using MPEG-7 and MPEG-21 for personalizing video.
Multimedia, IEEE, 11 (1). 42-52.
WU, C., MA, Y.-F., ZHANG, H.-J. and ZHONG, Y.-Z. 2002. Events recognition by semantic inference for
sports video. in Multimedia and Expo, 2002. Proceedings. 2002 IEEE International Conference on, (2002), 805808.
XIE, L., CHANG, S.-F., DIVAKARAN, A. and SUN, H. 2002. Structure analysis of soccer video with hidden
Markov models. in Acoustics, Speech, and Signal Processing, 2002 IEEE International Conference on,
(Columbia University, 2002), 4096-4099.
XU, P., XIE, L. and CHANG, S.-F. 1998. Algorithms and System for Segmentation and Structure Analysis in
Soccer Video. in IEEE International Conference on Multimedia and Expo, (Tokyo, Japan,, 1998), IEEE.
YU, X. 2003. Trajectory-based ball detection and tracking with applications to semantic analysis of broadcast
soccer video. in ACM MM 2003, (Berkeley, CA, USA, 2003), ACM, 11-20.
ZEINIK-MANOR, L. and IRANI, M. 2001. Event-based analysis of video. in Computer Vision and Pattern
Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, (The
Weizmann Institute of Science, 2001), 123-130.
ZHANG, H.J. (ed.), Content-based video browsing and retrieval. CRC press LLC, 1999.
APPENDIX 1: ORA-SS NOTATION
APPENDIX 2. FULL-SCALED DIAGRAM OF FIGURE 6
Replay
Video
Segment
Play
2, 1:1, 1:1
2, 1:1,?
Audio
Segment
Excitement
2, 1:1,+
Segment
sCollection
…
Performancebased Video
Set-Pointbased Video
Sport Video (SV)
2, 1:1, 1:1
SV Component
2, 1:1, ?
Face
Visual
Segment
Semantic
Relation Col
Goal-area
2, 1:1,?
2, 1:1,?
Syntactic
Relation Col
…
Time-based
Video
2, 1:1, 1:1
…
Overall (Match)
Summary
What Where
Action
Text
Period-based
Video
Sports
Name
2, 1:1, 1:2
2, 1:1,*
Key
Frame
Break
Sport Video Library
(Database)
Period-based
Event collection
2, 1:1,*
2, 1:1, 1:1
Assist
Giver
Video
Segment
Team
Semantic Object
Collection
2, 1:1, *
Player
Team
Benefited
2, 1:1,?
Period-based
Event
2, 1:1,*
<Goal>
2, 1:1, ?
2, 1:1, ?
Goal
Scorer
Replay
2, 1:1, *
Highlights
Summary
2, 1:1, +
Highlight
Collection
2, 1:1, +
Event
sb,
2, 1:1, 1: n, <
2, 1:1,?
Hierarchical
Summary
2, 1:1, *
2, 1:1,?
2, 1:1,?
Comprehensive
Summary
2, 1:1, 1: n, <
P-B Sequence
Key Frame
2, 1:1, *
Play
sp,
2, 1:1, 1: n, <
Event
se,
2, 1:1, ?
Team
2, 1:1, *
Key Audio
Download